intro

Understanding Natural Language Processing with Me!

Hi, I’m Shiyi. Welcome to my technical blog. Please check out this page for a more detailed account of my journey with respect to learning Natural Language Processing. Here I will only be documenting the gists. I will be presenting everything I have learned so far, including important concepts, necessary code snippets, and more. I am by no means an expert in this subject, but I have gone through extensive studies and training in the fields and subfields related to have a good grasp of what’s important.

Areas that I have dabbled in,

→ General Linguistics
→ Symbolic Computational Linguistics
→ Statistical Natural Language Processing
→ State of the Art Large Language Modeling

The Subject Matter

What do we mean by Natural Language Processing? If we do a little googling and researching, it's very intuitive that natural language processing involves a set of solutions to various natural human language tasks. The most common ones are

→ Sentiment analysis
→ Machine translation
→ Word-sense disambiguation
→ Named-entity recognition
→ Topic modeling
→ Text classification
→ Document Classification
→ Question answering

A Little Bit Of History

The history of Computational Linguistics dates back to the 40s to 50s. So, it's not very long ago that the field that has created ChatGPT or any form of AI that is so commonly adopted in every aspect of our lives now started to have its very first ancestral ideation. It's still a fairly new and young field with infinite possibilities up for exploration.

Before diving in, first we have to ask ourselves what exactly is artificial intelligence (AI)?

According to the official definition extracted out of John McCarthy's 2004 paper listed on IBM's website,

🤖️ "It is the science and engineering of making intelligent machines,
 especially intelligent computer programs. It is related to the similar
 task of using computers to understand human intelligence, but AI does
 not have to confine itself to methods that are biologically observable."

So if it's to understand human intelligence, we need to know how we as humans gain information and human intelligence, or the brain, really works both through physiology and psychology,

💡 Two Important Sources of Knowledge: Rationalism and Empiricism.

  The first acquires knowledge through reasoning and logic, while the second
  through experience and experimentation.

Below are some important notes with respect to the historical timeline of the development of Computational Linguistics and NLP and how it all started from one of these two principles and gradually transitioned to the other (rationalism / computationalism to empiricism / connectionism; although computationalism is not always symbolic; namely it also incorporates empirical evidence):

Noisy Channel Model

Shannon used metaphors like noisy channels and decoding to explain language transmission. He created the first probabilistic measurement of English entropy.

Instrumental phonetics and sound spectrograph laid the groundwork for speech recognition research.

Foundational Insights: 1940s - 1950s

Chomsky's work on finite-state machines and grammar paved the way for formal language theory and transformational and context-free grammar.

Turing's work on computer science led to the development of propositional logic, regular expressions, and finite state automata.

Probabilistic models like Markov processes have significantly automated and formalized natural language processing, providing a probabilistic framework for understanding language structure.

The Merging of Two Cultures

Language theory: involves the study of parsing algorithms, formal language theory, and generative syntax. The Transformations and Discourse Analysis Project (TDAP) was one of the earliest complete parsing systems.

Artificial intelligence (AI): was coined in 1956 by John McCarthy, Marvin Minsky, Cloude Shannon, and Nathaniel Rochester. Early systems used keyboard searches, pattern matching, and basic reasoning. By the late 1960s, more formal logical systems had been developed.

Paradigms Develop

The Bayesian approach was used to address optical character recognition issues, with Bledsoe and Browning creating a text-recognition system using a large dictionary.

Mosteller and Wallace addressed authorship attribution issues.

The first online corpus was created in 1963-1964, containing 1 million words from 500 different texts. IBM and Carnegie Mellon University employees developed speech recognition algorithms with techniques such as the HMM and analogies to a noisy channel and decoding.

Empiricism Redux

Kaplan and Kay's work in phonology and syntax led to the two-level morphology model, which was influential in language modeling.

IBM's research in speech recognition, particularly through Statistical Machine Translation (SMT) and Hidden Markov Models (HMMs), introduced empiricism to computational linguistics.

Connectionist strategies, inspired by neural networks, predated the neural language models we use today. Modern neural language models, like transformers in GPT or BERT, differ significantly in their complexity and capacity to learn from vast datasets.

The transition to probabilistic techniques was more a result of computational needs than a direct evolution.