Intro to NLP Part I: Resource & Summary
Today we are going to talk about NLP! I have listed some resources that are helpful to start the learning journey in this field. I also did a summary of some of the recent topics and models in NLP. This is the first part of the NLP series. See other posts here.
Resource
Courses
The first three are graduate level NLP courses with a focus on deep learning applications in NLP and state-of-the-art models (e.g. Attention, Transformer, BERT, Multi-lingual, etc). The last two courses are more introductory courses.
-
Stanford CS224N Natural Language Processing with Deep Learning by Christopher Mannings
- CMU CS11-737 Multilingual NLP 2020, by Graham Neubig
-
CMU CS11-747 Neural Nets for NLP 2020 by Graham Neubig
-
Stanford CS224U Natural Language Understanding
- Stanford CS124 From Languages to Information by Dan Jurafsky
Textbook
-
Natural Language Processing by Jacob Eisenstein
-
Introduction to Information Retrieval by Christopher Manning, Prabhakar Raghavan and Hinrich Schütze
-
Speech and Language Processing by Dan Jurafsky and James H. Martin
-
Neural Network Methods in Natural Language Processing by Yoav Goldberg
-
A Primer on Neural Network Models for Natural Language Processing by Yoav Goldberg
I am currently learning CS224N and CS11-747 while reading Speech and Language Processing and A Primer on Neural Network Models for Natural Language Processing.
Summary
Chris mentioned in CS224N that NLP can be categorized into more tranditional NLP (up until approximately 2012) and more “neural nets stype” representation and models since 2013. Today we are going to focus on later part: some of the recent models and breakthroughs in NLP.
Word Embedding
Word embedding is a dense representation of words in a vector space. The biggest jump in NLP when moving from the sparse-input linear models to neural-network based models is to represent each feature as a dense vectors rather than each feature as a unique dimension (one-hot encoding). One of the advantages of word embeddings over one-hot encoding is the ability to generalize: semantically similar words are similar in vector representation, thus are close to each other in the vector space. Additionally, it can reveal the hidden semantic relationship between words by their relative positions in the vector space. One notable example is that v(“king”)-v(“man”)+v(“woman”) should be close to v(“queen”).
There are two main approaches to learn the word enbeddings, count-based and context-based. Count-based methods utilize word frequency count and co-occurance matrix. The main idea is that words in the same contexts are more likely to be similar in semantic meanings. Context-based methods converted the problem into a predictive model that predict the central word given context words (Continuous Bag-of-Words, CBOW) or predict a word being a context word for the given target (also known as Skip-gram model).
Word2Vec and GloVe
Word2Vec (Mikolov et al., 2013) is a popular technique to learn word embeddings using a two-layer neural net. It was introduced by a team of researchers at Google led by Tomas Mikolov. Global Vector(GloVe) is another model that combines count-based Matrix Factorization and skip-gram model. We will talk about the details of Word2Vec and GloVe in our next post.
Language Models
Language models are models that assign probability to sequence of words. It’s widely used in a variety of application such as speech recognition, machine translation, question answering, natural language generation, etc. There are many types of language models, most of them fall into the categories: statistical language modeling and neural language modeling.
Statistical language modeling
Statistical language modeling learns the probability of word occurence based on text. Mostly it work at the word level. For example, a language model is able to predict the next word given the previous n words. It may look at context of a short sequence of words or it could also work at sentences or paragraphs level.
Some common statistical language modeling types are:
- Unigram
unigram evaluates each word/term independently without looking at the conditioning context.
- N-gram N-gram is a simple model that assign sequance of n words. For example “my book”, “watch out” is a 2-gram (bigram), “this is a” is a 3-gram (trigram), etc. N-gram model will assign/estimate the probability of the n+1 word given the previous n words.
There are more complicated language model like RNN LMs, Seq2Seq, and more recently, the pre-trained language models like BERT, GPT-3, etc.
- ELMo
- BERT
- T5
- GPT-3
- Attention mechanism
Up Next: We will talk about word embeddings and the state-of-the-art language models in more details in our next post. Stay tuned!
Leave a comment