Natural language processing (NLP) has long been a fundamental area in computer science. However, its trajectory changed dramatically with the introduction of word embeddings. Before embeddings, NLP relied primarily on rule-based approaches that treated words as discrete tokens. With word embeddings, computers gained the ability to understand language through vector space representations. In this article, you will learn about:
- How word embeddings convert words into dense vectors
- How to use pretrained word embeddings
- How to train your own word embeddings
- How word embeddings are used in modern language models
Let’s get started!
Word Embeddings in Language Models
Photo by Satoshi Hirayama. Some rights reserved.
Overview
This post is divided into three parts; they are:
- Understanding Word Embeddings
- Using Pretrained Word Embeddings
- Training Word2Vec with Gensim
- Training Word2Vec with PyTorch
- Embeddings in Transformer Models
Understanding Word Embeddings
Word embeddings represent words as dense vectors in a continuous space, where semantically similar words are positioned close to each other. The core principle is that words appearing in similar contexts should have similar vector representations. This concept gained prominence through models like Word2Vec, GloVe, FastText, and ELMo.
Word embedding models are typically trained using unsupervised learning because the ideal vector representation for each word is unknown (otherwise, we could use it directly). The objective is to learn word co-occurrence patterns from the training corpus.
Word2Vec, introduced by the paper “Efficient Estimation of Word Representations in Vector Space”, pioneered this approach. It uses a neural network to predict words based on local context and comes in two variants:
- Continuous Bag of Words (CBOW): Predicts the target word given its context
- Skip-gram: Predicts the context words given the target word
Skip-gram generally performs better for smaller datasets and rare words, while CBOW is faster and more effective for larger datasets. Word2Vec demonstrated that computers could understand semantic relationships between words by showing that embedding vectors could satisfy equations like “king – man + woman ≈ queen”.
GloVe (Global Vectors for Word Representation) takes a different approach. Rather than using a neural network, it constructs and factorizes a word co-occurrence matrix to obtain embeddings. GloVe combines the strengths of:
- Global matrix factorization methods (like latent semantic analysis)
- Local context window methods (like Word2Vec)
The resulting embeddings capture both semantic and syntactic relationships between words and often outperform Word2Vec on tasks requiring broader semantic understanding.
FastText improved upon Word2Vec by learning vectors for character n-grams instead of whole words. This approach captures subword information, solving the out-of-vocabulary problem and providing better performance for morphologically rich languages.
ELMo, a more recent model, uses a deep bi-directional LSTM to generate context-dependent word vectors. Unlike previous models, ELMo’s word vectors are not fixed but vary based on context. While less commonly used today after the emergence of large language models, ELMo’s core idea that word meaning should depend on context forms the foundation of all modern language models.
Using Pretrained Word Embeddings
You can easily use pretrained word embeddings from popular libraries. Here’s an example using the gensim
library with GloVe embeddings:
from gensim.models import KeyedVectors
# Load pretrained GloVe embeddings model = KeyedVectors.load_word2vec_format(‘glove.6B.50d.txt’, binary=False, no_header=True)
# Find similar words similar_words = model.most_similar(‘king’) print(similar_words) print()
# Word analogies result = model.most_similar(positive=[‘king’, ‘woman’], negative=[‘man’]) print(result) |
To run this code, you need to download the GloVe embeddings from https://nlp.stanford.edu/projects/glove/ and extract the file glove.6B.50d.txt
from the zip file glove.6B.zip
. This file contains the trained vectors of 400,000 words from a training corpus of 6 billion words.
When you run this code, you will see the following output:
[(‘prince’, 0.8236179351806641), (‘queen’, 0.7839043140411377), (‘ii’, 0.7746230363845825), (’emperor’, 0.7736247777938843), (‘son’, 0.766719400882721), (‘uncle’, 0.7627150416374207), (‘kingdom’, 0.7542160749435425), (‘throne’, 0.7539913654327393), (‘brother’, 0.7492411136627197), (‘ruler’, 0.7434253692626953)]
[(‘queen’, 0.8523604273796082), (‘throne’, 0.7664334177970886), (‘prince’, 0.7592144012451172), (‘daughter’, 0.7473883628845215), (‘elizabeth’, 0.7460219860076904), (‘princess’, 0.7424570322036743), (‘kingdom’, 0.7337412238121033), (‘monarch’, 0.721449077129364), (‘eldest’, 0.7184861898422241), (‘widow’, 0.7099431157112122)] |
The first output shows that “king” is most similar to “prince” under this embedding model. And the second output shows that “queen” is the closest word to “king + woman – man”.
Training Word2Vec with Gensim
Gensim provides a simple interface to train your own Word2Vec model. Here’s how:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
from gensim.models import Word2Vec from gensim.utils import simple_preprocess
# Prepare your text data sentences = [ “the quick brown fox jumps over the lazy dog”, “a quick brown dog jumps over the lazy fox”, # … more sentences ]
# Preprocess the sentences tokenized_sentences = [simple_preprocess(sentence) for sentence in sentences]
# Train the model model = Word2Vec( sentences=tokenized_sentences, vector_size=100, # dimension of the word vectors window=5, # context window size min_count=1, # ignore words with frequency < min_count workers=4, # number of CPU cores to use sg=0 # 0 for CBOW, 1 for Skip-gram )
# Save the model model.save(“word2vec.model”)
# Use the model model = Word2Vec.load(“word2vec.model”) vector = model.wv[‘quick’] # get the vector for a word similar_words = model.wv.most_similar(‘quick’) print(similar_words) |
Running this code will not give you any good model. For a useful embedding, you need a large corpus to train on. You probably do not want to expand the Python list sentences
, but rewrite the code to read from some files on the disk.
Assume you did that, gensim will train a Word2Vec model and save it to the file word2vec.model
. Once you have trained it, you can load it back and use it to get the vector for a word, as shown in the code above.
Training Word2Vec with PyTorch
You can also implement Word2Vec from scratch using PyTorch. Here’s a basic implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
import torch import torch.nn as nn import torch.optim as optim
class Word2VecModel(nn.Module): def __init__(self, vocab_size, embedding_dim): super().__init__() self.embeddings = nn.Embedding(vocab_size, embedding_dim) self.linear = nn.Linear(embedding_dim, vocab_size)
def forward(self, inputs): embeds = self.embeddings(inputs) out = self.linear(embeds) return out
# Prepare your text data sentences = [ “the quick brown fox jumps over the lazy dog”, “a quick brown dog jumps over the lazy fox”, # … more sentences ]
# Create a dataset for training skipgram_size = 2 dataset = [] vocab = set() for sentence in sentences: tokens = sentence.split() vocab.update(tokens) for i in range(len(tokens)): context = tokens[i–skipgram_size:i] + tokens[i+1:i+skipgram_size+1] target = tokens[i] dataset.append((context, target))
vocab_to_idx = {word: idx for idx, word in enumerate(sorted(vocab))} vocab_size = len(vocab)
# Training setup embedding_dim = 50 model = Word2VecModel(vocab_size, embedding_dim) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=0.001) num_epochs = 10
# Training loop for epoch in range(num_epochs): for context, target in dataset: context_idx = [vocab_to_idx[word] for word in context] target_idx = [vocab_to_idx[target]] * len(context) optimizer.zero_grad() output = model(torch.tensor(target_idx)) loss = criterion(output, torch.tensor(context_idx)) loss.backward() optimizer.step()
# Save the model torch.save(model.state_dict(), “word2vec.pt”) |
This code will train a “skip-gram” model of Word2Vec. In this model, the training data is a window of words from the text corpus. You should do some preprocessing to make the set of vocabulary clean, for example, removal of punctuations and convert all words to lowercase. Pay attention to how the variables context
and target
are used. In a window, such as “the quick brown fox jumps” in the example above, the model will be fed with the center word and asked to predict any other word in the same window. The loss function for the training is the cross-entropy loss.
This example probably does not give you a good model because you need a larger corpus and more epochs to train. However, notice that the model has an embedding layer and a linear layer. The embedding layer created with nn.Embedding
will be the word embedding matrix you are interested in.
Also, notice that the embedding layer is just a numerical matrix. You need a lookup table, such as vocab_to_idx
in the code above, to convert the word to an index and then use the index to get the embedding vector. The lookup table should be saved together with the model because you will not be able to use it if you cannot convert words to the correct index.
Embeddings in Transformer Models
From the above example, you learned that word embeddings can be trained and you can create a nn.Embedding
layer for that purpose. In fact, most modern language models use this method. Let’s look at the BERT model as an example.
from transformers import BertModel, BertConfig
config = BertConfig() model = BertModel(config=config) print(model) print(model.embeddings.word_embeddings.state_dict()) |
When you run this, you will see:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0-11): 12 x BertLayer( (attention): BertAttention( (self): BertSdpaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) OrderedDict({‘weight’: tensor([[ 0.0000, 0.0000, 0.0000, …, 0.0000, 0.0000, 0.0000], [ 0.0373, -0.0254, -0.0057, …, 0.0262, -0.0122, 0.0050], [-0.0222, 0.0076, 0.0077, …, 0.0085, 0.0052, 0.0209], …, [-0.0253, -0.0047, 0.0141, …, -0.0262, -0.0303, -0.0488], [-0.0029, -0.0301, -0.0286, …, -0.0130, -0.0312, -0.0125], [ 0.0507, -0.0257, -0.0376, …, 0.0087, -0.0076, 0.0027]])}) |
The BERT model is sophisticated and has a lot of components. The word embedding layer is named word_embeddings
. Once you have created the model, you can refer to it with model.embeddings.word_embeddings
. From its parameters, you can see that it has 30522 vocabularies, and each vector has dimension 768. The second print statement will dump the embedding matrix. You should expect the matrix in the shape of (30522, 768)
.
In a previous post, you learned that language models need a tokenizer to split the input text into tokens. Tokenizers also assign a token ID to each token. This token ID is the row index of the embedding matrix. When you feed an input text to this model, you should feed a sequence of token IDs instead. Usually, the embedding layer is the first layer of the model. It will convert the sequence of token IDs to a sequence of embedding vectors by replacing each token ID with the corresponding rows in the embedding matrix.
Further Readings
Below are some further readings on the topic:
Summary
In this article, you learned about word embeddings and their applications. In particular, you learned that:
- Word embeddings represent words as dense vectors in a continuous space, with semantically similar words positioned nearby
- Pretrained word embeddings are readily available through popular libraries
- You can train custom word embeddings using either Gensim or PyTorch
- Modern transformer models utilize learned embeddings through the
nn.Embedding
layer - Embeddings are essential for capturing semantic relationships between words