Text embeddings are numerical representations of text that capture semantic meaning in a way that machines can understand and process. These embeddings have revolutionized natural language processing by enabling computers to work with text more meaningfully than traditional bag-of-words or one-hot encoding approaches.
In the following, you’ll explore how to generate high-quality text embeddings using transformer models from the Hugging Face Hub. In particular, you will learn:
- What are text embeddings
- How to generate text embeddings from the BERT model
- How to generate higher quality embeddings
Let’s get started!
Text Embedding Generation with Transformers
Photo by Greg Rivers. Some rights reserved.
Overview
This post is divided into three parts; they are:
- Understanding Text Embeddings
- Other Techniques to Generate Embedding
- How to Get a High-Quality Text Embedding?
Understanding Text Embeddings
Text embeddings are to use numerical vectors to represent text. A trivial way to represent text is to find all words in the dictionary and assign a unique number to each word. Then, you can represent each word as a one-hot vector or a sentence as a bag-of-words vector: the number in each position means how many times the word appears in the sentence.
A dictionary has thousands of words, so one hot vector is too large and sparse. A dense vector, in which each element is a floating point number instead of a boolean, can make it more compact. However, what value should each element in the vector be? This isn’t easy to decide. But it can be trained. Examples include Word2Vec, GloVe, and FastText. The interesting property of dense word vectors is that they place semantically similar words closer together in the vector space. You can use the vector to measure the semantic similarity between words and perform word math, such as “king – man + woman = queen”.
One step further, you want to represent a sentence as a vector. It is more difficult than simply adding the word vectors from the words in a sentence because you need to identify the context of a word. For example, “bear” can be a verb or a noun; the word vector cannot tell their difference but is important for the context. Representing the semantic meaning of a sentence into a vector is very useful for many NLP tasks.
Transformer models can generate such contextual embeddings by processing the entire sequence of words at once. The representation of a word in the embedding depends on its context within the text. This allows for much richer representations that can capture nuances like polysemy (words with multiple meanings).
Training a transformer model to generate embeddings is computationally expensive and difficult because it requires a high-quality dataset and a complex training process. Fortunately, we can use pre-trained models to generate embeddings if we merely want to create a vector to represent a text’s semantic meaning.
Let’s see how you can generate embeddings for sentences using a pre-trained BERT model, which is known to create high-quality contextual embeddings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from transformers import AutoTokenizer, AutoModel import torch import numpy as np
# Load pre-trained model and tokenizer tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”) model = AutoModel.from_pretrained(“bert-base-uncased”)
# Define some example sentences sentences = [ “The cat sat on the mat.”, “The dog slept on the floor.”, “I love natural language processing.” ]
def get_embeddings(sentences, model, tokenizer): “Function to get embeddings for a batch of sentences”
# Tokenize input and get model output encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors=“pt”) with torch.no_grad(): model_output = model(**encoded_input)
# Use the CLS token embedding as the sentence embedding sentence_embeddings = model_output.last_hidden_state[:, 0, :]
# Convert torch tensor to numpy array for easier handling return sentence_embeddings.numpy()
# Get embeddings for our example sentences embeddings = get_embeddings(sentences, model, tokenizer) print(f“Embedding shape: {embeddings.shape}”) print(f“First 5 dimensions of the sentences’ embeddings:\n{np.round(embeddings[:, :5], 3)}”) |
In this example, you use the pre-trained BERT model to generate embeddings for three example sentences. You need to use both the tokenizer and the model from BERT. The tokenizer splits the sentence into sub-word tokens, and the model generates the contextual embeddings. The tokenizer and the model are created using the “auto-class” from the transformers library. You only need to specify the pre-trained model name, bert-base-uncased
.
The base BERT model has 12 layers and 768 hidden dimensions. It is uncased, meaning the input text is treated as case-insensitive. Because the model has a hidden dimension 768, the generated embeddings for each sentence are a vector of 768 dimensions.
The get_embeddings()
function takes a list of sentences, a model, and a tokenizer and returns embeddings for each sentence. The way it works is trivial. But note that the sentence embedding is extracted from the first token of the model output:
... sentence_embeddings = model_output.last_hidden_state[:, 0, :] |
The first token is the [CLS]
token, a special token added by the tokenizer to the beginning of each sentence. It is what the model is trained to use to represent the sentence. You can see it as a summary of the entire sentence. In the tokenizer, you set truncation=True
to prevent sending a sequence too long to the model. You also set return_tensors=" pt"
to get PyTorch tensors, which your model expects.
Finally, at the end of the function, you convert the embeddings to a numpy array to detach it from PyTorch and move it back to the CPU. The output of the above code is:
Embedding shape: (3, 768) First 5 dimensions of the sentences’ embeddings: [[-0.364 -0.053 -0.367 -0.03 -0.461] [-0.276 -0.043 -0.613 0.175 -0.309] [-0.042 0.043 -0.253 -0.35 -0.374]] |
You can verify that the length of each context embedding vector is 768.
Other Techniques to Generate Embedding
While using the [CLS]
token embedding is a common approach, it is not the only one.
Mean Pooling
Recall that the BERT model is a transformer model, which takes the sequence of tokens as input and creates a sequence as output. If you can use the [CLS]
prefix token for embedding, you can also take the average of all output tokens. This is the technique of mean pooling. It may provide a better representation of the sentence.
Let’s see how you can modify the previous code to use mean pooling:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
rom transformers import AutoTokenizer, AutoModel import torch import numpy as np
# Load pre-trained model and tokenizer tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”) model = AutoModel.from_pretrained(“bert-base-uncased”)
# Define some example sentences sentences = [ “The cat sat on the mat.”, “The dog slept on the floor.”, “I love natural language processing.” ]
def get_embeddings(sentences, model, tokenizer): “Function to get embeddings for a batch of sentences with mean pooling”
# Tokenize input and get model output encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors=“pt”) with torch.no_grad(): model_output = model(**encoded_input)
# Extract the attention mask and output sequence attention_mask = encoded_input[“attention_mask”] output_seq = model_output.last_hidden_state
# Mean pooling: take the average of all token embeddings mask = attention_mask.unsqueeze(–1).expand(output_seq.size()).float() sum_embeddings = (output_seq * mask).sum(1) sum_mask = torch.clamp(mask.sum(1), min=1e–9) mean_pooled = sum_embeddings / sum_mask
# Convert torch tensor to numpy array for easier handling return mean_pooled.numpy()
# Get embeddings with mean pooling embeddings = get_embeddings(sentences, model, tokenizer) print(f“Embedding shape: {embeddings.shape}”) print(f“First 5 dimensions of the sentences’ embeddings:\n{np.round(embeddings[:, :5], 3)}”) |
The key difference is in the get_embeddings()
function. Firstly, you make use of the attention mask from the tokenizer output. It is a binary tensor that indicates which tokens are real tokens (1) and which are padding tokens (0). It is in the shape of (batch size, sequence length), but the model output is in the shape of (batch size, sequence length, hidden dimension). Therefore, you use unsqueeze(-1)
to add an extra dimension at the end of the attention mask and expand it to match the shape of the model output.
Then, the sum of all embedding vectors is computed by multiplying the model output sequence with the attention mask, where the mask value is 0, which will not contribute to the sum. The sum is computed at the second dimension (i.e., axis=1), corresponding to the sequence length dimension.
The average is then computed by dividing the sum by the sum of the mask. As the mask is either 1 or 0, the sum of the mask indicates how many non-padding elements are in the sequence. To avoid division by zero, you use torch.clamp()
to ensure the sum of the mask is at least 1e-9.
The output of the above code is:
Embedding shape: (3, 768) First 5 dimensions of the sentences’ embeddings: [[-0.182 -0.266 -0.219 0.211 0.285] [-0.056 -0.208 -0.281 0.223 0.417] [ 0.428 0.355 -0.182 -0.048 0.142]] |
The mean pooling method is believed to provide better sentence embeddings than just the [CLS]
token, especially for tasks like semantic similarity and information retrieval.
Using Sentence Transformers
BERT is a general-purpose model, and the one used in the previous example is a base model that is supposed to be used with a different “head” for a specialized task. The [CLS]
token, for example, was proposed in its original paper for a classification task. Therefore, it may not be the best choice for generating sentence embeddings. You may find the embedding vectors generated do not exhibit the properties you expect, such as the cosine similarity between sentences not reflecting their semantic similarity.
Indeed, nothing prevents you from fine-tuning the BERT model or any other transformer model to produce a better sentence embedding. But if you do not want to go through the hassle, you can use the Sentence Transformers library, which provides models that are specifically fine-tuned for generating high-quality sentence embeddings. It also hosts the pre-trained models from the Hugging Face hub. Let’s see how to use them:
The Sentence Transformers library provides specifically fine-tuned models for generating high-quality sentence embeddings. It is a separate Python library. You can install it with:
pip install sentence–transformers |
Let’s see how to use them:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity import numpy as np
# Define some example sentences sentences = [ “The cat sat on the mat.”, “The dog slept on the floor.”, “I love natural language processing.” ]
# Load a pre-trained model and generate embeddings model = SentenceTransformer(“all-MiniLM-L6-v2”) embeddings = model.encode(sentences)
# Get embeddings with mean pooling print(f“Embedding shape: {embeddings.shape}”) print(f“First 5 dimensions of the sentences’ embeddings:\n{np.round(embeddings[:, :5], 3)}”)
# Calculate cosine similarity between the first two sentences similarity = cosine_similarity([embeddings[0]], [embeddings[1]]) print(f“Cosine similarity between ‘{sentences[0]}’ and ‘{sentences[1]}’: {np.round(similarity[0][0], 3)}”) |
The code is shorter because the model from the Sentence Transformers library handles the tokenization and embedding generation in one step. Note that the Sentence Transformers model differs from the model that can be instantiated by the transformers
library. You must ensure the model name is supported by the Sentence Transformers library, or you can pick some “original” pre-trained models from the library’s documentation.
The model used in the example is all-MiniLM-L6-v2
. It is small, so it runs faster and requires less memory. It outputs a 384-dimensional embedding. To tell why a specialized sentence embedding model is better, you can compare the cosine similarity between the first two sentences:
$$
\cos(\theta_{\mathbf{a}, \mathbf{b}}) = \frac{ \mathbf{a} \cdot \mathbf{b} }{ \vert \mathbf{a} \vert \vert \mathbf{b} \vert }
$$
Scikit-learn implements the cosine similarity as a function cosine_similarity()
, which accepts two matrices as input and computes the cosine similarity between every pair of rows across the two matrices. Hence, if you have two vectors, you must encapsulate each in a list.
The output of the above code is:
Embedding shape: (3, 384) First 5 dimensions of the sentences’ embeddings: [[ 0.13 -0.016 -0.037 0.058 -0.06 ] [ 0.01 -0.01 -0.039 0.14 -0.006] [ 0.039 -0.078 0.055 0. 0.036]] Cosine similarity between ‘The cat sat on the mat.’ and ‘The dog slept on the floor.’: 0.408 |
If you want to compare the similarity between all pairs of sentences,
... print(cosine_similarity(embeddings, embeddings).round(3)) |
This will give you a symmetric 3×3 matrix with all diagonal elements being 1. The off-diagonal elements are the cosine similarity between the sentences.
If you compare the embedding results from all the examples above, you will find that the sentence transformer model provides a better distinction in cosine similarity between the pair of the first two sentences (0.408) than the last two (-0.028). In contrast, the first example (using only the [CLS]
token) does not provide a good distinction (0.941 vs 0.792). Hence, you can see the output from the sentence transformer model is of higher quality.
How to Get a High-Quality Text Embedding?
All sentence embeddings are generated from a deep learning model, especially transformer models. The quality of the embedding highly depends on the quality of the model and the training data.
Larger models, such as BERT and RoBERTa, are generally better than smaller ones, such as DistilBERT, trading off speed and memory usage for quality. A model trained or fine-tuned for a specific task will also likely provide better embedding than a general-purpose model if used for a specialized domain. For example, a model trained with a corpus from the medical domain will likely provide better embedding of medical text than a general-purpose model.
Also, note that the tokenizer plays an important role in the embedding quality. Transformer models operate on a sequence of tokens. A tokenizer that splits the sentence into sub-words that retain the semantic meaning helps the model generate better embedding. At one extreme, the tokenizer can emit every single character as a token, but that will lose a lot of information when the sequence is fed into the model. A tokenizer with a larger vocabulary so that tokens are more likely to be meaningful words will help the model understand the context. Still, it will also make the model larger.
Further Readings
Below are some further readings to learn more about text embedding generation.
Summary
In this post, you have seen how text embeddings allow you to compare text by their semantic meaning. Good text embedding helps the computer understand the text and perform NLP tasks. Specifically, you have learned:
- The different kinds of text embeddings
- How sentence embeddings can capture the semantic meaning into a context vector
- Various techniques to generate text embeddings from the BERT model
- Using the Sentence Transformers library to generate high-quality sentence embeddings