Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. By combining the strengths of retrieval systems with generative models, RAG systems can produce more accurate, factual, and contextually relevant responses. This approach is particularly valuable when dealing with domain-specific knowledge or when up-to-date information is required.
In this post, you will explore how to build a basic RAG system using models from the Hugging Face library. You’ll build each system component, from document indexing to retrieval and generation, and implement a complete end-to-end solution. Specifically, you will learn:
- The RAG architecture and its components
- How to build a document indexing and retrieval system
- How to implement a transformer-based generator
Let’s get started!
Building RAG Systems with Transformers
Photo by Tina Nord. Some rights reserved.
Overview
This post is divided into five parts:
- Understanding the RAG architecture
- Building the Document Indexing System
- Implementing the Retrieval System
- Implementing the Generator
- Building the Complete RAG System
Understanding the RAG Architecture
An RAG system consists of two main components:
- Retriever: Responsible for finding relevant documents or passages from a knowledge base given a query.
- Generator: Uses the retrieved documents and the original query to generate a coherent and informative response.
Each of these components has many fine details. You need RAG because the generator alone (i.e., the language model) cannot generate accurate and contextually relevant responses, which are known as hallucinations. Therefore, you need the retriever to provide hints to help the generator.
This approach combines generative models’ broad language understanding capabilities with the ability to access specific information from a knowledge base. This results in responses that are both fluent and factually accurate.
Let’s implement each component of a RAG system step by step.
Building the Document Indexing System
The first step in creating a RAG system is to build a document indexing system. This system must encode documents into dense vector representations and store them in a database. Then, we can retrieve the documents based on contextual similarity. This means you need to be able to search by vector similarity metrics, not exact matches. This is a key point – not all database systems can be used to build a document indexing system.
Of course, you could collect documents, encode them into vector representations, and keep them in memory. When retrieval is requested, you could compute the similarity one by one to find the closest match. However, checking each vector in a loop is inefficient and not scalable. FAISS is a library that is optimized for this task. To install FAISS, you can compile it from source or use the pre-compiled version from PyPI:
In the following, you’ll create a language model to encode documents into dense vector representations and store them in a FAISS index for efficient retrieval:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
import faiss import torch from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(“sentence-transformers/all-MiniLM-L6-v2”) model = AutoModel.from_pretrained(“sentence-transformers/all-MiniLM-L6-v2”)
def generate_embedding(docs, model, tokenizer): # Tokenize each text and convert to PyTorch tensors inputs = tokenizer(docs, padding=True, truncation=True, return_tensors=“pt”, max_length=512) with torch.no_grad(): outputs = model(**inputs)
# Embedding defined as mean pooling of all tokens attention_mask = inputs[“attention_mask”] embeddings = outputs.last_hidden_state
expanded_mask = attention_mask.unsqueeze(–1).expand(embeddings.shape).float() sum_embeddings = torch.sum(embeddings * expanded_mask, axis=1) sum_mask = torch.clamp(expanded_mask.sum(axis=1), min=1e–9) mean_embeddings = sum_embeddings / sum_mask
# Convert to numpy array return mean_embeddings.cpu().numpy()
# Sample document collection documents = [ “Transformers are a type of deep learning model introduced in the paper ‘Attention “ “Is All You Need’.”, “BERT (Bidirectional Encoder Representations from Transformers) is a “ “transformer-based model designed to understand the context of a word based on “ “its surroundings.”, “GPT (Generative Pre-trained Transformer) is a transformer-based model designed for “ “natural language generation tasks.”, “T5 (Text-to-Text Transfer Transformer) treats every NLP problem as a text-to-text “ “problem, where both the input and output are text strings.”, “RoBERTa is an optimized version of BERT with improved training methodology and more “ “training data.”, “DistilBERT is a smaller, faster version of BERT that retains 97% of its language “ “understanding capabilities.”, “ALBERT reduces the parameters of BERT by sharing parameters across layers and using “ “embedding factorization.”, “XLNet is a generalized autoregressive pretraining method that overcomes the “ “limitations of BERT by using permutation language modeling.”, “ELECTRA uses a generator-discriminator architecture for more efficient pretraining.”, “DeBERTa enhances BERT with disentangled attention and an enhanced mask decoder.” ]
# Generate embeddings for all documents, then create FAISS index for efficient similarity search document_embeddings = generate_embedding(documents, model, tokenizer) dimension = document_embeddings.shape[1] # Dimension of the embeddings index = faiss.IndexFlatL2(dimension) # Using L2 (Euclidean) distance index.add(document_embeddings) # Add embeddings to the index print(f“Created index with {index.ntotal} documents”) |
The key part of this code is the generate_embedding()
function. It takes a list of documents, encodes them through the model, and returns a dense vector representation using mean pooling over all token embeddings from each document. The document does not need to be long and complete. A sentence or paragraph is expected because the models have a context window limit. Moreover, you will see later in another example that a very long document is not ideal for RAG.
You used a pre-trained Sentence Transformer model, sentence-transformers/all-MiniLM-L6-v2
, which is specifically designed for generating sentence embeddings. You do not keep the original document in the FAISS index; you only keep the embedding vectors. You pre-build the L2 distance index among these vectors for efficient similarity search.
You may modify this code for different implementations of the RAG system. For example, the dense vector representation is obtained by mean pooling. Still, you can just use the first token since the tokenizer prepends the [CLS]
token to each sentence, and the model is supposed to produce the context embedding over this special token. Moreover, L2 distance is used here because you declared the FAISS index intending to use it with the L2 metric. There is no cosine similarity metric in FAISS, but L2 and cosine distance are similar. Note that, with normalized vectors,
$$
\begin{align}
\Vert \mathbf{x} – \mathbf{y} \Vert_2^2
&= (\mathbf{x} – \mathbf{y})^\top (\mathbf{x} – \mathbf{y}) \\
&= \mathbf{x}^\top \mathbf{x} – 2 \mathbf{x}^\top \mathbf{y} + \mathbf{y}^\top \mathbf{y} \\
&= 2 – 2 \mathbf{x}^\top \mathbf{y} \\
&= 2 – 2 \cos \theta
\end{align}
$$
Therefore, L2 distance is equivalent to cosine distance when the vectors are normalized (as long as you remember that when dissimilarity increases, L2 runs from 0 to infinity, but cosine distance decreases from +1 to -1). If you intended to use cosine distance, you should modify the code to become:
... document_embeddings = generate_embedding(documents, model, tokenizer) normalized = document_embeddings / np.linalg.norm(document_embeddings, axis=1, keepdims=True) index.add(normalized) |
Essentially, you scaled each embedding vector to make it unit length.
Implementing the Retrieval System
With the documents indexed, let’s see how you can retrieve some of the most relevant documents for a given query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
...
def retrieve_documents(query, index, documents, k=3): # Generate embedding for the query query_embedding = generate_embedding(query, model, tokenizer) # 1xD matrix # Search the index for similar documents distances, indices = index.search(query_embedding, k) # 1xk matrices # Return the retrieved documents and their distances retrieved_docs = [(documents[idx], float(distances[0][i])) for i, idx in enumerate(indices[0])] return retrieved_docs
# Example query query = “What is BERT?” retrieved_docs = retrieve_documents(query, index, documents)
# Print the retrieved documents print(f“Query: {query}\n”) for i, (doc, distance) in enumerate(retrieved_docs): print(f“Document {i+1} (Distance: {distance:.4f}):”) print(doc) print() |
If you run this code, you will see the following output:
Query: What is BERT?
Document 1 (Distance: 23.7060): BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model designed to understand the context of a word based on its surroundings.
Document 2 (Distance: 28.0794): RoBERTa is an optimized version of BERT with improved training methodology and more training data.
Document 3 (Distance: 29.5908): DistilBERT is a smaller, faster version of BERT that retains 97% of its language understanding capabilities. |
In the function retrieve_documents()
, you provide the query string, the FAISS index, and the document collection. You then generate the embedding for the query just like you did for the documents. Then, you leverage the search()
method of the FAISS index to find the k most similar documents to the query embedding. The search()
method returns two arrays:
distances
: The distances between the query embedding and the indexed embeddings. Since this is how you defined the index, these are the L2 distances.indices
: The indices of the indexed embeddings that are most similar to the query embedding, matching the distances array.
You can use these arrays to retrieve the most similar documents from the original collection. Here, you use the indices to get the documents from the list. Afterward, you print the retrieved documents along with their distances from the query in the embedding space in descending order of relevance or increasing distance.
Note that the document’s context vector is supposed to represent the entire document. Therefore, the distance between the query and the document may be large if the document contains a lot of information. Ideally, you want the documents to be focused and concise. If you have a long text, you may want to split it into multiple documents to make the RAG system more accurate.
This retrieval system forms the first component of our RAG architecture. Given a user query, it allows us to find relevant information from our knowledge base. There are many other ways to implement the same functionality, but this highlights the key idea of vector search.
Implementing the Generator
Next, let’s implement the generator component of our RAG system.
It is a prompt engineering problem. While the user provides a query, you first retrieve the most relevant documents from the retriever and create a new prompt that includes the user’s query and the retrieved documents as context. Then, you use a pre-trained language model to generate a response based on the new prompt.
Here is how you can implement it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
...
from transformers import AutoModelForSeq2SeqLM
gen_tokenizer = AutoTokenizer.from_pretrained(“t5-small”) gen_model = AutoModelForSeq2SeqLM.from_pretrained(“t5-small”)
def generate_response(query, retrieved_docs, max_length=150): # Combine the query and retrieved documents into a single prompt context = “\n”.join(retrieved_docs) prompt = f“question: {query} context: {context}”
# Generate a response inputs = gen_tokenizer(prompt, return_tensors=“pt”, max_length=512, truncation=True) with torch.no_grad(): outputs = gen_model.generate( inputs.input_ids, max_length=max_length, num_beams=4, early_stopping=True, no_repeat_ngram_size=2 ) response = gen_tokenizer.decode(outputs[0], skip_special_tokens=True) return response
# Generate a response for the example query response = generate_response(query, [doc for doc, score in retrieved_docs]) print(“Generated Response:”) print(response) |
This is the generator component of our RAG system. You instantiate the pre-trained T5 model (small version, but you can pick a larger one or a different model that fits to run on your system). This model is a sequence-to-sequence model that generates a new sequence from a given sequence. If you use a different model, such as the “causal LM” model, you may need to change the prompt to make it work more efficiently.
In the generate_response()
function, you combine the query and the retrieved documents into a single prompt. Then, you use the T5 model to generate a response. You can adjust the generation parameters to make it work better. In the above, only beam search is used for simplicity. The model’s output is then decoded to a text string as the response. Since you combined multiple documents into a single prompt, you need to be careful that the prompt does not exceed the context window of the model.
The generator leverages the information from the retrieved documents to produce a fluent and factually accurate response. The model behaves vastly differently when you just pose the query without context.
Building the Complete RAG System
That’s all you need to build a basic RAG system. Let’s create a function to wrap up the retrieval and generation components:
... def rag_pipeline(query, documents, retriever_k=3, max_length=150): retrieved_docs = retrieve_documents(query, index, documents, k=retriever_k) response = generate_response(query, retrieved_docs, max_length=max_length) return response, retrieved_docs |
Then you can use the RAG pipeline in a loop to generate responses for a set of queries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
...
# Example queries queries = [ “What is BERT?”, “How does GPT work?”, “What is the difference between BERT and GPT?”, “What is a smaller version of BERT?” ] # Run the RAG pipeline for each query for query in queries: response, retrieved_docs = rag_pipeline(query, documents) print(f“Query: {query}”) print() print(“Retrieved Documents:”) for i, (doc, distance) in enumerate(retrieved_docs): print(f“Document {i+1} (Distance: {distance:.4f}):”) print(doc) print() print(“Generated Response:”) print(response) print(“-“ * 20) |
You can see that the queries are answered one by one in a loop. The set of documents, however, is prepared in advance and reused for all queries. This is how an RAG system typically works.
The complete code of all the above is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
import faiss import torch from transformers import AutoTokenizer, AutoModel from transformers import AutoModelForSeq2SeqLM
# Model to use in retriever tokenizer = AutoTokenizer.from_pretrained(“sentence-transformers/all-MiniLM-L6-v2”) model = AutoModel.from_pretrained(“sentence-transformers/all-MiniLM-L6-v2”) # Model to use in generator gen_tokenizer = AutoTokenizer.from_pretrained(“t5-small”) gen_model = AutoModelForSeq2SeqLM.from_pretrained(“t5-small”)
def generate_embedding(docs, model, tokenizer): # Tokenize each text and convert to PyTorch tensors inputs = tokenizer(docs, padding=True, truncation=True, return_tensors=“pt”, max_length=512) with torch.no_grad(): outputs = model(**inputs)
# Embedding defined as mean pooling of all tokens attention_mask = inputs[“attention_mask”] embeddings = outputs.last_hidden_state
expanded_mask = attention_mask.unsqueeze(–1).expand(embeddings.shape).float() sum_embeddings = torch.sum(embeddings * expanded_mask, axis=1) sum_mask = torch.clamp(expanded_mask.sum(axis=1), min=1e–9) mean_embeddings = sum_embeddings / sum_mask
# Convert to numpy array return mean_embeddings.cpu().numpy()
def retrieve_documents(query, index, documents, k=3): # Generate embedding for the query query_embedding = generate_embedding(query, model, tokenizer) # 1xD matrix # Search the index for similar documents distances, indices = index.search(query_embedding, k) # 1xk matrices # Return the retrieved documents and their distances retrieved_docs = [(documents[idx], float(distances[0][i])) for i, idx in enumerate(indices[0])] return retrieved_docs
def generate_response(query, retrieved_docs, max_length=150): # Combine the query and retrieved documents into a single prompt if retrieved_docs: context = “\n”.join(retrieved_docs) prompt = f“question: {query} context: {context}” else: prompt = f“question: {query}”
# Generate a response inputs = gen_tokenizer(prompt, return_tensors=“pt”, max_length=512, truncation=True) with torch.no_grad(): outputs = gen_model.generate( inputs.input_ids, max_length=max_length, num_beams=4, early_stopping=True, no_repeat_ngram_size=2 ) response = gen_tokenizer.decode(outputs[0], skip_special_tokens=True) return response
def rag_pipeline(query, documents, retriever_k=3, max_length=150): retrieved_docs = retrieve_documents(query, index, documents, k=retriever_k) docs = [doc for doc, distance in retrieved_docs] response = generate_response(query, docs, max_length=max_length) return response, retrieved_docs
# Sample document collection documents = [ “Transformers are a type of deep learning model introduced in the paper ‘Attention “ “Is All You Need’.”, “BERT (Bidirectional Encoder Representations from Transformers) is a “ “transformer-based model designed to understand the context of a word based on “ “its surroundings.”, “GPT (Generative Pre-trained Transformer) is a transformer-based model designed for “ “natural language generation tasks.”, “T5 (Text-to-Text Transfer Transformer) treats every NLP problem as a text-to-text “ “problem, where both the input and output are text strings.”, “RoBERTa is an optimized version of BERT with improved training methodology and more “ “training data.”, “DistilBERT is a smaller, faster version of BERT that retains 97% of its language “ “understanding capabilities.”, “ALBERT reduces the parameters of BERT by sharing parameters across layers and using “ “embedding factorization.”, “XLNet is a generalized autoregressive pretraining method that overcomes the “ “limitations of BERT by using permutation language modeling.”, “ELECTRA uses a generator-discriminator architecture for more efficient pretraining.”, “DeBERTa enhances BERT with disentangled attention and an enhanced mask decoder.” ]
# Generate embeddings for all documents, then create FAISS index for efficient similarity search document_embeddings = generate_embedding(documents, model, tokenizer) dimension = document_embeddings.shape[1] # Dimension of the embeddings index = faiss.IndexFlatL2(dimension) # Using L2 (Euclidean) distance index.add(document_embeddings) # Add embeddings to the index print(f“Created index with {index.ntotal} documents”)
# Example queries queries = [ “What is BERT?”, “How does GPT work?”, “What is the difference between BERT and GPT?”, “What is a smaller version of BERT?” ] # Run the RAG pipeline for each query for query in queries: response, retrieved_docs = rag_pipeline(query, documents) print(f“Query: {query}”) print() print(“Retrieved Documents:”) for i, (doc, distance) in enumerate(retrieved_docs): print(f“Document {i+1} (Distance: {distance:.4f}):”) print(doc) print() print(“Generated Response:”) print(response) print(“-“ * 20) |
This code is self-contained. All the documents and queries are defined in the code. This is a starting point, and you may extend it for new features, such as saving the indexed documents in a file that you can load later without re-indexing every time.
Further Readings
Below are some further readings that you may find useful:
Summary
This post explored building a Retrieval-Augmented Generation (RAG) system using transformer models from the Hugging Face library. We’ve implemented each system component, from document indexing to retrieval and generation, and combined them into a complete end-to-end solution.
RAG systems represent a powerful approach to enhancing the capabilities of language models by grounding them in external knowledge. RAG systems can produce more accurate, factual, and contextually relevant responses by retrieving relevant information and incorporating it into the generation process.