You’ve likely used ChatGPT, Gemini, or Grok, which demonstrate how large language models can exhibit human-like intelligence. While creating a clone of these large language models at home is unrealistic and unnecessary, understanding how they work helps demystify their capabilities and recognize their limitations.
All these modern large language models are decoder-only transformers. Surprisingly, their architecture is not overly complex. While you may not have extensive computational power and memory, you can still create a smaller language model that mimics some capabilities of the larger ones. By designing, building, and training such a scaled-down version, you’ll better understand what the model is doing, rather than simply viewing it as a black box labeled “AI.”
In this 10-part crash course, you’ll learn through examples how to build and train a transformer model from scratch using PyTorch. The mini-course focuses on model architecture, while advanced optimization techniques, though important, are beyond our scope. We’ll guide you from data collection through to running your trained model. Each lesson covers a specific transformer component, explaining its role, design parameters, and PyTorch implementation. By the end, you’ll have explored every aspect of the model and gained a comprehensive understanding of how transformer models work.
Let’s get started.
Building Transformer Models from Scratch with PyTorch (10-day Mini-Course)
Photo by Caleb Jack. Some rights reserved.
Who Is This Mini-Course For?
Before we begin, let’s make sure you’re in the right place. The list below provides general guidelines on whom this course is designed for. Don’t worry if you don’t match these points exactly—you might just need to brush up on certain areas to keep up.
- Developers with some coding experience. You should be comfortable writing Python code and setting up your development environment (a prerequisite). You don’t need to be an expert coder, but you should be able to install packages and write scripts without hesitation.
- Developers with basic machine learning knowledge. You should have a general understanding of machine learning models and feel comfortable using them. You don’t need to be an expert, but you should not be afraid to learn more about them.
- Developers familiar with PyTorch. This project is based on PyTorch. To keep it concise, we will not cover the basics of PyTorch. You are not required to be a PyTorch expert, but you are expected to be able to read and understand PyTorch code, and more importantly, know how to read the documentation of PyTorch in case you encountered any functions that you are not familiar with.
This mini-course is not a textbook on transformer or LLM. Instead, it serves as a project-based guide that takes you step by step from a developer with minimal experience to one who can confidently demonstrate how a transformer model is created.
Mini-Course Overview
This mini-course is divided into 10 parts.
Each lesson is designed to take about 30 minutes for the average developer. While some lessons may be completed more quickly, others might require more time if you choose to explore them in depth.
You can progress at your own pace. We recommend following a comfortable schedule of one lesson per day over ten days to allow for proper absorption of the material.
The topics you will cover over the next 10 lessons are as follows:
- Lesson 1: Getting the Data
- Lesson 2: Train a Tokenizer for Your Language Model
- Lesson 3: Positional Encoding
- Lesson 4: Grouped Query Attention
- Lesson 5: Causal Mask
- Lesson 6: Mixture of Expert Models
- Lesson 7: RMS Norm and Skip Connection
- Lesson 8: The Complete Transformer Model
- Lesson 9: Training the Model
- Lesson 10: Using the Model
This journey will be both challenging and rewarding.
While it requires dedication through reading, research, and programming, the hands-on experience you’ll gain in building a transformer model will be invaluable.
Post your results in the comments; I’ll cheer you on!
Hang in there; don’t give up.
You can download the code of this post here.
Lesson 01: Getting the Data
We are building a language model using transformer architecture. A language model is a probabilistic representation of human language that predicts the likelihood of words appearing in a sequence. Rather than being manually constructed, these probabilities are learned from data. Therefore, the first step in building a language model is to collect a large corpus of text that captures the natural patterns of language use.
There are numerous sources of text data available. Project Gutenberg is an excellent source of free text data, offering a wide variety of books across different genres. Here’s how you can download text data from Project Gutenberg to your local directory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import os import requests
DATASOURCE = { “memoirs_of_grant”: “https://www.gutenberg.org/ebooks/4367.txt.utf-8”, “frankenstein”: “https://www.gutenberg.org/ebooks/84.txt.utf-8”, “sleepy_hollow”: “https://www.gutenberg.org/ebooks/41.txt.utf-8”, “origin_of_species”: “https://www.gutenberg.org/ebooks/2009.txt.utf-8”, “makers_of_many_things”: “https://www.gutenberg.org/ebooks/28569.txt.utf-8”, “common_sense”: “https://www.gutenberg.org/ebooks/147.txt.utf-8”, “economic_peace”: “https://www.gutenberg.org/ebooks/15776.txt.utf-8”, “the_great_war_3”: “https://www.gutenberg.org/ebooks/29265.txt.utf-8”, “elements_of_style”: “https://www.gutenberg.org/ebooks/37134.txt.utf-8”, “problem_of_philosophy”: “https://www.gutenberg.org/ebooks/5827.txt.utf-8”, “nights_in_london”: “https://www.gutenberg.org/ebooks/23605.txt.utf-8”, } for filename, url in DATASOURCE.items(): if not os.path.exists(f“{filename}.txt”): response = requests.get(url) with open(f“{filename}.txt”, “wb”) as f: f.write(response.content) |
This code downloads each book as a separate text file. Since Project Gutenberg provides pre-cleaned text, we only need to extract the book contents and store them as a list of strings in Python:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# Read and preprocess the text def preprocess_gutenberg(filename): with open(filename, “r”, encoding=“utf-8”) as f: text = f.read()
# Find the start and end of the actual content start = text.find(“*** START OF THE PROJECT GUTENBERG EBOOK”) start = text.find(“\n”, start) + 1 end = text.find(“*** END OF THE PROJECT GUTENBERG EBOOK”)
# Extract the main content text = text[start:end].strip()
# Basic preprocessing # Remove multiple newlines and spaces text = “\n”.join(line.strip() for line in text.split(“\n”) if line.strip()) return text
def get_dataset_text(): all_text = [] for filename in DATASOURCE: text = preprocess_gutenberg(f“{filename}.txt”) all_text.append(text) return all_text
text = get_dataset_text() |
The preprocess_gutenberg()
function removes the Project Gutenberg header and footer from each book and joins the lines into a single string. The get_dataset_text()
function applies this preprocessing to all books and returns a list of strings, where each string represents a complete book.
Your Task
Try running the code above! While this small collection of books would typically be insufficient for training a production-ready language model, it serves as an excellent starting point for learning. Notice that the books in the DATASOURCE
dictionary span various genres. Can you think about why having diverse genres is important when building a language model?
In the next lesson, you will learn how to convert the textual data into numbers.
Lesson 02: Train a Tokenizer for Your Language Model
Computers operate on numbers, so text must be converted into numerical form for processing. In a language model, we assign numbers to “tokens,” and these thousands of distinct tokens form the model’s vocabulary.
A simple approach would be to open a dictionary and assign a number to each word. However, this naive method cannot handle unseen words effectively. A better approach is to train an algorithm that processes input text and breaks it down into tokens. This algorithm, called a tokenizer, splits text efficiently and can handle unseen words.
There are several approaches to training a tokenizer. Byte-pair encoding (BPE) is one of the most popular methods used in modern LLMs. Let’s use the tokenizer
library to train a BPE tokenizer using the text we collected in the previous lesson:
tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE()) tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.ByteLevel(add_prefix_space=True) tokenizer.decoder = tokenizers.decoders.ByteLevel() VOCAB_SIZE = 10000 trainer = tokenizers.trainers.BpeTrainer( vocab_size=VOCAB_SIZE, special_tokens=[“[pad]”, “[eos]”], show_progress=True ) text = get_dataset_text() tokenizer.train_from_iterator(text, trainer=trainer) tokenizer.enable_padding(pad_id=tokenizer.token_to_id(“[pad]”), pad_token=“[pad]”) # Save the trained tokenizer tokenizer.save(“gutenberg_tokenizer.json”, pretty=True) |
This example creates a small BPE tokenizer with a vocabulary size of 10,000. Production LLMs typically use vocabularies that are orders of magnitude larger for better language coverage. Even for this toy project, training a tokenizer takes time as it analyzes character collocations to form words. It’s recommended to save the tokenizer as a JSON file, as shown above, so you can easily reload it later:
tokenizer = tokenizers.Tokenizer.from_file(“gutenberg_tokenizer.json”) |
Your Task
Besides BPE, WordPiece is another common tokenization algorithm. Try creating a WordPiece version of the tokenizer above.
Why is a vocabulary size of 10,000 insufficient for a good language model? Research the number of words in a typical English dictionary and explain the implications for language modeling.
In the next lesson, you’ll learn about positional encoding.
Lesson 03: Positional Encoding
Unlike recurrent neural networks, transformer models process entire sequences simultaneously. However, this parallel processing means they lack inherent understanding of token order. Since token position is crucial for understanding context, transformer models incorporate positional encodings into their input processing to capture this sequential information.
While several positional encoding methods exist, Rotary Positional Encoding (RoPE) has emerged as the most widely used approach. RoPE operates by applying rotational transformations to the embedded token vectors. Each token is represented as a vector, and the encoding process involves multiplying pairs of vector elements by a $2\times 2$ rotation matrix:
$$
\mathbf{\hat{x}}_m = \mathbf{R}_m\mathbf{x}_m = \begin{bmatrix}
\cos(m\theta_i) & -\sin(m\theta_i) \\
\sin(m\theta_i) & \cos(m\theta_i)
\end{bmatrix} \mathbf{x}_m
$$
To implement RoPE, you can use the following PyTorch code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def rotate_half(x): x1, x2 = x.chunk(2, dim=–1) return torch.cat((–x2, x1), dim=–1)
def apply_rotary_pos_emb(x, cos, sin): return (x * cos) + (rotate_half(x) * sin)
class RotaryPositionalEncoding(nn.Module): def __init__(self, dim, max_seq_len=1024): super().__init__() N = 10000 inv_freq = 1. / (N ** (torch.arange(0, dim, 2).float() / dim)) position = torch.arange(max_seq_len).float() inv_freq = torch.cat((inv_freq, inv_freq), dim=–1) sinusoid_inp = torch.outer(position, inv_freq) self.register_buffer(“cos”, sinusoid_inp.cos()) self.register_buffer(“sin”, sinusoid_inp.sin())
def forward(self, x, seq_len=None): if seq_len is None: seq_len = x.size(1) cos = self.cos[:seq_len].view(1, seq_len, 1, –1) sin = self.sin[:seq_len].view(1, seq_len, 1, –1) return apply_rotary_pos_emb(x, cos, sin)
sequence = torch.randn(1, 10, 4, 128) rope = RotaryPositionalEncoding(128) new_sequence = rope(sequence) |
The RotaryPositionalEncoding
module implements the positional encoding mechanism for input sequences. Its __init__
function pre-computes sine and cosine values for all possible positions and dimensions, while the forward
function applies the rotation matrix to transform the input.
An important implementation detail is the use of register_buffer
in the __init__
function to store sine and cosine values. This tells PyTorch to treat these tensors as non-trainable model parameters, ensuring proper management across different computing devices (e.g., GPU) and during model serialization.
Your Task
Experiment with the code provided above. Earlier, we learned that RoPE applies to embedded token vectors in a sequence. Take a closer look at the input tensor sequence
used to test the RotaryPositionalEncoding
module: why is it a 4D tensor? While the last dimension (128) represents the embedding size, can you identify what the first three dimensions (1, 10, 4) represent in the context of transformer architecture?
In the next lesson, you will learn about the attention block.
Lesson 04: Grouped Query Attention
The signature component of a transformer model is its attention mechanism. When processing a sequence of tokens, the attention mechanism builds connections between tokens to understand their context.
The attention mechanism predates transformer models, and several variants have evolved over time. In this lesson, you will learn to implement Grouped Query Attention (GQA).
A transformer model begins with a sequence of embedded tokens, which are essentially vectors. The modern attention mechanism computes an output sequence based on three input sequences: query, key, and value. These three sequences are derived from the input sequence through different projections:
batch_size, seq_len, hidden_dim = x.shape
q_proj = nn.Linear(hidden_dim, num_heads * head_dim) k_proj = nn.Linear(hidden_dim, num_kv_heads * head_dim) v_proj = nn.Linear(hidden_dim, num_kv_heads * head_dim) out_proj = nn.Linear(num_heads * head_dim, hidden_dim)
q = q_proj(x).view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2) k = k_proj(x).view(batch_size, seq_len, num_kv_heads, head_dim).transpose(1, 2) v = v_proj(x).view(batch_size, seq_len, num_kv_heads, head_dim).transpose(1, 2) output = F.scaled_dot_product_attention(q, k, v, enable_gqa=True) output = output.transpose(1, 2).reshape(batch_size, seq_len, hidden_dim).contiguous() output = out_proj(q) |
The projection is performed by a fully-connected neural network layer that operates on the input tensor’s last dimension. As shown above, the projection’s output is reshaped using view()
and then transposed. The input tensor x
is 3D, and the view()
function transforms it into a 4D tensor by splitting the last dimension into two: the attention heads and the head dimension. The transpose()
function then swaps the sequence length dimension with the attention head dimension.
The resulting 4D tensor has attention operations that only involve the last two dimensions. The actual attention computation is performed using PyTorch’s built-in scaled_dot_product_attention()
function. The result is then reshaped back into a 3D tensor and projected to the original dimension.
This architecture is called grouped query attention because it uses different numbers of heads for queries versus keys and values. Typically, the number of query heads is a multiple of the number of key-value heads.
Since we will use such attention mechanism a lot, let’s create a class for it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
class GQA(nn.Module): def __init__(self, hidden_dim, num_heads, num_kv_heads, dropout=0.1): super().__init__() self.num_heads = num_heads self.num_kv_heads = num_kv_heads self.head_dim = hidden_dim // num_heads self.num_groups = num_heads // num_kv_heads self.dropout = dropout self.q_proj = nn.Linear(hidden_dim, self.num_heads * self.head_dim) self.k_proj = nn.Linear(hidden_dim, self.num_kv_heads * self.head_dim) self.v_proj = nn.Linear(hidden_dim, self.num_kv_heads * self.head_dim) self.out_proj = nn.Linear(self.num_heads * self.head_dim, hidden_dim)
def forward(self, q, k, v, mask=None, rope=None): q_batch_size, q_seq_len, hidden_dim = q.shape k_batch_size, k_seq_len, hidden_dim = k.shape v_batch_size, v_seq_len, hidden_dim = v.shape
# projection q = self.q_proj(q).view(q_batch_size, q_seq_len, –1, self.head_dim).transpose(1, 2) k = self.k_proj(k).view(k_batch_size, k_seq_len, –1, self.head_dim).transpose(1, 2) v = self.v_proj(v).view(v_batch_size, v_seq_len, –1, self.head_dim).transpose(1, 2)
# apply rotary positional encoding if rope: q = rope(q) k = rope(k)
# compute grouped query attention q = q.contiguous() k = k.contiguous() v = v.contiguous() output = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=self.dropout, enable_gqa=True) output = output.transpose(1, 2).reshape(q_batch_size, q_seq_len, hidden_dim).contiguous() output = self.out_proj(output) return output |
The forward function includes two optional arguments: mask
and rope
. The rope
argument expects a module that applies rotary positional encoding, which was covered in the previous lesson. The mask
argument will be explained in the next lesson.
Your Task
Consider why this implementation is called grouped query attention. The original transformer architecture uses multihead attention. How would you modify this grouped query attention implementation to create a multihead attention mechanism?
In the next lesson, you’ll learn about masking in attention operations.
Lesson 05: Causal Mask
A key characteristic of decoder-only transformer models is the use of causal masks in their attention layers. A causal mask is a matrix applied during attention score calculation to prevent the model from attending to future tokens. Specifically, a query token $i$ can only attend to key tokens $j$ where $j \leq i$.
With query and key sequences of length $N$, the causal mask is a square matrix of shape $(N, N)$. The element $(i,j)$ indicates whether query token $i$ can attend to the key token $j$.
In a boolean mask matrix, the element $(i,j)$ is True for $i \le j$, making all elements on and below the diagonal True. However, we typically use a floating-point matrix because we can simply add it to the attention score matrix before applying softmax normalization. In this case, elements where $i \le j$ are set to 0, and all other elements are set to $-\infty$.
Creating such a causal mask is straightforward in PyTorch:
mask = torch.triu(torch.full((N, N), float(‘-inf’)), diagonal=1) |
This creates a matrix of shape $(N, N)$ filled with $-\infty$, then uses the triu()
function to zero out all elements on and below the diagonal, creating an upper-triangular matrix.
Applying the mask in attention is straightforward:
output = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, enable_gqa=True) |
In some cases, you might need to mask additional elements, such as padding tokens in the sequence. This can be done by setting the corresponding elements to $-\infty$ in the mask tensor. While the example above shows a 2D tensor, when using both causal and padding masks, you’ll need to create a 3D tensor. In this case, each element in the batch has its own mask, and the first dimension of the mask tensor should match the batch dimension of the input tensors q
, k
, and v
.
Your Task
Given the scaled_dot_product_attention()
call above and a tensor q
of shape $(B, H, N, D)$ containing some padding tokens, how would you create a mask tensor of shape $(B, N, N)$ that combines both causal and padding masks to: (1) prevent attention to future tokens and (2) mask all attention operations involving padding tokens?
In the next lesson, you will learn about MLP sublayer.
Lesson 06: Mixture of Expert Models
Transformer models consist of stacked transformer blocks, where each block contains an attention sublayer and an MLP sublayer. The attention sublayer implements a multi-head attention mechanism, while the MLP sublayer is a feed-forward network.
The MLP sublayer introduces non-linearity to the model and is where much of the model’s “intelligence” resides. To enhance the model’s capabilities, you can either increase the size of the feed-forward network or employ a more sophisticated architecture like Mixture of Experts (MoE).
MoE is a recent innovation in transformer models. It consists of multiple parallel MLP sublayers with a router that selects a subset of them to process the input. The final output is a weighted sum of the outputs from the selected MLP sublayers. Many modern large language models use SwiGLU as their MLP sublayer, which combines three linear transformations with a SiLU activation function. Here’s how to implement it:
class SwiGLU(nn.Module): def __init__(self, hidden_dim, intermediate_dim): super().__init__() self.gate = nn.Linear(hidden_dim, intermediate_dim) self.up = nn.Linear(hidden_dim, intermediate_dim) self.down = nn.Linear(intermediate_dim, hidden_dim) self.act = nn.SiLU()
def forward(self, x): x = self.act(self.gate(x)) * self.up(x) x = self.down(x) return x |
For example, in a system with 8 MLP sublayers, the router processes each input token using a linear layer to produce 8 scores. The top 2 scoring sublayers are selected to process the input, and their outputs are combined using weighted summation.
Since PyTorch doesn’t yet provide a built-in MoE layer, you need to implement it yourself. Here’s an implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
class MoELayer(nn.Module): def __init__(self, hidden_dim, intermediate_dim, num_experts, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k # Create expert networks self.experts = nn.ModuleList([ SwiGLU(hidden_dim, intermediate_dim) for _ in range(num_experts) ]) self.router = nn.Linear(hidden_dim, num_experts)
def forward(self, hidden_states): batch_size, seq_len, hidden_dim = hidden_states.shape
# Reshape for expert processing, then compute routing probabilities hidden_states_reshaped = hidden_states.view(–1, hidden_dim) # shape of router_logits: (batch_size * seq_len, num_experts) router_logits = self.router(hidden_states_reshaped)
# Select top-k experts, then softmax output probabilities will sum to 1 # output shape: (batch_size * seq_len, k) top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=–1) top_k_probs = F.softmax(top_k_logits, dim=–1)
# Allocate output tensor output = torch.zeros(batch_size * seq_len, hidden_dim, device=hidden_states.device, dtype=hidden_states.dtype)
# Process through selected experts unique_experts = torch.unique(top_k_indices) for i in unique_experts: expert_id = int(i) # token_mask (boolean tensor) = which token of the input should use this expert # token_mask shape: (batch_size * seq_len,) mask = (top_k_indices == expert_id) token_mask = mask.any(dim=1) assert token_mask.any(), f“Expecting some tokens using expert {expert_id}”
# select tokens, apply the expert, then add to the output expert_input = hidden_states_reshaped[token_mask] expert_weight = top_k_probs[mask].unsqueeze(–1) # shape: (N, 1) expert_output = self.experts[expert_id](expert_input) # shape: (N, hidden_dim) output[token_mask] += expert_output * expert_weight
# Reshape back to original shape output = output.view(batch_size, seq_len, hidden_dim) return output |
The forward()
method first uses the router to generate top_k_indices
and top_k_probs
. Based on these indices, it selects and applies the corresponding experts to process the input. The results are combined using weighted summation with top_k_probs
. The input is a 3D tensor of shape (batch_size, seq_len, hidden_dim)
, and since each token in a sequence can be processed by different experts, the method uses masking to correctly apply the weighted sum.
Your Task
Models like DeepSeek V2 incorporate a shared expert in their MoE architecture. It is an expert that processes every input regardless of routing. Can you modify the code above to include a shared expert?
In the next lesson, you will learn about normalization layers.
Lesson 07: RMS Norm and Skip Connections
A Transformer is a typical deep learning model that can easily stack hundreds of transformer blocks, with each block containing multiple operations.
Such deep models are sensitive to the vanishing gradient problem. Normalization layers are added to mitigate this issue and stabilize the training.
The two most common normalization layers in transformer models are Layer Norm and RMS Norm. We will use RMS Norm because it has fewer parameters. Using the built-in RMS Norm layer in PyTorch is straightforward:
rms_norm = nn.RMSNorm(hidden_dim) output_rms = rms_norm(x) |
There are two ways to use RMS Norm in a transformer model: pre-norm and post-norm. In pre-norm, you apply RMS Norm before the attention and feed-forward sublayers, while in post-norm, you apply it after. This difference becomes clear when considering the skip connections. Here’s an example of a decoder-only transformer block with pre-norm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
class DecoderLayer(nn.Module): def __init__(self, hidden_dim, num_heads, num_kv_heads, moe_experts, moe_topk, dropout=0.1): super().__init__() self.self_attn = GQA(hidden_dim, num_heads, num_kv_heads, dropout) self.mlp = MoELayer(hidden_dim, 4 * hidden_dim, moe_experts, moe_topk) self.norm1 = nn.RMSNorm(hidden_dim) self.norm2 = nn.RMSNorm(hidden_dim)
def forward(self, x, mask=None, rope=None): # self-attention sublayer out = self.norm1(x) out = self.self_attn(out, out, out, mask, rope) x = out + x # MLP sublayer out = self.norm2(x) out = self.mlp(out) return out + x |
Each transformer block contains an attention sublayer (implemented using the GQA class from lesson 4) and a feed-forward sublayer (implemented using the MoE class from lesson 6), along with two RMS Norm layers.
In the forward()
method, we first normalize the input before applying the attention sublayer. Then, for the skip connection, we add the original unnormalized input to the attention sublayer’s output. In a post-norm approach, we would instead apply attention to the unnormalized input and then normalize the tensor after the skip connection. Research has shown that the pre-norm approach provides more stable training.
Your Task
From the description above, how would you modify the code to make it a post-norm transformer block?
In the next lesson, you will learn to create the complete transformer model.
Lesson 08: The Complete Transformer Model
So far, you have created all the building blocks of the transformer model. You can build a complete transformer model by stacking these blocks together. Before doing that, let’s list out the design parameters by creating a dictionary for the model configuration:
model_config = { “num_layers”: 8, “num_heads”: 8, “num_kv_heads”: 4, “hidden_dim”: 768, “moe_experts”: 8, “moe_topk”: 3, “max_seq_len”: 512, “vocab_size”: len(tokenizer.get_vocab()), “dropout”: 0.1, } |
The number of transformer blocks and the hidden dimension directly determine the model size. You can think of them as the “depth” and “width” of the model respectively. For each transformer block, you need to specify the number of attention heads (and in GQA, the number of key-value heads). Since we’re using the MoE model, you also need to define the total number of experts and the top-k value. Note that the MLP sublayer (implemented as SwiGLU) typically sets the intermediate dimension to 4 times the hidden dimension, so you don’t need to specify this separately.
The remaining hyperparameters don’t affect the model size: the maximum sequence length (which the rotary positional encoding depends on), the vocabulary size (which determines the embedding matrix dimensions), and the dropout rate used during training.
With these, you can create a transformer model. Let’s call it TextGenerationModel
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
class TextGenerationModel(nn.Module): def __init__(self, num_layers, num_heads, num_kv_heads, hidden_dim, moe_experts, moe_topk, max_seq_len, vocab_size, dropout=0.1): super().__init__() self.rope = RotaryPositionalEncoding(hidden_dim // num_heads, max_seq_len) self.embedding = nn.Embedding(vocab_size, hidden_dim) self.decoders = nn.ModuleList([ DecoderLayer(hidden_dim, num_heads, num_kv_heads, moe_experts, moe_topk, dropout) for _ in range(num_layers) ]) self.norm = nn.RMSNorm(hidden_dim) self.out = nn.Linear(hidden_dim, vocab_size)
def forward(self, ids, mask=None): x = self.embedding(ids) for decoder in self.decoders: x = decoder(x, mask, self.rope) x = self.norm(x) return self.out(x)
model = TextGenerationModel(**model_config) |
In this model, we create a single rotary position encoding module that’s reused across all transformer blocks. Since it’s a constant module, we only need one instance. The model begins with an embedding layer that converts token IDs into embedding vectors. These vectors are then processed through a series of transformer blocks. The output from the final transformer block remains a sequence of embedding vectors, which we normalize and project to vocabulary-sized logits using a linear layer. These logits represent probability distributions for predicting the next token in the sequence.
Your Task
The model is now complete. However, consider this question: Why does the forward()
method accept a mask as an optional argument? If we’re using a causal mask, wouldn’t it make more sense to generate it internally within the model?
In the next lesson, you will learn to train the model.
Lesson 09: Training the Model
Now that you’ve built a model, let’s learn how to train it. In lesson 1, you prepared the dataset for training. The next step is to wrap the dataset as a PyTorch Dataset object:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
class GutenbergDataset(torch.utils.data.Dataset): def __init__(self, text, tokenizer, seq_len=512): self.seq_len = seq_len # Encode the entire text self.encoded = tokenizer.encode(text).ids
def __len__(self): return len(self.encoded) – self.seq_len
def __getitem__(self, idx): chunk = self.encoded[idx:idx + self.seq_len + 1] # +1 for target x = torch.tensor(chunk[:–1]) y = torch.tensor(chunk[1:]) return x, y
BATCH_SIZE = 32 text = “\n”.join(get_dataset_text()) dataset = GutenbergDataset(text, tokenizer, seq_len=model_config[“max_seq_len”]) dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True) |
This dataset is designed for model pre-training, where the task is to predict the next token in a sequence. The dataset
object is a Python iterable that produces pairs of (x,y), where x is a sequence of token IDs with fixed length, and y is the corresponding next token. Since the training targets (y) are derived from the input data itself, this approach is called self-supervised learning.
Depending on your hardware, you can optimize the training speed and memory usage. If you have a GPU with limited memory, you can load the model onto the GPU and use half-precision (bfloat16) to reduce memory consumption. Here’s how:
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’) model = model.to(device).to(torch.bfloat16) |
If you still encounter out of memory error, you may want to reduce the model size or batch size.
You need to write a training loop to train the model. In PyTorch, you may do as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
N_EPOCHS = 2 LR = 0.0005 WARMUP_STEPS = 2000 CLIP_NORM = 6.0
optimizer = optim.AdamW(model.parameters(), lr=LR) loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer.token_to_id(“[pad]”))
# Learning rate scheduling warmup_scheduler = optim.lr_scheduler.LinearLR( optimizer, start_factor=0.01, end_factor=1.0, total_iters=WARMUP_STEPS) cosine_scheduler = optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=N_EPOCHS * len(dataloader) – WARMUP_STEPS, eta_min=0) scheduler = optim.lr_scheduler.SequentialLR( optimizer, schedulers=[warmup_scheduler, cosine_scheduler], milestones=[WARMUP_STEPS])
print(f“Training for {N_EPOCHS} epochs with {len(dataloader)} steps per epoch”) best_loss = float(‘inf’)
for epoch in range(N_EPOCHS): model.train() epoch_loss = 0
progress_bar = tqdm.tqdm(dataloader, desc=f“Epoch {epoch+1}/{N_EPOCHS}”) for x, y in progress_bar: x = x.to(device) y = y.to(device)
# Create causal mask mask = create_causal_mask(x.shape[1], device, torch.bfloat16)
# Forward pass optimizer.zero_grad() outputs = model(x, mask.unsqueeze(0))
# Compute loss loss = loss_fn(outputs.view(–1, outputs.shape[–1]), y.view(–1))
# Backward pass loss.backward() torch.nn.utils.clip_grad_norm_( model.parameters(), CLIP_NORM, error_if_nonfinite=True ) optimizer.step() scheduler.step() epoch_loss += loss.item()
# Show loss in tqdm progress_bar.set_postfix(loss=loss.item())
avg_loss = epoch_loss / len(dataloader) print(f“Epoch {epoch+1}/{N_EPOCHS}; Avg loss: {avg_loss:.4f}”)
# Save checkpoint if loss improved if avg_loss < best_loss: best_loss = avg_loss torch.save(model.state_dict(), “textgen_model.pth”) |
While this training loop might differ from what you’ve used for other models, it follows best practices for training transformers. The code uses a cosine learning rate scheduler with a warm-up period—the learning rate gradually increases during warm-up and then decreases following a cosine curve.
To prevent gradient explosion, we implement gradient clipping, which stabilizes training by limiting drastic changes in model parameters.
The model functions as a next-token predictor, outputting a probability distribution over the entire vocabulary. Since this is essentially a classification task (predicting which token comes next), we use cross-entropy loss for training.
The training progress is monitored using tqdm, which displays the loss for each epoch. The model’s parameters are saved whenever the loss improves, ensuring we keep the best performing version.
Your Task
The training loop above runs for only two epochs. Consider why this number is relatively small, and what factors might make additional epochs unnecessary for this particular task.
In the next lesson, you will learn to use the model.
Lesson 10: Using the Model
After training the model, you can use it to generate text. To optimize performance, disable gradient computation in PyTorch. Additionally, since some modules like dropout behave differently during training and inference, switch the model to evaluation mode before use.
Let’s create a function for text generation that can be called multiple times to generate different samples:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
def generate_text(model, tokenizer, prompt, max_length=100, temperature=1.0): model.eval() device = next(model.parameters()).device
# Encode the prompt, set tensor to batch size of 1 input_ids = torch.tensor(tokenizer.encode(prompt).ids).unsqueeze(0).to(device)
with torch.no_grad(): for _ in range(max_length): # Get model predictions for the next token as the last element of the output outputs = model(input_ids) next_token_logits = outputs[:, –1, :] / temperature # Sample from the distribution probs = F.softmax(next_token_logits, dim=–1) next_token = torch.multinomial(probs, num_samples=1) # Append to input_ids input_ids = torch.cat([input_ids, next_token], dim=1) # Stop if we predict the end token if next_token[0].item() == tokenizer.token_to_id(“[eos]”): break
return tokenizer.decode(input_ids[0].tolist())
# Test the model with some prompts test_prompts = [ “Once upon a time,”, “We the people of the”, “In the beginning was the”, ]
print(“\nGenerating sample texts:”) for prompt in test_prompts: generated = generate_text(model, tokenizer, prompt) print(f“\nPrompt: {prompt}”) print(f“Generated: {generated}”) print(“-“ * 80) |
The generate_text()
function implements probabilistic sampling for token generation. Although the model outputs logits representing a probability distribution over the vocabulary, it doesn’t always select the most probable token. Instead, it uses the softmax function to convert logits to probabilities. The temperature parameter controls the sampling distribution: lower values make the model more conservative by emphasizing likely tokens, while higher values make it more creative by reducing the probability differences between tokens.
The function takes a partial sentence as a prompt
string and generates a sequence of tokens using the model. Although the model is trained with batches, this function uses a batch size of 1 for simplicity. The final output is returned as a decoded string.
Your Task
Look at the code above: Why does the function need to determine the model’s device at the beginning?
The current implementation uses a simple sampling approach. An advanced technique called nucleus sampling (or top-p sampling) considers only the most likely tokens whose cumulative probability exceeds a threshold $p$. How would you modify the code to implement nucleus sampling?
This is the last lesson.
The End! (Look How Far You Have Come)
You made it. Well done!
Take a moment and look back at how far you have come.
- You discovered what are transformer models and their architecture.
- You learned how to build a transformer model from scratch.
- You learned how to train and use a transformer model.
Don’t make light of this; you have come a long way in a short time. This is just the beginning of your transformer model journey. Keep practicing and developing your skills.
Summary
How did you do with the mini-course?
Did you enjoy this crash course?
Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.