In this article, you will learn how quantization shrinks large language models and how to convert an FP16 checkpoint into an efficient GGUF file you can share and run locally.
Topics we will cover include:
- What precision types (FP32, FP16, 8-bit, 4-bit) mean for model size and speed
- How to use
huggingface_hubto fetch a model and authenticate - How to convert to GGUF with
llama.cppand upload the result to Hugging Face
And away we go.
Quantizing LLMs Step-by-Step: Converting FP16 Models to GGUF
Image by Author
Introduction
Large language models like LLaMA, Mistral, and Qwen have billions of parameters that demand a lot of memory and compute power. For example, running LLaMA 7B in full precision can require over 12 GB of VRAM, making it impractical for many users. You can check the details in this Hugging Face discussion. Don’t worry about what “full precision” means yet; we’ll break it down soon. The main idea is this: these models are too big to run on standard hardware without help. Quantization is that help.
Quantization allows independent researchers and hobbyists to run large models on personal computers by shrinking the size of the model without severely impacting performance. In this guide, we’ll explore how quantization works, what different precision formats mean, and then walk through quantizing a sample FP16 model into a GGUF format and uploading it to Hugging Face.
What Is Quantization?
At a very basic level, quantization is about making a model smaller without breaking it. Large language models are made up of billions of numerical values called weights. These numbers control how strongly different parts of the network influence each other when producing an output. By default, these weights are stored using high-precision formats such as FP32 or FP16, which means every number takes up a lot of memory, and when you have billions of them, things get out of hand very quickly. Take a single number like 2.31384. In FP32, that one number alone uses 32 bits of memory. Now imagine storing billions of numbers like that. This is why a 7B model can easily take around 28 GB in FP32 and about 14 GB even in FP16. For most laptops and GPUs, that’s already too much.
Quantization fixes this by saying: we don’t actually need that much precision anymore. Instead of storing 2.31384 exactly, we store something close to it using fewer bits. Maybe it becomes 2.3 or a nearby integer value under the hood. The number is slightly less accurate, but the model still behaves the same in practice. Neural networks can tolerate these small errors because the final output depends on billions of calculations, not a single number. Small differences average out, much like image compression reduces file size without ruining how the image looks. But the payoff is huge. A model that needs 14 GB in FP16 can often run in about 7 GB with 8-bit quantization, or even around 4 GB with 4-bit quantization. This is what makes it possible to run large language models locally instead of relying on expensive servers.
After quantizing, we often store the model in a unified file format. One popular format is GGUF, created by Georgi Gerganov (author of llama.cpp). GGUF is a single-file format that includes both the quantized weights and useful metadata. It’s optimized for quick loading and inference on CPUs or other lightweight runtimes. GGUF also supports multiple quantization types (like Q4_0, Q8_0) and works well on CPUs and low-end GPUs. Hopefully, this clarifies both the concept and the motivation behind quantization. Now let’s move on to writing some code.
Step-by-Step: Quantizing a Model to GGUF
1. Installing Dependencies and Logging to Hugging Face
Before downloading or converting any model, we need to install the required Python packages and authenticate with Hugging Face. We’ll use huggingface_hub, Transformers, and SentencePiece. This ensures we can access public or gated models without errors:
from huggingface_hub import login
login()
|
!pip install –U huggingface_hub transformers sentencepiece –q
from huggingface_hub import login login() |
2. Downloading a Pre-trained Model
We will pick a small FP16 model from Hugging Face. Here we use TinyLlama 1.1B, which is small enough to run in Colab but still gives a good demonstration. Using Python, we can download it with huggingface_hub:
model_id = “TinyLlama/TinyLlama-1.1B-Chat-v1.0″
snapshot_download(
repo_id=model_id,
local_dir=”model_folder”,
local_dir_use_symlinks=False
)
|
from huggingface_hub import snapshot_download
model_id = “TinyLlama/TinyLlama-1.1B-Chat-v1.0” snapshot_download( repo_id=model_id, local_dir=“model_folder”, local_dir_use_symlinks=False ) |
This command saves the model files into the model_folder directory. You can replace model_id with any Hugging Face model ID that you want to quantize. (If needed, you can also use AutoModel.from_pretrained with torch.float16 to load it first, but snapshot_download is straightforward for grabbing the files.)
3. Setting Up the Conversion Tools
Next, we clone the llama.cpp repository, which contains the conversion scripts. In Colab:
|
!git clone https://github.com/ggml-org/llama.cpp !pip install –r llama.cpp/requirements.txt –q |
This gives you access to convert_hf_to_gguf.py. The Python requirements ensure you have all needed libraries to run the script.
4. Converting the Model to GGUF with Quantization
Now, run the conversion script, specifying the input folder, output filename, and quantization type. We will use q8_0 (8-bit quantization). This will roughly halve the memory footprint of the model:
|
!python3 llama.cpp/convert_hf_to_gguf.py /content/model_folder \ —outfile /content/tinyllama–1.1b–chat.Q8_0.gguf \ —outtype q8_0 |
Here /content/model_folder is where we downloaded the model, /content/tinyllama-1.1b-chat.Q8_0.gguf is the output GGUF file, and the --outtype q8_0 flag means “quantize to 8-bit.” The script loads the FP16 weights, converts them into 8-bit values, and writes a single GGUF file. This file is now much smaller and ready for inference with GGUF-compatible tools.
|
Output: INFO:gguf.gguf_writer:Writing the following files: INFO:gguf.gguf_writer:/content/tinyllama–1.1b–chat.Q8_0.gguf: n_tensors = 201, total_size = 1.2G Writing: 100% 1.17G/1.17G [00:26<00:00, 44.5Mbyte/s] INFO:hf–to–gguf:Model successfully exported to /content/tinyllama–1.1b–chat.Q8_0.gguf |
You can verify the output:
|
!ls –lh /content/tinyllama–1.1b–chat.Q8_0.gguf |
You should see a file a few GB in size, reduced from the original FP16 model.
|
–rw–r—r— 1 root root 1.1G Dec 30 20:23 /content/tinyllama–1.1b–chat.Q8_0.gguf |
5. Uploading the Quantized Model to Hugging Face
Finally, you can publish the GGUF model so others can easily download and use it using the huggingface_hub Python library:
api = HfApi()
repo_id = “kanwal-mehreen18/tinyllama-1.1b-gguf”
api.create_repo(repo_id, exist_ok=True)
api.upload_file(
path_or_fileobj=”/content/tinyllama-1.1b-chat.Q8_0.gguf”,
path_in_repo=”tinyllama-1.1b-chat.Q8_0.gguf”,
repo_id=repo_id
)
|
from huggingface_hub import HfApi
api = HfApi() repo_id = “kanwal-mehreen18/tinyllama-1.1b-gguf” api.create_repo(repo_id, exist_ok=True)
api.upload_file( path_or_fileobj=“/content/tinyllama-1.1b-chat.Q8_0.gguf”, path_in_repo=“tinyllama-1.1b-chat.Q8_0.gguf”, repo_id=repo_id ) |
This creates a new repository (if it doesn’t exist) and uploads your quantized GGUF file. Anyone can now load it with llama.cpp, llama-cpp-python, or Ollama. You can access the quantized GGUF file that we created here.
Wrapping Up
By following the steps above, you can take any supported Hugging Face model, quantize it (e.g. to 4-bit or 8-bit), and save it as GGUF. Then push it to Hugging Face to share or deploy. This makes it easier than ever to compress and use large language models on everyday hardware.
