Building LLM Applications with Hugging Face Endpoints and FastAPI
Image by Author | Ideogram
Introduction
FastAPI is a modern and high-performance compliant web framework for building APIs with Python. It simplifies the job of efficiently building HTTP APIs and integrating them with AI and machine learning models, including language models like those available at the Hugging Face Hub.
Based on the combination of Hugging Face Endpoints and FastAPI, this article illustrates an example of how to build an API endpoint using FastAPI that interacts with a large language model (LLM) accessible in Hugging Face. The FastAPI server is set up to listen to incoming requests containing text prompts, which are subsequently forwarded to the Hugging Face model using the Requests library. The language model processes the input prompt and returns a generated response, which is then sent back to the client (our local machine). The combined use of FastAPI’s efficient handling of HTTP requests and Hugging Face’s powerful LLMs, helps developers quickly build AI-powered applications that respond to user prompts based on natural language generation.
Step-by-Step Example
First, install the following packages on your local environment:
pip install fastapi uvicorn requests |
We will explain one by one each of these packages:
- FastAPI: as we already know, FastAPI enables API development based on Python using HTTP services, and integrates well with AI models like Hugging Face’s language models. It provides an endpoint-based infrastructure where requests like prompts can be sent.
- Uvicorn: Uvicorn is an ASGI server (Asynchronous Server Gateway Interface) used for running FastAPI applications. It can undertake asynchronous operations in a high-performance fashion, being suitable for managing multiple concurrent requests, something usual in production AI applications.
- Requests: a handy HTTP library for making Web service requests in Python. It partly removes the complexity of HTTP interactions, facilitating the process to send GET, POST, and other kinds of HTTP requests.
The next step is to create a Python script file called app.py
, containing a series of instructions to use the Hugging Face API and initialize FastAPI. Note that for HF_API_KEY
you’ll need to replace the default string with your own Hugging Face API token, after having registered on Hugging Face website.
The app.py
file code starts as follows:
import requests from fastapi import FastAPI from pydantic import BaseModel
HF_API_URL = “https://api-inference.huggingface.co/models/facebook/opt-1.3b” HF_API_KEY = “your_huggingface_api_key” # Replace with your actual API key
class PromptRequest(BaseModel): prompt: str
app = FastAPI() |
The HF_API_URL
variable points to an example language model provided via Hugging Face API, concretely, the opt-1.3b
model which is a large-scale transformer-based model built by Facebook AI Research (FAIR). This model, accessible through Hugging Face’s Inference API, facilitates interaction by sending HTTP requests. It can be accessed via the URL https://api-inference.huggingface.co/models/facebook/opt-1.3b
, expecting a prompt as an input and generating text responses based on that prompt, as we will see shortly.
Let’s continue the code for our app.py
script:
@app.post(“/generate”) async def generate_text(request: PromptRequest): prompt = request.prompt headers = {“Authorization”: f“Bearer {HF_API_KEY}”} payload = {“inputs”: prompt}
response = requests.post(HF_API_URL, json=payload, headers=headers)
if response.status_code == 200: return {“response”: response.json()} else: return {“error”: response.text}
if __name__ == “__main__”: import uvicorn uvicorn.run(app, host=“0.0.0.0”, port=8000) |
Now it’s time to run FastAPI locally on our machine. To do so, we need to execute the following command in our terminal, as a result of which a FastAPI will start running at the following address: http://127.0.0.1:8000
.
Now we are ready to test the API locally. Make sure you are running the following code from the same local machine:
import requests
url = “http://127.0.0.1:8000/generate” data = {“prompt”: “Once upon a time”} response = requests.post(url, json=data)
# Print the response print(response.json()) |
This code sends a POST request to FastAPI, containing a prompt with the text “Once upon a time.” We expected to obtain a response that has been generated by the pointed Hugging Face language model, namely a follow-up text generated upon the prompt.
In fact, once everything has been set up correctly, the obtained output should look similar to this:
{ “response”: [ { “generated_text”: “Once upon a time, in a land far, far away…” } ] } |
Wrapping Up
Building LLM applications with Hugging Face and FastAPI is a powerful way to combine cutting-edge AI models with efficient web frameworks. By following the steps outlined in this article, you can create a seamless pipeline for generating text responses and deploying AI-powered APIs. Whether you’re prototyping or scaling to production, this approach offers a robust foundation for integrating natural language generation into your applications.
Once familiar with this setup, the next step to deploy real-world language model applications into production using Hugging Face and FastAPI would be via services like AWS, Heroku, GCP, or Azure for production use.