Image by Author
Introduction
You don’t always need a heavy wrapper, a big client class, or dozens of lines of boilerplate to call a large language model. Sometimes one well-crafted line of Python does all the work: send a prompt, receive a response. That kind of simplicity can speed up prototyping or embedding LLM calls inside scripts or pipelines without architectural overhead.
In this article, you’ll see ten Python one-liners that call and interact with LLMs. We will cover:
Each snippet comes with a brief explanation and a link to official documentation, so you can verify what’s happening under the hood. By the end, you’ll know not only how to drop in fast LLM calls but also understand when and why each pattern works.
Setting Up
Before dropping in the one-liners, there are a few things to prepare so they run smoothly:
Install required packages (only once):
pip install openai anthropic google–generativeai requests httpx |
Ensure your API keys are set in environment variables, never hard-coded in your scripts. For example:
export OPENAI_API_KEY=“sk-…” export ANTHROPIC_API_KEY=“claude-yourkey” export GOOGLE_API_KEY=“your_google_key” |
For local setups (Ollama, LM Studio, vLLM), you need the model server running locally and listening on the correct port (for instance, Ollama’s default REST API runs at http://localhost:11434).
All one-liners assume you use the right model name and that the model is either accessible via cloud or locally. With that in place, you can paste each one-liner directly into your Python REPL or script and get a response, subject to quota or local resource limits.
Hosted API One-Liners (Cloud Models)
Hosted APIs are the easiest way to start using large language models. You don’t have to run a model locally or worry about GPU memory; just install the client library, set your API key, and send a prompt. These APIs are maintained by the model providers themselves, so they’re reliable, secure, and frequently updated.
The following one-liners show how to call some of the most popular hosted models directly from Python. Each example sends a simple message to the model and prints the generated response.
1. OpenAI GPT Chat Completion
OpenAI’s API gives access to GPT models like GPT-4o and GPT-4o-mini. The SDK handles everything from authentication to response parsing.
from openai import OpenAI; print(OpenAI().chat.completions.create(model=“gpt-4o-mini”, messages=[{“role”:“user”,“content”:“Explain vector similarity”}]).choices[0].message.content) |
What it does: It creates a client, sends a message to GPT-4o-mini, and prints the model’s reply.
Why it works: The openai
Python package wraps the REST API cleanly. You only need your OPENAI_API_KEY
set as an environment variable.
Documentation: OpenAI Chat Completions API
2. Anthropic Claude
Anthropic’s Claude models (Claude 3, Claude 3.5 Sonnet, etc.) are known for their long context windows and detailed reasoning. Their Python SDK follows a similar chat-message format to OpenAI’s.
from anthropic import Anthropic; print(Anthropic().messages.create(model=“claude-3-5-sonnet”, messages=[{“role”:“user”,“content”:“How does chain of thought prompting work?”}]).content[0].text) |
What it does: Initializes the Claude client, sends a message, and prints the text of the first response block.
Why it works: The .messages.create()
method uses a standard message schema (role + content), returning structured output that’s easy to extract.
Documentation: Anthropic Claude API Reference
3. Google Gemini
Google’s Gemini API (via the google-generativeai
library) makes it simple to call multimodal and text models with minimal setup. The key difference is that Gemini’s API treats every prompt as “content generation,” whether it’s text, code, or reasoning.
import os, google.generativeai as genai; genai.configure(api_key=os.getenv(“GOOGLE_API_KEY”)); print(genai.GenerativeModel(“gemini-1.5-flash”).generate_content(“Describe retrieval-augmented generation”).text) |
What it does: Calls the Gemini 1.5 Flash model to describe retrieval-augmented generation (RAG) and prints the returned text.
Why it works: GenerativeModel()
sets the model name, and generate_content()
handles the prompt/response flow. You just need your GOOGLE_API_KEY
configured.
Documentation: Google Gemini API Quickstart
4. Mistral AI (REST request)
Mistral provides a simple chat-completions REST API. You send a list of messages and receive a structured JSON response in return.
import requests, json; print(requests.post(“https://api.mistral.ai/v1/chat/completions”, headers={“Authorization”:“Bearer YOUR_MISTRAL_API_KEY”}, json={“model”:“mistral-tiny”,“messages”:[{“role”:“user”,“content”:“Define fine-tuning”}]}).json()[“choices”][0][“message”][“content”]) |
What it does: Posts a chat request to Mistral’s API and prints the assistant message.
Why it works: The endpoint accepts an OpenAI-style messages array and returns choices -> message -> content
.
Check out the Mistral API reference and quickstart.
5. Hugging Face Inference API
If you host a model or use a public one on Hugging Face, you can call it with a single POST. The text-generation
task returns generated text in JSON.
import requests; print(requests.post(“https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2”, headers={“Authorization”:“Bearer YOUR_HF_TOKEN”}, json={“inputs”:“Write a haiku about data”}).json()[0][“generated_text”]) |
What it does: Sends a prompt to a hosted model on Hugging Face and prints the generated text.
Why it works: The Inference API exposes task-specific endpoints; for text generation, it returns a list with generated_text
.
Documentation: Inference API and Text Generation task pages.
Local Model One-Liners
Running models on your machine gives you privacy and control. You avoid network latency and keep data local. The tradeoff is set up: you need the server running and a model pulled. The one-liners below assume you have already started the local service.
6. Ollama (Local Llama 3 or Mistral)
Ollama exposes a simple REST API on localhost:11434. Use /api/generate
for prompt-style generation or /api/chat
for chat turns.
import requests; print(requests.post(“http://localhost:11434/api/generate”, json={“model”:“llama3”,“prompt”:“What is vector search?”}).text) |
What it does: Sends a generate request to your local Ollama server and prints the raw response text.
Why it works: Ollama runs a local HTTP server with endpoints like /api/generate
and /api/chat
. You must have the app running and the model pulled first. See official API documentation.
7. LM Studio (OpenAI-Compatible Endpoint)
LM Studio can serve local models behind OpenAI-style endpoints such as /v1/chat/completions
. Start the server from the Developer tab, then call it like any OpenAI-compatible backend.
import requests; print(requests.post(“http://localhost:1234/v1/chat/completions”, json={“model”:“phi-3”,“messages”:[{“role”:“user”,“content”:“Explain embeddings”}]}).json()[“choices”][0][“message”][“content”]) |
What it does: Calls a local chat completion and prints the message content.
Why it works: LM Studio exposes OpenAI-compatible routes and also supports an enhanced API. Recent releases also add /v1/responses
support. Check the docs if your local build uses a different route.
8. vLLM (Self-Hosted LLM Server)
vLLM provides a high-performance server with OpenAI-compatible APIs. You can run it locally or on a GPU box, then call /v1/chat/completions
.
import requests; print(requests.post(“http://localhost:8000/v1/chat/completions”, json={“model”:“mistral”,“messages”:[{“role”:“user”,“content”:“Give me three LLM optimization tricks”}]}).json()[“choices”][0][“message”][“content”]) |
What it does: Sends a chat request to a vLLM server and prints the first response message.
Why it works: vLLM implements OpenAI-compatible Chat and Completions APIs, so any OpenAI-style client or plain requests call works once the server is running. Check the documentation.
Handy Tricks and Tips
Once you know the basics of sending requests to LLMs, a few neat tricks make your workflow faster and smoother. These final two examples demonstrate how to stream responses in real-time and how to execute asynchronous API calls without blocking your program.
9. Streaming Responses from OpenAI
Streaming allows you to print each token as it is generated by the model, rather than waiting for the full message. It’s perfect for interactive apps or CLI tools where you want output to appear instantly.
from openai import OpenAI; [print(c.choices[0].delta.content or “”, end=“”) for c in OpenAI().chat.completions.create(model=“gpt-4o-mini”, messages=[{“role”:“user”,“content”:“Stream a poem”}], stream=True)] |
What it does: Sends a prompt to GPT-4o-mini and prints tokens as they arrive, simulating a “live typing” effect.
Why it works: The stream=True
flag in OpenAI’s API returns partial events. Each chunk
contains a delta.content
field, which this one-liner prints as it streams in.
Documentation: OpenAI Streaming Guide.
10. Async Calls with httpx
Asynchronous calls enable you to query models without blocking your app, making them ideal for making multiple requests simultaneously or integrating LLMs into web servers.
import asyncio, httpx; print(asyncio.run(httpx.AsyncClient().post(“https://api.mistral.ai/v1/chat/completions”, headers={“Authorization”:“Bearer TOKEN”}, json={“model”:“mistral-tiny”,“messages”:[{“role”:“user”,“content”:“Hello”}]})).json()[“choices”][0][“message”][“content”]) |
What it does: Posts a chat request to Mistral’s API asynchronously, then prints the model’s reply once complete.
Why it works: The httpx
library supports async I/O, so network calls don’t block the main thread. This pattern is handy for lightweight concurrency in scripts or apps.
Documentation: Async Support.
Wrapping Up
Each of these one-liners is more than a quick demo; it’s a building block. You can turn any of them into a function, wrap them inside a command-line tool, or build them into a backend service. The same code that fits on one line can easily expand into production workflows once you add error handling, caching, or logging.
If you want to explore further, check the official documentation for detailed parameters like temperature, max tokens, and streaming options. Each provider maintains reliable references:
The real takeaway is that Python makes working with LLMs both accessible and flexible. Whether you’re running GPT-4o in the cloud or Llama 3 locally, you can reach production-grade results with just a few lines of code.