In this article, you will learn how to build an end-to-end sentiment analysis pipeline using Scikit-LLM and open-source large language models served through the Groq API.
Topics we will cover include:
- How Scikit-LLM bridges classical scikit-learn pipelines with modern large language model API calls.
- How to set up Scikit-LLM with a Groq backend and prepare the IMDB Movie Reviews dataset for inference.
- How to build, run, and evaluate a zero-shot sentiment classification pipeline using scikit-learn-compatible syntax.
Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM
Introduction
Traditional machine learning pipelines for predictive tasks like text classification usually rely on extracting structured, numerical features from raw text — for instance, TF-IDF frequencies or token embeddings — to feed into classical models such as logistic regression, ensembles, or support vector machines.
With the rise of large language models (LLMs), the rules of the game have somewhat changed: it is now possible to leverage zero-shot or few-shot reasoning on existing, pre-trained models for language tasks as part of a machine learning framework. Scikit-LLM is a Python library that addresses this: it bridges the gap between classical machine learning and modern LLM API calls. In this article, we will use Scikit-LLM alongside Groq backend models to build an end-to-end pipeline for sentiment analysis (a domain-specific form of text classification), achieving reasonably fast inference results with open-source models. From preprocessing to inference, we will use a large, realistically-sized dataset — the IMDB movie reviews dataset.
Prerequisites, Setup, and Obtaining the Dataset
To make the code shown in this tutorial work, you’ll need to have installed the Scikit-LLM library:
Once installed, the first step is to set it up and configure API credentials. In other words, we will need to “connect” Scikit-LLM to an endpoint — namely an LLM API repository like Groq. Make sure you register on Groq and generate an API key here: you’ll need to copy and paste it in the code below:
# 1. Pointing to a Groq’s compatible endpoint
SKLLMConfig.set_gpt_url(“https://api.groq.com/openai/v1”)
# 2. Set your free Groq API key
# Get yours at https://console.groq.com/keys
SKLLMConfig.set_openai_key(“YOUR-API-KEY-GOES-HERE”)
|
from skllm.config import SKLLMConfig
# 1. Pointing to a Groq’s compatible endpoint SKLLMConfig.set_gpt_url(“https://api.groq.com/openai/v1”)
# 2. Set your free Groq API key # Get yours at https://console.groq.com/keys SKLLMConfig.set_openai_key(“YOUR-API-KEY-GOES-HERE”) |
Scikit-LLM uses an endpoint function, set_gpt_url, that is compatible with OpenAI by default; we have routed it to make internal requests to a custom Groq URL: https://api.groq.com/openai/v1.
The next stage of the process is importing the IMDB Movie Reviews dataset — which has about 50K instances — and preparing it for the sentiment analysis pipeline we will build. Instances consist of a text review labeled with a sentiment, which can be positive or negative (this is a binary classification problem, solvable with models like logistic regression, for instance).
For convenience, we read the dataset from a publicly available GitHub repository version in CSV format:
# Fetching a large, realistic-sized dataset (IMDB Movie Reviews – 50,000 rows)
# We will read the data from a public raw CSV for convenience
url = “https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv”
print(“Downloading dataset…”)
df = pd.read_csv(url)
print(f”Total dataset size: {df.shape[0]} rows”)
# In a realistic LLM pipeline using a free-tier API, sending 50,000 requests
# will likely trigger quota limits. Thus, we will use 500 rows for demonstrating our pipeline execution.
# Feel free to use more data if you have paid API access.
df_sampled = df.sample(n=500, random_state=42)
# The IMDB dataset contains HTML tags and formatting noise: that’s perfect for testing our cleaner
X = df_sampled[“review”]
y = df_sampled[“sentiment”] # Labels are ‘positive’ or ‘negative’
# Splitting into training (for initializing zero-shot labels) and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import pandas as pd from sklearn.model_selection import train_test_split
# Fetching a large, realistic-sized dataset (IMDB Movie Reviews – 50,000 rows) # We will read the data from a public raw CSV for convenience url = “https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv” print(“Downloading dataset…”) df = pd.read_csv(url)
print(f“Total dataset size: {df.shape[0]} rows”)
# In a realistic LLM pipeline using a free-tier API, sending 50,000 requests # will likely trigger quota limits. Thus, we will use 500 rows for demonstrating our pipeline execution. # Feel free to use more data if you have paid API access. df_sampled = df.sample(n=500, random_state=42)
# The IMDB dataset contains HTML tags and formatting noise: that’s perfect for testing our cleaner X = df_sampled[“review”] y = df_sampled[“sentiment”] # Labels are ‘positive’ or ‘negative’
# Splitting into training (for initializing zero-shot labels) and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
Note that we fetched 500 rows only for demonstration purposes, as otherwise inference may take long without sufficient computing resources. You can freely change this sample size, n=500, to adapt it to your own needs.
Building the Sentiment Analysis Pipeline
Here comes the most interesting part of the process! A data science pipeline boils down to a series of preprocessing, cleaning, and data preparation steps followed by model setup or training, inference, and evaluation. For a predictive, text-based scenario like ours, preprocessing typically entails cleaning and normalizing the text. Scikit-learn provides an elegant class, FunctionTransformer, to define and encapsulate preprocessing steps based on a custom function:
def clean_text_data(texts):
“””Cleans raw text inputs by removing HTML tags and stripping whitespace.”””
series = pd.Series(texts).astype(str)
# Remove HTML tags like <br />
cleaned = series.str.replace(r'<[^>]+>’, ‘ ‘, regex=True)
# Remove extra spaces
cleaned = cleaned.str.strip().str.replace(r’\s+’, ‘ ‘, regex=True)
return cleaned.tolist()
# Wrapping the cleaning function to enable its use inside a Pipeline object
text_cleaner = FunctionTransformer(clean_text_data)
|
from sklearn.preprocessing import FunctionTransformer
def clean_text_data(texts): “”“Cleans raw text inputs by removing HTML tags and stripping whitespace.”“” series = pd.Series(texts).astype(str) # Remove HTML tags like <br /> cleaned = series.str.replace(r‘<[^>]+>’, ‘ ‘, regex=True) # Remove extra spaces cleaned = cleaned.str.strip().str.replace(r‘\s+’, ‘ ‘, regex=True) return cleaned.tolist()
# Wrapping the cleaning function to enable its use inside a Pipeline object text_cleaner = FunctionTransformer(clean_text_data) |
Now we put together this preprocessing object with a model instance to create the Pipeline. Once defined, this pipeline orchestrates the whole process of preparing the data and passing it to the model at both training and inference stages — even though we use the term “training”, no actual weight-based training will occur, as we are utilizing a pre-trained model from Groq for zero-shot classification. Fitting the model only involves passing it the classification labels to use.
# Define the end-to-end pipeline
sentiment_pipeline = Pipeline([
(“cleaner”, text_cleaner),
# Updated to use Groq’s active Llama 3.1 8B model
(“llm_classifier”, ZeroShotGPTClassifier(model=”custom_url::llama-3.1-8b-instant”))
])
# Fit the pipeline
# Note: For Zero-Shot classification, fit() doesn’t train the LLM.
# It simply registers the unique labels present in ‘y_train’ (positive, negative).
print(“Fitting the pipeline…”)
sentiment_pipeline.fit(X_train, y_train)
|
from sklearn.pipeline import Pipeline from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
# Define the end-to-end pipeline sentiment_pipeline = Pipeline([ (“cleaner”, text_cleaner), # Updated to use Groq’s active Llama 3.1 8B model (“llm_classifier”, ZeroShotGPTClassifier(model=“custom_url::llama-3.1-8b-instant”)) ])
# Fit the pipeline # Note: For Zero-Shot classification, fit() doesn’t train the LLM. # It simply registers the unique labels present in ‘y_train’ (positive, negative). print(“Fitting the pipeline…”) sentiment_pipeline.fit(X_train, y_train) |
Once we have run the pipeline to “fit” the model, we use it once more for inference. Both steps use familiar scikit-learn syntax. Besides evaluating the model pipeline’s performance, we also display a few example predictions:
print(f”Running predictions on {len(X_test)} test samples…”)
# Run predictions through the pipeline
predictions = sentiment_pipeline.predict(X_test)
# Evaluate the pipeline’s performance on the realistic data
print(“\n— Classification Report —“)
print(classification_report(y_test, predictions))
# Display a few side-by-side examples
print(“\n— Sample Predictions —“)
for review, actual, predicted in zip(X_test[:3], y_test[:3], predictions[:3]):
# Truncate review for display purposes
short_review = review[:100] + “…”
print(f”Review: {short_review}”)
print(f”Actual: {actual} | Predicted: {predicted}\n”)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from sklearn.metrics import classification_report
print(f“Running predictions on {len(X_test)} test samples…”) # Run predictions through the pipeline predictions = sentiment_pipeline.predict(X_test)
# Evaluate the pipeline’s performance on the realistic data print(“\n— Classification Report —“) print(classification_report(y_test, predictions))
# Display a few side-by-side examples print(“\n— Sample Predictions —“) for review, actual, predicted in zip(X_test[:3], y_test[:3], predictions[:3]): # Truncate review for display purposes short_review = review[:100] + “…” print(f“Review: {short_review}”) print(f“Actual: {actual} | Predicted: {predicted}\n”) |
Here’s the detailed output — execution of the above code may take a few minutes to complete:
negative 0.95 0.97 0.96 60
positive 0.95 0.93 0.94 40
accuracy 0.95 100
macro avg 0.95 0.95 0.95 100
weighted avg 0.95 0.95 0.95 100
— Sample Predictions —
Review: I saw mommy…well, she wasn’t exactly kissing Santa Clause; he has his hand on her thigh and wicked…
Actual: negative | Predicted: negative
Review: This entry is certainly interesting for series fans (like myself), but yet it is mostly incomprehens…
Actual: negative | Predicted: negative
Review: Ingrid Bergman (Cleo Dulaine) has never been so beautiful. Gary Cooper as “Cleent” so perfectly cast…
Actual: positive | Predicted: positive
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
—– Classification Report —– precision recall f1–score support
negative 0.95 0.97 0.96 60 positive 0.95 0.93 0.94 40
accuracy 0.95 100 macro avg 0.95 0.95 0.95 100 weighted avg 0.95 0.95 0.95 100
—– Sample Predictions —– Review: I saw mommy...well, she wasn‘t exactly kissing Santa Clause; he has his hand on her thigh and wicked... Actual: negative | Predicted: negative
Review: This entry is certainly interesting for series fans (like myself), but yet it is mostly incomprehens... Actual: negative | Predicted: negative
Review: Ingrid Bergman (Cleo Dulaine) has never been so beautiful. Gary Cooper as “Cleent” so perfectly cast... Actual: positive | Predicted: positive |
Our pipeline is doing a solid job at classifying sentiment in reviews. Well done!
Wrapping Up
This article walked you through defining an end-to-end pipeline for sentiment classification using Scikit-LLM and freely available, pre-trained LLMs from API endpoints like Groq. This is a versatile approach to using classic scikit-learn syntax in novel, LLM-driven machine learning applications.
