In this article, you will learn what cuML is, and how it can significantly speed up the training of machine learning models through GPU acceleration.
Topics we will cover include:
- The aim and distinctive features of cuML.
- How to prepare datasets and train a machine learning model for classification with cuML in a scikit-learn-like fashion.
- How to easily compare results with an equivalent conventional scikit-learn model, in terms of classification accuracy and training time.
Let’s not waste any more time.

A Hands-On Introduction to cuML for GPU-Accelerated Machine Learning Workflows
Image by Editor | ChatGPT
Introduction
This article offers a hands-on Python introduction to cuML, a Python library from RAPIDS AI (an open-source suite within NVIDIA) for GPU-accelerated machine learning workflows across widely used models. In conjunction with its data science–oriented sibling, cuDF, cuML has gained popularity among practitioners who need scalable, production-ready machine learning solutions.
The hands-on tutorial below uses cuML together with cuDF for GPU-accelerated dataset management in a DataFrame format. For an introduction to cuDF, check out this related article.
About cuML: An “Accelerated Scikit-Learn”
RAPIDS cuML (short for CUDA Machine Learning) is an open-source library that accelerates scikit-learn–style machine learning on NVIDIA GPUs. It provides drop-in replacements for many popular algorithms, often reducing training and inference times on large datasets — without major code changes or a steep learning curve for those familiar with scikit-learn.
Among its three most distinctive features:
- cuML follows a scikit-learn-like API, easing the transition from CPU to GPU for machine learning with minimal code changes
- It covers a broad set of techniques — all GPU-accelerated — including regression, classification, ensemble methods, clustering, and dimensionality reduction
- Through tight integration with the RAPIDS ecosystem, cuML works hand-in-hand with cuDF for data preprocessing, as well as with related libraries to facilitate end-to-end, GPU-native pipelines
Hands-On Introductory Example
To illustrate the basics of cuML for building GPU-accelerated machine learning models, we will consider a fairly large, yet easily accessible, dataset via public URL in Jason Brownlee’s repository: the adult income dataset. This is a large, slightly class-unbalanced dataset intended for binary classification tasks, namely predicting whether an adult’s income level is high (above $50K) or low (below $50K) based on a set of demographic and socio-economic features. Therefore, we aim to build a binary classification model.
IMPORTANT: To run the code below on Google Colab or a similar notebook environment, make sure you change the runtime type to GPU; otherwise, a warning will be raised indicating cuDF cannot find the specific CUDA driver library it utilizes.
We start by importing the necessary libraries for our scenario:
import cudf import cuml from cuml.model_selection import train_test_split as gpu_train_test_split from cuml.linear_model import LogisticRegression as cuLogReg from IPython.display import display
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import time |
Note that, in addition to cuML modules and functions to split the dataset and train a logistic regression classifier, we have also imported their classical scikit-learn counterparts. While not mandatory for using cuML (as it works independently from plain scikit-learn), we are importing equivalent scikit-learn components for the sake of comparison in the rest of the example.
Next, we load the dataset into a cuDF dataframe optimized for GPU usage:
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/adult-all.csv” # Column names (they are not included in the dataset’s CSV file we will read) cols = [ “age”,“workclass”,“fnlwgt”,“education”,“education_num”, “marital_status”,“occupation”,“relationship”,“race”,“sex”, “capital_gain”,“capital_loss”,“hours_per_week”,“native_country”,“income” ]
df = cudf.read_csv(url, header=None, names=cols) display(df.head()) |
Once the data is loaded, we identify the target variable and convert it into binary (1 for high income, 0 for low income):
df[“income”] = df[“income”].str.strip() df[“income”] = (df[“income”] == “>50K”).astype(“int32”) |
This dataset combines numeric features with a slight predominance of categorical ones. Most scikit-learn models — including decision trees and logistic regression — do not natively handle string-valued categorical features, so they require encoding. A similar pattern applies to cuML; hence, we will select a small number of features to train our classifier and one-hot encode the categorical ones.
# Feature selection (let’s say based on domain expertise!) features = [“age”,“education_num”,“hours_per_week”,“workclass”,“occupation”,“sex”] X = df[features] y = df[“income”]
# One-hot encode categorical features X_enc = cudf.get_dummies(X, drop_first=True) print(“Encoded feature shape:”, X_enc.shape) |
So far, we have used cuML (and also cuDF) much like using classical scikit-learn along with Pandas.
Now comes the interesting part. We will split the dataset into training and test sets and train a logistic regression classifier twice, using both CUDA GPU (cuML) and standalone scikit-learn. We will then compare both the classification accuracy and the time taken to train each model. Here’s the complete code for the model training and comparison:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# MODEL 1: GPU (cuML) train-test split and training t0 = time.time() X_train, X_test, y_train, y_test = gpu_train_test_split(X_enc, y, test_size=0.2, random_state=42)
model_gpu = cuLogReg(max_iter=1000) model_gpu.fit(X_train, y_train) gpu_time = time.time() – t0
acc_gpu = model_gpu.score(X_test, y_test) print(f“cuML Logistic Regression accuracy: {acc_gpu:.4f}, time: {gpu_time:.3f} sec”)
# MODEL 2: Scikit-learn and Pandas-driven train-test split and model training df_pd = pd.read_csv(url, header=None, names=cols) df_pd[“income”] = df_pd[“income”].str.strip() df_pd[“income”] = (df_pd[“income”] == “>50K”).astype(“int32”)
X_pd = df_pd[features] y_pd = df_pd[“income”] X_pd = pd.get_dummies(X_pd, drop_first=True)
t0 = time.time() X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(X_pd, y_pd, test_size=0.2, random_state=42)
model_cpu = LogisticRegression(max_iter=1000) model_cpu.fit(X_train_pd, y_train_pd) cpu_time = time.time() – t0
acc_cpu = model_cpu.score(X_test_pd, y_test_pd) print(f“scikit-learn Logistic Regression accuracy: {acc_cpu:.4f}, time: {cpu_time:.3f} sec”) |
The results are quite interesting. They should look something like:
cuML Logistic Regression accuracy: 0.8014, time: 0.428 sec scikit–learn Logistic Regression accuracy: 0.8097, time: 15.184 sec |
As we can observe, the model trained with cuML achieved very similar classification performance to its classical scikit-learn counterpart, but it trained over an order of magnitude faster: about 0.5 seconds compared to roughly 15 seconds for the scikit-learn classifier. Your exact numbers will vary with hardware, drivers, and library versions.
Wrapping Up
This article provided a gentle, hands-on introduction to the cuML library for enabling GPU-boosted construction of machine learning models for classification, regression, clustering, and more. Through a simple comparison, we showed how cuML can help build effective models with significantly enhanced training efficiency.