10 Python One-Liners Every Machine Learning Practitioner Should Know

10 Python One-Liners Every Machine Learning Practitioner Should Know
Image by Editor | ChatGPT

Introduction

Developing machine learning systems entails a well-established lifecycle, consisting of a series of stages from data preparation and preprocessing to modeling, validation, deployment to production, and continuous maintenance. Needless to say, a significant amount of coding effort is involved across these stages, often in the Python language. But did you know that with a few tips and hacks, the Python language can help simplify code workflows, thereby turbocharging the overall process of building machine learning solutions?

This article presents 10 one-liners — single lines of code that undertake meaningful tasks compactly and efficiently — constituting practical ways to prepare, build, and validate machine learning systems. These one-liners are intended to help machine learning engineers, data scientists, and practitioners in general simplify and streamline the machine learning lifecycle processes.

The code examples below assume the prior definition of key variables like datasets, training and test subsets, models, and so on. Likewise, it also assumes that the necessary imports of classes, library modules, etc., have been made; they are omitted for the sake of clarity and focus on the one-liners to be illustrated.

1. Downsampling a Large Dataset

Testing a machine learning workflow on a very large dataset is usually easier if a small subset can be sampled. This one-liner does precisely that: it downsamples 1000 instances from a full dataset contained in a Pandas DataFrame, named df, without the need for an iterative control structure that would otherwise turn the sampling into a slower process.

df_small = df.sample(n=1000, random_state=42)

df_small = df.sample(n=1000, random_state=42)

The efficiency gain is more notable when the original dataset is larger.

2. Feature Scaling and Model Training Together

What could be more efficient than encapsulating one stage of the machine learning workflow into a single line of code? Of course, encapsulating two stages in just one line! A great example is this one-liner, which uses scikit-learn’s make_pipeline() function alongside fit() to define and apply a two-stage feature scaling and model training pipeline: all in a single, simple line of code.

pipe = make_pipeline(StandardScaler(), Ridge()).fit(X_train, y_train)

pipe = make_pipeline(StandardScaler(), Ridge()).fit(X_train, y_train)

The above example uses a ridge regression model, hence the use of the Ridge class as the second argument in the pipeline.

3. Simple Model Training on the Fly

Of course, another handy and very commonly used one-liner is the one that initializes and trains a specific type of machine learning model in the same instruction. Unlike the previous example that instantiated a pipeline object to encapsulate both the data scaling and model training stages, this seemingly less ambitious approach is preferred if you have an already preprocessed dataset and simply want to train a model directly without additional overhead, or when you want to instantiate multiple models for comparison and benchmarking.

clf = LogisticRegression().fit(X_train, y_train)

clf = LogisticRegression().fit(X_train, y_train)

4. Model Hyperparameter Tuning

Chances are, you may have needed to manually set up some model hyperparameters, especially in highly customizable models like decision trees and ensembles. Using one hyperparameter setting or another can significantly affect model performance, and when the optimal settings are unknown, it is best to try multiple possible configurations and find the best one. Fortunately, this tuning or search process can also be implemented in a very compact fashion.

This example one-liner applies Grid Search, a common hyperparameter tuning strategy, to train three “versions” of a support vector machine model, by using different values of the key hyperparameter used in this family of models, called C. The hyperparameter tuning process is performed alongside a cross-validation process to rigorously evaluate to determine which of the trained model versions is the most promising, hence we specify the number of cross-validation folds by using cv=3.

best = GridSearchCV(model, {‘C’:[0.1,1,10]}, cv=3).fit(X_train, y_train).best_params_

best = GridSearchCV(model, {‘C’:[0.1,1,10]}, cv=3).fit(X_train, y_train).best_params_

The result returned is the best hyperparameter setting found.

5. Cross-Validation Scoring

Speaking of cross-validation, here’s another handy one-liner that instantly evaluates the robustness of a previously trained machine learning model — i.e., its accuracy and ability to generalize to unseen data — using k-fold cross-validation. Recall that this approach averages evaluation results for all folds; hence, the arithmetic mean is applied at the end of the process:

score = cross_val_score(model, X, y, cv=5).mean()

score = cross_val_score(model, X, y, cv=5).mean()

6. Informative Predictions: Putting Together Class Probabilities and Class Predictions

In classification models, or classifiers, test instances are assigned to a class by calculating the probability of belonging to each possible class and then selecting the class with the highest probability. During this process, you may sometimes want to have a holistic view of both the class probabilities and the assigned class for every test instance.

This one-liner helps you do so by creating a DataFrame object that contains multiple class probability columns (one per class), plus a final column added via the assign() method that contains the assigned class. The code assumes you have previously trained a model for multiple-class classification, for instance, a decision tree:

preds_df = pd.DataFrame(model.predict_proba(X_test), columns=model.classes_).assign(pred_class=model.predict(X_test))

preds_df = pd.DataFrame(model.predict_proba(X_test), columns=model.classes_).assign(pred_class=model.predict(X_test))

7. Predictions and ROC AUC Evaluation

There are multiple ways to evaluate a model by determining the ROC curve and the area under the curve (AUC), with the following one being arguably the most concise approach to directly obtain the AUC:

roc_auc = roc_auc_score(y_true, model.predict_proba(X_test)[:,1])

roc_auc = roc_auc_score(y_true, model.predict_proba(X_test)[:,1])

This example is for a binary classifier. The [:,1] slice selects the probabilities for the positive class (the second column) from the output of model.predict_proba(X_test).

8. Getting Multiple Evaluation Metrics

Why not take advantage of Python’s multiple assignment capabilities to calculate several evaluation metrics for a classification model in one go? Here’s how to do it to calculate the precision, recall, and F1 score.

precision, recall, f1 = precision_score(y_true, y_pred), recall_score(y_true, y_pred), f1_score(y_true, y_pred)

precision, recall, f1 = precision_score(y_true, y_pred), recall_score(y_true, y_pred), f1_score(y_true, y_pred)

While there’s an alternative approach, the classification_report() function, to obtain these three metrics and print them in a tabular report, this one-liner might be preferred when you need direct access to the raw metric values for further use later on, e.g. for comparisons, debugging, etc.

9. Displaying Confusion Matrices as a DataFrame

Presenting the confusion matrix as a labeled DataFrame object, rather than just printing it, can significantly ease the interpretation of evaluation results, giving a glimpse of how predictions align with the true classes. This example does so for a binary classifier:

cm_df = pd.DataFrame(confusion_matrix(y_true, y_pred), index=[‘Actual 0′,’Actual 1’], columns=[‘Pred 0′,’Pred 1’])

cm_df = pd.DataFrame(confusion_matrix(y_true, y_pred), index=[‘Actual 0’,‘Actual 1’], columns=[‘Pred 0’,‘Pred 1’])

10. Sorting Feature Importance

This last one-liner again makes use of Python’s built-in capabilities to make otherwise lengthy code very compact, particularly for populating a list iteratively. In this case, for a trained model like a random forest ensemble, we extract and rank the feature names and their corresponding importance weights. This gives us a quick understanding of which features are most relevant for making predictions.

sorted_features = [f for _, f in sorted(zip(model.feature_importances_, feature_names), reverse=True)]