As data scientists with Python programming skills, we use Scikit-Learn a lot. It’s a machine learning package usually taught to new users initially and can be used right through to production. However, much of what is being taught is basic implementation, and Scikit-Learn contains many secrets to improve our data workflow.
This article will discuss seven secrets from Scikit-Learn you probably didn’t know. Without further ado, let’s get into it.
1. Probability Calibration
Some machine learning model classification task models provide probability output for each class. The problem with the probability estimation output is that it is not necessarily well-calibrated, which means that it does not reflect the actual likelihood of the output.
For example, your model might provide 95% of the “fraud” class output, but only 70% of that prediction is correct. Probability calibration would aim to adjust the probabilities to reflect the actual likelihood.
There are a few calibration methods, although the most common are the sigmoid calibration and the isotonic regression. The following code uses Scikit-Learn to calibrate the technique in the classifier.
from sklearn.calibration import CalibratedClassifierCV from sklearn.svm import SVC
svc = SVC(probability=False) calibrated_svc = CalibratedClassifierCV(base_estimator=svc, method=‘sigmoid’, cv=5) calibrated_svc.fit(X_train, y_train) probabilities = calibrated_svc.predict_proba(X_test) |
You can change the model as long as it provides probability output. The method allows you to switch between the “sigmoid” or “isotonic”.
For example, here is a Random Forest classifier with isotonic calibration.
from sklearn.calibration import CalibratedClassifierCV from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42) calibrated_rf = CalibratedClassifierCV(base_estimator=rf, method=‘isotonic’, cv=5) calibrated_rf.fit(X_train, y_train) probabilities = calibrated_rf.predict_proba(X_test) |
If your model does not provide the desired prediction, consider calibrating your classifier.
2. Feature Union
The next secret we will explore is the implementation of the feature union. If you don’t know about it, feature union is a Scikit-Class that provides a way to combine multiple transformer objects into a single transformer.
It’s a valuable class when we want to perform multiple transformations and extractions from the same dataset and use them in parallel for our machine-learning modeling.
Let’s see how they work in the following code.
from sklearn.pipeline import FeatureUnion, Pipeline from sklearn.decomposition import PCA from sklearn.feature_selection import SelectKBest from sklearn.svm import SVC
combined_features = FeatureUnion([ (“pca”, PCA(n_components=2)), (“select_best”, SelectKBest(k=1)) ])
pipeline = Pipeline([ (“features”, combined_features), (“svm”, SVC()) ])
pipeline.fit(X_train, y_train) |
In the code above, we can see that we combined two transformer methods for dimensionality reduction with PCA and selected the best top features into one transformer pipeline with feature union. Combining them with the pipeline would allow our feature union to be used in a singular process.
It’s also possible to chain the feature union if you want to better control the feature manipulation and preprocessing. Here is an example of the previous method we discussed with an additional feature union.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# First FeatureUnion first_union = FeatureUnion([ (“pca”, PCA(n_components=5)), (“select_best”, SelectKBest(k=5)) ])
# Second FeatureUnion second_union = FeatureUnion([ (“poly”, PolynomialFeatures(degree=2, include_bias=False)), (“scaled”, StandardScaler()) ])
pipeline = Pipeline([ (“first_union”, first_union), (“second_union”, second_union), (“svm”, SVC()) ])
pipeline.fit(X_train, y_train) score = pipeline.score(X_test, y_test) |
It’s an excellent methodology for those who need extensive preprocessing at the beginning of the machine learning modeling process.
3. Feature Agglomeration
The next secret we would explore is the feature agglomeration. This is a feature selection method from Scikit-Learn but uses hierarchical clustering to merge similar features.
Feature agglomeration is a dimensionality reduction methodology, which means it is useful when there are many features and some features are significantly correlated with each other. It is also be based on hierarchical clustering, merging the features based on the linkage criterion and distance measurement we set.
Let’s see how it works in the following code.
from sklearn.cluster import FeatureAgglomeration
agglo = FeatureAgglomeration(n_clusters=10) X_reduced = agglo.fit_transform(X) |
We set up the number of features we want by setting the cluster numbers. Let’s see how we change the distance measurement into cosine similarity.
agglo = FeatureAgglomeration(metric=‘cosine’) |
We can also change the linkage method with the following code.
agglo = FeatureAgglomeration(linkage=‘average’) |
Then, we can also change the function to aggregate the features for the new feature.
import numpy as np agglo = FeatureAgglomeration(pooling_func=np.median) |
Try experimenting with the feature agglomeration to acquire the best dataset for your modeling.
4. Predefined Split
The predefined split is a Scikit-Learn class used for a custom cross-validation strategy. It specifies the schema during training and test data splitting. It’s a valuable method when we want to split our data in a certain way, and the standard K-fold or stratified K-fold is insufficient.
Let’s try out predefined split using the code below.
from sklearn.model_selection import PredefinedSplit, GridSearchCV
# -1 for training, 0 for test test_fold = [–1 if i < 100 else 0 for i in range(len(X))] ps = PredefinedSplit(test_fold)
param_grid = {‘parameter’: [1, 10, 100]} grid_search = GridSearchCV(model, param_grid, cv=ps) grid_search.fit(X, y) |
In the example above, we set the data splitting schema by selecting the first hundred data as training and the rest as the test.
The strategy for splitting depends on your requirements. We can change that with the weighting process.
sample_weights = np.random.rand(100) test_fold = [–1 if i < 80 else 0 for i in range(len(X))] ps = PredefinedSplit(test_fold) |
This strategy offers a novel take on the data-splitting process, so try it out to see if it offers benefits to you.
5. Warm Start
Have you trained a machine learning model that requires an extensive dataset, and want to train it in batch? Or are you using online learning that requires incremental learning using streaming data? If you find yourself in these cases, you don’t want to retrain the model from the beginning.
This is where a warm start could help you.
The warm start is a parameter in the Scikit-Learn model that allows us to reuse our last trained solution when fitting the model again. This method is valuable when we don’t want to retrain our model from scratch.
For example, the code below shows the warm start process when we add more trees to the model and retrain it without starting from the beginning.
from sklearn.ensemble import GradientBoostingClassifier
#100 trees model = GradientBoostingClassifier(n_estimators=100, warm_start=True) model.fit(X_train, y_train)
# Add 50 trees model.n_estimators += 50 model.fit(X_train, y_train) |
It’s also possible to do batch training with the warm start feature.
from sklearn.linear_model import SGDClassifier
model = SGDClassifier(max_iter=1000, warm_start=True)
# Train on first batch model.fit(X_batch_1, y_batch_1)
# Continue training on second batch model.fit(X_batch_2, y_batch_2) |
Experiment with a warm start to always have the best model without sacrificing on training time.
6. Incremental Learning
And speaking of incremental learning, we can use Scikit-Learn to do that, too. As mentioned above, incremental learning — or online learning — is a machine learning training process in which we sequentially introduce new data.
It’s often used when our dataset is extensive, or the data is expected to come in over time. It’s also used when we expect data distribution to change over time, so constant retraining is required, but not from scratch.
In this case, several algorithms from Scikit-Learn provide incremental learning support using the partial fit method. It would allow the model training to take place in batches.
Let’s look at a code example.
from sklearn.linear_model import SGDClassifier import numpy as np
classes = np.unique(y_train) model = SGDClassifier() for batch_X, batch_y in data_stream: model.partial_fit(batch_X, batch_y, classes=classes) |
The incremental learning will keep running as long as the loop continues.
It’s also possible to perform incremental learning not only for model training but also for preprocessing.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() for batch_X, batch_y in data_stream: batch_X = scaler.partial_fit_transform(batch_X) model.partial_fit(batch_X, batch_y, classes=classes) |
If your modeling requires incremental learning, try to use the partial fit method from Scikit-Learn.
7. Accessing Experimental Features
Not every class and function from Scikit-Learn have been released in the stable version. Some are still experimental, and we must enable them before using them.
If we want to enable the features, we need to see what features are still in the experimental and import the enable experiment API from Scikit-Learn.
Let’s see an example code below.
# Enable the experimental feature from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer
imputer = IterativeImputer(random_state=0) |
As of the time this article was written, the IterativeImputer
class is still in the experimental phase, and we need to import the enabler in the beginning before we use the class.
Another feature that is still in the experimental phase is the halving search methodology.
from sklearn.experimental import enable_halving_search_cv from sklearn.model_selection import HalvingRandomSearchCV from sklearn.model_selection import HalvingGridSearchCV |
If you find useful features in Scikit-Learn but are unable to access them, they might be in the experimental phase, so try to access them by importing the enabler.
Conclusion
Scikit-Learn is a popular library that is used in many machine learning implementations. There are so many features in the library that there are undoubtedly many you are unaware of. To review, the seven secrets we covered in this article were:
- Probability Calibration
- Feature Union
- Feature Agglomeration
- Predefined Split
- Warm Start
- Incremental Learning
- Accessing Experimental Features
I hope this has helped!