10 NumPy One-Liners to Simplify Feature Engineering
Image by Author | Ideogram
When building machine learning models, most developers focus on model architectures and hyperparameter tuning. However, the real competitive advantage comes from crafting representative features that help your model understand the underlying patterns in your data.
While libraries like Pandas and Scikit-learn provide excellent tools for this task, NumPy’s vectorized operations can make feature engineering both faster and more elegant.
In this article, we’ll explore 10 powerful NumPy one-liners that can simplify your feature engineering workflow. These techniques use NumPy’s broadcasting, advanced indexing, and mathematical functions to create new features efficiently.
1. Robust Scaling with Median Absolute Deviation
Standard scaling works well for normally distributed data, but it breaks down when outliers are present. A single extreme value can completely skew your normalization, making your features less useful for machine learning models.
This is especially problematic in domains like finance or web analytics where outliers often contain important information. Median Absolute Deviation (MAD) scaling provides a robust alternative that can handle a substantially large fraction of outliers.
import numpy as np
# Sample data with outliers data = np.array([1, 200, 3, 10, 4, 50, 6, 9, 3, 100])
# One-liner: Robust scaling using MAD scaled = (data – np.median(data)) / np.median(np.abs(data – np.median(data))) print(scaled) |
Output:
[–1.44444444 42.77777778 –1. 0.55555556 –0.77777778 9.44444444 –0.33333333 0.33333333 –1. 20.55555556] |
This one-liner works by first centering the data around the median (data - np.median(data))
, then dividing by the MAD. The MAD is the median of the absolute deviations from the median. This gives you a robust measure of scale that outliers can’t corrupt, while still preserving the relative importance of extreme values.
2. Binning Continuous Variables with Quantiles
Converting continuous variables into categorical bins is essential for many algorithms and can help capture non-linear relationships. Equal-width binning often creates imbalanced groups, especially with skewed data. With quantile-based binning, however, you get roughly the same number of samples in each bin.
This technique is particularly useful when you need to discretize variables for tree-based models or when creating interpretable features for business stakeholders.
# Sample continuous data (e.g., customer ages) ages = np.array([18, 25, 35, 22, 45, 67, 23, 29, 34, 56, 41, 38, 52, 28, 33])
# One-liner: Create 4 equal-frequency bins binned = np.digitize(ages, np.percentile(ages, [25, 50, 75])) – 1 print(binned) |
Output:
[–1 –1 1 –1 2 2 –1 0 1 2 1 1 2 0 0] |
The np.percentile()
function calculates the quantile boundaries (25th, 50th, 75th percentiles), then np.digitize()
assigns each value to its appropriate bin. This approach automatically handles skewed distributions and creates meaningful groups regardless of the underlying data distribution.
3. Polynomial Features Without Loops
Polynomial features help capture non-linear relationships between variables. Traditional approaches often involve nested loops or complex library calls.
Creating polynomial features is important when you suspect interaction effects between variables, such as the relationship between temperature and humidity affecting crop yields, or how price and quality together influence customer satisfaction.
# Original features (e.g., temperature, humidity) X = np.array([[20, 65], [25, 70], [30, 45], [22, 80]])
# One-liner: Generate degree-2 polynomial features poly_features = np.column_stack([X[:, [i, j]].prod(axis=1) for i in range(X.shape[1]) for j in range(i, X.shape[1])]) print(poly_features) |
Output:
[[ 400 1300 4225] [ 625 1750 4900] [ 900 1350 2025] [ 484 1760 6400]] |
This list comprehension creates all possible polynomial combinations by iterating through column pairs and computing their product. We use np.column_stack()
function to get these into a feature matrix. The result includes both squared terms (x₁², x₂²) and interaction terms (x₁x₂), giving your model access to non-linear relationships.
4. Lag Features for Time Series
Time series analysis often requires features that capture temporal dependencies. Lag features let your model access historical values, which is essential for forecasting and anomaly detection.
Creating lag features usually requires loops with quite careful index management. This vectorized approach generates all desired lags simultaneously while handling edge cases automatically.
# Time series data (e.g., daily sales) sales = np.array([100, 98, 120,130, 74, 145, 110, 140, 65, 105, 135])
lags = np.column_stack([np.roll(sales, shift) for shift in [1, 2, 3]])[3:] print(lags) |
Output:
[[120 98 100] [130 120 98] [ 74 130 120] [145 74 130] [110 145 74] [140 110 145] [ 65 140 110] [105 65 140]] |
The np.roll()
function shifts array elements by the specified number of positions. The list comprehension creates multiple shifted versions, and np.column_stack()
combines them into a feature matrix. Slicing with [3:]
removes the initial rows where lagged values would be invalid, ensuring clean training data.
5. One-Hot Encoding Without pandas
One-hot encoding is essential for handling categorical variables in machine learning. While pandas provides convenient methods, pure NumPy implementations are faster and more memory-efficient for large datasets.
This approach is particularly valuable when working with high-cardinality categorical features.
# Categorical data (e.g., product categories) categories = np.array([0, 1, 2, 1, 0, 2, 3, 1])
# One-liner: One-hot encode one_hot = (categories[:, None] == np.arange(categories.max() + 1)).astype(int) print(one_hot) |
Output:
[[1 0 0 0] [0 1 0 0] [0 0 1 0] [0 1 0 0] [1 0 0 0] [0 0 1 0] [0 0 0 1] [0 1 0 0]] |
This technique uses broadcasting to compare each category value against all possible categories. The [:, None]
reshapes the array to enable broadcasting, and np.arange(categories.max() + 1)
creates the comparison range. The boolean result is converted to integers, creating a binary matrix where each row represents one sample and each column represents one category.
6. Distance Features from Coordinates
Geospatial features often require distance calculations from reference points. This is common in location-based models for delivery optimization, real estate pricing, or demographic analysis.
Computing distances efficiently is crucial when dealing with large datasets of coordinates, and this vectorized approach scales well to millions of data points.
# Coordinate data locations = np.array([[40.7128, –74.0060], [34.0522, –118.2437], [41.8781, –87.6298], [29.7604, –95.3698]]) reference = np.array([39.7392, –104.9903])
# One-liner: Calculate Euclidean distances from reference point distances = np.sqrt(((locations – reference) ** 2).sum(axis=1)) print(distances) |
Output:
[30.99959263 14.42201722 17.4917653 13.86111358] |
This uses NumPy’s broadcasting to subtract the reference point from all locations simultaneously. The squared differences are summed along axis 1, and the square root gives Euclidean distances. For more precise geographic distances, you could extend this to use the haversine formula.
7. Interaction Features Between Variable Pairs
Feature interactions often help understand hidden patterns that individual features miss. This is common in domains like marketing (price × quality), medicine (drug interactions), or finance (volatility × volume).
Creating all pairwise interactions manually is tedious and error-prone. This vectorized approach generates them systematically and efficiently.
# Sample features (e.g., price, quality, brand_score) features = np.array([[10, 8, 7], [15, 9, 6], [12, 7, 8], [20, 10, 9]])
# One-liner: Create all pairwise interactions interactions = np.array([features[:, i] * features[:, j] for i in range(features.shape[1]) for j in range(i+1, features.shape[1])]).T print(interactions) |
Output:
[[ 80 70 56] [135 90 54] [ 84 96 56] [200 180 90]] |
The nested comprehension generates all unique pairs of features (avoiding duplicates like feature 1 × feature 2 and feature 2 × feature 1). The .T transposes the result so each row represents a sample and each column represents an interaction term. This systematic approach ensures you don’t miss important feature combinations.
8. Rolling Window Statistics
Rolling statistics smooth noisy data and capture local trends. This is essential for time series analysis, signal processing, and creating features that represent recent behavior rather than historical averages.
Traditional approaches often involve loops or complex pandas operations. This convolution-based method is both elegant and efficient.
# Noisy signal data (e.g., stock prices, sensor readings) signal = np.array([10, 27, 12, 18, 11, 19, 20, 26, 12, 19, 25, 31, 28]) window_size = 4
# One-liner: Create rolling mean features rolling_mean = np.convolve(signal, np.ones(window_size)/window_size, mode=‘valid’) print(rolling_mean) |
Output:
[16.75 17. 15. 17. 19. 19.25 19.25 20.5 21.75 25.75] |
Convolution naturally implements rolling window operations. The np.ones(window_size)/window_size
creates a uniform averaging kernel, and mode="valid"
ensures the output only includes positions where the window fully overlaps with the data. This approach extends easily to other window functions like Gaussian or exponential weighting.
9. Outlier Indicator Features
Rather than removing outliers, creating features that flag their presence can provide valuable information to your model. This is particularly useful in fraud detection, quality control, or any domain where anomalies are meaningful.
This approach preserves the information content of outliers while preventing them from dominating your model’s training process.
# Data with potential outliers (e.g., transaction amounts) amounts = np.array([25, 30, 28, 32, 500, 29, 31, 27, 33, 26])
# One-liner: Create outlier indicator features outlier_flags = ((amounts < np.percentile(amounts, 5)) | (amounts > np.percentile(amounts, 95))).astype(int) print(outlier_flags) |
Output:
This technique uses the 5th and 95th percentiles as outlier thresholds, flagging any values outside this range. The boolean result is converted to integers, creating binary features that indicate anomalous observations. You can adjust the percentile thresholds based on your domain knowledge and the acceptable false positive rate.
10. Frequency Encoding for Categorical Variables
Frequency encoding replaces categorical values with their occurrence counts, which can be more informative than arbitrary label encoding. This is particularly useful when category frequency correlates with your target variable.
# Categorical data (e.g., product categories) categories = np.array([‘Electronics’, ‘Books’, ‘Electronics’, ‘Clothing’, ‘Books’, ‘Electronics’, ‘Home’, ‘Books’])
# One-liner: Frequency encode unique_cats, counts = np.unique(categories, return_counts=True) freq_encoded = np.array([counts[np.where(unique_cats == cat)[0][0]] for cat in categories]) print(freq_encoded) |
Output:
This approach first uses np.unique()
to find all unique categories and their counts. Then, for each original category value, it looks up the corresponding frequency count. The result is a numerical feature where each value represents how often that category appears in the dataset, providing the model with information about category popularity or rarity.
Best Practices for Feature Engineering
When you’re creating new and more representative, please keep the following in mind:
Memory efficiency: When working with large datasets, consider the memory implications of feature engineering. Some operations can significantly increase your dataset size.
Feature selection: More features aren’t always better. Use techniques like correlation analysis or feature importance to select the most relevant engineered features.
Validation: Always validate your engineered features on a holdout set to ensure they improve model performance and don’t cause overfitting.
Domain knowledge: The best engineered features often come from understanding your domain. These NumPy techniques are tools to implement your domain insights efficiently.
Conclusion
These NumPy one-liners are practical solutions to common feature engineering challenges.
When you’re working with time series, geospatial data, or traditional tabular datasets, these techniques will help you build more efficient and maintainable feature engineering pipelines. The key is knowing when to use each approach and how to combine them to extract the maximum signal from your data.
Remember that the best feature engineering technique is the one that helps your model learn the patterns specific to your problem domain. Use these one-liners as building blocks, but always validate their effectiveness through proper cross-validation and domain expertise. So yeah, happy feature engineering!