Data science was known as statistical analysis before it get its name because that was the only way to extract information from data. With recent advance in technology, machine learning models are introduced to expand our capability to understand data. There are a lot of machine learning models that you can use. However, you are not required to learn everything. The most important is to learn what these new tools can help you.
In this 7-part crash course, you will learn from examples how to perform a data science project with the most common machine learning models. This mini-course is focused on the core of data science. It is assumed that you gathered the data and made it ready to use. This mini-course is intended for practitioners who are already comfortable with programming in Python, and willing to learn about the common tools for data science such as pandas and scikit-learn. A machine learning engineer’s goal is to create the model; data scientists should aim for explaining the data using the machine learning model as the tool. You will see how these tools can help, and how to draw a quantitatively supported statement from the data you have. Let’s get started.
Net Level Data Science (7-day Mini-Course)
Photo by geraldo stanislas. Some rights reserved.
Who Is This Mini-Course For?
Before we start, let’s ensure you are in the right place. The list below provides some general guidelines as to who this course was designed for. Don’t panic if you don’t match these points exactly; you might just need to brush up in one area or another to keep up.
- Developers that know how to write a little code. This means that it is not a big deal for you to get things done with Python and know how to setup the ecosystem on your workstation (a prerequisite). It does not mean you’re a wizard coder, but you’re not afraid to install packages and write scripts.
- Developers that know a little machine learning. This means you know about some basic machine learning models and are not afraid to use them. It does not mean that you are an expert in all models, but you can tell the strength and weakness of a model.
- Developers who know a bit about data science tools. Using a Jupyter notebook is common in data science. Handing data in Python would be easier if you use the library pandas. This list goes on. You are not required to be an expert in any library, but being comfortable invoking the different libraries and writing code to manipulate data is all you need.
This mini-course is not a textbook on data science. Rather, it is a project guideline that takes you step-by-step from a developer with minimal knowledge to a developer who can confidently demonstrate how a data science project can be done.
Mini-Course Overview
This mini-course is divided into 7 parts.
Each lesson was designed to take the average developer about 30 minutes. You might finish some much sooner and other you may choose to go deeper and spend more time.
You can complete each part as quickly or as slowly as you like. A comfortable schedule may be to complete one lesson per day over seven days. Highly recommended.
The topics you will cover over the next 7 lessons are as follows:
- Lesson 1: Getting the Data
- Lesson 2: Find the Numeric Columns for Linear Regression
- Lesson 3: Performing Linear Regression
- Lesson 4: Interpreting Factors
- Lesson 5: Feature Selection
- Lesson 6: Decision Tree
- Lesson 7: Random Forest and Probability
This is going to be a lot of fun.
You’ll have to do some work, though: a little reading, research and programming. You want to learn how to finish a data science project, right?
Post your results in the comments; I’ll cheer you on!
Hang in there; don’t give up.
Lesson 01: Getting the Data
The dataset we will use for this mini-course is the “All Countries Dataset” that is available on Kaggle:
This dataset describes almost all countries’ demographic, economic, geographic, health, and political data. The most well-known dataset of this type would be the CIA World Fact Book. Scrapping from the World Fact Book should give you more comprehensive and up-to-date data. However, using this dataset in CSV format would save you a lot of trouble when building your web scraper.
Downloading this dataset from Kaggle (you may need to sign up an account to do so), you will find the CSV file All Countries.csv
. Let’s check this dataset with pandas.
import pandas as pd
df = pd.read_csv(“All Countries.csv”) df.info() |
The above code will print a table to the screen, like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<class ‘pandas.core.frame.DataFrame’> RangeIndex: 194 entries, 0 to 193 Data columns (total 64 columns): # Column Non-Null Count Dtype — —— ————– —– 0 country 194 non-null object 1 country_long 194 non-null object 2 currency 194 non-null object 3 capital_city 194 non-null object 4 region 194 non-null object 5 continent 194 non-null object 6 demonym 194 non-null object 7 latitude 194 non-null float64 8 longitude 194 non-null float64 9 agricultural_land 193 non-null float64 … 62 political_leader 187 non-null object 63 title 187 non-null object dtypes: float64(48), int64(6), object(10) memory usage: 97.1+ KB |
In the above, you see the basic information of the dataset. For example, at the top, you know that there are 194 entries (rows) in this CSV file. And the table tell you there are 64 columns (indexed by number 0 to 63). Some columns are numeric, such as latitude, and some are not, such as capital_city. The data type “object” in pandas usually means it is a string type. You also know that there are some missing values, such as in agricultural_land
, there are only 193 non-null values over 194 entries, meaning there is one row with missing values on this column.
Let’s see more detail into the dataset, such as taking the first five rows as a sample:
This will show you the first five rows of the dataset in a tabular form.
Your Task
This is the basic exploration of a dataset. But using the head()
function may not be always appropriate (e.g., when the input data are sorted). There are also tail()
function for the similar purpose. However, running df.sample(5)
would usually more helpful as it is to randomly sample 5 rows. Try with this function. Also, as you can see from the above output, the columns are clipped to the screen width. How to modify the above code to show all columns from the sample?
Hint: There is a to_string()
function in pandas as well as you can adjust the general print option display.max_columns
.
In the next lesson, you will see how to prepare your data for linear regression.
Lesson 02: Find the Numeric Columns for Linear Regression
Let’s jump into one of the most trivial task: Predicting the GDP of a country based on some other factors using linear regression. But before you use the data, it is important to make sure there is no bad data involved. For example, if you are going to use linear regression, all number must be valid so that addition and multiplication is possible. This means NaN
(“not a number”) or infinity should not exist. Often, NaN
is used to denote missing value.
It is easy to fill in missing value in the dataset. For example, in pandas, you can fill all missing values (NaN
) to zero:
But why zero? Actually the best value to fill depends on the column. Sometimes a predefined value is suitable. Sometimes it is better to fill with average of other non-missing data.
Another approach would be to ignore any columns with missing values. The set of all columns with no missing values can be found by counting the number of null or NaN
values:
print(df.isnull().sum().sort_values(ascending=False).to_string()) |
You will see the above prints:
internally_displaced_persons 121 central_government_debt_pct_gdp 74 hiv_incidence 61 energy_imports_pct 56 … urban_population_under_5m 0 rural_land 0 urban_land 0 country 0 |
You can list out all columns with no missing values by looking for the index of those rows with value 0:
df_null_count = df.isnull().sum().sort_values(ascending=False) print(df_null_count[df_null_count == 0].index) |
The data that can be used by linear regression must be numeric. We can find those columns by checking the columns reported by describe()
, which computes the basic statistics of those columns that are numerical:
print(df.describe().columns) |
Combining them, this is to list out all the columns that are both numerical and without missing values, using set
intersection:
print(list(set(df.describe().columns) & set(df_null_count[df_null_count == 0].index))) |
Your Task
Look at the set of columns above, GDP is missing. This actually make sense if you look at the data from the CSV file. There’s one country without GDP (which makes sense that you will not have the data). Can you find out that country? How you can find that out in pandas? How about let’s find the columns with 3 or fewer missing values and then remove those countries with any missing values in those columns. How you can do that in Python? There should be a simple way that takes only a few lines to short list the pandas DataFrame.
In the next lesson, you will run linear regression from the numeric columns that you short-listed above.
Lesson 03: Performing Linear Regression
Let’s start from the DataFrame. We will find the numeric columns with 3 or fewer missing values from the entire dataset:
df_null_count = df.isnull().sum().sort_values(ascending=False) good_cols = list(set(df_null_count[df_null_count <= 3].index) & set(df.describe().columns)) print(good_cols)
df_cleaned = df.dropna(axis=“index”, how=“any”, subset=good_cols).copy() print(df_cleaned) |
Let’s focus on the columns listed in good_cols
. How much do you think population can predict the GDP? After all, a country with more people should have higher GDP.
To find out, we can use scikit-learn to build a linear model:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True) X = df_cleaned[[“population”]] Y = df_cleaned[“gdp”] model.fit(X, Y) print(model.coef_) print(model.intercept_) print(model.score(X, Y)) |
This score (last line printed) is the $R^2$ of linear regression. The best is 1.0 and if the predictor X
is independent of y
, that will be 0.0. We got 0.34 here. Not a very high score. Let’s try to add some more columns to X
to see if more predictor can work better:
model = LinearRegression(fit_intercept=True) X = df_cleaned[[“population”, “rural_population”, “median_age”, “life_expectancy”]] Y = df_cleaned[“gdp”] model.fit(X, Y) print(model.coef_) print(model.intercept_) print(model.score(X, Y)) |
The $R^2$ score increased to 0.66, which is much better than before. You can also see the coefficients of the linear regression. The rural population has a negative coefficient. This means the more rural population, the lower the GDP of the country.
Your Task
Nothing stops you from using all numerical columns from the DataFrame for linear regression. Can you try that? What is the $R^2$ in this case? Which factors are positively correlated to GDP? How you can find that out?
In the next lesson, you will learn how to interpret the coefficients from the linear regression model.
Lesson 04: Interpreting Factors
Let’s try to run a linear regression for life expectancy with all factors. Remember that we find out all usable columns by identifying those with no missing values:
df_null_count = df.isnull().sum().sort_values(ascending=False) good_cols = list(set(df_null_count[df_null_count <= 3].index) & set(df.describe().columns)) print(good_cols) |
This shows the columns used to be the following:
[‘renewable_energy_consumption_pct’, ‘rural_land’, ‘urban_population_under_5m’, ‘women_parliament_seats_pct’, ‘electricity_access_pct’, ‘gdp’, ‘rural_population’, ‘birth_rate’, ‘population_female’, ‘fertility_rate’, ‘urban_land’, ‘nitrous_oxide_emissions’, ‘press’, ‘democracy_score’, ‘life_expectancy’, ‘urban_population’, ‘agricultural_land’, ‘longitude’, ‘methane_emissions’, ‘population’, ‘internet_pct’, ‘population_male’, ‘hospital_beds’, ‘land_area’, ‘median_age’, ‘net_migration’, ‘latitude’, ‘death_rate’, ‘forest_area’, ‘co2_emissions’] |
Then we can prepare the predictor as everything except the target (life expectancy)
X = df_cleaned[[x for x in good_cols if x != “life_expectancy”]] Y = df_cleaned[“life_expectancy”] model.fit(X, Y) print(model.coef_) print(model.intercept_) print(model.score(X, Y)) |
It is easier to identify which coefficient corresponds to which column by matching:
for col, coef in zip(X.columns, model.coef_): print(“%s: %.3e” % (col, coef)) |
Some factors are negative. Those contributed negatively to the life expectancy. For example, higher death rate contributed negatively to life expectancy, which makes sense. Some coefficients are very small, such as net_migration
is in the order of $10^{-6}$, so you can essentially consider that as zero, i.e., that feature has no effect to the target.
Your Task
Since some features have no effect, why don’t you remove them from the regression? How can you do that automatically? Hint: Write a loop to add the “best feature” in each iteration and compare the increase in $R^2$ score.
In the next lesson, you will learn how to find the best feature subset automatically.
Lesson 05: Feature Selection
In the previous lesson, you predicted the life expectancy with all factors available. Let’s refine the regression model to make it “explainable.” Let’s say, find the top 5 factors affecting life expectancy. There are many ways to select features. Sequential feature selection is one of them and probably the simplest to understand: Enumerate and check all combinations using a greedy algorithm until the target number of features is found. Let’s try this out:
from sklearn.feature_selection import SequentialFeatureSelector
# Initializing the Linear Regression model model = LinearRegression(fit_intercept=True)
# Perform Sequential Feature Selector sfs = SequentialFeatureSelector(model, n_features_to_select=5) X = df_cleaned[[x for x in good_cols if x != “life_expectancy”]] Y = df_cleaned[“life_expectancy”] sfs.fit(X, Y) # Uses a default of cv=5 selected_feature = list(X.columns[sfs.get_support()]) print(“Feature selected for highest predictability:”, selected_feature) |
These are the five best features to use, as suggested by the sequential feature selector. Let’s build the model again and look at the coefficients:
model = LinearRegression(fit_intercept=True) X = df_cleaned[selected_feature] Y = df_cleaned[“life_expectancy”] model.fit(X, Y) print(model.score(X, Y)) for col, coef in zip(X.columns, model.coef_): print(“%s: %.3e” % (col, coef)) print(“Intercept:”, model.intercept_) |
This shows:
0.9248375749867905 electricity_access_pct: 3.798e-02 birth_rate: 1.319e-01 press: 3.290e-01 median_age: 9.035e-01 death_rate: -1.118e+00 Intercept: 51.251243580962864 |
This says life expectancy is increased by the access of electricity, as well as median age? Of course and intuitively, a country with high life expectancy will have high median age. That’s the weakness of the regression model: The algorithm will not be able to identify “data leak” in which some unreasonable predictor is involved in the model that rendered the model unhelpful.
This is the art of data science: Clean up the input data carefully and smartly before running the algorithm to avoid garbage-in-garbage-out.
Let’s convert GDP, land area, and some other columns into “per-capita” version, and rerun the feature selector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
per_capita = [“gdp”, “land_area”, “forest_area”, “rural_land”, “agricultural_land”, “urban_land”, “population_male”, “population_female”, “urban_population”, “rural_population”] for col in per_capita: df_cleaned[col] = df_cleaned[col] / df_cleaned[“population”]
col_to_use = per_capita + [ “nitrous_oxide_emissions”, “methane_emissions”, “fertility_rate”, “hospital_beds”, “internet_pct”, “democracy_score”, “co2_emissions”, “women_parliament_seats_pct”, “press”, “electricity_access_pct”, “renewable_energy_consumption_pct”]
model = LinearRegression(fit_intercept=True) sfs = SequentialFeatureSelector(model, n_features_to_select=6) X = df_cleaned[col_to_use] Y = df_cleaned[“life_expectancy”] sfs.fit(X, Y) # Uses a default of cv=5 selected_feature = list(X.columns[sfs.get_support()]) print(“Feature selected for highest predictability:”, selected_feature) |
And let’s check the coefficients in the linear regression using the selected features by running the previous code, you will get:
0.7854421025889131 gdp: 1.076e-04 forest_area: -2.357e+01 fertility_rate: -2.155e+00 internet_pct: 3.464e-02 press: 3.032e-01 electricity_access_pct: 6.548e-02 Intecept: 66.44197315903226 |
This shows GDP (per capita) is the strongest predictor to life expectancy (which makes sense, since richer country should have better health care). Also, the forest area is a negative factor to life expectancy, or that may be an indicator of urbanization. The press freedom, access to internet and electricity are all positively correlated to life expectancy, since they are reflecting how well-developed is the society.
Your Task
This lesson shows you that data science is not a robotic process, but you need intuition to handle and preprocess the data to make the model work better. One thing that didn’t do here is to normalize the data before regression: GDP per capita is in dollar amount while other factors are percentage amount, which causes the exaggerated disparity of the resulting coefficient. Can you try to rescale these factors and rerun the code above? Does it change the feature set selected? Does it change the $R^2$ score of the linear regression model?
In the next lesson, you will learn about decision tree.
Lesson 06: Decision Tree
If linear regression is the first model a data scientist would try on any task, decision tree would be the second. It is another model that is simple and easy to understand. It is also a model that works better in a different class of problem: Classification.
Let’s try to understand if countries between Northern and Southern Hemisphere are different. First, we need to create a label on the dataset:
df_cleaned[“north”] = df_cleaned[“latitude”] > 0 |
Now, let’s train a simple decision tree model as a classifier for that new column, based on the selected columns we used in the previous lesson. In scikit-learn, the syntax is almost identical to linear regression:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3) X = df_cleaned[col_to_use] Y = df_cleaned[“north”] model.fit(X, Y) print(model.score(X, Y)) |
The score in decision tree classifier is the mean accuracy. Before we discuss this accuracy, let’s see how many countries in this dataset:
You get:
north True 147 False 40 Name: count, dtype: int64 |
If there’s an equal number of countries from Northern and Southern Hemisphere, a random guess would be 50% accuracy. Here the data is imbalanced. If the model always predicts for Northern Hemisphere, the accuracy will be 78%. Therefore, this model is slightly better than a wild guess.
It doesn’t mean the model is useless. Here we used the model to prove that the features are not strong to classify a country, or in other words, there is no significant difference between countries from Northern or Southern Hemisphere if we just look at these features.
Your Task
You can actually visualize the decision tree to see what factors are used. Scikit-learn has a plot function for that, but using the Python module dtreeviz
is better. Try out the code below. What factors are used in the model?
In the next lesson, you will expand a decision tree into a random forest.
Lesson 07: Random Forest and Probability
If you tried decision tree, you can replicate the tree into a forest to improve the accuracy. There are many ways to replicate a tree into a forest. For example, you can train multiple trees using a resampled dataset (i.e., pick a subset of rows randomly for each tree). You can also train trees using a random subset of features (i.e., columns).
Building a random forest would be trivial if you do not want too much fine-tuning:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=5, max_depth=3) X = df_cleaned[col_to_use] Y = df_cleaned[“north”] model.fit(X, Y) print(model.score(X, Y)) |
This means using 10 trees instead of 1 deteriorates the accuracy slightly. That’s the nature of random forest that not using all data for training, hence no guarantee that a random forest can be better than decision tree. But that also confirms what we learned before: Probably not much difference between countries in Northern and Southern Hemispheres.
Visualizing a random forest would need to visualize each tree one by one. You can find the decision trees in the forest from the list model.estimators_
.
Random forest created above is an ensemble of decision trees that they “vote” for the final result. Scikit-learn has another implementation that build the forest using gradient boosting algorithm. You don’t need to know the difference in detail because the functional syntax is the same:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=5, max_depth=3) X = df_cleaned[col_to_use] Y = df_cleaned[“north”] model.fit(X, Y) model.score(X, Y) |
While decision tree and random forest are used as classifier in these tutorials, the models are not returning clear-cut answer of classification result. Especially in the case of GradientBoostingClassifier
, the algorithm behind assumes numerical output. Therefore, the native output of the model is the probability of each predicted class. You can find the probability in this way:
print(model.predict_proba(X)) |
This gives you a row of probabilities for each row of input. Normally you care about the class of highest probability, which you can get with predict()
:
You can now tell how confident, in average, the model predicts whether a country is from Northern or Southern Hemisphere by telling the average probability from its prediction:
import numpy as np
print(np.mean(model.predict_proba(X)[range(len(X)), model.predict(X).astype(int)])) |
The above picks the predicted output from the model and match it with the probability value, then calculate the average. You now have an argument that the model sees no difference between Northern and Southern Hemisphere because the value above is not any better than wild guess.
Your Task
Scikit-learn is not the go-to library for gradient boosting classifiers. The more common library of choice is XGBoost. How to rewrite the classifier above with XGBoost? How to set the hyperparameters n_estimators
and max_depth
in the case of XGBoost?
This was the final lesson.
The End! (Look How Far You Have Come)
You made it. Well done!
Take a moment and look back at how far you have come.
- You discovered how scikit-learn can help you finish a data science project.
- You learned how to use machine learning models to interpret data.
- You experimented with linear regression and decision tree models, and saw how simple models like these are still useful.
Don’t make light of this; you have come a long way in a short time. This is just the beginning of your data science journey. Keep practicing and developing your skills.
Summary
How did you do with the mini-course?
Did you enjoy this crash course?
Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.