Data Preparation for ML
Contents
Data Preparation for ML#
Before training any ML model, you will have to invest some time in data preparation. There are usually two steps:
Data Cleaning is very general and usually needs to be done before training. This step should be applied iteratively during EDA.
Data Transformations are usually very ML task/algorithm specific. The goal is to convert data into a suitable form for the task/algorithm.
Let’s start by importing packages:
import pandas as pd
import numpy as np
from sklearn import preprocessing, impute
Data Cleaning#
This section will show you some techniques for dealing with noisy datasets. In general, there are two approaches:
Throwing noisy instances away
Applying some techniques to data to reduce the noise
Usually, throwing data away is not the best option because acquiring data (and additional samples) is very expensive, so we try to salvage as much data as possible. Of course, it should only be done to some extent because, in many cases, some samples are beyond recovery).
Handling Duplications#
In many cases, data handed to us might contain duplicated instances (for example, a single measurement is duplicated several times). Having such data might introduce a lot of unwanted biases.
df = pd.DataFrame(
{
'column 1': ['Looping'] * 3 + ['Functions'] * 4,
'column 2': [10, 10, 22, 23, 23, 24, 24]
}
)
df
column 1 | column 2 | |
---|---|---|
0 | Looping | 10 |
1 | Looping | 10 |
2 | Looping | 22 |
3 | Functions | 23 |
4 | Functions | 23 |
5 | Functions | 24 |
6 | Functions | 24 |
df.duplicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
df.drop_duplicates()
column 1 | column 2 | |
---|---|---|
0 | Looping | 10 |
2 | Looping | 22 |
3 | Functions | 23 |
5 | Functions | 24 |
Handling missing or incorrect data#
It is very often that we receive a noisy dataset with some incorrect or missing measurements. There are many ways how to approach the problem.
Dropping instances or attributes#
As discussed earlier, removing information should be done cautiously; let’s review some examples:
df = pd.DataFrame(
data=np.arange(15, 30).reshape(5, 3),
index=['apple', 'banana', 'kiwi', 'grapes', 'mango'],
columns=['store1', 'store2', 'store3']
)
df
store1 | store2 | store3 | |
---|---|---|---|
apple | 15 | 16 | 17 |
banana | 18 | 19 | 20 |
kiwi | 21 | 22 | 23 |
grapes | 24 | 25 | 26 |
mango | 27 | 28 | 29 |
Let’s add some noise:
df['store4'] = np.nan
df.loc['watermelon'] = np.arange(15, 19)
df.loc['oranges'] = np.nan
df['store5'] = np.nan
df['store4']['apple'] = 20.
df
store1 | store2 | store3 | store4 | store5 | |
---|---|---|---|---|---|
apple | 15.0 | 16.0 | 17.0 | 20.0 | NaN |
banana | 18.0 | 19.0 | 20.0 | NaN | NaN |
kiwi | 21.0 | 22.0 | 23.0 | NaN | NaN |
grapes | 24.0 | 25.0 | 26.0 | NaN | NaN |
mango | 27.0 | 28.0 | 29.0 | NaN | NaN |
watermelon | 15.0 | 16.0 | 17.0 | 18.0 | NaN |
oranges | NaN | NaN | NaN | NaN | NaN |
df.isnull()
store1 | store2 | store3 | store4 | store5 | |
---|---|---|---|---|---|
apple | False | False | False | False | True |
banana | False | False | False | True | True |
kiwi | False | False | False | True | True |
grapes | False | False | False | True | True |
mango | False | False | False | True | True |
watermelon | False | False | False | False | True |
oranges | True | True | True | True | True |
df.notnull()
store1 | store2 | store3 | store4 | store5 | |
---|---|---|---|---|---|
apple | True | True | True | True | False |
banana | True | True | True | False | False |
kiwi | True | True | True | False | False |
grapes | True | True | True | False | False |
mango | True | True | True | False | False |
watermelon | True | True | True | True | False |
oranges | False | False | False | False | False |
df.isnull().sum()
store1 1
store2 1
store3 1
store4 5
store5 7
dtype: int64
df['store4'].dropna()
apple 20.0
watermelon 18.0
Name: store4, dtype: float64
df.dropna()
store1 | store2 | store3 | store4 | store5 |
---|
df.dropna(how='all')
store1 | store2 | store3 | store4 | store5 | |
---|---|---|---|---|---|
apple | 15.0 | 16.0 | 17.0 | 20.0 | NaN |
banana | 18.0 | 19.0 | 20.0 | NaN | NaN |
kiwi | 21.0 | 22.0 | 23.0 | NaN | NaN |
grapes | 24.0 | 25.0 | 26.0 | NaN | NaN |
mango | 27.0 | 28.0 | 29.0 | NaN | NaN |
watermelon | 15.0 | 16.0 | 17.0 | 18.0 | NaN |
df.dropna(how='all', axis=1)
store1 | store2 | store3 | store4 | |
---|---|---|---|---|
apple | 15.0 | 16.0 | 17.0 | 20.0 |
banana | 18.0 | 19.0 | 20.0 | NaN |
kiwi | 21.0 | 22.0 | 23.0 | NaN |
grapes | 24.0 | 25.0 | 26.0 | NaN |
mango | 27.0 | 28.0 | 29.0 | NaN |
watermelon | 15.0 | 16.0 | 17.0 | 18.0 |
oranges | NaN | NaN | NaN | NaN |
df.dropna(thresh=5, axis=1)
store1 | store2 | store3 | |
---|---|---|---|
apple | 15.0 | 16.0 | 17.0 |
banana | 18.0 | 19.0 | 20.0 |
kiwi | 21.0 | 22.0 | 23.0 |
grapes | 24.0 | 25.0 | 26.0 |
mango | 27.0 | 28.0 | 29.0 |
watermelon | 15.0 | 16.0 | 17.0 |
oranges | NaN | NaN | NaN |
Replacing values#
This is the easiest option if you have any domain knowledge of how wrong or missing values should be treated.
df = pd.DataFrame(
{
'column 1': [200., 3000., np.nan, 3000., 234., 444., np.nan, 332., 3332.],
'column 2': range(9)
}
)
df
column 1 | column 2 | |
---|---|---|
0 | 200.0 | 0 |
1 | 3000.0 | 1 |
2 | NaN | 2 |
3 | 3000.0 | 3 |
4 | 234.0 | 4 |
5 | 444.0 | 5 |
6 | NaN | 6 |
7 | 332.0 | 7 |
8 | 3332.0 | 8 |
df.replace(to_replace=np.nan, value=0.0)
column 1 | column 2 | |
---|---|---|
0 | 200.0 | 0 |
1 | 3000.0 | 1 |
2 | 0.0 | 2 |
3 | 3000.0 | 3 |
4 | 234.0 | 4 |
5 | 444.0 | 5 |
6 | 0.0 | 6 |
7 | 332.0 | 7 |
8 | 3332.0 | 8 |
Filling in missing data#
Let’s start with some basics techniques:
df = pd.DataFrame(
data=np.arange(15, 30).reshape(5, 3),
index=['apple', 'banana', 'kiwi', 'grapes', 'mango'],
columns=['store1', 'store2', 'store3']
)
df['store4'] = np.nan
df.loc['watermelon'] = np.arange(15, 19)
df.loc['oranges'] = np.nan
df['store5'] = np.nan
df['store4']['apple'] = 20.
df
store1 | store2 | store3 | store4 | store5 | |
---|---|---|---|---|---|
apple | 15.0 | 16.0 | 17.0 | 20.0 | NaN |
banana | 18.0 | 19.0 | 20.0 | NaN | NaN |
kiwi | 21.0 | 22.0 | 23.0 | NaN | NaN |
grapes | 24.0 | 25.0 | 26.0 | NaN | NaN |
mango | 27.0 | 28.0 | 29.0 | NaN | NaN |
watermelon | 15.0 | 16.0 | 17.0 | 18.0 | NaN |
oranges | NaN | NaN | NaN | NaN | NaN |
Let’s fill missing values with a constant:
filled_df = df.fillna(0)
filled_df
store1 | store2 | store3 | store4 | store5 | |
---|---|---|---|---|---|
apple | 15.0 | 16.0 | 17.0 | 20.0 | 0.0 |
banana | 18.0 | 19.0 | 20.0 | 0.0 | 0.0 |
kiwi | 21.0 | 22.0 | 23.0 | 0.0 | 0.0 |
grapes | 24.0 | 25.0 | 26.0 | 0.0 | 0.0 |
mango | 27.0 | 28.0 | 29.0 | 0.0 | 0.0 |
watermelon | 15.0 | 16.0 | 17.0 | 18.0 | 0.0 |
oranges | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
It might introduce some bias:
df.mean()
store1 20.0
store2 21.0
store3 22.0
store4 19.0
store5 NaN
dtype: float64
filled_df.mean()
store1 17.142857
store2 18.000000
store3 18.857143
store4 5.428571
store5 0.000000
dtype: float64
There are also more sophisticated options, for example, let’s fill values by mean:
imp = impute.SimpleImputer(strategy='mean')
filled_df = df.copy().dropna(how='all')
filled_df[:] = imp.fit_transform(df.T).T
filled_df
store1 | store2 | store3 | store4 | store5 | |
---|---|---|---|---|---|
apple | 15.0 | 16.0 | 17.0 | 20.0 | 17.0 |
banana | 18.0 | 19.0 | 20.0 | 19.0 | 19.0 |
kiwi | 21.0 | 22.0 | 23.0 | 22.0 | 22.0 |
grapes | 24.0 | 25.0 | 26.0 | 25.0 | 25.0 |
mango | 27.0 | 28.0 | 29.0 | 28.0 | 28.0 |
watermelon | 15.0 | 16.0 | 17.0 | 18.0 | 16.5 |
Data Transformations#
Let’s review some techniques that should help you to transform data into a form that is required by the ML task/algorithm.
Discretization#
Discretization (known as quantization or binning) allows the partitioning of continuous features into discrete values. Specific datasets may benefit from discretization.
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]
bins = [118, 125, 135, 160, 200]
category = pd.cut(height, bins)
category
[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160, 200], (135, 160], (135, 160], (125, 135]]
Length: 12
Categories (4, interval[int64, right]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]
pd.value_counts(category)
(118, 125] 5
(125, 135] 3
(135, 160] 3
(160, 200] 1
dtype: int64
Luckily, we do not have to reinvent the wheel, as sklearn already has some transformations in place for this case. For example:
df = pd.DataFrame(np.array([[-3., 5., 15], [0., 6., 14], [6., 3., 11]]))
df
0 | 1 | 2 | |
---|---|---|---|
0 | -3.0 | 5.0 | 15.0 |
1 | 0.0 | 6.0 | 14.0 |
2 | 6.0 | 3.0 | 11.0 |
preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit_transform(df)
array([[0., 1., 1.],
[1., 1., 1.],
[2., 0., 0.]])
Encoding Variables#
Often features are not given as continuous values but as categorical. ML algorithms do not always support such representation. In that case, we might need to apply different encodings:
df = pd.DataFrame(
{
'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'],
'votes': range(6, 12, 1)
}
)
df
gender | votes | |
---|---|---|
0 | female | 6 |
1 | female | 7 |
2 | male | 8 |
3 | unknown | 9 |
4 | male | 10 |
5 | female | 11 |
pd.get_dummies(df['gender'])
female | male | unknown | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 1 | 0 |
5 | 1 | 0 | 0 |
Or with help of sklearn:
preprocessing.OneHotEncoder().fit_transform(df[['gender']]).toarray()
array([[1., 0., 0.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]])
Standardization, or mean removal, variance scaling, and non-linear transformations#
Many ML algorithms might have some requirements for input distributions of features. It might also be beneficial to transform a feature distribution as it can significantly improve convergence speed or lead to better performance. Let’s review some techniques:
from sklearn.datasets import load_iris
iris_data = load_iris(as_frame=True)
df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
Let’s standardize features:
preprocessing.StandardScaler().fit_transform(df)[:10]
array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],
[-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
[-1.38535265, 0.32841405, -1.39706395, -1.3154443 ],
[-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ],
[-1.02184904, 1.24920112, -1.34022653, -1.3154443 ],
[-0.53717756, 1.93979142, -1.16971425, -1.05217993],
[-1.50652052, 0.78880759, -1.34022653, -1.18381211],
[-1.02184904, 0.78880759, -1.2833891 , -1.3154443 ],
[-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
[-1.14301691, 0.09821729, -1.2833891 , -1.44707648]])
Or convert features to be in a range:
preprocessing.MinMaxScaler().fit_transform(df)[:10]
array([[0.22222222, 0.625 , 0.06779661, 0.04166667],
[0.16666667, 0.41666667, 0.06779661, 0.04166667],
[0.11111111, 0.5 , 0.05084746, 0.04166667],
[0.08333333, 0.45833333, 0.08474576, 0.04166667],
[0.19444444, 0.66666667, 0.06779661, 0.04166667],
[0.30555556, 0.79166667, 0.11864407, 0.125 ],
[0.08333333, 0.58333333, 0.06779661, 0.08333333],
[0.19444444, 0.58333333, 0.08474576, 0.04166667],
[0.02777778, 0.375 , 0.06779661, 0.04166667],
[0.16666667, 0.45833333, 0.08474576, 0. ]])
Or map features to a different distribution:
preprocessing.QuantileTransformer(n_quantiles=10).fit_transform(df)[:10]
array([[0.22222222, 0.83333333, 0.11111111, 0.11111111],
[0.11111111, 0.44444444, 0.11111111, 0.11111111],
[0.07407407, 0.66666667, 0.08333333, 0.11111111],
[0.05555556, 0.57575758, 0.22222222, 0.11111111],
[0.16666667, 0.88888889, 0.11111111, 0.11111111],
[0.33333333, 0.93055556, 0.24183007, 0.25423729],
[0.05555556, 0.77777778, 0.11111111, 0.23728814],
[0.16666667, 0.77777778, 0.22222222, 0.11111111],
[0.01851852, 0.33333333, 0.11111111, 0.11111111],
[0.11111111, 0.57575758, 0.22222222, 0. ]])
Exercises#
Try to add some new features as a combination of the available ones to a dataset. Do a simple EDA to analyze patterns.#
Often it’s helpful to add complexity to a model by considering nonlinear features of the input data. Review implementation of polynomial features and apply it to the Titanic dataset.
# load the dataset
titanic_url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
titanic_data = pd.read_csv(titanic_url)
# TODO: your answer here
Resources#
- scia
6.3. preprocessing data. URL: https://scikit-learn.org/stable/modules/preprocessing.html.
- scib
6.4. imputation of missing values. URL: https://scikit-learn.org/stable/modules/impute.html.
- Pac
PacktPublishing. Packtpublishing/hands-on-exploratory-data-analysis-with-python: hands-on exploratory data analysis with python, published by packt. URL: https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python.