Data Preparation for ML#

Before training any ML model, you will have to invest some time in data preparation. There are usually two steps:

  • Data Cleaning is very general and usually needs to be done before training. This step should be applied iteratively during EDA.

  • Data Transformations are usually very ML task/algorithm specific. The goal is to convert data into a suitable form for the task/algorithm.

✏️ The example is inspired by [Pac], [scia] and [scib].

Let’s start by importing packages:

import pandas as pd
import numpy as np
from sklearn import preprocessing, impute

Data Cleaning#

This section will show you some techniques for dealing with noisy datasets. In general, there are two approaches:

  • Throwing noisy instances away

  • Applying some techniques to data to reduce the noise

Usually, throwing data away is not the best option because acquiring data (and additional samples) is very expensive, so we try to salvage as much data as possible. Of course, it should only be done to some extent because, in many cases, some samples are beyond recovery).

Handling Duplications#

In many cases, data handed to us might contain duplicated instances (for example, a single measurement is duplicated several times). Having such data might introduce a lot of unwanted biases.

df = pd.DataFrame(
    {
        'column 1': ['Looping'] * 3 + ['Functions'] * 4,
        'column 2': [10, 10, 22, 23, 23, 24, 24]
    }
)
df
column 1 column 2
0 Looping 10
1 Looping 10
2 Looping 22
3 Functions 23
4 Functions 23
5 Functions 24
6 Functions 24
df.duplicated()
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool
df.drop_duplicates()
column 1 column 2
0 Looping 10
2 Looping 22
3 Functions 23
5 Functions 24

Handling missing or incorrect data#

It is very often that we receive a noisy dataset with some incorrect or missing measurements. There are many ways how to approach the problem.

Dropping instances or attributes#

As discussed earlier, removing information should be done cautiously; let’s review some examples:

df = pd.DataFrame(
    data=np.arange(15, 30).reshape(5, 3),
    index=['apple', 'banana', 'kiwi', 'grapes', 'mango'],
    columns=['store1', 'store2', 'store3']
)
df
store1 store2 store3
apple 15 16 17
banana 18 19 20
kiwi 21 22 23
grapes 24 25 26
mango 27 28 29

Let’s add some noise:

df['store4'] = np.nan
df.loc['watermelon'] = np.arange(15, 19)
df.loc['oranges'] = np.nan
df['store5'] = np.nan
df['store4']['apple'] = 20.
df
store1 store2 store3 store4 store5
apple 15.0 16.0 17.0 20.0 NaN
banana 18.0 19.0 20.0 NaN NaN
kiwi 21.0 22.0 23.0 NaN NaN
grapes 24.0 25.0 26.0 NaN NaN
mango 27.0 28.0 29.0 NaN NaN
watermelon 15.0 16.0 17.0 18.0 NaN
oranges NaN NaN NaN NaN NaN
df.isnull()
store1 store2 store3 store4 store5
apple False False False False True
banana False False False True True
kiwi False False False True True
grapes False False False True True
mango False False False True True
watermelon False False False False True
oranges True True True True True
df.notnull()
store1 store2 store3 store4 store5
apple True True True True False
banana True True True False False
kiwi True True True False False
grapes True True True False False
mango True True True False False
watermelon True True True True False
oranges False False False False False
df.isnull().sum()
store1    1
store2    1
store3    1
store4    5
store5    7
dtype: int64
df['store4'].dropna()
apple         20.0
watermelon    18.0
Name: store4, dtype: float64
df.dropna()
store1 store2 store3 store4 store5
df.dropna(how='all')
store1 store2 store3 store4 store5
apple 15.0 16.0 17.0 20.0 NaN
banana 18.0 19.0 20.0 NaN NaN
kiwi 21.0 22.0 23.0 NaN NaN
grapes 24.0 25.0 26.0 NaN NaN
mango 27.0 28.0 29.0 NaN NaN
watermelon 15.0 16.0 17.0 18.0 NaN
df.dropna(how='all', axis=1)
store1 store2 store3 store4
apple 15.0 16.0 17.0 20.0
banana 18.0 19.0 20.0 NaN
kiwi 21.0 22.0 23.0 NaN
grapes 24.0 25.0 26.0 NaN
mango 27.0 28.0 29.0 NaN
watermelon 15.0 16.0 17.0 18.0
oranges NaN NaN NaN NaN
df.dropna(thresh=5, axis=1)
store1 store2 store3
apple 15.0 16.0 17.0
banana 18.0 19.0 20.0
kiwi 21.0 22.0 23.0
grapes 24.0 25.0 26.0
mango 27.0 28.0 29.0
watermelon 15.0 16.0 17.0
oranges NaN NaN NaN

Replacing values#

This is the easiest option if you have any domain knowledge of how wrong or missing values should be treated.

df = pd.DataFrame(
    {
        'column 1': [200., 3000., np.nan, 3000., 234., 444., np.nan, 332., 3332.],
        'column 2': range(9)
    }
)
df
column 1 column 2
0 200.0 0
1 3000.0 1
2 NaN 2
3 3000.0 3
4 234.0 4
5 444.0 5
6 NaN 6
7 332.0 7
8 3332.0 8
df.replace(to_replace=np.nan, value=0.0)
column 1 column 2
0 200.0 0
1 3000.0 1
2 0.0 2
3 3000.0 3
4 234.0 4
5 444.0 5
6 0.0 6
7 332.0 7
8 3332.0 8

Filling in missing data#

Let’s start with some basics techniques:

df = pd.DataFrame(
    data=np.arange(15, 30).reshape(5, 3),
    index=['apple', 'banana', 'kiwi', 'grapes', 'mango'],
    columns=['store1', 'store2', 'store3']
)
df['store4'] = np.nan
df.loc['watermelon'] = np.arange(15, 19)
df.loc['oranges'] = np.nan
df['store5'] = np.nan
df['store4']['apple'] = 20.
df
store1 store2 store3 store4 store5
apple 15.0 16.0 17.0 20.0 NaN
banana 18.0 19.0 20.0 NaN NaN
kiwi 21.0 22.0 23.0 NaN NaN
grapes 24.0 25.0 26.0 NaN NaN
mango 27.0 28.0 29.0 NaN NaN
watermelon 15.0 16.0 17.0 18.0 NaN
oranges NaN NaN NaN NaN NaN

Let’s fill missing values with a constant:

filled_df = df.fillna(0)
filled_df
store1 store2 store3 store4 store5
apple 15.0 16.0 17.0 20.0 0.0
banana 18.0 19.0 20.0 0.0 0.0
kiwi 21.0 22.0 23.0 0.0 0.0
grapes 24.0 25.0 26.0 0.0 0.0
mango 27.0 28.0 29.0 0.0 0.0
watermelon 15.0 16.0 17.0 18.0 0.0
oranges 0.0 0.0 0.0 0.0 0.0

It might introduce some bias:

df.mean()
store1    20.0
store2    21.0
store3    22.0
store4    19.0
store5     NaN
dtype: float64
filled_df.mean()
store1    17.142857
store2    18.000000
store3    18.857143
store4     5.428571
store5     0.000000
dtype: float64

There are also more sophisticated options, for example, let’s fill values by mean:

imp = impute.SimpleImputer(strategy='mean')

filled_df = df.copy().dropna(how='all')
filled_df[:] = imp.fit_transform(df.T).T
filled_df
store1 store2 store3 store4 store5
apple 15.0 16.0 17.0 20.0 17.0
banana 18.0 19.0 20.0 19.0 19.0
kiwi 21.0 22.0 23.0 22.0 22.0
grapes 24.0 25.0 26.0 25.0 25.0
mango 27.0 28.0 29.0 28.0 28.0
watermelon 15.0 16.0 17.0 18.0 16.5

Data Transformations#

Let’s review some techniques that should help you to transform data into a form that is required by the ML task/algorithm.

Discretization#

Discretization (known as quantization or binning) allows the partitioning of continuous features into discrete values. Specific datasets may benefit from discretization.

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]
bins = [118, 125, 135, 160, 200]

category = pd.cut(height, bins)
category
[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160, 200], (135, 160], (135, 160], (125, 135]]
Length: 12
Categories (4, interval[int64, right]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]
pd.value_counts(category)
(118, 125]    5
(125, 135]    3
(135, 160]    3
(160, 200]    1
dtype: int64

Luckily, we do not have to reinvent the wheel, as sklearn already has some transformations in place for this case. For example:

df = pd.DataFrame(np.array([[-3., 5., 15], [0., 6., 14], [6., 3., 11]]))
df
0 1 2
0 -3.0 5.0 15.0
1 0.0 6.0 14.0
2 6.0 3.0 11.0
preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit_transform(df)
array([[0., 1., 1.],
       [1., 1., 1.],
       [2., 0., 0.]])

Encoding Variables#

Often features are not given as continuous values but as categorical. ML algorithms do not always support such representation. In that case, we might need to apply different encodings:

df = pd.DataFrame(
    {
        'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'],
        'votes': range(6, 12, 1)
    }
)
df
gender votes
0 female 6
1 female 7
2 male 8
3 unknown 9
4 male 10
5 female 11
pd.get_dummies(df['gender'])
female male unknown
0 1 0 0
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
5 1 0 0

Or with help of sklearn:

preprocessing.OneHotEncoder().fit_transform(df[['gender']]).toarray()
array([[1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

Standardization, or mean removal, variance scaling, and non-linear transformations#

Many ML algorithms might have some requirements for input distributions of features. It might also be beneficial to transform a feature distribution as it can significantly improve convergence speed or lead to better performance. Let’s review some techniques:

from sklearn.datasets import load_iris

iris_data = load_iris(as_frame=True)
df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Let’s standardize features:

preprocessing.StandardScaler().fit_transform(df)[:10]
array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ],
       [-0.53717756,  1.93979142, -1.16971425, -1.05217993],
       [-1.50652052,  0.78880759, -1.34022653, -1.18381211],
       [-1.02184904,  0.78880759, -1.2833891 , -1.3154443 ],
       [-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
       [-1.14301691,  0.09821729, -1.2833891 , -1.44707648]])

Or convert features to be in a range:

preprocessing.MinMaxScaler().fit_transform(df)[:10]
array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ]])

Or map features to a different distribution:

preprocessing.QuantileTransformer(n_quantiles=10).fit_transform(df)[:10]
array([[0.22222222, 0.83333333, 0.11111111, 0.11111111],
       [0.11111111, 0.44444444, 0.11111111, 0.11111111],
       [0.07407407, 0.66666667, 0.08333333, 0.11111111],
       [0.05555556, 0.57575758, 0.22222222, 0.11111111],
       [0.16666667, 0.88888889, 0.11111111, 0.11111111],
       [0.33333333, 0.93055556, 0.24183007, 0.25423729],
       [0.05555556, 0.77777778, 0.11111111, 0.23728814],
       [0.16666667, 0.77777778, 0.22222222, 0.11111111],
       [0.01851852, 0.33333333, 0.11111111, 0.11111111],
       [0.11111111, 0.57575758, 0.22222222, 0.        ]])

Exercises#

Try to add some new features as a combination of the available ones to a dataset. Do a simple EDA to analyze patterns.#

Often it’s helpful to add complexity to a model by considering nonlinear features of the input data. Review implementation of polynomial features and apply it to the Titanic dataset.

# load the dataset
titanic_url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
titanic_data = pd.read_csv(titanic_url)

# TODO: your answer here

Resources#

scia

6.3. preprocessing data. URL: https://scikit-learn.org/stable/modules/preprocessing.html.

scib

6.4. imputation of missing values. URL: https://scikit-learn.org/stable/modules/impute.html.

Pac

PacktPublishing. Packtpublishing/hands-on-exploratory-data-analysis-with-python: hands-on exploratory data analysis with python, published by packt. URL: https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python.