Data Preparation for ML#

Before training any ML model, you will have to invest some time in data preparation. There are usually two steps:

Data Cleaning is very general and usually needs to be done before training. This step should be applied iteratively during EDA.
Data Transformations are usually very ML task/algorithm specific. The goal is to convert data into a suitable form for the task/algorithm.

✏️ The example is inspired by [Pac], [scia] and [scib].

Let’s start by importing packages:

import pandas as pd
import numpy as np
from sklearn import preprocessing, impute

Data Cleaning#

This section will show you some techniques for dealing with noisy datasets. In general, there are two approaches:

Throwing noisy instances away
Applying some techniques to data to reduce the noise

Usually, throwing data away is not the best option because acquiring data (and additional samples) is very expensive, so we try to salvage as much data as possible. Of course, it should only be done to some extent because, in many cases, some samples are beyond recovery).

Handling Duplications#

In many cases, data handed to us might contain duplicated instances (for example, a single measurement is duplicated several times). Having such data might introduce a lot of unwanted biases.

df = pd.DataFrame(
    {
        'column 1': ['Looping'] * 3 + ['Functions'] * 4,
        'column 2': [10, 10, 22, 23, 23, 24, 24]
    }
)
df

	column 1	column 2
0	Looping	10
1	Looping	10
2	Looping	22
3	Functions	23
4	Functions	23
5	Functions	24
6	Functions	24

df.duplicated()

  False
   True
  False
  False
   True
  False
   True
dtype: bool

df.drop_duplicates()

	column 1	column 2
0	Looping	10
2	Looping	22
3	Functions	23
5	Functions	24

Handling missing or incorrect data#

It is very often that we receive a noisy dataset with some incorrect or missing measurements. There are many ways how to approach the problem.

Dropping instances or attributes#

As discussed earlier, removing information should be done cautiously; let’s review some examples:

df = pd.DataFrame(
    data=np.arange(15, 30).reshape(5, 3),
    index=['apple', 'banana', 'kiwi', 'grapes', 'mango'],
    columns=['store1', 'store2', 'store3']
)
df

	store1	store2	store3
apple	15	16	17
banana	18	19	20
kiwi	21	22	23
grapes	24	25	26
mango	27	28	29

Let’s add some noise:

df['store4'] = np.nan
df.loc['watermelon'] = np.arange(15, 19)
df.loc['oranges'] = np.nan
df['store5'] = np.nan
df['store4']['apple'] = 20.
df

	store1	store2	store3	store4	store5
apple	15.0	16.0	17.0	20.0	NaN
banana	18.0	19.0	20.0	NaN	NaN
kiwi	21.0	22.0	23.0	NaN	NaN
grapes	24.0	25.0	26.0	NaN	NaN
mango	27.0	28.0	29.0	NaN	NaN
watermelon	15.0	16.0	17.0	18.0	NaN
oranges	NaN	NaN	NaN	NaN	NaN

df.isnull()

	store1	store2	store3	store4	store5
apple	False	False	False	False	True
banana	False	False	False	True	True
kiwi	False	False	False	True	True
grapes	False	False	False	True	True
mango	False	False	False	True	True
watermelon	False	False	False	False	True
oranges	True	True	True	True	True

df.notnull()

	store1	store2	store3	store4	store5
apple	True	True	True	True	False
banana	True	True	True	False	False
kiwi	True	True	True	False	False
grapes	True	True	True	False	False
mango	True	True	True	False	False
watermelon	True	True	True	True	False
oranges	False	False	False	False	False

df.isnull().sum()

store1    1
store2    1
store3    1
store4    5
store5    7
dtype: int64

df['store4'].dropna()

apple         20.0
watermelon    18.0
Name: store4, dtype: float64

df.dropna()

	store1	store2	store3	store4	store5

df.dropna(how='all')

	store1	store2	store3	store4	store5
apple	15.0	16.0	17.0	20.0	NaN
banana	18.0	19.0	20.0	NaN	NaN
kiwi	21.0	22.0	23.0	NaN	NaN
grapes	24.0	25.0	26.0	NaN	NaN
mango	27.0	28.0	29.0	NaN	NaN
watermelon	15.0	16.0	17.0	18.0	NaN

df.dropna(how='all', axis=1)

	store1	store2	store3	store4
apple	15.0	16.0	17.0	20.0
banana	18.0	19.0	20.0	NaN
kiwi	21.0	22.0	23.0	NaN
grapes	24.0	25.0	26.0	NaN
mango	27.0	28.0	29.0	NaN
watermelon	15.0	16.0	17.0	18.0
oranges	NaN	NaN	NaN	NaN

df.dropna(thresh=5, axis=1)

	store1	store2	store3
apple	15.0	16.0	17.0
banana	18.0	19.0	20.0
kiwi	21.0	22.0	23.0
grapes	24.0	25.0	26.0
mango	27.0	28.0	29.0
watermelon	15.0	16.0	17.0
oranges	NaN	NaN	NaN

Replacing values#

This is the easiest option if you have any domain knowledge of how wrong or missing values should be treated.

df = pd.DataFrame(
    {
        'column 1': [200., 3000., np.nan, 3000., 234., 444., np.nan, 332., 3332.],
        'column 2': range(9)
    }
)
df

	column 1	column 2
0	200.0	0
1	3000.0	1
2	NaN	2
3	3000.0	3
4	234.0	4
5	444.0	5
6	NaN	6
7	332.0	7
8	3332.0	8

df.replace(to_replace=np.nan, value=0.0)

	column 1	column 2
0	200.0	0
1	3000.0	1
2	0.0	2
3	3000.0	3
4	234.0	4
5	444.0	5
6	0.0	6
7	332.0	7
8	3332.0	8

Filling in missing data#

Let’s start with some basics techniques:

df = pd.DataFrame(
    data=np.arange(15, 30).reshape(5, 3),
    index=['apple', 'banana', 'kiwi', 'grapes', 'mango'],
    columns=['store1', 'store2', 'store3']
)
df['store4'] = np.nan
df.loc['watermelon'] = np.arange(15, 19)
df.loc['oranges'] = np.nan
df['store5'] = np.nan
df['store4']['apple'] = 20.
df

	store1	store2	store3	store4	store5
apple	15.0	16.0	17.0	20.0	NaN
banana	18.0	19.0	20.0	NaN	NaN
kiwi	21.0	22.0	23.0	NaN	NaN
grapes	24.0	25.0	26.0	NaN	NaN
mango	27.0	28.0	29.0	NaN	NaN
watermelon	15.0	16.0	17.0	18.0	NaN
oranges	NaN	NaN	NaN	NaN	NaN

Let’s fill missing values with a constant:

filled_df = df.fillna(0)
filled_df

	store1	store2	store3	store4
apple	15.0	16.0	17.0	20.0
banana	18.0	19.0	20.0	0.0
kiwi	21.0	22.0	23.0	0.0
grapes	24.0	25.0	26.0	0.0
mango	27.0	28.0	29.0	0.0
watermelon	15.0	16.0	17.0	18.0
oranges	0.0	0.0	0.0	0.0

It might introduce some bias:

df.mean()

store1    20.0
store2    21.0
store3    22.0
store4    19.0
store5     NaN
dtype: float64

filled_df.mean()

store1    17.142857
store2    18.000000
store3    18.857143
store4     5.428571
store5     0.000000
dtype: float64

There are also more sophisticated options, for example, let’s fill values by mean:

imp = impute.SimpleImputer(strategy='mean')

filled_df = df.copy().dropna(how='all')
filled_df[:] = imp.fit_transform(df.T).T
filled_df

	store1	store2	store3	store4	store5
apple	15.0	16.0	17.0	20.0	17.0
banana	18.0	19.0	20.0	19.0	19.0
kiwi	21.0	22.0	23.0	22.0	22.0
grapes	24.0	25.0	26.0	25.0	25.0
mango	27.0	28.0	29.0	28.0	28.0
watermelon	15.0	16.0	17.0	18.0	16.5

Data Transformations#

Let’s review some techniques that should help you to transform data into a form that is required by the ML task/algorithm.

Discretization#

Discretization (known as quantization or binning) allows the partitioning of continuous features into discrete values. Specific datasets may benefit from discretization.

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]
bins = [118, 125, 135, 160, 200]

category = pd.cut(height, bins)
category

[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160, 200], (135, 160], (135, 160], (125, 135]]
Length: 12
Categories (4, interval[int64, right]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]

pd.value_counts(category)

(118, 125]    5
(125, 135]    3
(135, 160]    3
(160, 200]    1
dtype: int64

Luckily, we do not have to reinvent the wheel, as sklearn already has some transformations in place for this case. For example:

df = pd.DataFrame(np.array([[-3., 5., 15], [0., 6., 14], [6., 3., 11]]))
df

	0	1	2
0	-3.0	5.0	15.0
1	0.0	6.0	14.0
2	6.0	3.0	11.0

preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit_transform(df)

array([[0., 1., 1.],
       [1., 1., 1.],
       [2., 0., 0.]])

Encoding Variables#

Often features are not given as continuous values but as categorical. ML algorithms do not always support such representation. In that case, we might need to apply different encodings:

df = pd.DataFrame(
    {
        'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'],
        'votes': range(6, 12, 1)
    }
)
df

	gender	votes
0	female	6
1	female	7
2	male	8
3	unknown	9
4	male	10
5	female	11

pd.get_dummies(df['gender'])

	female	male	unknown
0	1	0	0
1	1	0	0
2	0	1	0
3	0	0	1
4	0	1	0
5	1	0	0

Or with help of sklearn:

preprocessing.OneHotEncoder().fit_transform(df[['gender']]).toarray()

array([[1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

Standardization, or mean removal, variance scaling, and non-linear transformations#

Many ML algorithms might have some requirements for input distributions of features. It might also be beneficial to transform a feature distribution as it can significantly improve convergence speed or lead to better performance. Let’s review some techniques:

from sklearn.datasets import load_iris

iris_data = load_iris(as_frame=True)
df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Let’s standardize features:

preprocessing.StandardScaler().fit_transform(df)[:10]

array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ],
       [-0.53717756,  1.93979142, -1.16971425, -1.05217993],
       [-1.50652052,  0.78880759, -1.34022653, -1.18381211],
       [-1.02184904,  0.78880759, -1.2833891 , -1.3154443 ],
       [-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
       [-1.14301691,  0.09821729, -1.2833891 , -1.44707648]])

Or convert features to be in a range:

preprocessing.MinMaxScaler().fit_transform(df)[:10]

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ]])

Or map features to a different distribution:

preprocessing.QuantileTransformer(n_quantiles=10).fit_transform(df)[:10]

array([[0.22222222, 0.83333333, 0.11111111, 0.11111111],
       [0.11111111, 0.44444444, 0.11111111, 0.11111111],
       [0.07407407, 0.66666667, 0.08333333, 0.11111111],
       [0.05555556, 0.57575758, 0.22222222, 0.11111111],
       [0.16666667, 0.88888889, 0.11111111, 0.11111111],
       [0.33333333, 0.93055556, 0.24183007, 0.25423729],
       [0.05555556, 0.77777778, 0.11111111, 0.23728814],
       [0.16666667, 0.77777778, 0.22222222, 0.11111111],
       [0.01851852, 0.33333333, 0.11111111, 0.11111111],
       [0.11111111, 0.57575758, 0.22222222, 0.        ]])

Exercises#

Try to add some new features as a combination of the available ones to a dataset. Do a simple EDA to analyze patterns.#

Often it’s helpful to add complexity to a model by considering nonlinear features of the input data. Review implementation of polynomial features and apply it to the Titanic dataset.

# load the dataset
titanic_url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
titanic_data = pd.read_csv(titanic_url)

# TODO: your answer here

Resources#

scia: 6.3. preprocessing data. URL: https://scikit-learn.org/stable/modules/preprocessing.html.
scib: 6.4. imputation of missing values. URL: https://scikit-learn.org/stable/modules/impute.html.
Pac: PacktPublishing. Packtpublishing/hands-on-exploratory-data-analysis-with-python: hands-on exploratory data analysis with python, published by packt. URL: https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python.

Data Science Academy

Data Preparation for ML

Contents