Lecture 4: Model Improvements and Pipelines#

In this lecture we are going to build upon the knowledge of last week(s). We will combine data preprocessing steps with ML models and crete data pipelines. In addition, a few tips of model improvements (class imbalance) will be discussed.

This week you will learn:

Something about Random-forest and deep learning.
How to combine data-preprocessing and modelling in a data pipeline.
How to use different models on a given, practical problem.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from generate_dataset import generate_dataset

To illustrate where we currrently are I am reposing the schema below from last week. Un this class we will zoom in further on classification.

alt text

In this lecture we will discuss and apply several ML-algorithms.

The main models of this lecture are:

Logistic Regression (see previous lecture)
Random Forest (a decision tree)
Neural Network

I will briefly discuss them.

The slide below was shown in last week’s presentation. A reminder that logistic regression is a linear classifier.

The second model is the Random Forest, which is basically a collection of decision trees. A classical decision tree is shown below. The question is “should I play badminton?”

A single decision tree is unstable. For this reason we generate many of them and then average the predictions. A Random Forest is not a linear classifier and can thus be vary useful.

And finally we will use Neural Networks. The image below shows a single hidden layer, but the number of hidden layers can ofcource vary (deep learning). The mathematical idea is that we can fit almost every function by just adding hidden layers. The more layers we add, the more complex our functions can become. This allowed for lost of breakthroughs in many fields (e.g. computer vision and NLP) in the last ten years.

Data Pipeline#

To illustrate the use of data pipelines, we are going to generate data using a function that I created (see generate_dataset.py). The task will be a classification task and will contain both numerical and categorical features. Unlike last we week, we will use the sklearn function make_classification to generate our data.

from secrets import choice

X_train, X_test, y_train, y_test = generate_dataset()

For our data pipeline, we need to know which features are categorical or numerical. Both types of variables require different steps in the preprocessing pipeline. Categorical features need to be numerically encoded (e.g: 1h-encoding). Numerical features need (in many cases) to be standardized (see lecture 2).

categorical_cols = [column for column in X_train.columns if 'categorical' in column]
numerical_cols = [column for column in X_train.columns if 'numerical' in column]

We will now create a pipeline using ColumnTransformer and Pipeline from sklearn.

# create ColumnTransformer, and pass the column names to transform in each step

def make_pipeline(categorical_cols=[], numerical_cols=[], classifier=LogisticRegression()):
    preprocessor = ColumnTransformer(
        [
            ('ohe', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols),
            ('scale', StandardScaler(), numerical_cols)
        ]
    )

    clf = Pipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('classifier', classifier)
        ]
    )

    return clf

Now that we have defined our pipeline, let’s see how different models will perform on our generated data. We define the models that we want to test in the list below.

models_to_run = [
    LogisticRegression(multi_class='multinomial'),
    RandomForestClassifier(max_depth=2),
    RandomForestClassifier(max_depth=5),
    RandomForestClassifier(max_depth=10),
    RandomForestClassifier(max_depth=20),
    RandomForestClassifier(max_depth=80),
    MLPClassifier(hidden_layer_sizes=2),
    MLPClassifier(hidden_layer_sizes=6),
    MLPClassifier(hidden_layer_sizes=10),
    MLPClassifier(hidden_layer_sizes=20),
    MLPClassifier(hidden_layer_sizes=40)
]
    

The standard classification metric to evaluate models in sklearn pipelines is the Accuracy. The Accuracy is the number of correct predictions out of all the datapoints. This metric works ok in generic situations, but might not be the best in cases where the distribution of the target, y, is not even. This situation is normally referred to as unbalanced.

We will quickly assess our target distribution to see if it is balancef or unbalanced before we continue.

plt.pie(np.bincount(y_train) / len(y_train), labels=np.unique(y_train))

([<matplotlib.patches.Wedge at 0x7f0f8a5742e0>,
  <matplotlib.patches.Wedge at 0x7f0f8a574760>,
  <matplotlib.patches.Wedge at 0x7f0f8a574be0>],
 [Text(0.5258866840431149, 0.9661486405031771, '0'),
  Text(-1.0986106518112515, -0.055268759049369474, '1'),
  Text(0.5737659537757365, -0.9385055302382718, '2')])

The distribution is approximately even, so the Accuracy will suffice. We will now run the models on our data, store the clasification results (the Accuracy!) and show them using a pandas dataframe.

results = {}
for model in models_to_run:

    # create instance of the class, feed the model to be tested
    clf = make_pipeline(categorical_cols, numerical_cols, model)
    clf.fit(X_train, y_train)

    # store results in a dictionary 
    results[str(model)] = clf.score(X_test, y_test)
pd.DataFrame.from_dict(results, orient='index', columns=['score'])

/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

	score
LogisticRegression(multi_class='multinomial')	0.488
RandomForestClassifier(max_depth=2)	0.588
RandomForestClassifier(max_depth=5)	0.776
RandomForestClassifier(max_depth=10)	0.780
RandomForestClassifier(max_depth=20)	0.768
RandomForestClassifier(max_depth=80)	0.768
MLPClassifier(hidden_layer_sizes=2)	0.468
MLPClassifier(hidden_layer_sizes=6)	0.624
MLPClassifier(hidden_layer_sizes=10)	0.668
MLPClassifier(hidden_layer_sizes=20)	0.712
MLPClassifier(hidden_layer_sizes=40)	0.728

There are big differences between the performances, but the random forest and the neural network seem to be the big winners. Clearly, there is something to gain from upgrading from the simple logstic regression to RF or NN. So what exactly are the the weak/strong points of these algorithm and how do we know when to use which one?

LOGISTIC REGRESSION#

Advantages: interpretable results and should be used in cases where you want to understand relationships.
Disadvantages: often not able to capture complex/relationships and doesn’t work well out of the box if non-linearities are present.

RANDOM FOREST#

Advantages: works very well out of the box. Is resilient against overfitting and is able to capture complex (non-linear) relationships. Almost no data-preprocessing is needed!
Disadvantages: there are many hyperparamers to tune, not as interpretable as logistic regression.

Neural Network#

Advantages: can theoretically model any relation. Can lead to very high results, IF tuned properly.
Disadvantages: very easy to overfit, designing a model can take a long time, almost no interpretability.

WINE DATASET#

So, let’s see how well our pipeline will perform on some real-world examples. Our first stop is the famous ‘wine’ dataset, where we ought to predict the quality of wines given several features.

wine = pd.read_csv('../../static/data/winequality-red.csv')

y_wine = wine['quality']
X_wine = wine.drop(['quality'], axis=1)

This dataset is actually a regression problem, but we will change it into a classification problem by grouping the target. We will create two groups: ‘medium’ with all rating up to and including 6, and ‘excellent’ containg all ratings from 7.

# Create Classification version of target variable
y_wine = pd.Series([1 if x >= 7 else 0 for x in y_wine])

y_wine = y_wine.replace({0: 'medium', 1: 'excellent'})

So what do our features look like?

X_wine.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4

Do we have any categorical feature hiding in our dataset that we would want to one-hot-encode with our pipeline?

Remember that a categorical feature often has a low number of unique values.

X_wine.nunique()

fixed acidity            96
volatile acidity        143
citric acid              80
residual sugar           91
chlorides               153
free sulfur dioxide      60
total sulfur dioxide    144
density                 436
pH                       89
sulphates                96
alcohol                  65
dtype: int64

Thw output above shows the number of unique values per feature. If the number of unique values is low, it could be an indication that it is better to treat the variable as categorical (make them a string).

The dataset has a total of 1600 rows (or observations), so in this case I would say that it is ok to treat all the variables as numerical.

For class balance we have to plot the distribution of the target again.

labels_wine, counts_wine = np.unique(y_wine, return_counts=True)
plt.pie(counts_wine / len(y_wine), labels=labels_wine, autopct='%1.1f%%')
plt.title('Distribution of wine labels')
plt.axis('equal')
plt.show()

The distribution of the labels is a off. It means that we can get 86% accuracy by always predicting ‘medium’. Therefor we need a metric that somehow corrects for this. We find our corrected metric in someone called the ‘F1 score’. We will also try to ‘fix’ the class imbalance by adding a parameter to our pipeline.

X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine, y_wine, test_size=.50, random_state=15)

We dont have categorical features in our dataset, so we dont need to process them. We standardize all of our features!

categorical_cols = []
numerical_cols = X_train_wine.columns

from sklearn.metrics import f1_score

results = {}
for model in models_to_run:

    # create instance of the class, feed the model to be tested
    clf = make_pipeline(categorical_cols, numerical_cols, model)
    clf.fit(X_train_wine, y_train_wine)

    # store results in a dictionary 
    results[str(model)] = f1_score(y_test_wine, clf.predict(X_test_wine), average='weighted')
pd.DataFrame.from_dict(results, orient='index', columns=['score'])

/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

	score
LogisticRegression(multi_class='multinomial')	0.868228
RandomForestClassifier(max_depth=2)	0.822008
RandomForestClassifier(max_depth=5)	0.875235
RandomForestClassifier(max_depth=10)	0.897810
RandomForestClassifier(max_depth=20)	0.906959
RandomForestClassifier(max_depth=80)	0.905877
MLPClassifier(hidden_layer_sizes=2)	0.826647
MLPClassifier(hidden_layer_sizes=6)	0.866706
MLPClassifier(hidden_layer_sizes=10)	0.874667
MLPClassifier(hidden_layer_sizes=20)	0.869089
MLPClassifier(hidden_layer_sizes=40)	0.874326

An interesting observation is that the Neural Network does not seem to do much better than the Random Forest. Ofcourse, this might be different with proper hyperparameter tuning (see next week!), but something else must be mentioned. On tabular data (data that fit into an excel spreadsheet), Neural Networks might not be the best choice. A Random Forest, on the other hand, tends to have a good performance on tabular data.

Our minority class (excellent wines!) only has an occurence of 13%. Luckily there is a trick in sklearn’s random forest algorithm, a single parameter that we can use in order to correct for this. This parameter is called class_weight and we want to set it to balanced in order to tackle class imbalance.

It to our pipeline we do this as follows:

# create ColumnTransformer, and pass the column names to transform in each step

def make_pipeline(categorical_cols=[], numerical_cols=[], classifier=LogisticRegression()):

    if isinstance(classifier, RandomForestClassifier):
        setattr(classifier, 'class_weight', 'balanced')

    preprocessor = ColumnTransformer(
        [
            ('ohe', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols),
            ('scale', StandardScaler(), numerical_cols)
        ]
    )

    clf = Pipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('classifier', classifier)
        ]
    )

    return clf

from sklearn.metrics import f1_score

results = {}
for model in models_to_run:

    # create instance of the class, feed the model to be tested
    clf = make_pipeline(categorical_cols, numerical_cols, model)
    clf.fit(X_train_wine, y_train_wine)

    # store results in a dictionary 
    results[str(model)] = f1_score(y_test_wine, clf.predict(X_test_wine), average='weighted')
pd.DataFrame.from_dict(results, orient='index', columns=['score'])

/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

	score
LogisticRegression(multi_class='multinomial')	0.868228
RandomForestClassifier(class_weight='balanced', max_depth=2)	0.825516
RandomForestClassifier(class_weight='balanced', max_depth=5)	0.862949
RandomForestClassifier(class_weight='balanced', max_depth=10)	0.896311
RandomForestClassifier(class_weight='balanced', max_depth=20)	0.895385
RandomForestClassifier(class_weight='balanced', max_depth=80)	0.900999
MLPClassifier(hidden_layer_sizes=2)	0.850417
MLPClassifier(hidden_layer_sizes=6)	0.863810
MLPClassifier(hidden_layer_sizes=10)	0.855434
MLPClassifier(hidden_layer_sizes=20)	0.861475
MLPClassifier(hidden_layer_sizes=40)	0.875574

Unfortunately the parameter did not change our model for the better. This can ofcourse always happen in Machine Learning, which is sometimes more of an art than a science. Several other methods to combat class imbalance could be checked for (oversampling, smote…), but an increase in performance is never guaruanteed.

IN PRACTICE: MNIST DATASET#

We will turn to a second problem: the classification of handwritten digits. This is a computer vision task, and we know that Logistic Regression and Random Forest are ill-equipped for this task. Instead, we will add a convolutional neural network to our pipeline.

from tensorflow.keras.datasets.mnist import load_data
from tensorflow.keras import Sequential
from tensorflow.keras.utils import plot_model
from tensorflow.keras.datasets.mnist import load_data
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.layers import (
    Conv2D, 
    Dense, 
    Dropout, 
    Flatten, 
    MaxPool2D
)

from sklearn.datasets import load_digits

(X_digits_train, y_digits_train), (X_digits_test, y_digits_test) = load_data()

So what does our data look like?

X_digits_train[:, :, :]

array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

X_digits_train.shape

(60000, 28, 28)

So it is important to understand that this dataset is fundamentally different than the datasets that we have seen so far. Before we were seeing N * P datasets, where N is the number of data points and P is the number of features. Our current dataset is N * H * W. N is still the number of data points, but H refers to the height of the image and W to its width. Every cell in this array has a number attached to it to determine its colour (0 = black).

Let’s see if we can plot some images.

plt.figure(figsize=(6,6)) # specifying the overall grid size

for i in range(4):
    plt.subplot(2,2,i+1)    # the number of images in the grid is 5*5 (25)
    plt.imshow(X_digits_train[i])

plt.show()

So the task is clear. We have these handwritten numbers and we will see if we can classify them correctly.

The models that we have seen so far, dont do good job on data like this. They do well on somewhat independent features, not on images with strong local correlations. We will, however, compare their (LR, RF..) performance to our new convolutional neural network.

To create our conv. neural network, we will turn to a library called tensorflow/Keras. For your understanding: Tensorflow is an ML/AI library that is optimized for mathematical operations. Keras is a library that runs on top of Tensorflow and does Deep Learning/Neural Networks. This library allows us to completely customize our neural network to our needs. A proper introduction to Tensorflow is outside the scope of this course, but in practice we stack layers after each other. The final layer returns 10 values, with a probability for each possible output (the number 0 up to 10).

# defining the model
inp_shape = X_digits_train.shape[1:]

def create_model():

    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', input_shape=inp_shape + (1,)))
    model.add(MaxPool2D((2, 2)))
    model.add(Conv2D(48, (3, 3), activation='relu'))
    model.add(MaxPool2D((2, 2)))
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(Dense(500, activation='relu'))
    model.add(Dense(10, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

We wrap our model in KerasClassifier and then add it to our models_to_run.

nn_clasifier = KerasClassifier(build_fn=create_model, verbose=0)

models_to_run = [
    LogisticRegression(multi_class='multinomial'),
    RandomForestClassifier(max_depth=20),
    MLPClassifier(hidden_layer_sizes=20),
    nn_clasifier
]

Our convolutional network is able to handle images, but the other models cannot. These models expect a N * P input (two dimensions) and not a N * H * W input (three dimensions). For those models we will average over the third dimension to get to our N * P format. We will, ofcourse, lose information in the process.

# create ColumnTransformer, and pass the column names to transform in each step
from sklearn.preprocessing import FunctionTransformer

def mean_over_second_image_dimension(img):
    return np.mean(img, axis=2)

transformer = FunctionTransformer(mean_over_second_image_dimension)


def make_pipeline_mnist(classifier=LogisticRegression()):

    clf = Pipeline(
        steps=[
            ('classifier', classifier)
        ]
    )

    # check if instance is KerasClassifier, if not, add first steo to reduce
    # the dimensions
    if not isinstance(classifier, KerasClassifier):
        clf.steps.insert(0, ['estimator', transformer]) #insert as first step


    return clf

Lets run the pipelines again and store our results

results = {}
for model in models_to_run:
    # create instance of the class, feed the model to be tested
    clf = make_pipeline_mnist(model)
    clf.fit(X_digits_train, y_digits_train)

    # store results in a dictionary 
    results[str(model)] = clf.score(X_digits_test, y_digits_test)
pd.DataFrame.from_dict(results, orient='index', columns=['score'])

/home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

2022-10-25 17:29:35.653981: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-25 17:29:35.725709: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-10-25 17:29:35.742910: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3609600000 Hz

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [31], line 5
      2 for model in models_to_run:
      3     # create instance of the class, feed the model to be tested
      4     clf = make_pipeline_mnist(model)
----> 5     clf.fit(X_digits_train, y_digits_train)
      7     # store results in a dictionary 
      8     results[str(model)] = clf.score(X_digits_test, y_digits_test)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/sklearn/pipeline.py:382, in Pipeline.fit(self, X, y, **fit_params)
    380     if self._final_estimator != "passthrough":
    381         fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 382         self._final_estimator.fit(Xt, y, **fit_params_last_step)
    384 return self

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py:223, in KerasClassifier.fit(self, x, y, **kwargs)
    221   raise ValueError('Invalid shape for y: ' + str(y.shape))
    222 self.n_classes_ = len(self.classes_)
--> 223 return super(KerasClassifier, self).fit(x, y, **kwargs)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py:166, in BaseWrapper.fit(self, x, y, **kwargs)
    163 fit_args = copy.deepcopy(self.filter_sk_params(Sequential.fit))
    164 fit_args.update(kwargs)
--> 166 history = self.model.fit(x, y, **fit_args)
    168 return history

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:1095, in Model.fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1088 with trace.Trace(
   1089     'train',
   1090     epoch_num=epoch,
   1091     step_num=step,
   1092     batch_size=batch_size,
   1093     _r=1):
   1094   callbacks.on_train_batch_begin(step)
-> 1095   tmp_logs = self.train_function(iterator)
   1096   if data_handler.should_sync:
   1097     context.async_wait()

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:828, in Function.__call__(self, *args, **kwds)
    826 tracing_count = self.experimental_get_tracing_count()
    827 with trace.Trace(self._name) as tm:
--> 828   result = self._call(*args, **kwds)
    829   compiler = "xla" if self._experimental_compile else "nonXla"
    830   new_tracing_count = self.experimental_get_tracing_count()

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:871, in Function._call(self, *args, **kwds)
    868 try:
    869   # This is the first call of __call__, so we have to initialize.
    870   initializers = []
--> 871   self._initialize(args, kwds, add_initializers_to=initializers)
    872 finally:
    873   # At this point we know that the initialization is complete (or less
    874   # interestingly an exception was raised) so we no longer need a lock.
    875   self._lock.release()

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:725, in Function._initialize(self, args, kwds, add_initializers_to)
    722 self._lifted_initializer_graph = lifted_initializer_graph
    723 self._graph_deleter = FunctionDeleter(self._lifted_initializer_graph)
    724 self._concrete_stateful_fn = (
--> 725     self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
    726         *args, **kwds))
    728 def invalid_creator_scope(*unused_args, **unused_kwds):
    729   """Disables variable creation."""

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/eager/function.py:2969, in Function._get_concrete_function_internal_garbage_collected(self, *args, **kwargs)
   2967   args, kwargs = None, None
   2968 with self._lock:
-> 2969   graph_function, _ = self._maybe_define_function(args, kwargs)
   2970 return graph_function

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3361, in Function._maybe_define_function(self, args, kwargs)
   3357   return self._define_function_with_shape_relaxation(
   3358       args, kwargs, flat_args, filtered_flat_args, cache_key_context)
   3360 self._function_cache.missed.add(call_context_key)
-> 3361 graph_function = self._create_graph_function(args, kwargs)
   3362 self._function_cache.primary[cache_key] = graph_function
   3364 return graph_function, filtered_flat_args

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3196, in Function._create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   3191 missing_arg_names = [
   3192     "%s_%d" % (arg, i) for i, arg in enumerate(missing_arg_names)
   3193 ]
   3194 arg_names = base_arg_names + missing_arg_names
   3195 graph_function = ConcreteFunction(
-> 3196     func_graph_module.func_graph_from_py_func(
   3197         self._name,
   3198         self._python_function,
   3199         args,
   3200         kwargs,
   3201         self.input_signature,
   3202         autograph=self._autograph,
   3203         autograph_options=self._autograph_options,
   3204         arg_names=arg_names,
   3205         override_flat_arg_shapes=override_flat_arg_shapes,
   3206         capture_by_value=self._capture_by_value),
   3207     self._function_attributes,
   3208     function_spec=self.function_spec,
   3209     # Tell the ConcreteFunction to clean up its graph once it goes out of
   3210     # scope. This is not the default behavior since it gets used in some
   3211     # places (like Keras) where the FuncGraph lives longer than the
   3212     # ConcreteFunction.
   3213     shared_func_graph=False)
   3214 return graph_function

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py:990, in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
    987 else:
    988   _, original_func = tf_decorator.unwrap(python_func)
--> 990 func_outputs = python_func(*func_args, **func_kwargs)
    992 # invariant: `func_outputs` contains only Tensors, CompositeTensors,
    993 # TensorArrays and `None`s.
    994 func_outputs = nest.map_structure(convert, func_outputs,
    995                                   expand_composites=True)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:634, in Function._defun_with_scope.<locals>.wrapped_fn(*args, **kwds)
    632     xla_context.Exit()
    633 else:
--> 634   out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    635 return out

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py:977, in func_graph_from_py_func.<locals>.wrapper(*args, **kwargs)
    975 except Exception as e:  # pylint:disable=broad-except
    976   if hasattr(e, "ag_error_metadata"):
--> 977     raise e.ag_error_metadata.to_exception(e)
    978   else:
    979     raise

ValueError: in user code:

    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:800 train_function  *
        return step_function(self, iterator)
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:790 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:783 run_step  **
        outputs = model.train_step(data)
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:749 train_step
        y_pred = self(x, training=True)
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/engine/base_layer.py:998 __call__
        input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
    /home/jan/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/tensorflow/python/keras/engine/input_spec.py:234 assert_input_compatibility
        raise ValueError('Input ' + str(input_index) + ' of layer ' +

    ValueError: Input 0 of layer sequential is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: (32, 28, 28)

As expected our convolutional network does a bitter job than the other algorithms! But, there are some tricks to upgrade their performances… see exercise!

Exercise / bonus tutorial#

Our previous method to reduce the dimensionality of the mnist inputs in order to prepare them for a Random Forest, is probably not the best. Instead of it, we will turn to a different approach.

In fact, we want to perform Dimensionality Reduction. This could also be achieved by the unsupervised ML-method of Principal Component Analysis (PCA). In this exercise, we will use this method to prepare our image data for a Random Forest.

load PCA from sklearn

from sklearn.decomposition import PCA

Replace the mean_over_second_image_dimension function by a reshape_image function. It should use np.reshape() and replace the 28 x 28 format by the 728 x 1 format.

# create ColumnTransformer, and pass the column names to transform in each step
from sklearn.preprocessing import FunctionTransformer

# def mean_over_second_image_dimension(img):
#     return np.mean(img, axis=2)

def reshape_image(img):
    
    # Your code here

  Input In [33]
    # Your code here
                    ^
IndentationError: expected an indented block

Add both the transformation step from the previous cell, and PCA to the pipeline below. For PCA, use the parameter n_components=20.

def make_pipeline_mnist(classifier=LogisticRegression()):

    clf = Pipeline(
        steps=[
            ('classifier', classifier)
        ]
    )

    # check if instance is KerasClassifier, if not, add first steo to reduce
    # the dimensions
    
    if not isinstance(classifier, KerasClassifier):
        
        # your code here


    return clf

Run the new pipeline on all the previous models.

What is the upgrade in performance? ;-)

Data Science Academy

Lecture 4: Model Improvements and Pipelines

Contents