Exploratory Data Analysis (EDA) Wine Quality dataset#

We will analyze the well-known wine dataset using our newly gained skills in this part. Each wine is described with several attributes obtained by physicochemical tests and by its quality (from 1 to 10). This dataset is perfect for many ML tasks such as:

  • Testing Outlier detection algorithms that can detect the few excellent or poor wines

  • Modeling the relationship between wine quality and wine attributes (using regression or classification)

  • Attribute (feature) selection techniques to remove unimportant features

✏️ The example is inspired by [Pac].

Let’s start by importing packages:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Let’s load our data (there are two datasets for red and white wines):

red_wine_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
white_wine_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

# load data to DF
df_red = pd.read_csv(red_wine_url, delimiter=";")
df_white = pd.read_csv(white_wine_url, delimiter=";")

Let’s check red wine data:

df_red.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
df_red.describe()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

Let’s check white wine data:

df_white.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
df_white.describe()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000
mean 6.854788 0.278241 0.334192 6.391415 0.045772 35.308085 138.360657 0.994027 3.188267 0.489847 10.514267 5.877909
std 0.843868 0.100795 0.121020 5.072058 0.021848 17.007137 42.498065 0.002991 0.151001 0.114126 1.230621 0.885639
min 3.800000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987110 2.720000 0.220000 8.000000 3.000000
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991723 3.090000 0.410000 9.500000 5.000000
50% 6.800000 0.260000 0.320000 5.200000 0.043000 34.000000 134.000000 0.993740 3.180000 0.470000 10.400000 6.000000
75% 7.300000 0.320000 0.390000 9.900000 0.050000 46.000000 167.000000 0.996100 3.280000 0.550000 11.400000 6.000000
max 14.200000 1.100000 1.660000 65.800000 0.346000 289.000000 440.000000 1.038980 3.820000 1.080000 14.200000 9.000000

As we will later combine our datasets into a single one, let’s add an extra column to indicate category:

df_white['wine_category'] = 'white'
df_red['wine_category'] = 'red'

A colleague in our department, a wine expert, gave us a tip that scaling the quality from 1 to 10 might probably be too granular for certain types of analysis, and it might be OK to divide wines into three quality categories - low, medium, high:

df_red['quality_label'] = df_red['quality'].apply(lambda value: ('low' if value <= 5 else 'medium') if value <= 7 else 'high')
df_red['quality_label'] = pd.Categorical(df_red['quality_label'], categories=['low', 'medium', 'high'])

df_white['quality_label'] = df_white['quality'].apply(lambda value: ('low' if value <= 5 else 'medium') if value <= 7 else 'high')
df_white['quality_label'] = pd.Categorical(df_white['quality_label'], categories=['low', 'medium', 'high'])

Red Wine Dataset Analysis#

Let’s get familiar with the red wine dataset with some visualizations that we already know:

_, axes = plt.subplots(1, 2, figsize=(20, 5))

sns.countplot(data=df_red, x='quality', ax=axes[0])
sns.countplot(data=df_red, x='quality_label', ax=axes[1]);
../../_images/eda_15_0.png

Well, the dataset is highly imbalanced when it comes to quality attribute.

Let’s check some relationships between attributes:

sns.pairplot(df_red.drop(columns=['quality']), hue="quality_label", height=3);
../../_images/eda_17_0.png

The plot is somehow hard to read. Let’s try to focus on an individual attribute:

_, axes = plt.subplots(1, 2, figsize=(10, 5))

sns.violinplot(data=df_red, y="alcohol", x="quality_label", ax=axes[0])
sns.violinplot(data=df_red, y="alcohol", x="quality", ax=axes[1]);
../../_images/eda_19_0.png

There are no obvious gotchas. Maybe we will be more lucky with the correlation plot:

sns.heatmap(df_red.corr(method="pearson"), annot=True);
/tmp/ipykernel_18914/527241991.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(df_red.corr(method="pearson"), annot=True);
../../_images/eda_21_1.png

There are definitely some interactions, we can study them, for example, with jointplot:

sns.jointplot(x='alcohol',y='sulphates',data=df_red, kind='reg');
../../_images/eda_23_0.png

Let’s analyze relationship between attributes and the quality:

columns = list(filter(lambda col: col not in {'quality', 'wine_category', 'quality_label'}, df_red.columns))
chunks = [columns[i:i + 6] for i in range(0, len(columns), 6)]
for columns_names in chunks:
    sns.pairplot(df_red, x_vars=columns_names, y_vars='quality', kind="reg")
../../_images/eda_25_0.png ../../_images/eda_25_1.png

White Wine Dataset Analysis#

Let’s run the same analysis for the white vine:

_, axes = plt.subplots(1, 2, figsize=(20, 5))

sns.countplot(data=df_white, x='quality', ax=axes[0])
sns.countplot(data=df_white, x='quality_label', ax=axes[1]);
../../_images/eda_27_0.png
sns.pairplot(df_white.drop(columns=['quality']), hue="quality_label", height=3);
../../_images/eda_28_0.png
_, axes = plt.subplots(1, 2, figsize=(10, 5))

sns.violinplot(data=df_white, y="alcohol", x="quality_label", ax=axes[0])
sns.violinplot(data=df_white, y="alcohol", x="quality", ax=axes[1]);
../../_images/eda_29_0.png
sns.heatmap(df_white.corr(method="pearson"), annot=True);
/tmp/ipykernel_18914/455693947.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(df_white.corr(method="pearson"), annot=True);
../../_images/eda_30_1.png
sns.jointplot(x='alcohol',y='sulphates',data=df_white, kind='reg');
../../_images/eda_31_0.png
columns = list(filter(lambda col: col not in {'quality', 'wine_category', 'quality_label'}, df_white.columns))
chunks = [columns[i:i + 6] for i in range(0, len(columns), 6)]
for columns_names in chunks:
    sns.pairplot(df_white, x_vars=columns_names, y_vars='quality', kind="reg")
../../_images/eda_32_0.png ../../_images/eda_32_1.png

Not surprisingly enough, white wine has a bit of different properties. Let’s combine datasets for additional analysis.

Analyze Combined Data#

Let’s concatenate datasets together:

df_wines = pd.concat([df_red, df_white])

# re-shuffle records just to randomize data points
df_wines = df_wines.sample(frac=1.0, random_state=42).reset_index(drop=True)
df_wines.head(10)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality wine_category quality_label
0 7.0 0.17 0.74 12.8 0.045 24.0 126.0 0.99420 3.26 0.38 12.2 8 white high
1 7.7 0.64 0.21 2.2 0.077 32.0 133.0 0.99560 3.27 0.45 9.9 5 red low
2 6.8 0.39 0.34 7.4 0.020 38.0 133.0 0.99212 3.18 0.44 12.0 7 white medium
3 6.3 0.28 0.47 11.2 0.040 61.0 183.0 0.99592 3.12 0.51 9.5 6 white medium
4 7.4 0.35 0.20 13.9 0.054 63.0 229.0 0.99888 3.11 0.50 8.9 6 white medium
5 7.2 0.53 0.14 2.1 0.064 15.0 29.0 0.99323 3.35 0.61 12.1 6 red medium
6 7.5 0.27 0.31 17.7 0.051 33.0 173.0 0.99900 3.09 0.64 10.2 5 white low
7 6.8 0.11 0.27 8.6 0.044 45.0 104.0 0.99454 3.20 0.37 9.9 6 white medium
8 9.0 0.44 0.49 2.4 0.078 26.0 121.0 0.99780 3.23 0.58 9.2 5 red low
9 7.1 0.23 0.30 2.6 0.034 62.0 148.0 0.99121 3.03 0.56 11.3 7 white medium

Just by comparing the quality, you can notice a different distributions. For example, the proportion of 5 and 6 is very different for red and white:

_, axes = plt.subplots(1, 2, figsize=(20, 5))

sns.countplot(data=df_wines, x="quality", hue="wine_category", ax=axes[0])
sns.countplot(data=df_wines, x="quality_label", hue="wine_category", ax=axes[1]);
../../_images/eda_36_0.png
sns.pairplot(df_wines.drop(columns=['quality']), hue="wine_category", height=3);
../../_images/eda_37_0.png

It seems that separating type of wines might be an easy task to do. Let’s check some interactions of combined dataset vs. quality:

sns.pairplot(df_wines.drop(columns=['quality']), hue="quality_label", height=3);
Error in callback <function flush_figures at 0x7f6dc92dd160> (for post_execute):
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib_inline/backend_inline.py:126, in flush_figures()
    123 if InlineBackend.instance().close_figures:
    124     # ignore the tracking, just draw and close all figures
    125     try:
--> 126         return show(True)
    127     except Exception as e:
    128         # safely show traceback if in IPython, else raise
    129         ip = get_ipython()

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib_inline/backend_inline.py:90, in show(close, block)
     88 try:
     89     for figure_manager in Gcf.get_all_fig_managers():
---> 90         display(
     91             figure_manager.canvas.figure,
     92             metadata=_fetch_figure_metadata(figure_manager.canvas.figure)
     93         )
     94 finally:
     95     show._to_draw = []

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/IPython/core/display_functions.py:298, in display(include, exclude, metadata, transient, display_id, raw, clear, *objs, **kwargs)
    296     publish_display_data(data=obj, metadata=metadata, **kwargs)
    297 else:
--> 298     format_dict, md_dict = format(obj, include=include, exclude=exclude)
    299     if not format_dict:
    300         # nothing to display (e.g. _ipython_display_ took over)
    301         continue

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/IPython/core/formatters.py:178, in DisplayFormatter.format(self, obj, include, exclude)
    176 md = None
    177 try:
--> 178     data = formatter(obj)
    179 except:
    180     # FIXME: log the exception
    181     raise

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/decorator.py:232, in decorate.<locals>.fun(*args, **kw)
    230 if not kwsyntax:
    231     args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/IPython/core/formatters.py:222, in catch_format_error(method, self, *args, **kwargs)
    220 """show traceback on failed format call"""
    221 try:
--> 222     r = method(self, *args, **kwargs)
    223 except NotImplementedError:
    224     # don't warn on NotImplementedErrors
    225     return self._check_return(None, args[0])

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/IPython/core/formatters.py:339, in BaseFormatter.__call__(self, obj)
    337     pass
    338 else:
--> 339     return printer(obj)
    340 # Finally look for special method names
    341 method = get_real_method(obj, self.print_method)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/IPython/core/pylabtools.py:151, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
    148     from matplotlib.backend_bases import FigureCanvasBase
    149     FigureCanvasBase(fig)
--> 151 fig.canvas.print_figure(bytes_io, **kw)
    152 data = bytes_io.getvalue()
    153 if fmt == 'svg':

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib/backend_bases.py:2338, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
   2334 try:
   2335     # _get_renderer may change the figure dpi (as vector formats
   2336     # force the figure dpi to 72), so we need to set it again here.
   2337     with cbook._setattr_cm(self.figure, dpi=dpi):
-> 2338         result = print_method(
   2339             filename,
   2340             facecolor=facecolor,
   2341             edgecolor=edgecolor,
   2342             orientation=orientation,
   2343             bbox_inches_restore=_bbox_inches_restore,
   2344             **kwargs)
   2345 finally:
   2346     if bbox_inches and restore_bbox:

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib/backend_bases.py:2204, in FigureCanvasBase._switch_canvas_and_return_print_method.<locals>.<lambda>(*args, **kwargs)
   2200     optional_kws = {  # Passed by print_figure for other renderers.
   2201         "dpi", "facecolor", "edgecolor", "orientation",
   2202         "bbox_inches_restore"}
   2203     skip = optional_kws - {*inspect.signature(meth).parameters}
-> 2204     print_method = functools.wraps(meth)(lambda *args, **kwargs: meth(
   2205         *args, **{k: v for k, v in kwargs.items() if k not in skip}))
   2206 else:  # Let third-parties do as they see fit.
   2207     print_method = meth

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib/_api/deprecation.py:410, in delete_parameter.<locals>.wrapper(*inner_args, **inner_kwargs)
    400     deprecation_addendum = (
    401         f"If any parameter follows {name!r}, they should be passed as "
    402         f"keyword, not positionally.")
    403     warn_deprecated(
    404         since,
    405         name=repr(name),
   (...)
    408                  else deprecation_addendum,
    409         **kwargs)
--> 410 return func(*inner_args, **inner_kwargs)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:520, in FigureCanvasAgg.print_png(self, filename_or_obj, metadata, pil_kwargs, *args)
    471 @_api.delete_parameter("3.5", "args")
    472 def print_png(self, filename_or_obj, *args,
    473               metadata=None, pil_kwargs=None):
    474     """
    475     Write the figure to a PNG file.
    476 
   (...)
    518         *metadata*, including the default 'Software' key.
    519     """
--> 520     self._print_pil(filename_or_obj, "png", pil_kwargs, metadata)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:467, in FigureCanvasAgg._print_pil(self, filename_or_obj, fmt, pil_kwargs, metadata)
    462 """
    463 Draw the canvas, then save it using `.image.imsave` (to which
    464 *pil_kwargs* and *metadata* are forwarded).
    465 """
    466 FigureCanvasAgg.draw(self)
--> 467 mpl.image.imsave(
    468     filename_or_obj, self.buffer_rgba(), format=fmt, origin="upper",
    469     dpi=self.figure.dpi, metadata=metadata, pil_kwargs=pil_kwargs)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/matplotlib/image.py:1656, in imsave(fname, arr, vmin, vmax, cmap, format, origin, dpi, metadata, pil_kwargs)
   1654 pil_kwargs.setdefault("format", format)
   1655 pil_kwargs.setdefault("dpi", (dpi, dpi))
-> 1656 image.save(fname, **pil_kwargs)

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/PIL/Image.py:2284, in Image.save(self, fp, format, **params)
   2281     filename = fp.name
   2283 # may mutate self!
-> 2284 self._ensure_mutable()
   2286 save_all = params.pop("save_all", False)
   2287 self.encoderinfo = params

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/PIL/Image.py:599, in Image._ensure_mutable(self)
    597 def _ensure_mutable(self):
    598     if self.readonly:
--> 599         self._copy()
    600     else:
    601         self.load()

File ~/miniconda3/envs/ds-academy-development/lib/python3.9/site-packages/PIL/Image.py:593, in Image._copy(self)
    591 def _copy(self):
    592     self.load()
--> 593     self.im = self.im.copy()
    594     self.pyaccess = None
    595     self.readonly = 0

KeyboardInterrupt: 
_, axes = plt.subplots(1, 2, figsize=(10, 5))

sns.violinplot(data=df_wines, y="alcohol", x="quality_label", hue="wine_category", ax=axes[0])
sns.violinplot(data=df_wines, y="alcohol", x="quality", hue="wine_category", ax=axes[1]);
../../_images/eda_40_0.png
sns.heatmap(df_wines.corr(method="pearson"), annot=True);
../../_images/eda_41_0.png
columns = list(filter(lambda col: col not in {'quality', 'wine_category', 'quality_label'}, df_wines.columns))
chunks = [columns[i:i + 6] for i in range(0, len(columns), 6)]
for columns_names in chunks:
    sns.pairplot(df_wines, x_vars=columns_names, y_vars='quality', kind="reg")
../../_images/eda_42_0.png ../../_images/eda_42_1.png

To revisit ML tasks mentioned earlier - it is tough to tell if it is better to solve those tasks using individual datasets or the combined one based on the analysis. My bet would be to do it on individual ones, but it depends on what you want to solve and which approach you take. As always, you need to experiment. I can imagine a scenario where, for example, the smaller white wines dataset can benefit from additional data from the red wines.

✏️ Several libraries aim to automate EDA to some extent if you want to get some results as quickly as possible. One of them is pandas-profiling. An example of automated analysis is Census Dataset EDA.

Exercises#

What sorts of passengers were more likely to survive the Titanic crash?#

For the exercise, you will use the famous Titanic dataset. The sinking of the Titanic is one of the most infamous shipwrecks in history.

Unfortunately, there weren’t enough lifeboats for everyone onboard on Titanic, resulting in 1502 out of 2224 passengers and crew deaths. While some elements of luck were involved in surviving, some groups of people seemed more likely to survive than others.

# load the dataset
titanic_url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
titanic_data = pd.read_csv(titanic_url)

# TODO: your answer here

Resources#

Pac

PacktPublishing. Packtpublishing/hands-on-exploratory-data-analysis-with-python: hands-on exploratory data analysis with python, published by packt. URL: https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python.