Introduction to Pandas#

Pandas is the most popular Python library for tabular data structures. You can think of Pandas as a compelling version of Excel.

Pandas allows us to work with the less structured data available in many forms. Pandas introduces more flexibility (e.g., attaching labels to data, working with missing data, etc.) and brings additional operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.)

Pandas Series and DataFrame objects build on the NumPy array structure and provide efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

✏️ The example is inspired by [Tom] and [Van17].

Let’s import pandas with the alias pd:

import numpy as np
import pandas as pd

pd.__version__
'1.5.1'

Pandas Series#

A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc.), including a mix of them.

Creating Series#

By default, Series are labeled with indices starting from 0. For example:

pd.Series(data=[-5, 1.3, 21, 6, 3])
0    -5.0
1     1.3
2    21.0
3     6.0
4     3.0
dtype: float64

However, you can add a custom index:

pd.Series(data=[-5, 1.3, 21, 6, 3], index=['a', 'b', 'c', 'd', 'e'])
a    -5.0
b     1.3
c    21.0
d     6.0
e     3.0
dtype: float64

You can create a Series from a dictionary:

pd.Series(data={'a': 10, 'b': 20, 'c': 30})
a    10
b    20
c    30
dtype: int64

Or from an ndarray:

pd.Series(data=np.random.randn(3))
0   -0.723800
1   -0.730294
2    0.397899
dtype: float64

Series Characteristics#

You can access the index labels of your Series using the .index attribute:

s = pd.Series(data=np.random.randn(3))
s.index
RangeIndex(start=0, stop=3, step=1)

You can access the underlying data array using .to_numpy():

s.to_numpy()
array([-2.87624963, -1.63405871, -0.88819511])

Indexing and Slicing Series#

Series are very much like ndarrays in this regard (in fact, Series can be passed to most NumPy functions):

s = pd.Series(data=range(5), index=['A', 'B', 'C', 'D', 'A'])
s
A    0
B    1
C    2
D    3
A    4
dtype: int64
s[0]
0
s[0:3]
A    0
B    1
C    2
dtype: int64

Series are also like dictionaries in that we can access values using index labels:

s[["B", "D", "C"]]
B    1
D    3
C    2
dtype: int64
"A" in s
True

Series do allow for non-unique indexing, but be careful because indexing operations won’t return unique values:

s["A"]
A    0
A    4
dtype: int64

Finally, we can also do boolean indexing with Series:

s[s >= 1]
B    1
C    2
D    3
A    4
dtype: int64

Series Operations#

Unlike ndarrays, operations between Series (+, -, /, *) align values based on their LABELS (not their position in the structure). The resulting index will be the sorted union of the two indexes. This gives you the flexibility to run operations on Series regardless of their labels.

s1 = pd.Series(data=range(4), index=["A", "B", "C", "D"])
s1
A    0
B    1
C    2
D    3
dtype: int64
s2 = pd.Series(data=range(10, 14), index=["B", "C", "D", "E"])
s2
B    10
C    11
D    12
E    13
dtype: int64
s1 + s2
A     NaN
B    11.0
C    13.0
D    15.0
E     NaN
dtype: float64

We can also perform standard operations on a series, like multiplying or squaring. NumPy also accepts Series as an argument to most functions because series are built off numpy arrays:

s1 ** 2
A    0
B    1
C    4
D    9
dtype: int64
np.exp(s1)
A     1.000000
B     2.718282
C     7.389056
D    20.085537
dtype: float64

Finally, like arrays, Series have many built-in methods for various operations:

s1.mean()
1.5

Pandas DataFrames#

DataFrames are just Series stuck together. Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data:

Creating DataFrames#

Dataframes can be created using pd.DataFrame(). Like series, index and column labels of dataframes are labeled starting from 0 by default:

pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9

We can use the index and columns arguments to give them labels:

pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=["R1", "R2", "R3"], columns=["C1", "C2", "C3"])
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9

There are so many ways to create dataframes. For example, you can use dictionary:

pd.DataFrame({"C1": [1, 2, 3], "C2": ['A', 'B', 'C']}, index=["R1", "R2", "R3"])
C1 C2
R1 1 A
R2 2 B
R3 3 C

Indexing and Slicing#

There are several main ways to select data from a DataFrame:

Method

Syntax

Output

Select column

df[col_label]

Series

Select row slice

df[row_1_int:row_2_int]

DataFrame

Select row/column by label

df.loc[row_label(s), col_label(s)]

Object for single selection, Series for one row/column, otherwise DataFrame

Select row/column by integer

df.iloc[row_int(s), col_int(s)]

Object for single selection, Series for one row/column, otherwise DataFrame

Select by row integer & column label

df.loc[df.index[row_int], col_label]

Object for single selection, Series for one row/column, otherwise DataFrame

Select by row label & column integer

df.loc[row_label, df.columns[col_int]]

Object for single selection, Series for one row/column, otherwise DataFrame

Select by boolean

df[bool_vec]

Object for single selection, Series for one row/column, otherwise DataFrame

Indexing with []#

df = pd.DataFrame(
    {
        "Name": ["Tom", "Mike", "Tiffany"],
        "Language": ["Python", "Python", "R"],
        "Courses": [5, 4, 7]
    }
)
df
Name Language Courses
0 Tom Python 5
1 Mike Python 4
2 Tiffany R 7

Select columns by single labels, lists of labels, or slices:

df['Name']
0        Tom
1       Mike
2    Tiffany
Name: Name, dtype: object
df[['Name', 'Language']]
Name Language
0 Tom Python
1 Mike Python
2 Tiffany R
df[df['Courses'] > 5]
Name Language Courses
2 Tiffany R 7
df[df['Name'] == "Tom"]
Name Language Courses
0 Tom Python 5
df[(df['Courses'] > 4) & (df['Language'] == 'Python')]
Name Language Courses
0 Tom Python 5

Indexing with .loc and .iloc#

Pandas created the methods .loc[] and .iloc[] as more flexible alternatives for accessing data from a dataframe. Use df.iloc[] for indexing with integers and df.loc[] for indexing with labels.

# returns a series
df.iloc[0]
Name           Tom
Language    Python
Courses          5
Name: 0, dtype: object
# slicing returns a dataframe
df.iloc[0:2]
Name Language Courses
0 Tom Python 5
1 Mike Python 4
# returns the indexed object
df.iloc[2, 1]
'R'

Now let’s look at .loc which accepts labels as references to rows/columns:

df.loc[:, 'Name']
0        Tom
1       Mike
2    Tiffany
Name: Name, dtype: object

Reading/Writing Data From External Sources#

Pandas facilitates reading directly from a url - pd.read_csv() accepts urls as input:

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()
country food_category consumption co2_emmission
0 Argentina Pork 10.51 37.20
1 Argentina Poultry 38.66 41.53
2 Argentina Beef 55.48 1712.00
3 Argentina Lamb & Goat 1.56 54.63
4 Argentina Fish 4.36 6.96

Notice that we are using function .head() to visualize just a fraction of rows taken from the top.

Let’s save data to a .csv file. You will likely be loading local .csv files more often.

out_path = '../../data/food_consumption.csv'
df.to_csv(out_path, sep='\t', encoding='utf-8', index=False)

Let’s read it back:

df = pd.read_csv(out_path, index_col=0, parse_dates=True, sep='\t')
df.head()
food_category consumption co2_emmission
country
Argentina Pork 10.51 37.20
Argentina Poultry 38.66 41.53
Argentina Beef 55.48 1712.00
Argentina Lamb & Goat 1.56 54.63
Argentina Fish 4.36 6.96

DataFrames Operations#

Similar to Series, DataFrames have built-in functions for performing most common operations, e.g., .max(), .idxmin(), .sort_values(), .value_counts(), etc. They’re all documented in the Pandas documentation.

Let’s try a few of them:

df.max()
food_category    Wheat and Wheat Products
consumption                        430.76
co2_emmission                      1712.0
dtype: object
df['co2_emmission'].max()
1712.0
df['co2_emmission'].idxmin()
'Argentina'
df.sort_values(by='co2_emmission').head()
food_category consumption co2_emmission
country
Togo Soybeans 0.01 0.0
Paraguay Soybeans 0.00 0.0
Saudi Arabia Pork 0.00 0.0
Cambodia Lamb & Goat 0.00 0.0
Croatia Soybeans 0.00 0.0
df['food_category'].value_counts()
Pork                        130
Poultry                     130
Beef                        130
Lamb & Goat                 130
Fish                        130
Eggs                        130
Milk - inc. cheese          130
Wheat and Wheat Products    130
Rice                        130
Soybeans                    130
Nuts inc. Peanut Butter     130
Name: food_category, dtype: int64

Applying Custom Functions#

There might be cases when you want to apply a function not built-in to Pandas. For this, we also have methods:

  • df.apply(), applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)

  • df.applymap(), applies a function element-wise (for functions that accept/return single values at a time)

  • series.apply()/series.map(), same as above but for Pandas series

df[['consumption', 'co2_emmission']].apply(np.sqrt)
consumption co2_emmission
country
Argentina 3.241913 6.099180
Argentina 6.217717 6.444377
Argentina 7.448490 41.376322
Argentina 1.249000 7.391211
Argentina 2.088061 2.638181
... ... ...
Bangladesh 4.680812 5.586591
Bangladesh 4.179713 1.824829
Bangladesh 13.104579 14.824304
Bangladesh 0.781025 0.519615
Bangladesh 0.848528 1.126943

1430 rows × 2 columns

def length_of_category(x):
    return len(x)


df[['food_category']].applymap(length_of_category)
food_category
country
Argentina 4
Argentina 7
Argentina 4
Argentina 11
Argentina 4
... ...
Bangladesh 18
Bangladesh 24
Bangladesh 4
Bangladesh 8
Bangladesh 23

1430 rows × 1 columns

Grouping#

Often we are interested in examining specific groups in our data. df.groupby() allows us to group our data based on a variable(s):

dfg = df.groupby(by='food_category')

The groupby object is really just a dictionary of index-mappings, which we could look at if we wanted to:

dfg.groups
{'Beef': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Eggs': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Fish': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Lamb & Goat': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Milk - inc. cheese': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Nuts inc. Peanut Butter': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Pork': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Poultry': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Rice': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Soybeans': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Wheat and Wheat Products': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...]}

Let’s access a group using the .get_group() method:

dfg.get_group('Beef').head()
food_category consumption co2_emmission
country
Argentina Beef 55.48 1712.00
Australia Beef 33.86 1044.85
Albania Beef 22.50 694.30
Iceland Beef 13.36 412.26
New Zealand Beef 22.49 693.99

The usual thing to do however, is to apply aggregate functions to the groupby object:

dfg.mean()
consumption co2_emmission
food_category
Beef 12.123385 374.102000
Eggs 8.163769 7.499615
Fish 17.287077 27.601692
Lamb & Goat 2.600154 91.056769
Milk - inc. cheese 125.774692 179.153846
Nuts inc. Peanut Butter 4.137231 7.323000
Pork 16.123692 57.070077
Poultry 21.219231 22.793538
Rice 29.375154 37.591615
Soybeans 0.860538 0.387308
Wheat and Wheat Products 71.549538 13.644462

We can apply multiple functions using .aggregate():

dfg.aggregate(['mean', 'sum', 'count'])
consumption co2_emmission
mean sum count mean sum count
food_category
Beef 12.123385 1576.04 130 374.102000 48633.26 130
Eggs 8.163769 1061.29 130 7.499615 974.95 130
Fish 17.287077 2247.32 130 27.601692 3588.22 130
Lamb & Goat 2.600154 338.02 130 91.056769 11837.38 130
Milk - inc. cheese 125.774692 16350.71 130 179.153846 23290.00 130
Nuts inc. Peanut Butter 4.137231 537.84 130 7.323000 951.99 130
Pork 16.123692 2096.08 130 57.070077 7419.11 130
Poultry 21.219231 2758.50 130 22.793538 2963.16 130
Rice 29.375154 3818.77 130 37.591615 4886.91 130
Soybeans 0.860538 111.87 130 0.387308 50.35 130
Wheat and Wheat Products 71.549538 9301.44 130 13.644462 1773.78 130

Combining Datasets#

Some of the most exciting analyses come from combining different data sources. These operations can involve a straightforward concatenation of two datasets to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.

Simple Concatenation with pd.concat#

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate:

ser1 = pd.Series(['A', 'B', 'C'])
ser2 = pd.Series(['D', 'E', 'F'])
pd.concat([ser1, ser2])
0    A
1    B
2    C
0    D
1    E
2    F
dtype: object

By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0):

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pd.concat([df1, df2])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
0 1 2 3
1 4 5 6
2 7 8 9

You can also concatenate columns:

pd.concat([df1, df2], axis=1)
0 1 2 0 1 2
0 1 2 3 1 2 3
1 4 5 6 4 5 6
2 7 8 9 7 8 9

Merge and Join#

One essential feature offered by Pandas is its high-performance, in-memory join and merge operations. If you have ever worked with databases, you should be familiar with this type of data interaction. The main interface for this is the pd.merge function:

df1 = pd.DataFrame(
    {
        'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
        'group': ['Accounting', 'Engineering', 'Engineering', 'HR']
    }
)
df1
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
df2 = pd.DataFrame(
    {
        'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
        'hire_date': [2004, 2008, 2012, 2014]
    }
)
df2
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
One-to-one joins#

Perhaps the simplest type of merge expression is the one-to-one join, which is in many ways very similar to the column-wise concatenation:

df3 = pd.merge(df1, df2)
df3
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
Many-to-one joins#

Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate:

df4 = pd.DataFrame(
    {
        'group': ['Accounting', 'Engineering', 'HR'],
        'supervisor': ['Carly', 'Guido', 'Steve']
    }
)
df4
group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve
pd.merge(df3, df4)
employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve
Many-to-many joins#

If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.

df5 = pd.DataFrame(
    {
        'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
        'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']
    }
)
df5
group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization
pd.merge(df1, df5)
employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization

Exercises#

In this practice exercise, we’ll investigate different foods’ carbon footprints. We’ll leverage a dataset compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.

Import the dataset as a data frame named df from this URL:

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()
country food_category consumption co2_emmission
0 Argentina Pork 10.51 37.20
1 Argentina Poultry 38.66 41.53
2 Argentina Beef 55.48 1712.00
3 Argentina Lamb & Goat 1.56 54.63
4 Argentina Fish 4.36 6.96

What is the maximum co2_emmission in the dataset and which food type and country does it belong to?#

# TODO: your answer here

How many countries produce more than 1000 Kg CO2/person/year for at least one food type?#

# TODO: your answer here

What is the total emissions of all other (non-meat) products in the dataset combined?#

# TODO: your answer here

Resources#

Tom

TomasBeuzen. Tomasbeuzen/python-programming-for-data-science: content from the university of british columbia's master of data science course dsci 511. URL: https://github.com/TomasBeuzen/python-programming-for-data-science.

Van17

Jacob T. Vanderplas. Python Data Science Handbook: Essential Tools for working with data. O'Reilly, 2017.