Introduction to Pandas#

Pandas is the most popular Python library for tabular data structures. You can think of Pandas as a compelling version of Excel.

Pandas allows us to work with the less structured data available in many forms. Pandas introduces more flexibility (e.g., attaching labels to data, working with missing data, etc.) and brings additional operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.)

Pandas Series and DataFrame objects build on the NumPy array structure and provide efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

✏️ The example is inspired by [Tom] and [Van17].

Let’s import pandas with the alias pd:

import numpy as np
import pandas as pd

pd.__version__

'1.5.1'

Pandas Series#

A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc.), including a mix of them.

Creating Series#

By default, Series are labeled with indices starting from 0. For example:

pd.Series(data=[-5, 1.3, 21, 6, 3])

  -5.0
   1.3
  21.0
   6.0
   3.0
dtype: float64

However, you can add a custom index:

pd.Series(data=[-5, 1.3, 21, 6, 3], index=['a', 'b', 'c', 'd', 'e'])

a    -5.0
b     1.3
c    21.0
d     6.0
e     3.0
dtype: float64

You can create a Series from a dictionary:

pd.Series(data={'a': 10, 'b': 20, 'c': 30})

a    10
b    20
c    30
dtype: int64

Or from an ndarray:

pd.Series(data=np.random.randn(3))

 -0.723800
 -0.730294
  0.397899
dtype: float64

Series Characteristics#

You can access the index labels of your Series using the .index attribute:

s = pd.Series(data=np.random.randn(3))
s.index

RangeIndex(start=0, stop=3, step=1)

You can access the underlying data array using .to_numpy():

s.to_numpy()

array([-2.87624963, -1.63405871, -0.88819511])

Indexing and Slicing Series#

Series are very much like ndarrays in this regard (in fact, Series can be passed to most NumPy functions):

s = pd.Series(data=range(5), index=['A', 'B', 'C', 'D', 'A'])
s

A    0
B    1
C    2
D    3
A    4
dtype: int64

s[0]

s[0:3]

A    0
B    1
C    2
dtype: int64

Series are also like dictionaries in that we can access values using index labels:

s[["B", "D", "C"]]

B    1
D    3
C    2
dtype: int64

"A" in s

True

Series do allow for non-unique indexing, but be careful because indexing operations won’t return unique values:

s["A"]

A    0
A    4
dtype: int64

Finally, we can also do boolean indexing with Series:

s[s >= 1]

B    1
C    2
D    3
A    4
dtype: int64

Series Operations#

Unlike ndarrays, operations between Series (+, -, /, *) align values based on their LABELS (not their position in the structure). The resulting index will be the sorted union of the two indexes. This gives you the flexibility to run operations on Series regardless of their labels.

s1 = pd.Series(data=range(4), index=["A", "B", "C", "D"])
s1

A    0
B    1
C    2
D    3
dtype: int64

s2 = pd.Series(data=range(10, 14), index=["B", "C", "D", "E"])
s2

B    10
C    11
D    12
E    13
dtype: int64

s1 + s2

A     NaN
B    11.0
C    13.0
D    15.0
E     NaN
dtype: float64

We can also perform standard operations on a series, like multiplying or squaring. NumPy also accepts Series as an argument to most functions because series are built off numpy arrays:

s1 ** 2

A    0
B    1
C    4
D    9
dtype: int64

np.exp(s1)

A     1.000000
B     2.718282
C     7.389056
D    20.085537
dtype: float64

Finally, like arrays, Series have many built-in methods for various operations:

s1.mean()

1.5

Pandas DataFrames#

DataFrames are just Series stuck together. Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data:

Creating DataFrames#

Dataframes can be created using pd.DataFrame(). Like series, index and column labels of dataframes are labeled starting from 0 by default:

pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

	0	1	2
0	1	2	3
1	4	5	6
2	7	8	9

We can use the index and columns arguments to give them labels:

pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=["R1", "R2", "R3"], columns=["C1", "C2", "C3"])

	C1	C2	C3
R1	1	2	3
R2	4	5	6
R3	7	8	9

There are so many ways to create dataframes. For example, you can use dictionary:

pd.DataFrame({"C1": [1, 2, 3], "C2": ['A', 'B', 'C']}, index=["R1", "R2", "R3"])

	C1	C2
R1	1	A
R2	2	B
R3	3	C

Indexing and Slicing#

There are several main ways to select data from a DataFrame:

Method	Syntax	Output
Select column	`df[col_label]`	Series
Select row slice	`df[row_1_int:row_2_int]`	DataFrame
Select row/column by label	`df.loc[row_label(s), col_label(s)]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select row/column by integer	`df.iloc[row_int(s), col_int(s)]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select by row integer & column label	`df.loc[df.index[row_int], col_label]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select by row label & column integer	`df.loc[row_label, df.columns[col_int]]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select by boolean	`df[bool_vec]`	Object for single selection, Series for one row/column, otherwise DataFrame

Indexing with []#

df = pd.DataFrame(
    {
        "Name": ["Tom", "Mike", "Tiffany"],
        "Language": ["Python", "Python", "R"],
        "Courses": [5, 4, 7]
    }
)
df

	Name	Language	Courses
0	Tom	Python	5
1	Mike	Python	4
2	Tiffany	R	7

Select columns by single labels, lists of labels, or slices:

df['Name']

      Tom
     Mike
  Tiffany
Name: Name, dtype: object

df[['Name', 'Language']]

	Name	Language
0	Tom	Python
1	Mike	Python
2	Tiffany	R

df[df['Courses'] > 5]

	Name	Language	Courses
2	Tiffany	R	7

df[df['Name'] == "Tom"]

	Name	Language	Courses
0	Tom	Python	5

df[(df['Courses'] > 4) & (df['Language'] == 'Python')]

	Name	Language	Courses
0	Tom	Python	5

Indexing with .loc and .iloc#

Pandas created the methods .loc[] and .iloc[] as more flexible alternatives for accessing data from a dataframe. Use df.iloc[] for indexing with integers and df.loc[] for indexing with labels.

# returns a series
df.iloc[0]

Name           Tom
Language    Python
Courses          5
Name: 0, dtype: object

# slicing returns a dataframe
df.iloc[0:2]

	Name	Language	Courses
0	Tom	Python	5
1	Mike	Python	4

# returns the indexed object
df.iloc[2, 1]

'R'

Now let’s look at .loc which accepts labels as references to rows/columns:

df.loc[:, 'Name']

      Tom
     Mike
  Tiffany
Name: Name, dtype: object

Reading/Writing Data From External Sources#

Pandas facilitates reading directly from a url - pd.read_csv() accepts urls as input:

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()

	country	food_category	consumption	co2_emmission
0	Argentina	Pork	10.51	37.20
1	Argentina	Poultry	38.66	41.53
2	Argentina	Beef	55.48	1712.00
3	Argentina	Lamb & Goat	1.56	54.63
4	Argentina	Fish	4.36	6.96

Notice that we are using function .head() to visualize just a fraction of rows taken from the top.

Let’s save data to a .csv file. You will likely be loading local .csv files more often.

out_path = '../../data/food_consumption.csv'
df.to_csv(out_path, sep='\t', encoding='utf-8', index=False)

Let’s read it back:

df = pd.read_csv(out_path, index_col=0, parse_dates=True, sep='\t')
df.head()

	food_category	consumption	co2_emmission
country
Argentina	Pork	10.51	37.20
Argentina	Poultry	38.66	41.53
Argentina	Beef	55.48	1712.00
Argentina	Lamb & Goat	1.56	54.63
Argentina	Fish	4.36	6.96

DataFrames Operations#

Similar to Series, DataFrames have built-in functions for performing most common operations, e.g., .max(), .idxmin(), .sort_values(), .value_counts(), etc. They’re all documented in the Pandas documentation.

Let’s try a few of them:

df.max()

food_category    Wheat and Wheat Products
consumption                        430.76
co2_emmission                      1712.0
dtype: object

df['co2_emmission'].max()

1712.0

df['co2_emmission'].idxmin()

'Argentina'

df.sort_values(by='co2_emmission').head()

	food_category	consumption	co2_emmission
country
Togo	Soybeans	0.01	0.0
Paraguay	Soybeans	0.00	0.0
Saudi Arabia	Pork	0.00	0.0
Cambodia	Lamb & Goat	0.00	0.0
Croatia	Soybeans	0.00	0.0

df['food_category'].value_counts()

Pork                        130
Poultry                     130
Beef                        130
Lamb & Goat                 130
Fish                        130
Eggs                        130
Milk - inc. cheese          130
Wheat and Wheat Products    130
Rice                        130
Soybeans                    130
Nuts inc. Peanut Butter     130
Name: food_category, dtype: int64

Applying Custom Functions#

There might be cases when you want to apply a function not built-in to Pandas. For this, we also have methods:

df.apply(), applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)
df.applymap(), applies a function element-wise (for functions that accept/return single values at a time)
series.apply()/series.map(), same as above but for Pandas series

df[['consumption', 'co2_emmission']].apply(np.sqrt)

	consumption	co2_emmission
country
Argentina	3.241913	6.099180
Argentina	6.217717	6.444377
Argentina	7.448490	41.376322
Argentina	1.249000	7.391211
Argentina	2.088061	2.638181
...	...	...
Bangladesh	4.680812	5.586591
Bangladesh	4.179713	1.824829
Bangladesh	13.104579	14.824304
Bangladesh	0.781025	0.519615
Bangladesh	0.848528	1.126943

1430 rows × 2 columns

def length_of_category(x):
    return len(x)


df[['food_category']].applymap(length_of_category)

	food_category
country
Argentina	4
Argentina	7
Argentina	4
Argentina	11
Argentina	4
...	...
Bangladesh	18
Bangladesh	24
Bangladesh	4
Bangladesh	8
Bangladesh	23

1430 rows × 1 columns

Grouping#

Often we are interested in examining specific groups in our data. df.groupby() allows us to group our data based on a variable(s):

dfg = df.groupby(by='food_category')

The groupby object is really just a dictionary of index-mappings, which we could look at if we wanted to:

dfg.groups

{'Beef': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Eggs': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Fish': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Lamb & Goat': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Milk - inc. cheese': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Nuts inc. Peanut Butter': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Pork': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Poultry': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Rice': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Soybeans': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Wheat and Wheat Products': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...]}

Let’s access a group using the .get_group() method:

dfg.get_group('Beef').head()

	food_category	consumption	co2_emmission
country
Argentina	Beef	55.48	1712.00
Australia	Beef	33.86	1044.85
Albania	Beef	22.50	694.30
Iceland	Beef	13.36	412.26
New Zealand	Beef	22.49	693.99

The usual thing to do however, is to apply aggregate functions to the groupby object:

dfg.mean()

	consumption	co2_emmission
food_category
Beef	12.123385	374.102000
Eggs	8.163769	7.499615
Fish	17.287077	27.601692
Lamb & Goat	2.600154	91.056769
Milk - inc. cheese	125.774692	179.153846
Nuts inc. Peanut Butter	4.137231	7.323000
Pork	16.123692	57.070077
Poultry	21.219231	22.793538
Rice	29.375154	37.591615
Soybeans	0.860538	0.387308
Wheat and Wheat Products	71.549538	13.644462

We can apply multiple functions using .aggregate():

dfg.aggregate(['mean', 'sum', 'count'])

	consumption			co2_emmission
	mean	sum	count	mean	sum	count
food_category
Beef	12.123385	1576.04	130	374.102000	48633.26	130
Eggs	8.163769	1061.29	130	7.499615	974.95	130
Fish	17.287077	2247.32	130	27.601692	3588.22	130
Lamb & Goat	2.600154	338.02	130	91.056769	11837.38	130
Milk - inc. cheese	125.774692	16350.71	130	179.153846	23290.00	130
Nuts inc. Peanut Butter	4.137231	537.84	130	7.323000	951.99	130
Pork	16.123692	2096.08	130	57.070077	7419.11	130
Poultry	21.219231	2758.50	130	22.793538	2963.16	130
Rice	29.375154	3818.77	130	37.591615	4886.91	130
Soybeans	0.860538	111.87	130	0.387308	50.35	130
Wheat and Wheat Products	71.549538	9301.44	130	13.644462	1773.78	130

Combining Datasets#

Some of the most exciting analyses come from combining different data sources. These operations can involve a straightforward concatenation of two datasets to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.

Simple Concatenation with pd.concat#

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate:

ser1 = pd.Series(['A', 'B', 'C'])
ser2 = pd.Series(['D', 'E', 'F'])
pd.concat([ser1, ser2])

  A
  B
  C
  D
  E
  F
dtype: object

By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0):

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pd.concat([df1, df2])

	0	1	2
0	1	2	3
1	4	5	6
2	7	8	9
0	1	2	3
1	4	5	6
2	7	8	9

You can also concatenate columns:

pd.concat([df1, df2], axis=1)

	0	1	2	0	1	2
0	1	2	3	1	2	3
1	4	5	6	4	5	6
2	7	8	9	7	8	9

Merge and Join#

One essential feature offered by Pandas is its high-performance, in-memory join and merge operations. If you have ever worked with databases, you should be familiar with this type of data interaction. The main interface for this is the pd.merge function:

df1 = pd.DataFrame(
    {
        'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
        'group': ['Accounting', 'Engineering', 'Engineering', 'HR']
    }
)
df1

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering
3	Sue	HR

df2 = pd.DataFrame(
    {
        'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
        'hire_date': [2004, 2008, 2012, 2014]
    }
)
df2

	employee	hire_date
0	Lisa	2004
1	Bob	2008
2	Jake	2012
3	Sue	2014

One-to-one joins#

Perhaps the simplest type of merge expression is the one-to-one join, which is in many ways very similar to the column-wise concatenation:

df3 = pd.merge(df1, df2)
df3

	employee	group	hire_date
0	Bob	Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004
3	Sue	HR	2014

Many-to-one joins#

Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate:

df4 = pd.DataFrame(
    {
        'group': ['Accounting', 'Engineering', 'HR'],
        'supervisor': ['Carly', 'Guido', 'Steve']
    }
)
df4

	group	supervisor
0	Accounting	Carly
1	Engineering	Guido
2	HR	Steve

pd.merge(df3, df4)

	employee	group	hire_date	supervisor
0	Bob	Accounting	2008	Carly
1	Jake	Engineering	2012	Guido
2	Lisa	Engineering	2004	Guido
3	Sue	HR	2014	Steve

Many-to-many joins#

If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.

df5 = pd.DataFrame(
    {
        'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
        'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']
    }
)
df5

	group	skills
0	Accounting	math
1	Accounting	spreadsheets
2	Engineering	coding
3	Engineering	linux
4	HR	spreadsheets
5	HR	organization

pd.merge(df1, df5)

	employee	group	skills
0	Bob	Accounting	math
1	Bob	Accounting	spreadsheets
2	Jake	Engineering	coding
3	Jake	Engineering	linux
4	Lisa	Engineering	coding
5	Lisa	Engineering	linux
6	Sue	HR	spreadsheets
7	Sue	HR	organization

Exercises#

In this practice exercise, we’ll investigate different foods’ carbon footprints. We’ll leverage a dataset compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.

Import the dataset as a data frame named df from this URL:

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()

	country	food_category	consumption	co2_emmission
0	Argentina	Pork	10.51	37.20
1	Argentina	Poultry	38.66	41.53
2	Argentina	Beef	55.48	1712.00
3	Argentina	Lamb & Goat	1.56	54.63
4	Argentina	Fish	4.36	6.96

What is the maximum co2_emmission in the dataset and which food type and country does it belong to?#

# TODO: your answer here

How many countries produce more than 1000 Kg CO2/person/year for at least one food type?#

# TODO: your answer here

What is the total emissions of all other (non-meat) products in the dataset combined?#

# TODO: your answer here

Resources#

Tom: TomasBeuzen. Tomasbeuzen/python-programming-for-data-science: content from the university of british columbia's master of data science course dsci 511. URL: https://github.com/TomasBeuzen/python-programming-for-data-science.
Van17: Jacob T. Vanderplas. Python Data Science Handbook: Essential Tools for working with data. O'Reilly, 2017.

Data Science Academy

Introduction to Pandas

Contents