Introduction to Pandas
Contents
Introduction to Pandas#
Pandas is the most popular Python library for tabular data structures. You can think of Pandas as a compelling version of Excel.
Pandas allows us to work with the less structured data available in many forms. Pandas introduces more flexibility (e.g., attaching labels to data, working with missing data, etc.) and brings additional operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.)
Pandas Series and DataFrame objects build on the NumPy array structure and provide efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.
Let’s import pandas with the alias pd:
import numpy as np
import pandas as pd
pd.__version__
'1.5.1'
Pandas Series#
A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc.), including a mix of them.
Creating Series#
By default, Series are labeled with indices starting from 0. For example:
pd.Series(data=[-5, 1.3, 21, 6, 3])
0 -5.0
1 1.3
2 21.0
3 6.0
4 3.0
dtype: float64
However, you can add a custom index:
pd.Series(data=[-5, 1.3, 21, 6, 3], index=['a', 'b', 'c', 'd', 'e'])
a -5.0
b 1.3
c 21.0
d 6.0
e 3.0
dtype: float64
You can create a Series from a dictionary:
pd.Series(data={'a': 10, 'b': 20, 'c': 30})
a 10
b 20
c 30
dtype: int64
Or from an ndarray:
pd.Series(data=np.random.randn(3))
0 -0.723800
1 -0.730294
2 0.397899
dtype: float64
Series Characteristics#
You can access the index labels of your Series using the .index attribute:
s = pd.Series(data=np.random.randn(3))
s.index
RangeIndex(start=0, stop=3, step=1)
You can access the underlying data array using .to_numpy():
s.to_numpy()
array([-2.87624963, -1.63405871, -0.88819511])
Indexing and Slicing Series#
Series are very much like ndarrays in this regard (in fact, Series can be passed to most NumPy functions):
s = pd.Series(data=range(5), index=['A', 'B', 'C', 'D', 'A'])
s
A 0
B 1
C 2
D 3
A 4
dtype: int64
s[0]
0
s[0:3]
A 0
B 1
C 2
dtype: int64
Series are also like dictionaries in that we can access values using index labels:
s[["B", "D", "C"]]
B 1
D 3
C 2
dtype: int64
"A" in s
True
Series do allow for non-unique indexing, but be careful because indexing operations won’t return unique values:
s["A"]
A 0
A 4
dtype: int64
Finally, we can also do boolean indexing with Series:
s[s >= 1]
B 1
C 2
D 3
A 4
dtype: int64
Series Operations#
Unlike ndarrays, operations between Series (+, -, /, *) align values based on their LABELS (not their position in the structure). The resulting index will be the sorted union of the two indexes. This gives you the flexibility to run operations on Series regardless of their labels.
s1 = pd.Series(data=range(4), index=["A", "B", "C", "D"])
s1
A 0
B 1
C 2
D 3
dtype: int64
s2 = pd.Series(data=range(10, 14), index=["B", "C", "D", "E"])
s2
B 10
C 11
D 12
E 13
dtype: int64
s1 + s2
A NaN
B 11.0
C 13.0
D 15.0
E NaN
dtype: float64
We can also perform standard operations on a series, like multiplying or squaring. NumPy also accepts Series as an argument to most functions because series are built off numpy arrays:
s1 ** 2
A 0
B 1
C 4
D 9
dtype: int64
np.exp(s1)
A 1.000000
B 2.718282
C 7.389056
D 20.085537
dtype: float64
Finally, like arrays, Series have many built-in methods for various operations:
s1.mean()
1.5
Pandas DataFrames#
DataFrames are just Series stuck together. Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data:
Creating DataFrames#
Dataframes can be created using pd.DataFrame(). Like series, index and column labels of dataframes are labeled starting from 0 by default:
pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
We can use the index and columns arguments to give them labels:
pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=["R1", "R2", "R3"], columns=["C1", "C2", "C3"])
C1 | C2 | C3 | |
---|---|---|---|
R1 | 1 | 2 | 3 |
R2 | 4 | 5 | 6 |
R3 | 7 | 8 | 9 |
There are so many ways to create dataframes. For example, you can use dictionary:
pd.DataFrame({"C1": [1, 2, 3], "C2": ['A', 'B', 'C']}, index=["R1", "R2", "R3"])
C1 | C2 | |
---|---|---|
R1 | 1 | A |
R2 | 2 | B |
R3 | 3 | C |
Indexing and Slicing#
There are several main ways to select data from a DataFrame:
Method |
Syntax |
Output |
---|---|---|
Select column |
|
Series |
Select row slice |
|
DataFrame |
Select row/column by label |
|
Object for single selection, Series for one row/column, otherwise DataFrame |
Select row/column by integer |
|
Object for single selection, Series for one row/column, otherwise DataFrame |
Select by row integer & column label |
|
Object for single selection, Series for one row/column, otherwise DataFrame |
Select by row label & column integer |
|
Object for single selection, Series for one row/column, otherwise DataFrame |
Select by boolean |
|
Object for single selection, Series for one row/column, otherwise DataFrame |
Indexing with []#
df = pd.DataFrame(
{
"Name": ["Tom", "Mike", "Tiffany"],
"Language": ["Python", "Python", "R"],
"Courses": [5, 4, 7]
}
)
df
Name | Language | Courses | |
---|---|---|---|
0 | Tom | Python | 5 |
1 | Mike | Python | 4 |
2 | Tiffany | R | 7 |
Select columns by single labels, lists of labels, or slices:
df['Name']
0 Tom
1 Mike
2 Tiffany
Name: Name, dtype: object
df[['Name', 'Language']]
Name | Language | |
---|---|---|
0 | Tom | Python |
1 | Mike | Python |
2 | Tiffany | R |
df[df['Courses'] > 5]
Name | Language | Courses | |
---|---|---|---|
2 | Tiffany | R | 7 |
df[df['Name'] == "Tom"]
Name | Language | Courses | |
---|---|---|---|
0 | Tom | Python | 5 |
df[(df['Courses'] > 4) & (df['Language'] == 'Python')]
Name | Language | Courses | |
---|---|---|---|
0 | Tom | Python | 5 |
Indexing with .loc and .iloc#
Pandas created the methods .loc[] and .iloc[] as more flexible alternatives for accessing data from a dataframe. Use df.iloc[] for indexing with integers and df.loc[] for indexing with labels.
# returns a series
df.iloc[0]
Name Tom
Language Python
Courses 5
Name: 0, dtype: object
# slicing returns a dataframe
df.iloc[0:2]
Name | Language | Courses | |
---|---|---|---|
0 | Tom | Python | 5 |
1 | Mike | Python | 4 |
# returns the indexed object
df.iloc[2, 1]
'R'
Now let’s look at .loc which accepts labels as references to rows/columns:
df.loc[:, 'Name']
0 Tom
1 Mike
2 Tiffany
Name: Name, dtype: object
Reading/Writing Data From External Sources#
Pandas facilitates reading directly from a url - pd.read_csv()
accepts urls as input:
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()
country | food_category | consumption | co2_emmission | |
---|---|---|---|---|
0 | Argentina | Pork | 10.51 | 37.20 |
1 | Argentina | Poultry | 38.66 | 41.53 |
2 | Argentina | Beef | 55.48 | 1712.00 |
3 | Argentina | Lamb & Goat | 1.56 | 54.63 |
4 | Argentina | Fish | 4.36 | 6.96 |
Notice that we are using function .head() to visualize just a fraction of rows taken from the top.
Let’s save data to a .csv file. You will likely be loading local .csv files more often.
out_path = '../../data/food_consumption.csv'
df.to_csv(out_path, sep='\t', encoding='utf-8', index=False)
Let’s read it back:
df = pd.read_csv(out_path, index_col=0, parse_dates=True, sep='\t')
df.head()
food_category | consumption | co2_emmission | |
---|---|---|---|
country | |||
Argentina | Pork | 10.51 | 37.20 |
Argentina | Poultry | 38.66 | 41.53 |
Argentina | Beef | 55.48 | 1712.00 |
Argentina | Lamb & Goat | 1.56 | 54.63 |
Argentina | Fish | 4.36 | 6.96 |
DataFrames Operations#
Similar to Series, DataFrames have built-in functions for performing most common operations, e.g., .max(), .idxmin(), .sort_values(), .value_counts(), etc. They’re all documented in the Pandas documentation.
Let’s try a few of them:
df.max()
food_category Wheat and Wheat Products
consumption 430.76
co2_emmission 1712.0
dtype: object
df['co2_emmission'].max()
1712.0
df['co2_emmission'].idxmin()
'Argentina'
df.sort_values(by='co2_emmission').head()
food_category | consumption | co2_emmission | |
---|---|---|---|
country | |||
Togo | Soybeans | 0.01 | 0.0 |
Paraguay | Soybeans | 0.00 | 0.0 |
Saudi Arabia | Pork | 0.00 | 0.0 |
Cambodia | Lamb & Goat | 0.00 | 0.0 |
Croatia | Soybeans | 0.00 | 0.0 |
df['food_category'].value_counts()
Pork 130
Poultry 130
Beef 130
Lamb & Goat 130
Fish 130
Eggs 130
Milk - inc. cheese 130
Wheat and Wheat Products 130
Rice 130
Soybeans 130
Nuts inc. Peanut Butter 130
Name: food_category, dtype: int64
Applying Custom Functions#
There might be cases when you want to apply a function not built-in to Pandas. For this, we also have methods:
df.apply()
, applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)df.applymap()
, applies a function element-wise (for functions that accept/return single values at a time)series.apply()
/series.map()
, same as above but for Pandas series
df[['consumption', 'co2_emmission']].apply(np.sqrt)
consumption | co2_emmission | |
---|---|---|
country | ||
Argentina | 3.241913 | 6.099180 |
Argentina | 6.217717 | 6.444377 |
Argentina | 7.448490 | 41.376322 |
Argentina | 1.249000 | 7.391211 |
Argentina | 2.088061 | 2.638181 |
... | ... | ... |
Bangladesh | 4.680812 | 5.586591 |
Bangladesh | 4.179713 | 1.824829 |
Bangladesh | 13.104579 | 14.824304 |
Bangladesh | 0.781025 | 0.519615 |
Bangladesh | 0.848528 | 1.126943 |
1430 rows × 2 columns
def length_of_category(x):
return len(x)
df[['food_category']].applymap(length_of_category)
food_category | |
---|---|
country | |
Argentina | 4 |
Argentina | 7 |
Argentina | 4 |
Argentina | 11 |
Argentina | 4 |
... | ... |
Bangladesh | 18 |
Bangladesh | 24 |
Bangladesh | 4 |
Bangladesh | 8 |
Bangladesh | 23 |
1430 rows × 1 columns
Grouping#
Often we are interested in examining specific groups in our data. df.groupby()
allows us to group our data based on a variable(s):
dfg = df.groupby(by='food_category')
The groupby object is really just a dictionary of index-mappings, which we could look at if we wanted to:
dfg.groups
{'Beef': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Eggs': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Fish': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Lamb & Goat': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Milk - inc. cheese': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Nuts inc. Peanut Butter': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Pork': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Poultry': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Rice': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Soybeans': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...], 'Wheat and Wheat Products': ['Argentina', 'Australia', 'Albania', 'Iceland', 'New Zealand', 'USA', 'Uruguay', 'Luxembourg', 'Brazil', 'Kazakhstan', 'Sweden', 'Bermuda', 'Denmark', 'Finland', 'Ireland', 'Greece', 'France', 'Canada', 'Norway', 'Hong Kong SAR. China', 'French Polynesia', 'Israel', 'Switzerland', 'Netherlands', 'Kuwait', 'United Kingdom', 'Austria', 'Oman', 'Italy', 'Bahamas', 'Portugal', 'Malta', 'Armenia', 'Slovenia', 'Chile', 'Venezuela', 'Belgium', 'Germany', 'Russia', 'Croatia', 'Belarus', 'Spain', 'Paraguay', 'New Caledonia', 'South Africa', 'Barbados', 'Lithuania', 'Turkey', 'Estonia', 'Mexico', 'Costa Rica', 'Bolivia', 'Ecuador', 'Panama', 'Czech Republic', 'Romania', 'Colombia', 'Maldives', 'Cyprus', 'Serbia', 'United Arab Emirates', 'Algeria', 'Ukraine', 'Pakistan', 'Swaziland', 'Latvia', 'Bosnia and Herzegovina', 'Fiji', 'South Korea', 'Poland', 'Saudi Arabia', 'Botswana', 'Macedonia', 'Hungary', 'Trinidad and Tobago', 'Tunisia', 'Egypt', 'Mauritius', 'Bulgaria', 'Morocco', 'Slovakia', 'Niger', 'Kenya', 'Jordan', 'Japan', 'Georgia', 'Grenada', 'El Salvador', 'Cuba', 'China', 'Honduras', 'Taiwan. ROC', 'Angola', 'Jamaica', 'Namibia', 'Belize', 'Malaysia', 'Zimbabwe', 'Guatemala', 'Uganda', ...]}
Let’s access a group using the .get_group() method:
dfg.get_group('Beef').head()
food_category | consumption | co2_emmission | |
---|---|---|---|
country | |||
Argentina | Beef | 55.48 | 1712.00 |
Australia | Beef | 33.86 | 1044.85 |
Albania | Beef | 22.50 | 694.30 |
Iceland | Beef | 13.36 | 412.26 |
New Zealand | Beef | 22.49 | 693.99 |
The usual thing to do however, is to apply aggregate functions to the groupby object:
dfg.mean()
consumption | co2_emmission | |
---|---|---|
food_category | ||
Beef | 12.123385 | 374.102000 |
Eggs | 8.163769 | 7.499615 |
Fish | 17.287077 | 27.601692 |
Lamb & Goat | 2.600154 | 91.056769 |
Milk - inc. cheese | 125.774692 | 179.153846 |
Nuts inc. Peanut Butter | 4.137231 | 7.323000 |
Pork | 16.123692 | 57.070077 |
Poultry | 21.219231 | 22.793538 |
Rice | 29.375154 | 37.591615 |
Soybeans | 0.860538 | 0.387308 |
Wheat and Wheat Products | 71.549538 | 13.644462 |
We can apply multiple functions using .aggregate():
dfg.aggregate(['mean', 'sum', 'count'])
consumption | co2_emmission | |||||
---|---|---|---|---|---|---|
mean | sum | count | mean | sum | count | |
food_category | ||||||
Beef | 12.123385 | 1576.04 | 130 | 374.102000 | 48633.26 | 130 |
Eggs | 8.163769 | 1061.29 | 130 | 7.499615 | 974.95 | 130 |
Fish | 17.287077 | 2247.32 | 130 | 27.601692 | 3588.22 | 130 |
Lamb & Goat | 2.600154 | 338.02 | 130 | 91.056769 | 11837.38 | 130 |
Milk - inc. cheese | 125.774692 | 16350.71 | 130 | 179.153846 | 23290.00 | 130 |
Nuts inc. Peanut Butter | 4.137231 | 537.84 | 130 | 7.323000 | 951.99 | 130 |
Pork | 16.123692 | 2096.08 | 130 | 57.070077 | 7419.11 | 130 |
Poultry | 21.219231 | 2758.50 | 130 | 22.793538 | 2963.16 | 130 |
Rice | 29.375154 | 3818.77 | 130 | 37.591615 | 4886.91 | 130 |
Soybeans | 0.860538 | 111.87 | 130 | 0.387308 | 50.35 | 130 |
Wheat and Wheat Products | 71.549538 | 9301.44 | 130 | 13.644462 | 1773.78 | 130 |
Combining Datasets#
Some of the most exciting analyses come from combining different data sources. These operations can involve a straightforward concatenation of two datasets to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.
Simple Concatenation with pd.concat#
Pandas has a function, pd.concat()
, which has a similar syntax to np.concatenate
:
ser1 = pd.Series(['A', 'B', 'C'])
ser2 = pd.Series(['D', 'E', 'F'])
pd.concat([ser1, ser2])
0 A
1 B
2 C
0 D
1 E
2 F
dtype: object
By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0):
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pd.concat([df1, df2])
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
You can also concatenate columns:
pd.concat([df1, df2], axis=1)
0 | 1 | 2 | 0 | 1 | 2 | |
---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 4 | 5 | 6 |
2 | 7 | 8 | 9 | 7 | 8 | 9 |
Merge and Join#
One essential feature offered by Pandas is its high-performance, in-memory join and merge operations. If you have ever worked with databases, you should be familiar with this type of data interaction. The main interface for this is the pd.merge
function:
df1 = pd.DataFrame(
{
'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']
}
)
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2 = pd.DataFrame(
{
'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]
}
)
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
One-to-one joins#
Perhaps the simplest type of merge expression is the one-to-one join, which is in many ways very similar to the column-wise concatenation:
df3 = pd.merge(df1, df2)
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
Many-to-one joins#
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate:
df4 = pd.DataFrame(
{
'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']
}
)
df4
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
pd.merge(df3, df4)
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Bob | Accounting | 2008 | Carly |
1 | Jake | Engineering | 2012 | Guido |
2 | Lisa | Engineering | 2004 | Guido |
3 | Sue | HR | 2014 | Steve |
Many-to-many joins#
If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.
df5 = pd.DataFrame(
{
'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']
}
)
df5
group | skills | |
---|---|---|
0 | Accounting | math |
1 | Accounting | spreadsheets |
2 | Engineering | coding |
3 | Engineering | linux |
4 | HR | spreadsheets |
5 | HR | organization |
pd.merge(df1, df5)
employee | group | skills | |
---|---|---|---|
0 | Bob | Accounting | math |
1 | Bob | Accounting | spreadsheets |
2 | Jake | Engineering | coding |
3 | Jake | Engineering | linux |
4 | Lisa | Engineering | coding |
5 | Lisa | Engineering | linux |
6 | Sue | HR | spreadsheets |
7 | Sue | HR | organization |
Exercises#
In this practice exercise, we’ll investigate different foods’ carbon footprints. We’ll leverage a dataset compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.
Import the dataset as a data frame named df from this URL:
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()
country | food_category | consumption | co2_emmission | |
---|---|---|---|---|
0 | Argentina | Pork | 10.51 | 37.20 |
1 | Argentina | Poultry | 38.66 | 41.53 |
2 | Argentina | Beef | 55.48 | 1712.00 |
3 | Argentina | Lamb & Goat | 1.56 | 54.63 |
4 | Argentina | Fish | 4.36 | 6.96 |
What is the maximum co2_emmission in the dataset and which food type and country does it belong to?#
# TODO: your answer here
How many countries produce more than 1000 Kg CO2/person/year for at least one food type?#
# TODO: your answer here
What is the total emissions of all other (non-meat) products in the dataset combined?#
# TODO: your answer here
Resources#
- Tom
TomasBeuzen. Tomasbeuzen/python-programming-for-data-science: content from the university of british columbia's master of data science course dsci 511. URL: https://github.com/TomasBeuzen/python-programming-for-data-science.
- Van17
Jacob T. Vanderplas. Python Data Science Handbook: Essential Tools for working with data. O'Reilly, 2017.