Acquiring Data#

Most analyses start by importing data into your environment. The data are primarily available from the following data sources:

Local files (you should already have an idea)
Databases
APIs

✏️ The example is inspired by [Rei21].

Luckily for us, Python is very capable when accessing all of those data sources because there are tons of open-source libraries. Once we have access, we can easily import data to NumPy, pandas, or other libraries to exploit what we have learned regarding data wrangling. We will show you some basic examples to give you an idea.

Let’s start by importing pandas:

import pandas as pd

We have been primarily working with local files (CSV) so far, but a lot of business data is probably stored in a database. You usually need to know some basics of SQL (which is out of the scope of this course) to access the data.

It is pretty easy to get data from a database to pandas. Let’s use, for example, a built-in database SQLite with the Chinook Dataset:

# let's download data locally
!wget https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite -O ../../data/Chinook_Sqlite.sqlite

--2022-10-25 17:26:49--  https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 

302 Found
Location: https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite [following]
--2022-10-25 17:26:49--  https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 

185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.

HTTP request sent, awaiting response...

200 OK
Length: 1067008 (1,0M) [application/octet-stream]
Saving to: ‘../../data/Chinook_Sqlite.sqlite’


          ../../dat   0%[                    ]       0  --.-KB/s               

../../data/Chinook_ 100%[===================>]   1,02M  --.-KB/s    in 0,1s    

2022-10-25 17:26:50 (7,87 MB/s) - ‘../../data/Chinook_Sqlite.sqlite’ saved [1067008/1067008]

import sqlite3 as sql

conn = sql.connect('../../data/Chinook_Sqlite.sqlite')
df = pd.read_sql_query("SELECT * FROM invoice LIMIT 25", conn)
df.head()

	InvoiceId	CustomerId	InvoiceDate	BillingAddress	BillingCity	BillingState	BillingCountry	BillingPostalCode	Total
0	1	2	2009-01-01 00:00:00	Theodor-Heuss-Straße 34	Stuttgart	None	Germany	70174	1.98
1	2	4	2009-01-02 00:00:00	Ullevålsveien 14	Oslo	None	Norway	0171	3.96
2	3	8	2009-01-03 00:00:00	Grétrystraat 63	Brussels	None	Belgium	1000	5.94
3	4	14	2009-01-06 00:00:00	8210 111 ST NW	Edmonton	AB	Canada	T6G 2C7	8.91
4	5	23	2009-01-11 00:00:00	69 Salem Street	Boston	MA	USA	2113	13.86

Connecting to an API might be a bit more challenging. The basic idea is to send a request (which may include query parameters and access credentials) to an endpoint, which will return a response code plus the data you asked for. Let’s try it on a simple example:

import requests
response = requests.get("http://api.open-notify.org/astros.json")
print(response.json())

{'message': 'success', 'people': [{'name': 'Cai Xuzhe', 'craft': 'Tiangong'}, {'name': 'Chen Dong', 'craft': 'Tiangong'}, {'name': 'Liu Yang', 'craft': 'Tiangong'}, {'name': 'Sergey Prokopyev', 'craft': 'ISS'}, {'name': 'Dmitry Petelin', 'craft': 'ISS'}, {'name': 'Frank Rubio', 'craft': 'ISS'}, {'name': 'Nicole Mann', 'craft': 'ISS'}, {'name': 'Josh Cassada', 'craft': 'ISS'}, {'name': 'Koichi Wakata', 'craft': 'ISS'}, {'name': 'Anna Kikina', 'craft': 'ISS'}], 'number': 10}

res = pd.DataFrame(response.json()["people"])
res.head()

	name	craft
0	Cai Xuzhe	Tiangong
1	Chen Dong	Tiangong
2	Liu Yang	Tiangong
3	Sergey Prokopyev	ISS
4	Dmitry Petelin	ISS

Finally, let’s review some sources with valuable datasets that might help you with your projects. We will list just a few tips to get you started:

Kaggle is known for hosting machine learning and deep learning challenges, and with those challenges also come datasets.
.gov Datasets - many countries openly share their datasets with the public.
UCI Machine Learning Repository provides easy-to-use and cleaned datasets. We are also using some of their datasets during our lectures.
Dataset Subreddits might be another great source of data. People usually discuss the available datasets and how to use existing datasets for new tasks.
Another excellent community-driven source is Awesome Public Datasets on GitHub and its forks

Resources#

Rei21: John Micah Reid. 13 ways to access data in python. Oct 2021. URL: https://towardsdatascience.com/13-ways-to-access-data-in-python-bac5683e0063.

Data Science Academy

Acquiring Data

Contents

Acquiring Data#

Resources#