Acquiring Data#

Most analyses start by importing data into your environment. The data are primarily available from the following data sources:

  • Local files (you should already have an idea)

  • Databases

  • APIs

✏️ The example is inspired by [Rei21].

Luckily for us, Python is very capable when accessing all of those data sources because there are tons of open-source libraries. Once we have access, we can easily import data to NumPy, pandas, or other libraries to exploit what we have learned regarding data wrangling. We will show you some basic examples to give you an idea.

Let’s start by importing pandas:

import pandas as pd

We have been primarily working with local files (CSV) so far, but a lot of business data is probably stored in a database. You usually need to know some basics of SQL (which is out of the scope of this course) to access the data.

It is pretty easy to get data from a database to pandas. Let’s use, for example, a built-in database SQLite with the Chinook Dataset:

# let's download data locally
!wget https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite -O ../../data/Chinook_Sqlite.sqlite
--2022-10-25 17:26:49--  https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 
302 Found
Location: https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite [following]
--2022-10-25 17:26:49--  https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 
185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 
200 OK
Length: 1067008 (1,0M) [application/octet-stream]
Saving to: ‘../../data/Chinook_Sqlite.sqlite’


          ../../dat   0%[                    ]       0  --.-KB/s               
../../data/Chinook_ 100%[===================>]   1,02M  --.-KB/s    in 0,1s    

2022-10-25 17:26:50 (7,87 MB/s) - ‘../../data/Chinook_Sqlite.sqlite’ saved [1067008/1067008]
import sqlite3 as sql

conn = sql.connect('../../data/Chinook_Sqlite.sqlite')
df = pd.read_sql_query("SELECT * FROM invoice LIMIT 25", conn)
df.head()
InvoiceId CustomerId InvoiceDate BillingAddress BillingCity BillingState BillingCountry BillingPostalCode Total
0 1 2 2009-01-01 00:00:00 Theodor-Heuss-Straße 34 Stuttgart None Germany 70174 1.98
1 2 4 2009-01-02 00:00:00 Ullevålsveien 14 Oslo None Norway 0171 3.96
2 3 8 2009-01-03 00:00:00 Grétrystraat 63 Brussels None Belgium 1000 5.94
3 4 14 2009-01-06 00:00:00 8210 111 ST NW Edmonton AB Canada T6G 2C7 8.91
4 5 23 2009-01-11 00:00:00 69 Salem Street Boston MA USA 2113 13.86

Connecting to an API might be a bit more challenging. The basic idea is to send a request (which may include query parameters and access credentials) to an endpoint, which will return a response code plus the data you asked for. Let’s try it on a simple example:

import requests
response = requests.get("http://api.open-notify.org/astros.json")
print(response.json())
{'message': 'success', 'people': [{'name': 'Cai Xuzhe', 'craft': 'Tiangong'}, {'name': 'Chen Dong', 'craft': 'Tiangong'}, {'name': 'Liu Yang', 'craft': 'Tiangong'}, {'name': 'Sergey Prokopyev', 'craft': 'ISS'}, {'name': 'Dmitry Petelin', 'craft': 'ISS'}, {'name': 'Frank Rubio', 'craft': 'ISS'}, {'name': 'Nicole Mann', 'craft': 'ISS'}, {'name': 'Josh Cassada', 'craft': 'ISS'}, {'name': 'Koichi Wakata', 'craft': 'ISS'}, {'name': 'Anna Kikina', 'craft': 'ISS'}], 'number': 10}
res = pd.DataFrame(response.json()["people"])
res.head()
name craft
0 Cai Xuzhe Tiangong
1 Chen Dong Tiangong
2 Liu Yang Tiangong
3 Sergey Prokopyev ISS
4 Dmitry Petelin ISS

Finally, let’s review some sources with valuable datasets that might help you with your projects. We will list just a few tips to get you started:

  • Kaggle is known for hosting machine learning and deep learning challenges, and with those challenges also come datasets.

  • .gov Datasets - many countries openly share their datasets with the public.

  • UCI Machine Learning Repository provides easy-to-use and cleaned datasets. We are also using some of their datasets during our lectures.

  • Dataset Subreddits might be another great source of data. People usually discuss the available datasets and how to use existing datasets for new tasks.

  • Another excellent community-driven source is Awesome Public Datasets on GitHub and its forks

Resources#

Rei21

John Micah Reid. 13 ways to access data in python. Oct 2021. URL: https://towardsdatascience.com/13-ways-to-access-data-in-python-bac5683e0063.