Acquiring Data
Contents
Acquiring Data#
Most analyses start by importing data into your environment. The data are primarily available from the following data sources:
Local files (you should already have an idea)
Databases
APIs
✏️ The example is inspired by [Rei21].
Luckily for us, Python is very capable when accessing all of those data sources because there are tons of open-source libraries. Once we have access, we can easily import data to NumPy, pandas, or other libraries to exploit what we have learned regarding data wrangling. We will show you some basic examples to give you an idea.
Let’s start by importing pandas:
import pandas as pd
We have been primarily working with local files (CSV) so far, but a lot of business data is probably stored in a database. You usually need to know some basics of SQL (which is out of the scope of this course) to access the data.
It is pretty easy to get data from a database to pandas. Let’s use, for example, a built-in database SQLite with the Chinook Dataset:
# let's download data locally
!wget https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite -O ../../data/Chinook_Sqlite.sqlite
--2022-10-25 17:26:49-- https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response...
302 Found
Location: https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite [following]
--2022-10-25 17:26:49-- https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
Resolving raw.githubusercontent.com (raw.githubusercontent.com)...
185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response...
200 OK
Length: 1067008 (1,0M) [application/octet-stream]
Saving to: ‘../../data/Chinook_Sqlite.sqlite’
../../dat 0%[ ] 0 --.-KB/s
../../data/Chinook_ 100%[===================>] 1,02M --.-KB/s in 0,1s
2022-10-25 17:26:50 (7,87 MB/s) - ‘../../data/Chinook_Sqlite.sqlite’ saved [1067008/1067008]
import sqlite3 as sql
conn = sql.connect('../../data/Chinook_Sqlite.sqlite')
df = pd.read_sql_query("SELECT * FROM invoice LIMIT 25", conn)
df.head()
InvoiceId | CustomerId | InvoiceDate | BillingAddress | BillingCity | BillingState | BillingCountry | BillingPostalCode | Total | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 2009-01-01 00:00:00 | Theodor-Heuss-Straße 34 | Stuttgart | None | Germany | 70174 | 1.98 |
1 | 2 | 4 | 2009-01-02 00:00:00 | Ullevålsveien 14 | Oslo | None | Norway | 0171 | 3.96 |
2 | 3 | 8 | 2009-01-03 00:00:00 | Grétrystraat 63 | Brussels | None | Belgium | 1000 | 5.94 |
3 | 4 | 14 | 2009-01-06 00:00:00 | 8210 111 ST NW | Edmonton | AB | Canada | T6G 2C7 | 8.91 |
4 | 5 | 23 | 2009-01-11 00:00:00 | 69 Salem Street | Boston | MA | USA | 2113 | 13.86 |
Connecting to an API might be a bit more challenging. The basic idea is to send a request (which may include query parameters and access credentials) to an endpoint, which will return a response code plus the data you asked for. Let’s try it on a simple example:
import requests
response = requests.get("http://api.open-notify.org/astros.json")
print(response.json())
{'message': 'success', 'people': [{'name': 'Cai Xuzhe', 'craft': 'Tiangong'}, {'name': 'Chen Dong', 'craft': 'Tiangong'}, {'name': 'Liu Yang', 'craft': 'Tiangong'}, {'name': 'Sergey Prokopyev', 'craft': 'ISS'}, {'name': 'Dmitry Petelin', 'craft': 'ISS'}, {'name': 'Frank Rubio', 'craft': 'ISS'}, {'name': 'Nicole Mann', 'craft': 'ISS'}, {'name': 'Josh Cassada', 'craft': 'ISS'}, {'name': 'Koichi Wakata', 'craft': 'ISS'}, {'name': 'Anna Kikina', 'craft': 'ISS'}], 'number': 10}
res = pd.DataFrame(response.json()["people"])
res.head()
name | craft | |
---|---|---|
0 | Cai Xuzhe | Tiangong |
1 | Chen Dong | Tiangong |
2 | Liu Yang | Tiangong |
3 | Sergey Prokopyev | ISS |
4 | Dmitry Petelin | ISS |
Finally, let’s review some sources with valuable datasets that might help you with your projects. We will list just a few tips to get you started:
Kaggle is known for hosting machine learning and deep learning challenges, and with those challenges also come datasets.
.gov Datasets - many countries openly share their datasets with the public.
UCI Machine Learning Repository provides easy-to-use and cleaned datasets. We are also using some of their datasets during our lectures.
Dataset Subreddits might be another great source of data. People usually discuss the available datasets and how to use existing datasets for new tasks.
Another excellent community-driven source is Awesome Public Datasets on GitHub and its forks
Resources#
- Rei21
John Micah Reid. 13 ways to access data in python. Oct 2021. URL: https://towardsdatascience.com/13-ways-to-access-data-in-python-bac5683e0063.