Introduction to NumPy#

This chapter outlines techniques for loading, storing, and manipulating in-memory data in Python. It helps to think of all data as arrays of numbers because datasets come from various sources and formats, including collections of documents, images, sound clips, numerical measurements, or nearly anything else.

NumPy (short for Numerical Python) is usually the first to go library for efficient storage and manipulation of numerical arrays.

✏️ The example is inspired by [Tom] and [Van17].

Let’s import numpy with the alias np (to avoid having to type out n-u-m-p-y every time we want to use it):

import numpy as np

np.__version__
'1.23.4'

What are NumPy Arrays?#

NumPy arrays (“ndarrays”) are homogenous data structures that can contain all the basic Python data types (e.g., floats, integers, strings, etc.)—being homogenous means that each item is of the same type.

A numpy array is sort of like a list:

my_list = [1, 2, 3, 4, 5]
my_list
[1, 2, 3, 4, 5]
my_array = np.array([1, 2, 3, 4, 5])
my_array
array([1, 2, 3, 4, 5])

Notice that, unlike a list, arrays can only hold a single type (are homogenous):

my_array = np.array([1, "hi"])
my_array
array(['1', 'hi'], dtype='<U21')

Creating arrays#

Arrays are typically created using two main methods:

  • From existing data (usually lists or tuples) using np.array(), like we saw above;

  • Using built-in functions such as np.arange(), np.linspace(), np.zeros(), etc.

We have tried the first option; let’s try the second one:

# rom 1 inclusive to 5 exclusive
np.arange(1, 5)
array([1, 2, 3, 4])
# step by 2 from 1 to 11
np.arange(0, 11, 2)
array([ 0,  2,  4,  6,  8, 10])
# 5 equally spaced points between 0 and 10
np.linspace(0, 10, 5)
array([ 0. ,  2.5,  5. ,  7.5, 10. ])
# an array of zeros with size 2 x 3
np.zeros((2, 3))
array([[0., 0., 0.],
       [0., 0., 0.]])
# an array of the number 3.14 with size 3 x 3 x 3
np.full((3, 3, 3), 3.14)
array([[[3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14],
        [3.14, 3.14, 3.14]]])
# random numbers uniformly distributed from 0 to 1 with size 5 x 2
np.random.rand(5, 2)
array([[0.044556  , 0.59609264],
       [0.84076988, 0.14259046],
       [0.23512234, 0.9688142 ],
       [0.59603948, 0.51595658],
       [0.31869661, 0.77102051]])

Array Shapes#

As you just have seen, arrays can be of any dimension, shape, and size. There are three main attributes to work out the characteristics of an array:

  • .ndim: the number of dimensions of an array

  • .shape: the number of elements in each dimension (like calling len() on each dimension)

  • .size: the total number of elements in an array (i.e., the product of .shape)

array_2d = np.ones((3, 2))
print(f"Dimensions: {array_2d.ndim}")
print(f"Shape: {array_2d.shape}")
print(f"Size: {array_2d.size}")
Dimensions: 2
Shape: (3, 2)
Size: 6

Indexing and slicing#

Indexing and slicing arrays are similar to indexing and slicing lists, but there are just more dimensions:

x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x[3]
3
x[2:5]
array([2, 3, 4])

with 2d arrays:

x = np.random.randint(10, size=(4, 6))
x
array([[9, 6, 1, 5, 0, 3],
       [1, 6, 5, 7, 3, 4],
       [7, 6, 9, 5, 7, 4],
       [1, 3, 8, 6, 8, 5]])
x[3, 4]
8
x[3]
array([1, 3, 8, 6, 8, 5])

It is also possible to index by boolean values:

x = np.random.rand(10)
x
array([0.16240937, 0.62720278, 0.37121073, 0.04824186, 0.13019158,
       0.9593788 , 0.76024558, 0.39715428, 0.66882447, 0.45200871])
x_thresh = x > 0.5
x_thresh
array([False,  True, False, False, False,  True,  True, False,  True,
       False])
x[x > 0.5] = 0.5
x
array([0.16240937, 0.5       , 0.37121073, 0.04824186, 0.13019158,
       0.5       , 0.5       , 0.39715428, 0.5       , 0.45200871])

Comparing Arrays#

Determining if two arrays have the same shape and elements might be confusing. Let’s define our arrays first:

x = np.ones(5)
x
array([1., 1., 1., 1., 1.])
y = np.ones((1, 5))
y
array([[1., 1., 1., 1., 1.]])
z = np.ones((5, 1))
z
array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])
np.array_equal(x, x)
True
np.array_equal(x, y)
False
np.array_equal(x, z)
False
np.array_equal(y, z)
False

It is crucial to distinguish between different shapes as they affect operations.

Computation on Arrays#

Now, we will dive into why NumPy is so important in Python data science. Namely, it provides an easy and flexible interface to optimize computation with data arrays, such as:

  • Elementwise Operations

  • Aggregations

  • Broadcasting

Before we show you some basics, let’s have a motivational example. Computation on NumPy arrays can be speedy, or it can be slow. The key to making it fast is to use vectorized operations.

# DON'T DO THIS
array = np.array(range(5))
for i, element in enumerate(array):
    array[i] = element ** 2
array
array([ 0,  1,  4,  9, 16])
# DO THIS
array = np.array(range(5))
array **= 2
array
array([ 0,  1,  4,  9, 16])

Let’s actually compare those approaches:

# loop method
array = np.array(range(5))
time_loop = %timeit -q -o -r 3 for i, element in enumerate(array): array[i] = element ** 2

# vectorized method
time_vec = %timeit -q -o -r 3 array ** 2
print(f"Vectorized operation is {time_loop.average / time_vec.average:.2f}x faster than looping here.")
Vectorized operation is 3.52x faster than looping here.

We are using a tiny array here; imagine you are working with a vast dataset.

Elementwise Operations#

Elementwise operations refer to operations applied to each element of an array or between the paired elements of two arrays:

x = np.ones(4)
x
array([1., 1., 1., 1.])
y = x + 1
y
array([2., 2., 2., 2.])
x * y
array([2., 2., 2., 2.])
x == y
array([False, False, False, False])
np.array_equal(x, y)
False

Aggregations#

When faced with a large amount of data, the first step is to compute summary statistics for the data in question.

There are many aggregations functions, for example:

Function Name

Description

np.sum

Compute sum of elements

np.prod

Compute product of elements

np.mean

Compute mean of elements

np.std

Compute standard deviation

np.min

Find minimum value

np.max

Find maximum value

np.argmin

Find index of minimum value

np.argmax

Find index of maximum value

np.any

Evaluate whether any elements are true

np.all

Evaluate whether all elements are true

Let’s do some practice. Consider working out the hypotenuse of a triangle that with sides 3m and 4m:

sides = np.array([3, 4])

There are several ways we could solve this problem. We could directly use Pythagoras’s Theorem:

np.sqrt(np.sum([np.power(sides[0], 2), np.power(sides[1], 2)]))
5.0

Or apply a “vectorized” operation to the whole vector at one time:

(sides ** 2).sum() ** 0.5
5.0

Broadcasting#

Arrays with different sizes cannot be directly used in arithmetic operations. Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. The idea is to wrangle data so that operations can occur element-wise.

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:

  • If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with the ones on its leading (left) side.

  • If the shape of the two arrays does not match in any dimension, the array with a shape equal to 1 in that dimension is stretched to match the other shape.

  • If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Let’s try to center an array:

X = np.random.random((10, 3))
X
array([[0.91301774, 0.62873433, 0.94958671],
       [0.76117682, 0.8882128 , 0.2103175 ],
       [0.04546602, 0.35322598, 0.68759167],
       [0.22562999, 0.75706807, 0.98728675],
       [0.52966888, 0.67625261, 0.68496988],
       [0.40858253, 0.13311388, 0.50403301],
       [0.45049342, 0.9931796 , 0.53244825],
       [0.23608296, 0.81897064, 0.44304408],
       [0.99882097, 0.69607734, 0.65911066],
       [0.76949086, 0.59882432, 0.1939321 ]])
# compute mean along column
x_mean = X.mean(axis=0)
x_mean
array([0.53384302, 0.65436596, 0.58523206])
X_centered = X - x_mean
X_centered
array([[ 0.37917472, -0.02563163,  0.36435465],
       [ 0.2273338 ,  0.23384684, -0.37491456],
       [-0.488377  , -0.30113998,  0.10235961],
       [-0.30821302,  0.10270211,  0.40205469],
       [-0.00417414,  0.02188665,  0.09973782],
       [-0.12526049, -0.52125208, -0.08119905],
       [-0.0833496 ,  0.33881364, -0.05278381],
       [-0.29776006,  0.16460468, -0.14218798],
       [ 0.46497795,  0.04171138,  0.07387859],
       [ 0.23564784, -0.05554163, -0.39129996]])

Let’s check that the centered array has near zero mean:

X_centered.mean(axis=0)
array([-6.66133815e-17, -2.22044605e-17,  1.11022302e-17])

Reshaping#

There are 3 key reshaping methods I want you to know about for reshaping numpy arrays:

  • .rehshape()

  • np.newaxis

  • .ravel()/.flatten()

x = np.full((4, 3), 3.14)
x
array([[3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14]])
x.reshape(6, 2)
array([[3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14]])
# using -1 will calculate the dimension for you (if possible)
x.reshape(2, -1)
array([[3.14, 3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14, 3.14]])
x[:, np.newaxis]
array([[[3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14]],

       [[3.14, 3.14, 3.14]]])
x.flatten()
array([3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14,
       3.14])
x.ravel()
array([3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14,
       3.14])

Exercises#

Create an array of 10 ones and 10 zeroes:#

# TODO: your answer here

Create a matrix using created arrays as rows#

# TODO: your answer here

Calculate the sum of each column in the matrix#

# TODO: your answer here

Resources#

Tom

TomasBeuzen. Tomasbeuzen/python-programming-for-data-science: content from the university of british columbia's master of data science course dsci 511. URL: https://github.com/TomasBeuzen/python-programming-for-data-science.

Van17

Jacob T. Vanderplas. Python Data Science Handbook: Essential Tools for working with data. O'Reilly, 2017.