Back to articles list Articles
7 minutes read

An Introduction to NumPy in Python

NumPy makes working with arrays easy.

If you work with Python, it pays to know some basics of Python NumPy. It is incredibly useful for working with arrays since it is very fast and efficient. It also contains many methods to make manipulating and performing numerical operations on arrays simple.

There are many data structures in Python, including lists, dictionaries, Pandas DataFrames, and of course NumPy arrays. Each has its strengths, and knowing when to use one or the other can save time and effort in writing your programs.

In this article, we’ll show you the basics of Python NumPy and explain why it’s so useful. We’ll give you some examples to get you on your feet and give you the foundation to make your data analysis projects more efficient. If you’re interested in learning more about data science in Python, consider taking this track designed for complete beginners with no experience in IT.

Why NumPy?

Arrays in NumPy have many similarities to other data structures such as lists. They can store numerical data as well as strings, they are mutable, and they can be sliced and indexed in similar ways. A list, however, cannot handle numerical operations as easily as an array.

To multiply all elements in an array by 2, use array*2, where array is the name of the array. This is known as vectorization. To do the same with a list requires a for loop or a list comprehension, both of which need more code. Furthermore, arrays are much faster than lists, and they consume much less memory since the NumPy methods are highly optimized for working with arrays.

Pandas is another good alternative that provides functionality for data analysis and visualization. The basic data structure in Pandas is the Series, which is similar to a 1-dimensional NumPy array. However, once again, NumPy is faster and more efficient when it comes to performing numerical computations. For more information on working with Pandas, take a look at this article. We also have some material on visualizing time series data in Pandas.

Creating Arrays

NumPy arrays have a fixed size at creation, and the elements are required to be of the same data type. These are the two main constraints to keep in mind while creating arrays. The array() function contains the following arguments:

numpy.array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0, like=None)

For the sake of brevity, we won’t go through a detailed description of all the arguments. Take a look at the documentation if you’re interested in the details. For the majority of applications, you just need to define the object and possibly the dtype arguments.

To define a 1-dimensional array and print its shape, do the following:

>>> import numpy as np
>>> ar = np.array([1, 2, 3, 4])
>>> print(ar.shape)

For a NumPy multidimensional array, the object takes on the form of a nested sequence, where the individual sequences define the rows of the array. For example:

>>> ar = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> print(ar.shape)
(2, 4)

A NumPy array can also hold different data types, for example, integers, floats, strings, and Booleans. It can even store complex numbers. We can convert the elements in our array above to strings directly as follows:

	>>> ar_string = ar.astype(str)
	>>> ar_string
	array([['1', '2', '3', '4'],
             ['5', '6', '7', '8']], dtype='<U11')

Alternatively, we may set dtype=str when defining the array. An array of Boolean values may be created as follows:

>>> np.array([[1, 1, 0, 0], [0, 1, 0, 1]], dtype=bool)
array([[True, True, False, False],
           [False, True, False, True]])

This may be useful if you want to mask out certain values in another array.

There are several ways to define an array with arbitrary values as placeholders for filling in the real data later. The numpy.ones() and numpy.zeros() functions create an array filled with ones and zeros, respectively.

The numpy.empty() function creates an array without initializing entries. This particular function requires the user to manually set all the values in the array and should be used with caution. However, it may be a little faster than the other two functions.

To use these functions, the size of the array needs to be specified:

>>> np.zeros((3, 2))
array([[0., 0.],
       [0., 0.],
       [0., 0.]])

Arrays with a fixed sequence can be defined with two useful NumPy functions: arange and linspace. For arange, you need to specify the start and stop values and the step. For example:

>>> np.arange(2, 20, 4)
array([ 2, 6, 10, 14, 18])

This is similar to the built-in function range(), which can be used for looping. See this article for more details on loops in Python. The linspace function in NumPy returns evenly spaced numbers over an interval defined by the start and stop values. Using the same arguments as the last example gives:

>>> np.linspace(2, 20, 4)
array([ 2., 8., 14., 20.])

Here, the third argument defines the number of values to return rather than the step size as in the former example. To see an example of how to use this function to generate evenly spaced values for a time axis, see this article. A similar function, logspace, returns numbers spaced evenly on a logarithmic scale. Try it out to see what you get.

Reading and Writing CSVs

Most of the time, you want to read data saved in a file into a NumPy array. NumPy comes with a few functions to help load and save arrays. These are focused on handling either binary data or data stored in text files. The two functions load() and save() provide functionality for loading and saving arrays to a binary file.

If you’re working with text files, specifically CSV in Python, the best way to read and write arrays to file is with the loadtxt() and savetxt() functions. The latter has two required arguments, fname and X, which define the filename and the array data to save, respectively. To save to CSV, you also need to specify a delimiter. To demonstrate this, let’s create a 2 x 4 array, save it to CSV, then read it back in:

>>> ar = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> np.savetxt('output_array.csv', ar, delimiter=',')
>>> ar_read = np.loadtxt('output_array.csv', delimiter=',')
>>> ar_read
array([[1., 2., 3., 4.],
       [5., 6., 7., 8.]])

You may also use pure Python and the built-in open() function. Here’s an article about writing to file in Python that shows you how. If you’re handling large numbers of files in Python, here’s an article with some tips on how to rename files programmatically.

Some NumPy Array Methods

Now that we’ve covered ways to create an array in Python, let’s take a look at what you can do with it. NumPy has many useful and highly optimized methods that allow you to do array operations and get additional information about your array.

As we mentioned in the introduction, doing basic operations on arrays such as array1 + array2 or multiplying by a scalar is straightforward. There are efficient functions for linear algebra, for example, for calculating the dot or cross product or for taking the transpose of an array.

A common requirement is to summarize the contents of an array. NumPy includes functions to calculate statistics such as mean, median, standard deviation, etc. These are useful because they allow you to specify an axis to calculate the statistic over. By default, the statistic is calculated over the flattened array. For example:

>>> ar = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> np.mean(ar)

>>> np.mean(ar, axis=0)
array([3., 4., 5., 6.])

>>> np.mean(ar, axis=1)
array([2.5, 6.5])

Specifying axis=0 calculates the mean over the columns, and axis=1 calculates the mean over rows. Now that we know how to generate a summary statistic, we can find the mean of the rows and append this information to a new column in the array:

>>> row_mean = np.mean(ar, axis=1).reshape(2, 1)
>>> new_ar = np.append(ar, row_mean, axis=1)
>>> new_ar
array([[1., 2., 3., 4., 2.5],
       [5., 6., 7., 8., 6.5]])

Here, we calculate our statistic, reshape it, and then use the append function to add it as a new column. Notice the data type of the whole array has changed since our summary statistics are floating-point numbers.

Go Forth and NumPy

NumPy is a foundational tool in Python data analytics. It’s a mature library with a large number of useful functions and methods as well as speed and efficiency at its core. Now that you know the basics of Python NumPy, you can use what you’ve learned here to make your projects more efficient.

Visualizing your data is an important step in the data science process. We have two articles (Part 1 and Part 2) that give you an introduction to plotting with Matplotlib in Python.