Back to articles list Articles
6 minutes read

How to Plot a Running Average in Python Using matplotlib

Visualizing data is an essential part of data science. We show you how to plot running averages using matplotlib

The running average, also known as the moving average or rolling mean, can help filter out the noise and create a smooth curve from time-series data. It can also help highlight different seasonal cycles in time-series data. This is a very common tool used in many fields from physics to environmental science and finance.

In this article, we explain what the running average is and how it is calculated. We also show you how to visualize the results using matplotlib in Python. We further discuss some important things to understand about moving averages to help elevate your data analysis skills.

This article is aimed at people with a bit of experience in data analysis. If you’re looking for an introduction to data science, we have a course that provides the foundational skills. For more material that builds on top of that, take a look at this data science track.

What Is a Running Average?

To generate a running average, we need to decide on a window size in which to calculate the average values. This can be any number from 2 to n-1, where n is the number of data points in the time series. We define a window, calculate an average in the window, slide the window by one data point, and repeat until we get to the end.

To demonstrate this, let’s define some data and calculate a running average in Python in a for loop:

>>> import numpy as np
>>> data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> window = 2
>>> average_data = []
>>> for ind in range(len(data) – window + 1):
...     average_data.append(np.mean(data[ind:ind+window]))

Here, we define a window size of 2 data points and use a list slice to get the subset of data we want to average. Then, we use NumPy to calculate the mean value. The index then gets advanced with a for loop, and we repeat. Notice the loop is over len(data) – window + 1, which means our smoothed data has only 9 data points.

If you want to compare the running average to the original data, you have to align them correctly. A convenient way to do this is by inserting a NaN at the start of the list using list.insert(). Try it for yourself.

Plotting a Running Average in matplotlib

As a consequence of this method for smoothing data, the features (e.g., peaks or troughs) in a graph of a moving average lag the real features in the original data. The magnitude of the values is also different from the real data. This is important to keep in mind if you want to identify when a peak in the data has happened and what its magnitude is.

To demonstrate this, we can create a sine wave and calculate a running average in Python like we have done earlier:

>>> x = np.linspace(0, 10, 50)
>>> y = np.sin(x)
>>> window = 5
>>> average_y = []
>>> for ind in range(len(y) - window + 1):
...     average_y.append(np.mean(y[ind:ind+window]))

Here’s how to add NaNs to the start of the running average to ensure the list has the same length as the original data:

>>> for ind in range(window - 1):
...     average_y.insert(0, np.nan)

Now, we can plot the results using matplotlib:

>>> import matplotlib.pyplot as plt
>>> plt.figure(figsize=(10, 5))
>>> plt.plot(x, y, 'k.-', label='Original data')
>>> plt.plot(x, average_y, 'r.-', label='Running average')
>>> plt.yticks([-1, -0.5, 0, 0.5, 1])
>>> plt.grid(linestyle=':')
>>> plt.legend()

Running the above code produces the following plot in a new window:

running average in Python

The larger the window size, the greater the lags of the peaks and the troughs but the smoother the data. You need to test a few values to determine the best balance for your particular use case.

A good exercise to get a feel for this is to take the code example above and add some noise to the sine wave. The noise can be random numbers between, for example, 0 and 1. Then, smooth the data by calculating the running average, and then plot the two curves.

What About pandas?

The pandas library has become the backbone of data analysis in Python. Its basic data structure is Series.

pandas comes with a lot of built-in functions to help make processing data easier, including functions to calculate running averages. It’s also very useful for cleaning data, which we discuss in this article.

In most cases, you have your data in a file you can read into a data frame. We have two helpful articles: how to read CSV files and how to read Excel files in Python. The focus of this article isn’t on how to load data using pandas, so we assume you’ve already loaded your data and are ready to start processing and plotting. If you want some information on working with data frames in pandas, check out this article.

For this example, we have about 7 months of daily temperature measurements from Berlin, going from January 2021 to the end of July 2021. The running average for a week can be calculated by:

>>> temperature = df['temp']
>>> t_average = temperature.rolling(window=7).mean()

This is super convenient, since it quickly and easily calculates a rolling mean (i.e., a moving average) over the window you define in rolling(). Furthermore, it automatically aligns the data properly and fills in the missing data with NaN. Now, we can use matplotlib to plot the results:

>>> plt.figure(figsize=(10, 5))
>>> plt.plot(temperature, 'k-', label='Original')
>>> plt.plot(t_average, 'r-', label='Running average')
>>> plt.ylabel('Temperature (deg C)')
>>> plt.xlabel('Date')
>>> plt.grid(linestyle=':')
>>> plt.fill_between(t_average.index, 0, t_average, color='r', alpha=0.1)
>>> plt.legend(loc='upper left')

This opens the following figure in a new window:

running average in Python

You should notice here we only specified the y-values when we called plot(). This is because the index of the data frame includes the dates, and this is recognized and automatically handled by pandas.

In this plot, you can see the trend of increasing temperature going from winter to summer. There’s also a variation on small time scales that is evident from the smoothed data produced from the 7-day running average. Adding the gridlines helps guide the eye to the relevant date and temperature values; shading underneath the running average helps emphasize its value above or below zero degrees.

Take Running Averages in Python to the Next Level

In this article, we showed you how to calculate a running average in Python and plot the results using matplotlib. Plotting is a crucial skill for understanding data. For a demonstration on using matplotlib to visualize sound data, take a look at this article.

If you work a lot with tabular data, presenting tables in a visually appealing way is important. We have an article on pretty-printing tables in Python.

For this article, each data point in the averaging window contributed equally to the average. However, this doesn’t necessarily need to be the case. An exponential moving average, for example, places more weight on recent data, which helps address the problem with the lag.

We’d like to encourage you to use what you’ve learned here and play around with it a little. Try implementing an exponential moving average and see how it performs in smoothing a noisy sine wave. With a little bit of practice, you’ll take your Python skills to the next level.