17th Jul 2023 8 minutes read

What Is a Moving Average? Calculate It in Python

Learn how to calculate Python moving averages and supercharge your data analysis skills!

The moving average, also known as a rolling average, running average or running mean, is a critical part of any data scientist’s toolbox. When analyzing datasets like weather patterns and stock market trends, we tend to run into outliers and noise that can obscure the more meaningful trends we’re looking for.

In this article, you’ll learn how to use Python to calculate moving averages to “smooth out” noise in your data science code. Then, if you’re interested in a more comprehensive education on the usage of Python in data science, you can follow along with our excellent Introduction to Python for Data Science course.

What Is a Moving Average?

The moving average refers to transforming a series of points by calculating the averages of short sub-intervals within the larger series. This tends to give us a smoother dataset than the raw points alone, as calculating the averages in this way lessens the effect of sharp, rapid changes in our data.

Moving averages are calculated over a specific window of data points. Let’s say we have a dataset of 7 daily outdoor temperatures throughout a single week and we apply a moving average with a window size of 3. The resulting dataset will be:

The average temperature from Monday to Wednesday (days 1-3).
The average temperature from Tuesday to Thursday (days 2-4).
The average temperature from Wednesday to Friday (days 3-5).
The average temperature from Thursday to Saturday (days 4-6).
The average temperature from Friday to Sunday (days 5-7).

The transformed dataset is now less sensitive to rapid variations in daily temperature and shows a more general trend. This means that if (for example) Wednesday was an unusually hot or cold day, we wouldn’t see as big of a “spike” in our data.

When analyzing larger datasets, this can be exceptionally useful. Phenomena like short heat waves and cold spells won’t pollute our data, which allows us to focus on more meaningful long-term trends – such as climate change throughout the years and decades.

Choosing a Window Size

As a general rule of thumb, the window size is directly proportional to the amount of “smoothing” we want to achieve. Larger window sizes are more powerful and do a better job at reducing noise, but making them too large can oversimplify our dataset and hide the trends we’re actually looking for. Conversely, making the window size too small can do a poor job at reducing the amount of spikes in our data.

If we have a series of hourly stock prices throughout several months, for instance, we would be well-served by a window size above 24 hours; since as there are 24 hours in a day, we would be getting rid of the often meaningless hour-by-hour variations in price. Going further, increasing our window size up to 96 hours (or 4 days) would dampen the effect of price booms and busts that only last a few days.

Choosing the right window size is crucial to getting a high-quality dataset. Doing so requires knowledge of the nature of our data, e.g. the size of the series, the number and general interval of outliers). It also requires having a clearly defined goal – or knowing what it is you’re hoping to achieve!

Calculating the Python Moving Average

Let’s learn how to actually calculate the rolling mean using Python. We’ll go back to our outdoor temperature example, but we’ll use actual numbers this time:

Day of Week	Temperature in °C
Monday	23 °C
Tuesday	25 °C
Wednesday	12 °C
Thursday	28 °C
Friday	33 °C
Saturday	31 °C
Sunday	35 °C

If we use a window size of 3, we’ll need to calculate the average of every consecutive sequence of 3 data points. Remember, calculating an average of a series is as simple as adding together all the items, then dividing the sum by the number of items in the series.

Keeping this in mind, the final data points (rounded to three decimal places) will be:

(23 + 25 + 12) / 3 = 20 °C
(25 + 12 + 28) / 3 = 21.667 °C
(12 + 28 + 33) / 3 = 24.333 °C
(28 + 33 + 31) / 3 = 30.667 °C
(33 + 31 + 35) / 3 = 33 °C

Calculating the Moving Average in Python Using a Loop

Calculating by hand is well and good, but if you’re a real-world data scientist you’re going to want to integrate this calculation into the rest of your code-based solution. As Python is the leading programming language for data scientists (find out why at Why Python Is Used for Data Science), we are going to embark on a step-by-step exploration of calculating the rolling average in Python.

We’re first going to explore the calculation of the rolling average in Python using a regular loop. We’re then going to utilize the popular pandas data science library to simplify the calculation.

In any case, we’re going to start by defining the dataset. For this example, we’re going to stick to the outdoor temperature dataset we defined earlier:

dataset = [23, 25, 12, 28, 33, 31, 35]

We’re also going to store the window size in a separate variable to make the code cleaner and easier to edit. We’re going to stick with a window size of 3:

window_size = 3

Now that we have our inputs cleanly defined, we can write the actual loop to calculate the resulting dataset:

result = []
for i in range(len(dataset) - window_size + 1):
    window = dataset[i : i + window_size]
    window_average = sum(window) / window_size
    result.append(window_average)

Let’s break this down step by step.

First, we create an empty array that will hold the resulting data points. Then, we loop through the original array n - window_size + 1 times, where n is the initial number of elements. This is because the last window_size - 1 elements are necessarily lost as part of the calculation – e.g. if we calculate a moving average for the given day and the two following days, then the calculation is impossible for the last two days.

If this confuses you, remember that in our example there are no more rolling average intervals after Friday. We can’t start an interval on Saturday, because the end of the interval would be a weekday after Sunday, which doesn’t exist!

Now, for each of the starting indexes we loop through, we need to create a window. We do this using Python’s subscript operator [], which gets a subinterval from a larger array. In this case, we start each window at i (or the current index in the loop) and we end it at i + window_size, or window_size elements after the start.

Once we have a window, all that’s left is to calculate its average. To calculate the average of a sequence, we need to add together all its elements and then divide this total by the number of elements in the sequence. We do this using Python’s sum() function (which sums together an array) with the division (/) operator.

Finally, we append the newly calculated window average to our result array and continue the loop for the next indices.

Running this code gives us a result of:

[20, 21.667, 24.333, 30.667, 33]

This is the same as the result we got when we were calculating by hand earlier!

Calculating the Rolling Mean in Python Using pandas

One of the core appeals of the Python programming language is its extensibility. Using easily importable external libraries reduces the amount of boilerplate code we have to write for well-known calculations like moving averages. This allows us to focus on the bigger picture.

There are plenty of handy Python libraries you can use to simplify and enrich your code, but perhaps the most well-known data analysis Python library is pandas. It contains a large selection of data structures and functions that help us automate common data science tasks.

So, let’s start calculating the rolling average in Python using the pandas library!

We are once again going to start by defining an array of data points and a variable that will store the window size.

dataset = [23, 25, 12, 28, 33, 31, 35]
window_size = 3

We also need to import the pandas library as follows:

import pandas as pd

We’re going to want to transform the dataset into a form that the pandas library can understand. In this case, we’re going to use the DataFrame structure, which is essentially just a regular table. We’re going to initialize this table with a single column named Data, which will contain the original points.

df = pd.DataFrame(data, columns=['Data'])

Then, we’re going to add a column called Moving_Average that will store the rolling averages we’ll calculate using the rolling() and mean() functions in the pandas library. It’s as simple as:

df['Moving Average'] = df['Data'].rolling(window=window_size).mean()

And that’s all there is to it! If we were to print out the resulting table, we would get

    Data     Moving_Average
0	23          	NaN
1	25          	NaN
2	12    	20.000000
3	28    	21.666667
4	33    	24.333333
5	31    	30.666667
6	35    	33.000000

This perfectly matches the data we got using the other methods.

For these examples, we just have a regular table printed out in the console, but there are many more intuitive ways of visualizing data output. You can learn more at How to Generate a Data Summary in Python.

Calculating Rolling Means and More in Python

Did our discussion of rolling averages in Python stir your interest? If you want to learn more about using Python to supercharge your data science skills, you can explore some of the books we recommend at Best Python Books for Data Science and follow our tips at 11 Tips for Building A Strong Data Science Portfolio With Python. If you’re looking to improve your overall coding skills, there is no better place than our Learn Programming with Python track.

Happy coding!

Tags: