8th Apr 2024 12 minutes read

Python Data Analysis Example: A Step-by-Step Guide for Beginners

Doing real data analysis exercises is a great way to learn. But data analysis is a broad topic, and knowing how to proceed can be half the battle. In this step-by-step guide, we’ll show you a Python data analysis example and demonstrate how to analyze a dataset.

A great way to get practical experience in Python and accelerate your learning is by doing data analysis challenges. This will expose you to several key Python concepts, such as working with different file types, manipulating various data types (e.g. integers and strings), looping, and data visualization. Furthermore, you’ll also learn important data analysis techniques like cleaning data, smoothing noisy data, performing statistical tests and correlation analyses, and more. Along the way, you’ll also learn many built-in functions and Python libraries which make your work easier.

Knowing what steps to take in the data analysis process requires a bit of experience. For those wanting to explore data analysis, this article will show you a step-by-step guide to data analysis using Python. We’ll download a dataset, read it in, and start some exploratory data analysis to understand what we’re working with. Then we’ll be able to choose the best analysis technique to answer some interesting questions about the data.

This article is aimed at budding data analysts who already have a little experience in programming and analysis. If you’re looking for some learning material to get up-to-speed, consider our Introduction to Python for Data Science course, which contains 141 interactive exercises. For more in-depth material, our Python for Data Science track includes 5 interactive courses.

Python for Data Analysis

The process of examining, cleansing, transforming, and modeling data to discover useful information plays a crucial role in business, finance, academia, and other fields. Whether it's understanding customer behavior, optimizing business processes, or making informed decisions, data analysis provides you with the tools to unlock valuable insights from data.

Python has emerged as a preferred tool for data analysis due to its simplicity, versatility, and many o pen-source libraries. With its intuitive syntax and large online community, Python enables both beginners and experts to perform complex data analysis tasks efficiently. Libraries such as pandas, NumPy, and Matplotlib make this possible by providing essential functionalities for all aspects of the data analysis process.

The pandas library simplifies the process of working with structured data (e.g. tabular data, time series). NumPy, which is used for scientific computing in Python, provides powerful array objects and functions for numerical operations. It is essential for the mathematical computations involved in data analysis. It’s particularly useful for working with B ig D ata, as it is very efficient. Matplotlib is a comprehensive library for creating visualizations in Python; it facilitates the exploration and communication of data insights.

In the following sections, we’ll leverage these libraries to analyze a real-world dataset and demonstrate the process of going from raw data to useful conclusions.

The Sunspots Dataset

For this Python data analysis example, we’ll be working with the Sunspots dataset, which can be downloaded from Kaggle. The data includes a row number, a date, and an observation of the total number of sunspots for each month from 1749 to 2021.

Sunspots are regions of the sun's photosphere that are temporarily cooler than the surrounding material due to a reduction in convective transport of energy. As such, they appear darker and can be relatively easily observed – which accounts for the impressively long time period of the dataset. Sunspots can last anywhere from a few days to a few months, and have diameters ranging from around 16 km to 160,000 km. They can also be associated with solar flares and coronal mass ejections, which makes understanding them important for life on Earth.

Some interesting questions that could be investigated are:

What is the period of sunspot activity?
When can we expect the next peak in solar activity?

Python Data Analysis Example

Step 1: Import Data

Once you have downloaded the Sunspots dataset, the next step is to import the data into Python. There are several ways to do this; the one you choose depends on the format of your data.

If you have data in a text file, you may need to read the data in line-by-line using a for loop. As an example, take a look at how we imported the atmospheric sounding dataset in the article 7 Datasets to Practice Data Analysis in Python.

Alternatively, the data could be in the JSON format. In this case, you can use Python’s json library. This is covered in the How to Read and Write JSON Files in Python course.

A common way to store data is in either Excel (.xlsx) or comma-separated-value (.csv) files. In both of these cases, you can read the data directly into a pandas DataFrame. This is a useful way to parse data, since you can directly use many helpful pandas functions to manipulate and process the data. The How to Read and Write CSV Files in Python and How to Read and Write Excel Files in Python courses include interactive exercises to demonstrate this functionality.

Since the Sunspots dataset is in the CSV format, we can read it in using pandas. If you haven’t installed pandas yet, you can do so with a quick command:

	pip install pandas

Now, you can import the data into a DataFrame:

>>> import pandas as pd
>>> df = pd.read_csv('Sunspots.csv', index_col=0, parse_dates=['Date'])

The read_csv() function automatically parses the data. It comes with many arguments to customize how the data is imported. For example, the index_col argument defines which column to use as the row label. The parse_dates argument defines which column holds dates. Our DataFrame, called df, holds our sunspots data with the variable name Monthly Mean Total Sunspot Number and the date of observation with the variable name Date.

Step 2: Data Cleaning and Preparation

Cleaning the data involves handling missing values, converting variables into the correct data types, and applying any filters.

If your data has missing values, there are a number of possible ways to handle them. You could simply just convert them to NaN (not a number). Alternatively, you could do a forward (backward) fill, which copies the previous (next) value into the missing position. Or you could also interpolate by using neighboring values to extrapolate a value into the missing position. The method you choose depends on your use case.

You should also check to see that numerical data is stored as a float or integer; if not, you need to convert it to the correct data type. If there are outliers in your data, you may consider removing them so as not to bias your results.

Or maybe you’re working with text data and you need to remove punctuation and numbers from your text and convert everything to lowercase. All these considerations fall under the umbrella of data cleaning. For some concrete examples, see our article Python Data Cleaning: A How-to Guide for Beginners.

Let’s start by getting an overview of our dataset:

>>> df.head()
 
   Date        Monthly Mean Total Sunspot Number
0  1749-01-31                               96.7
1  1749-02-28                              104.3
2  1749-03-31                              116.7
3  1749-04-30                               92.8
4  1749-05-31                              141.7

The df.head() function prints the first 5 rows of data. You can see the row number (starting from zero), the date (in yyyy-mm-dd format), and the observation of the number of sunspots for the month. To check the datatypes of the variables, execute the following command:

>>> df.dtypes

Date                                 datetime64[ns]
Monthly Mean Total Sunspot Number           float64
dtype: object

The date has the datatype datetime64, which is used to store dates in pandas, and the number of sunspots variable is a float.

Next, here's how to check if there are any missing data points in the Monthly Mean Total Sunspot Number variable:

>>> any(df['Monthly Mean Total Sunspot Number'].isna())

False

This takes advantage of the built-in isna() function, which checks to see if there are any missing values. It returns a series of booleans – True if a value is missing, False if not. Then, we use the built-in function any() to check if any of the booleans are True. This returns False, which indicates there are no missing values in our data. You can find more details about this important step in The Most Helpful Python Data Cleaning Modules.

Step 3: Exploratory Data Analysis

The next stage is to start analyzing your data by calculating summary statistics, plotting histograms and scatter plots, or performing statistical tests. The goal is to gain a better understanding of the variables, and then use this understanding to guide the rest of the analysis. After performing exploratory data analysis, you will have a better understanding of what your data looks like and how to use it to answer questions. Our article Python Exploratory Data Analysis Cheat Sheet contains many more details, examples, and ideas about how to proceed.

A good starting point is to do a basic statistical analysis to determine the mean, median, standard deviation, etc. This can easily be achieved by using the df.describe() function:

>>> df.describe()

       Monthly Mean Total Sunspot Number
count                        3265.000000
mean                           81.778775
std                            67.889277
min                             0.000000
25%                            23.900000
50%                            67.200000
75%                           122.500000
max                           398.200000

We have a total of 3,265 observations and a mean of over 81 sunspots per month. The minimum is zero and the maximum is 398. This gives us an idea of the range of typical values. The standard deviation is about 67, which gives us an idea about how much the number of sunspots varies.

Notice the 50% percentile is less than the mean. This implies the data is skewed to lower values. This is very useful information if we want to do more advanced statistics since some tests assume a normal distribution.

We can confirm this by plotting a histogram of the number of sunspots per month. Visualization is an important skill in Python data analysis. Check out our article The Top 5 Python Libraries for Data Visualization. For our purposes, we’ll use matplotlib. This too can easily be installed with a quick pip install command. The code to plot a histogram looks like this:

import matplotlib.pyplot as plt
plt.hist(df['Monthly Mean Total Sunspot Number'], bins=20)
plt.ylabel('Counts')
plt.xlabel('Monthly Mean Sunspots')
plt.show()

Now we can see the most common value is less than 20 sunspots for the month, and numbers above 200 are quite rare. Finally, let’s plot the time series to see the full dataset:

plt.plot(df['Date'], df['Monthly Mean Total Sunspot Number'])
plt.xlabel('Date')
plt.ylabel('Number of Sunspots')
plt.show()

We can see from the above plot there is a periodic increase and decrease in the number of sunspots. It looks like the maximum occurs roughly every 9 – 12 years. A natural question arises as to exactly how long that period is.

Signal processing is a detailed topic, so we’ll skim over some of the hairy details. To keep it simple, we need to decompose the above signal into a frequency spectrum, then find the dominant frequency. From this we can then compute the period. To compute the frequency spectrum, the Fourier Transform can be used, which is implemented in NumPy:

import numpy as np

# Perform Fast Fourier Transform
fft_result = np.fft.fft(df['Monthly Mean Total Sunspot Number'])
fft_freq = np.fft.fftfreq(len(df))

Try plotting the frequency spectrum and you’ll notice many peaks. One of those hairy details of signal processing is the presence of peaks at the start and end of the array np.abs(fft_result). We can see from the time series we plotted above the period should be somewhere between 9 – 12 years, so we can safely exclude these peaks by slicing the magnitude array to filter out unwanted frequencies:

# Find dominant frequency (and period)
magnitude = np.abs(fft_result)
dominant_freq_index = np.argmax(magnitude[1:100]) + 1
dominant_freq = fft_freq[dominant_freq_index]

# Convert frequency to period
dominant_period = 1 / dominant_freq
print("Dominant period: {} years".format(dominant_period/12))

The output is as follows:

Dominant period: 10.883333333333333 years

We used NumPy’s argmax() function to find the index of the maximum frequency, used this to find the frequency, and then converted this to a period. We finally print the results as a period of years.

This is a great example of using the understanding gained from exploratory data analysis to inform our data processing so we get a result that makes sense.

Step 4: Drawing Conclusions from Data

We were able to learn that the average number of sunspots per month is around 81, but the distribution is highly skewed to lower numbers. Indeed, the most common number of sunspots per month is less than 20, but in a period of high solar activity (75^th percentile), there could be over 120.

By plotting the time series, we could see the signal is periodic and get an idea that there is a regular maximum and minimum in the number of sunspots. By doing some signal processing, we determined the maximum number of sunspots is about every 11 years. From the timeseries plot we can see the last maximum was around 2014, meaning the next should be around 2025.

Further Python Data Analysis Examples

Working with the Sunspots dataset presents some unique advantages – e.g. it’s not a common dataset. We discuss this in our article 11 Tips for Building a Strong Data Science Portfolio with Python. This example of Python data analysis can also teach us a lot about programming in Python. We learnt how to read data into a pandas DataFrame and summarize our data using built-in functions. We did some plotting with Matplotlib and got a taste of signal processing with NumPy. We also did a little array slicing to get results that make sense. You’ll learn many of these important topics in the Introduction to Python for Data Science course and the Python for Data Science track.

We just scratched the surface of this analysis of sunspot data in Python. There are many more interesting questions which could be answered. For example, is there a trend in the number of sunspots over the 272 years of data? How long does the maximum last? How many sunspots should there be during our predicted next maximum? These questions can all be answered with Python.

There’s always more to learn on your Python data analysis journey, and books are a great resource. Our article The Best Python Books for Data Science has some great suggestions for your next trip to a bookstore. All the suggestions there will give you the tools to delve deeper into Python and data analysis techniques. Then, it’s a matter of practicing what you learn by starting a new data science project. Here are some Python Data Science Project Ideas. Happy coding!

Tags: