Back to articles list Articles
11 minutes read

7 Datasets to Practice Data Analysis in Python

Data analysis is a skill that is becoming more essential in today's data-driven world. One effective way to practice with Python is to take on your own data analysis projects. In this article, we’ll show you 7 datasets you can start working on.

Python is a great tool for data analysis – in fact,  it has become very popular, as we discuss in Python’s Role in Big Data and Analytics. For Python beginners to become proficient in data analysis, they need to develop their programming and analysis knowledge. And the best way to do this is by creating your own data analysis projects.

Doing projects gives you a deep understanding of Python as well as the entire data analysis process. We’ve discussed this process in our Python Exploratory Data Analysis Cheat Sheet. It’s important to learn how to effectively explore different kinds of datasets – numerical, image, text, and even audio data.

But the first step is getting your hands on data, and it isn’t always obvious how to go about this. In this article, we’ll provide you with 7 datasets that you can use to practice data analysis in Python. We’ll explain what the data is, what it can be used for, and show you some code examples to get you on your feet. The examples will range from beginner-friendly to more advanced datasets used for deep learning.

For those looking for some beginner friendly Python learning material, I recommend our Learn Programming with Python track. It bundles together 5 courses, all designed to teach you the fundamentals. For the aspiring data scientists, our Introduction to Python for Data Science course contains 141 interactive exercises. If you just want to try things out, our article 10 Python Practice Exercises for Beginners with Detailed Solutions contains exercises from some of our courses.

7 Free Python Datasets

Diabetes dataset

The Diabetes dataset from scikit-learn is a collection of 442 patient medical records from a diabetes study conducted in the US. It contains 10 variables, including age, sex, body mass index, average blood pressure, and six blood serum measurements. The data was collected by the National Institute of Diabetes and Digestive and Kidney Diseases.

Here’s how to load the dataset into a pandas DataFrame and print the first couple of rows of some of the variables:

#Import the dataset and pandas
from sklearn import datasets
import pandas as pd

# Load the diabetes dataset and create a dataframe
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Add the target variable to the dataframe
df['target'] = diabetes.target

# Print the first 5 rows of some variables
print(df[['age', 'sex', 'bmi', 'target']].head())

        age       sex       bmi  target
0  0.038076  0.050680  0.061696   151.0
1 -0.001882 -0.044642 -0.051474    75.0
2  0.085299  0.050680  0.044451   141.0
3 -0.089063 -0.044642 -0.011595   206.0
4  0.005383 -0.044642 -0.036385   135.0

Here you can see the age, sex and body mass index. These variables have already been preprocessed to have a mean of zero and a standard deviation of one. The target is a quantitative measure of disease progression. To get started with a correlation analysis of some of the features in the dataset, do the following:

corr = df[['age', 'sex', 'bmi', 'target']].corr()
print(corr)
             age       sex       bmi    target
age     1.000000  0.173737  0.185085  0.187889
sex     0.173737  1.000000  0.088161  0.043062
bmi     0.185085  0.088161  1.000000  0.586450
target  0.187889  0.043062  0.586450  1.000000

This shows that BMI is positively correlated with disease progression, meaning the higher the BMI, the higher the chance of having diabetes. What relationships can you find between other variables in the data?

Forest Cover Types

The Forest covertype dataset, also from scikit-learn, is a collection of data from the US Forest Service (USFS). It includes cartographic variables that measure the forest cover type for 30 x 30 meter cells and includes a total of 54 attributes.

This rich dataset can be used for a variety of projects, such as predicting the forest cover type of a given area, analyzing the relationship between different forest cover types and environmental factors, or creating a model to predict the probability of a certain type of forest cover in a given area. It can also be used to study the effects of human activities on forest cover.

Here’s how to read the data into a DataFrame and print the first 5 rows:

from sklearn.datasets import fetch_covtype

# Get the dataset
df = fetch_covtype(as_frame=True)

# Print the first 5 rows of the dataframe
print(df.head())

   Elevation  Aspect  Slope  ...  Soil_Type_38  Soil_Type_39  target
0     2596.0    51.0    3.0  ...           0.0           0.0       5
1     2590.0    56.0    2.0  ...           0.0           0.0       5
2     2804.0   139.0    9.0  ...           0.0           0.0       2
3     2785.0   155.0   18.0  ...           0.0           0.0       2
4     2595.0    45.0    2.0  ...           0.0           0.0       5

You can see the variables include things like elevation, slope, and soil type. The target variable is an integer and corresponds to a forest cover type. Here’s how to print the most common types:

print('Count of each target value:', df['target'].value_counts())

Count of each target value: target
2    283301
1    211840
3     35754
7     20510
6     17367
5      9493
4      2747
Name: count, dtype: int64

The most commonly occurring type of forest in this dataset is type 2, with 283,301 occurrences. This corresponds to Lodgepole Pine. Type 4, the Cottonwood/Willow type, is the least frequently occurring type.

To get started in an analysis project, first start learning more about this data. Since this DataFrame is quite large with many different variables, check out How to Filter Rows and Select Columns in a Python DataFrame with pandas for some tips on manipulating the data.

Yahoo Finance

Python’s yfinance library is a powerful tool for downloading financial data from the Yahoo Finance website. You’ll need to install this library, which can be done with pip. It allows you to download data in a variety of formats; the data includes variables such as stock prices, dividends, splits, and more. To download data for Microsoft and plot the close price, do the following:

import yfinance as yf

# Get the data for the stock Microsoft
data = yf.download('MSFT', start="2020-01-01", end="2020-12-31")

# Plot the close price
data['Close'].plot(title='Microsoft Close Price')
plt.show()

This uses the built-in pandas.DataFrame.plot() method. Running this code produces the following visualization:

7 Datasets to Practice Data Analysis in Python

There are many options open for the analysis at this stage. A regression analysis can be used to model the relationship between different financial variables. In the article Regression Analysis in Python, we show an example of how to implement this.

Atmospheric Soundings

Atmospheric sounding data is data collected from weather balloons. A comprehensive dataset is maintained on the University of Wyoming's Upper Air Sounding website. The data includes variables such as temperature, pressure, dew point, wind speed, and wind direction. This data can be used for a variety of projects, such as forecasting temperature and wind speed for your home town. Since it has decades of observations, you could use it to study the effects of climate change on the atmosphere.

Simply select an observation site and choose a time from the web interface. You can highlight the tabular data and copy-paste into a text document. Save it as ‘weather_data.txt’. Then you can read it into Python like this:

file = open('weather_data.txt', 'r')

# Read data line by line
data = []
for line in file:
    variables = line.strip(' ').split(' ')
    data.append([float(var) for var in variables if len(var)>0])
file.close()

# Data into pandas DataFrame
df = pd.DataFrame(data, columns = ['PRES', 'HGHT', 'TEMP', 'DWPT', 'RELH', 'MIXR', 'DRCT', 'SKNT', 'THTA', 'THTE', 'THTV'])

This is a nice example of having to read the data in line by line. Using Matplotlib, you can plot the temperature as a function of height for your data as follows:

import matplotlib.pyplot as plt
plt.plot(df['TEMP'], df['HGHT'])
plt.ylabel('Height (m)')
plt.xlabel('Temperature (deg C)')
plt.show()
7 Datasets to Practice Data Analysis in Python

Note that your plot may look a little different depending on what site and date you chose to download. If you want to download a large amount of data, you’ll need to write a web scraper. See our article Web Scraping with Python Libraries for more details.

IMDB movie review

The IMDB Movie Review dataset is a collection of movie reviews from the Internet Movie Database (IMDB). It includes reviews from tens of thousands of movies, with each review consisting of a text review and a sentiment score. The sentiment score is a binary value, either positive or negative, that indicates the sentiment of the review.

This dataset can be used for a variety of projects, such as sentiment analysis – which aims to build models that can predict the sentiment of a review. It can also be used to identify the topics and themes of a movie. You can download the dataset from Kaggle. To read in the CSV data and start preprocessing, do the following:

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

df = pd.read_csv('IMDB Dataset.csv')

# To lowercase
df['review_clean'] = df['review'].str.lower()

# Remove stopwords
stop = stopwords.words('english')
df['review_clean'] = df['review_clean'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop))

# Tokenize reviews
df['review_clean'] = df['review_clean'].apply(lambda x: word_tokenize(x))

Text data is often quite messy, so cleaning and standardizing it as much as possible is important. See our article The Most Helpful Python Data Cleaning Modules for more information. Here, we have changed all characters to lowercase, removed stopwords (unimportant words), and tokenized the reviews (created a list of words from sentences). The article Null in Python: A Complete Guide has some more examples of working with text data.

There is more cleaning that could be done – for example, removing grammar. But this could be the starting point of a natural language processing project. Try seeing if there is a correlation between the most frequently occurring words and the sentiment.

Berlin Database of Emotional Speech

The Berlin Database of Emotional Speech (BDES) is a collection of German-language audio recordings of emotional speech. It was generated by having actors read out a set of sentences in different emotional states, such as anger, happiness, sadness, and fear. The data includes audio recordings of the actors' voices as well as annotations of the emotional states. This data can be used to study the acoustic features of emotional speech.

The data is available for download here. Metadata for the type of speech is recorded in the filename. For example, the ‘F’ in the filename ‘03a01Fa.wav’ means ‘Freude’ or Happiness. To plot the spectrogram of a happy German, do the following:

import scipy.io.wavfile as wav
import matplotlib.pyplot as plt

# Read in the .wav file
rate, data = wav.read('03a01Fa.wav ')

# Plot the spectrogram
plt.specgram(data, Fs=rate)
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.show()

This produces the following plot of frequency against time. The yellow colors indicate higher signal strength.

7 Datasets to Practice Data Analysis in Python

Try plotting the same for angry speech, and see how the frequency, speed, and intensity of the speech changes. For more details on working with audio data in Python, check out the article How to Visualize Sound in Python.

MNIST

The MNIST dataset is a collection of handwritten digits (ranging from 0 to 9) that is commonly used for training various image processing systems. It was created by the National Institute of Standards and Technology (NIST) and is widely used in machine learning and computer vision. The dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 pixel grayscale image associated with a label from 0 to 9.

To load and start working with this data, you’ll need to install Keras, which is a powerful Python library for deep learning. The easiest way to do this is with a quick pip install keras command from the terminal. You can import the MNIST data and plot some of the digit images like this:

from keras.datasets import mnist
import matplotlib.pyplot as plt

# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Plot the digits
for i in range(25):
    plt.subplot(5, 5, i+1)
    plt.imshow(x_train[i], cmap='gray')
    plt.title(y_train[i])
    plt.axis('off')
    plt.show()
7 Datasets to Practice Data Analysis in Python

You can see the images of the handwritten digits with their labels above them. This dataset can be used to train a supervised image recognition model. The pixel values are the input data, and the labels are the truth that the model uses to adjust the internal weights. You can see how this is implemented in the Keras code examples section.

Improve Your Analysis Skills with Python Datasets

Getting started is often the hardest part of any challenge. In this article, we shared 7 datasets that you can use to start your next analysis project. The code examples we provided should serve as a starting point and allow you to delve deep into the data. From analyzing financial data to predicting the weather, Python can be used to explore and understand data in a variety of ways.

These datasets were chosen to give you exposure to working with a variety of different data types – numbers, text, and even images and audio. Our article An Introduction to NumPy in Python has more examples of working with numerical data.

With the right resources and practice, you can become an expert in data analysis and use Python datasets to make sense of the world around you. So, take the time to learn Python and start exploring the world of data!