25th Dec 2023 11 minutes read 7 Datasets to Practice Data Analysis in Python Luke Hande python data analysis online practice Data analysis is a skill that is becoming more essential in today's data-driven world. One effective way to practice with Python is to take on your own data analysis projects. In this article, we’ll show you 7 datasets you can start working on. Python is a great tool for data analysis – in fact, it has become very popular, as we discuss in Python’s Role in Big Data and Analytics. For Python beginners to become proficient in data analysis, they need to develop their programming and analysis knowledge. And the best way to do this is by creating your own data analysis projects. Doing projects gives you a deep understanding of Python as well as the entire data analysis process. We’ve discussed this process in our Python Exploratory Data Analysis Cheat Sheet. It’s important to learn how to effectively explore different kinds of datasets – numerical, image, text, and even audio data. But the first step is getting your hands on data, and it isn’t always obvious how to go about this. For those looking to collect data for their Python projects, a web scraping solution can be a powerful tool to extract and analyze data efficiently, even from complex websites. In this article, we’ll provide you with 7 datasets that you can use to practice data analysis in Python. We’ll explain what the data is, what it can be used for, and show you some code examples to get you on your feet. The examples will range from beginner-friendly to more advanced datasets used for deep learning. For those looking for some beginner friendly Python learning material, I recommend our Learn Programming with Python track. It bundles together 5 courses, all designed to teach you the fundamentals. For the aspiring data scientists, our Introduction to Python for Data Science course contains 141 interactive exercises. If you just want to try things out, our article 10 Python Practice Exercises for Beginners with Detailed Solutions contains exercises from some of our courses. 7 Free Python Datasets Diabetes dataset The Diabetes dataset from scikit-learn is a collection of 442 patient medical records from a diabetes study conducted in the US. It contains 10 variables, including age, sex, body mass index, average blood pressure, and six blood serum measurements. The data was collected by the National Institute of Diabetes and Digestive and Kidney Diseases. Here’s how to load the dataset into a pandas DataFrame and print the first couple of rows of some of the variables: #Import the dataset and pandas from sklearn import datasets import pandas as pd # Load the diabetes dataset and create a dataframe diabetes = datasets.load_diabetes() df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) # Add the target variable to the dataframe df['target'] = diabetes.target # Print the first 5 rows of some variables print(df[['age', 'sex', 'bmi', 'target']].head()) age sex bmi target 0 0.038076 0.050680 0.061696 151.0 1 -0.001882 -0.044642 -0.051474 75.0 2 0.085299 0.050680 0.044451 141.0 3 -0.089063 -0.044642 -0.011595 206.0 4 0.005383 -0.044642 -0.036385 135.0 Here you can see the age, sex and body mass index. These variables have already been preprocessed to have a mean of zero and a standard deviation of one. The target is a quantitative measure of disease progression. To get started with a correlation analysis of some of the features in the dataset, do the following: corr = df[['age', 'sex', 'bmi', 'target']].corr() print(corr) age sex bmi target age 1.000000 0.173737 0.185085 0.187889 sex 0.173737 1.000000 0.088161 0.043062 bmi 0.185085 0.088161 1.000000 0.586450 target 0.187889 0.043062 0.586450 1.000000 This shows that BMI is positively correlated with disease progression, meaning the higher the BMI, the higher the chance of having diabetes. What relationships can you find between other variables in the data? Forest Cover Types The Forest covertype dataset, also from scikit-learn, is a collection of data from the US Forest Service (USFS). It includes cartographic variables that measure the forest cover type for 30 x 30 meter cells and includes a total of 54 attributes. This rich dataset can be used for a variety of projects, such as predicting the forest cover type of a given area, analyzing the relationship between different forest cover types and environmental factors, or creating a model to predict the probability of a certain type of forest cover in a given area. It can also be used to study the effects of human activities on forest cover. Here’s how to read the data into a DataFrame and print the first 5 rows: from sklearn.datasets import fetch_covtype # Get the dataset df = fetch_covtype(as_frame=True) # Print the first 5 rows of the dataframe print(df.head()) Elevation Aspect Slope ... Soil_Type_38 Soil_Type_39 target 0 2596.0 51.0 3.0 ... 0.0 0.0 5 1 2590.0 56.0 2.0 ... 0.0 0.0 5 2 2804.0 139.0 9.0 ... 0.0 0.0 2 3 2785.0 155.0 18.0 ... 0.0 0.0 2 4 2595.0 45.0 2.0 ... 0.0 0.0 5 You can see the variables include things like elevation, slope, and soil type. The target variable is an integer and corresponds to a forest cover type. Here’s how to print the most common types: print('Count of each target value:', df['target'].value_counts()) Count of each target value: target 2 283301 1 211840 3 35754 7 20510 6 17367 5 9493 4 2747 Name: count, dtype: int64 The most commonly occurring type of forest in this dataset is type 2, with 283,301 occurrences. This corresponds to Lodgepole Pine. Type 4, the Cottonwood/Willow type, is the least frequently occurring type. To get started in an analysis project, first start learning more about this data. Since this DataFrame is quite large with many different variables, check out How to Filter Rows and Select Columns in a Python DataFrame with pandas for some tips on manipulating the data. Yahoo Finance Python’s yfinance library is a powerful tool for downloading financial data from the Yahoo Finance website. You’ll need to install this library, which can be done with pip. It allows you to download data in a variety of formats; the data includes variables such as stock prices, dividends, splits, and more. To download data for Microsoft and plot the close price, do the following: import yfinance as yf # Get the data for the stock Microsoft data = yf.download('MSFT', start="2020-01-01", end="2020-12-31") # Plot the close price data['Close'].plot(title='Microsoft Close Price') plt.show() This uses the built-in pandas.DataFrame.plot() method. Running this code produces the following visualization: There are many options open for the analysis at this stage. A regression analysis can be used to model the relationship between different financial variables. In the article Regression Analysis in Python, we show an example of how to implement this. Atmospheric Soundings Atmospheric sounding data is data collected from weather balloons. A comprehensive dataset is maintained on the University of Wyoming's Upper Air Sounding website. The data includes variables such as temperature, pressure, dew point, wind speed, and wind direction. This data can be used for a variety of projects, such as forecasting temperature and wind speed for your home town. Since it has decades of observations, you could use it to study the effects of climate change on the atmosphere. Simply select an observation site and choose a time from the web interface. You can highlight the tabular data and copy-paste into a text document. Save it as ‘weather_data.txt’. Then you can read it into Python like this: file = open('weather_data.txt', 'r') # Read data line by line data = [] for line in file: variables = line.strip(' ').split(' ') data.append([float(var) for var in variables if len(var)>0]) file.close() # Data into pandas DataFrame df = pd.DataFrame(data, columns = ['PRES', 'HGHT', 'TEMP', 'DWPT', 'RELH', 'MIXR', 'DRCT', 'SKNT', 'THTA', 'THTE', 'THTV']) This is a nice example of having to read the data in line by line. Using Matplotlib, you can plot the temperature as a function of height for your data as follows: import matplotlib.pyplot as plt plt.plot(df['TEMP'], df['HGHT']) plt.ylabel('Height (m)') plt.xlabel('Temperature (deg C)') plt.show() Note that your plot may look a little different depending on what site and date you chose to download. If you want to download a large amount of data, you’ll need to write a web scraper. See our article Web Scraping with Python Libraries for more details. IMDB movie review The IMDB Movie Review dataset is a collection of movie reviews from the Internet Movie Database (IMDB). It includes reviews from tens of thousands of movies, with each review consisting of a text review and a sentiment score. The sentiment score is a binary value, either positive or negative, that indicates the sentiment of the review. This dataset can be used for a variety of projects, such as sentiment analysis – which aims to build models that can predict the sentiment of a review. It can also be used to identify the topics and themes of a movie. You can download the dataset from Kaggle. To read in the CSV data and start preprocessing, do the following: import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize df = pd.read_csv('IMDB Dataset.csv') # To lowercase df['review_clean'] = df['review'].str.lower() # Remove stopwords stop = stopwords.words('english') df['review_clean'] = df['review_clean'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop)) # Tokenize reviews df['review_clean'] = df['review_clean'].apply(lambda x: word_tokenize(x)) Text data is often quite messy, so cleaning and standardizing it as much as possible is important. See our article The Most Helpful Python Data Cleaning Modules for more information. Here, we have changed all characters to lowercase, removed stopwords (unimportant words), and tokenized the reviews (created a list of words from sentences). The article Null in Python: A Complete Guide has some more examples of working with text data. There is more cleaning that could be done – for example, removing grammar. But this could be the starting point of a natural language processing project. Try seeing if there is a correlation between the most frequently occurring words and the sentiment. Berlin Database of Emotional Speech The Berlin Database of Emotional Speech (BDES) is a collection of German-language audio recordings of emotional speech. It was generated by having actors read out a set of sentences in different emotional states, such as anger, happiness, sadness, and fear. The data includes audio recordings of the actors' voices as well as annotations of the emotional states. This data can be used to study the acoustic features of emotional speech. The data is available for download here. Metadata for the type of speech is recorded in the filename. For example, the ‘F’ in the filename ‘03a01Fa.wav’ means ‘Freude’ or Happiness. To plot the spectrogram of a happy German, do the following: import scipy.io.wavfile as wav import matplotlib.pyplot as plt # Read in the .wav file rate, data = wav.read('03a01Fa.wav ') # Plot the spectrogram plt.specgram(data, Fs=rate) plt.xlabel('Time (s)') plt.ylabel('Frequency (Hz)') plt.show() This produces the following plot of frequency against time. The yellow colors indicate higher signal strength. Try plotting the same for angry speech, and see how the frequency, speed, and intensity of the speech changes. For more details on working with audio data in Python, check out the article How to Visualize Sound in Python. MNIST The MNIST dataset is a collection of handwritten digits (ranging from 0 to 9) that is commonly used for training various image processing systems. It was created by the National Institute of Standards and Technology (NIST) and is widely used in machine learning and computer vision. The dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 pixel grayscale image associated with a label from 0 to 9. To load and start working with this data, you’ll need to install Keras, which is a powerful Python library for deep learning. The easiest way to do this is with a quick pip install keras command from the terminal. You can import the MNIST data and plot some of the digit images like this: from keras.datasets import mnist import matplotlib.pyplot as plt # Load the MNIST dataset (x_train, y_train), (x_test, y_test) = mnist.load_data() # Plot the digits for i in range(25): plt.subplot(5, 5, i+1) plt.imshow(x_train[i], cmap='gray') plt.title(y_train[i]) plt.axis('off') plt.show() You can see the images of the handwritten digits with their labels above them. This dataset can be used to train a supervised image recognition model. The pixel values are the input data, and the labels are the truth that the model uses to adjust the internal weights. You can see how this is implemented in the Keras code examples section. Improve Your Analysis Skills with Python Datasets Getting started is often the hardest part of any challenge. In this article, we shared 7 datasets that you can use to start your next analysis project. The code examples we provided should serve as a starting point and allow you to delve deep into the data. From analyzing financial data to predicting the weather, Python can be used to explore and understand data in a variety of ways. These datasets were chosen to give you exposure to working with a variety of different data types – numbers, text, and even images and audio. Our article An Introduction to NumPy in Python has more examples of working with numerical data. With the right resources and practice, you can become an expert in data analysis and use Python datasets to make sense of the world around you. So, take the time to learn Python and start exploring the world of data! Tags: python data analysis online practice