26th Feb 2024 13 minutes read

Python Data Analysis Example: Ames Housing Dataset

Are you curious about analyzing data with Python? This article walks you through a step-by-step Python data analysis example.

Have you ever wondered how companies make informed decisions based on vast amounts of data? Have you ever found yourself staring at a dataset, wondering where to start your data analysis? This article will guide you through the process of conducting data analyses in Python and transforming data into actionable knowledge.

If you are reading this article because you are interested in using Python for data analysis, check out our Introduction to Python for Data Science. This 12-hour interactive course will teach you the fundamentals of working with data in Python. You only need an Internet connection and a willingness to learn. By the end of the course, you’ll understand the essentials of data operations in Python.

Why Use Python for Data Analysis?

Data analysis is the process of inspecting, cleaning, and transforming data. Its goal is to discover useful information, draw conclusions, and support decision-making. It involves using statistical, logical, and computational techniques to interpret data, identify patterns, and uncover hidden insights.

With its extensive array of purpose-built libraries – such as pandas, NumPy, Matplotlib, and Seaborn – Python has become popular for data analysis. These libraries are collections of code that make it simpler, faster, and easier to process, analyze, and visualize data. Understanding these libraries is essential for data analysis in Python.

Let's deepen our knowledge of data analysis in Python with a real-world dataset.

The Ames Housing Dataset

For this article, we'll use the Ames Housing Dataset from Kaggle. It is a very popular dataset you can use to learn Python for data analysis. We'll cover the key steps of analyzing data and include a practical project to apply what you learn. If you want to develop your data science skills – whether to advance your current career path or switch careers entirely – it is always a good idea to build a data science portfolio.

This dataset contains price data in real estate for the city of Ames, Iowa. It includes different features, such as the number of bedrooms, the size, the overall condition of the property, etc.

Each row in the dataset corresponds to one property sale in the city of Ames. The dataset can be used to build a model to forecast the price of a house in Ames.

Here's a brief description of the columns we’ll use in this article:

SalePrice – The property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass – The building class.
MSZoning – The general zoning classification.
LotFrontage – The linear feet of street connected to the property.
LotArea – The lot size (in square feet).
Neighborhood – The property’s physical location within Ames city limits.
1stFlrSF – First floor square feet.
2ndFlrSF – Second floor square feet.

For a more detailed description of the dataset's structure, refer to the data description on Kaggle.

Python Data Analysis Example Walkthrough

Our goal in this article is to give you a basic overview of how to analyze a dataset with Python.

One question we can work on with the Ames dataset is this:

What features contribute the most to the sale price?

Step 1: Import Data

To start the analysis, we must import the data into Python. A dataset may come in various formats, e.g. CSV, JSON, or Excel.

CSV stands for comma-separated values. It is a text file that stores tabular data, where one line represents one data record. The values for each record are typically separated (as the name suggests) by commas. This format is popular because it’s human-readable, easy to process, and easy to share. Most data applications can export and import data in the CSV format.

JSON stands for JavaScript Object Notation. It is a format typically used to transmit data in web applications. You’ll typically JSON data from an application using an API. It uses key-value pairs to represent structured data. It’s a text file, so it can be read by humans, but its formatting makes it more difficult to read than a CSV.

Finally, there’s the Excel format, which is mainly files with the .xls and .xlsx extensions. They are saved as Microsoft Excel spreadsheets and are often used for storing data and performing data analysis in business and financial settings.

If you want to work with any of these formats in Python, check out our courses on reading and writing JSON files, CSV files, and Excel files in Python.

In our case, the Ames dataset comes in a CSV file. The data is split into a training dataset and a test dataset. This is because this dataset is used to train machine learning algorithms (using the train dataset) and then to assess the model's performance on unseen data (the test dataset). In our case, we want to learn how to perform a data analysis. We will work on the train dataset only.

Let's import the train dataset with the pandas Python library. If you haven't installed it yet, you can do so with the following command:

	pip install pandas

Next, we need to import the library, load the data, and view its first few lines. Here’s the code:

import pandas as pd

# Load dataset
df = pd.read_csv("train.csv") 
df.head()

First, we import the pandas library as pd. Every time we want to use a function from the pandas library, we’ll need to add pd before that function.

Next, we call the read_csv function from pandas to load the CSV file called train.csv. We then store the data as a pandas DataFrame called df. Finally, we call head() to peek into the first five rows of the dataset.

You should have an output similar to the one below. However, please note that for the sake of readability, I’ve reproduced only a few columns.

Id	MSSubClass	MSZoning	LotFrontage	LotArea	…	SalePrice
1	60	RL	65.0	8450	…	208500
2	20	RL	80.0	9600	…	181500
3	70	RL	68.0	11250	…	223500
4	60	RL	60.0	9550	…	140000
5	60	RL	84.0	14260	…	250000

5 rows x 81 columns

Now that the data is loaded, we can move on to the next step: preparing our dataset for analysis. This process is called data cleaning.

Step 2: Data Cleaning and Preparation

Python data cleaning involves handling missing values, filtering data, and converting data into various types for analysis. This is done with the help of various Python libraries.

In the following example, we’ll remove the Id column. It’s for identification purposes only and we cannot use it to derive any meaningful insight. Here is the code:

# Drop Id column
df = df.drop(['Id'], axis=1)

The drop() function takes the axis parameter as 1 to remove the column ‘Id’. The result of the operation is then assigned back to the df variable. The ‘Id’ column is now removed.

Here is our updated table:

MSSubClass	MSZoning	LotFrontage	LotArea	…	SalePrice
60	RL	65.0	8450	…	208500
20	RL	80.0	9600	…	181500
70	RL	68.0	11250	…	223500
60	RL	60.0	9550	…	140000
60	RL	84.0	14260	…	250000

Next, let's put all column names into lowercase to reduce the risk of typos:

# lowercase column names
df.columns = map(str.lower, df.columns)

We use the map() function to apply the str.lower function to the dataset’s column names. str.lower is a built-in Python function that converts a string to all lowercase characters. Finally, we assign the result of the operation to df.columns. Now, each column name is replaced with its lowercase version.

This gives us the following output:

mssubclass	mszoning	lotfrontage	lotarea	…	saleprice
60	RL	65.0	8450	…	208500
20	RL	80.0	9600	…	181500
70	RL	68.0	11250	…	223500
60	RL	60.0	9550	…	140000
60	RL	84.0	14260	…	250000

All right, let’s check if we have any missing values. We can use another pandas function for this:

print(df.isnull().sum().to_string())

The isnull() function will check each value in the data frame and return either True (if the value is missing) or False (if the value is present). The sum() function will output the number of missing values per column. We convert the output to a string using to_string() and display everything using a print() statement to avoid truncating the output. Here are the results:

mssubclass          0
mszoning            0
lotfrontage       259
lotarea             0
neighborhood        0
1stflrsf            0
2ndflrsf            0
saleprice           0

This shows that most of the columns don’t have any missing values. The exception is lotfrontage, which has 259 missing values.

The way you fix missing values depends on your use case. Sometimes, the value is missing because there’s no record – that’s the case in this dataset. We can use different ways to replace the missing values; the one we choose will depend on what we want to do with the data.

Let’s handle the lotfrontage column. Since the value is missing because the feature does not exist for these properties, we can set this value to 0.

df['lotfrontage'] = df['lotfrontage'].fillna(0)

In the code above, we select the lotfrontage column and use the fillna() function to replace all the missing values with 0. It may or may not be the best solution here; for example, we could also replace the missing values with the mean. In the case of a machine learning application, replacing missing values with the mean or any other relevant values could be a better solution as it would avoid creating a bias and maintain the predictive power of the feature.

We can proceed similarly with other columns that have missing values. For more on this, read our article on data cleaning.

Once the dataset is cleaned, it is time to study it. This step is called "exploratory data analysis."

Step 3: Exploratory Data Analysis

Exploratory data analysis is a crucial phase that involves understanding the dataset's characteristics. During this step, basic statistical analysis (mean, median, etc.) is applied to the data. We often compute the mean, the median, and the standard deviation for each column to understand the data distribution.

The mean represents the average value of the dataset, computed by summing all values and dividing by the total number of observations. The median is the middle value of the dataset. Compared to the mean, the median is less sensitive than the mean to extreme value.

Another value of interest is the standard deviation, which is the square root of the variance and is useful to measure the data spread. The variance measures how far the data points are spread from their mean value. These values are useful for identifying patterns, understanding the trend, and detecting outliers.

Because a picture can speak a thousand words, data visualization techniques like bar charts, histograms, and scatter plots are used to unveil patterns and trends. Matplotlib and Seaborn (which is built on top of Matplotlib) are two popular Python data visualization libraries.

In this dataset, we want to see whether there's a correlation between the price of a house and its features, such as the number of bedrooms or whether or not it has a fireplace. Let’s do an exploratory analysis and see what we find.

Step 3a: Basic Statistical Analysis

We can display several statistical values with the describe() function from pandas. It looks like this:

df.describe()

This code generates statistics for each column of the df DataFrame. It outputs the count (the number of non-null values in each column), mean, standard deviation (std), minimum value (min), the 25th percentile (25%), the median (50%), the 75th percentile (75%), and the maximum value (max). This helps us get an overall overview of the distribution of the values in this column.

Here is the output that you would get. For readability purposes, I will not reproduce all the columns:

	mssubclass	lotfrontage	lotarea	…	saleprice
count	1460.00	1460.0	1460.00	…	1460.00
mean	56.897	57.62	10516.82	…	181921.19
std	42.30	34.66	9981.26	…	79442.5
min	20.00	0	1300.00	…	34900.00
25%	20.00	42.00	7553.5	…	129975.00
50%	50.00	63.00	9478.5	…	163000.00
75%	70.00	79.00	11601.5	…	214000.00
max	190.00	313.00	215245.00	…	755000.00

Step 3b: Data Visualization

Plotting the features against the house price makes it easy to understand the price drivers of a house.

Let’s visualize the distribution of the sale price using the matplotlib and seaborn libraries. If you haven’t installed any of the visualization libraries used below, you can do so using pip:

pip install matplotlib
pip install seaborn

Then we run the following:

import matplotlib.pyplot as plt
import seaborn as sns

# Shape of sale price distribution
sns.distplot(df['saleprice'])

In the code above, the distplot() function from seaborn is run on the saleprice column. It combines a histogram with a kernel density estimate, providing a smooth representation of the distribution. Below is the visualization:

Another interesting visualization is the scatterplot between the target, the sale price, and various features; this helps us derive correlations between variables.

In the following example, we plot the surface of the first floor in square feet against the sale price:

sns.lmplot(x='1stflrsf', y='saleprice', data=df, line_kws={'color': 'red'})

The lmplot() function combines a scatterplot with a linear regression line. It helps understand the relationship between the 1stflrsf data, which represents the surface of the first floor in square feet, and the sale price.

Code-wise, we need to call the lmplot() from the seaborn library as sns.lmplot() and add the arguments for the x and y axes, as well as the data argument, which is the df variable corresponding to the dataset. Finally, the line_kws argument sets the parameters of the regression line.

We get the following:

This visualization shows that the price increases with the surface of the first floor.

To conclude the exploratory analysis, let’s have one last example using categorical data. A good way of representing categorical data is through a box plot. Here, we’ll plot neighborhood data against the sale price to see if we can identify a correlation or a pattern.

Here is the code:

sns.boxplot(x='neighborhood', y='saleprice', data=df)
plt.xticks(rotation=45, ha='right')

We call the boxplot() function from seaborn. Then we input the neighborhood column as the x-axis and the saleprice column as the y-axis from the df dataset.

For readability purposes, we call the xticks() function from matplotlib.pyplot to adjust the labels on the x-axis. We rotate them by 45 degrees and use the ha argument to align them to the right so the labels align correctly with the ticks.

This code creates a box-and-whisker plot of the neighborhood data against the sale price. A box plot consists of boxes that represent the interquartile range of the data (25th to 75th percentile), with a line inside the box representing the median. The "whiskers" extend to the minimum and maximum values within a certain distance from the quartiles. Outliers beyond the whiskers are usually represented as individual points. It is a very useful plot to visualize the distribution of categorical data.

Step 4: Drawing Conclusions from Data

We can see on this plot that the price of the houses increases with the surface of the first floor. This plot shows a positive correlation between the surface of the first floor and the price of a house. The price varies significantly depending on where the house is located.

From the plots above, we can conclude that the house price is driven by the first-floor surface and the house's location, as we can see a pattern between these variables and the sale price.

There are more examples we could cover, but these give you an idea of a data analyst’s or data scientist’s daily work: interpreting the results of the data analysis and understanding the limitations and potential biases. This is where domain knowledge and critical thinking come into play to derive meaningful conclusions from the data.

Subsequent steps would be to analyze the data in-depth and understand the drivers of the housing price. We cannot go through the whole dataset, as it would make this article too long. However, the snippets above give you a good starting point on what to look for when analyzing a dataset. At the end of the day, it is about questioning the data and trying to understand it in depth, with the goal of drawing meaningful conclusions to help in decision-making.

What’s Next for Python Data Analysis?

This article has provided a step-by-step guide to data analysis in Python using a real-world dataset. By mastering these techniques, you can unlock the power of data to make informed decisions.

If you’re interested in learning to use Python for data analysis, check out our dedicated Python for Data Science learning track. It is a set of five interactive courses that will teach you how to use Python to get started with data analysis

Once you’re done with the learning track, I encourage you to read books on data science, explore other courses on LearnPython.com, play with other datasets, and build an entire data science project.

Tags: