Back to articles list Articles
8 minutes read

How to Generate a Data Summary in Python

Learn different methods for summarizing data in Python.

Data is power. The more data we have, the better and more robust products we create. However, working with large amounts of data has its challenges. We need software tools and packages to gain insights, like for creating a data summary in Python.

A substantial number of data-based solutions and products use tabular data, that is, data stored in a table format with labeled rows and columns. Each row represents an observation (i.e., a data point) and columns represent features or attributes about that observation.

As the numbers of rows and columns increase, it becomes more difficult to inspect data manually. Since we almost always work with large datasets, using a software tool to summarize data is a fundamental requirement.

Summaries of data come in handy for a variety of tasks:

  • Learning the underlying structure of a dataset.
  • Understanding the distribution of features (i.e., columns).
  • Exploratory data analysis.

As the leading programming language in the data science ecosystem, Python has libraries for creating data summaries. The most popular and commonly used library for this purpose is pandas. LearnPython has an Introduction to Python for Data Science course that covers the pandas library in great detail.

pandas is a data analysis and manipulation library for Python. In this article, we go over several examples to demonstrate how to use pandas for creating and displaying data summaries.

Getting Started With pandas

Let’s start with importing pandas.

import pandas as pd

Consider a sales dataset in CSV format that contains the sales and stock quantities of some products and their product groups. We create a pandas DataFrame for the data in this file and display the first 5 rows as below:

df = pd.read_csv(“sales.csv”)
df.head()

Output:

product_groupproduct_codesales_qtystock_qty
0A1000337791
1C1001502757
2A1002402827
3A10034111227
4C1004186361

A data summary in pandas starts with checking the size of the data. The shape method returns a tuple with the counts of rows and columns of a DataFrame.

>>> df.shape
(300, 4)

It contains 300 rows and 4 columns. This is a clean dataset that is ready to be analyzed. However, most real-life datasets require cleaning. Here is an article that explains the most helpful Python data cleaning modules.

We continue summarizing the data by focusing on each column separately. pandas has two main data structures: DataFrame and Series. A DataFrame is a two-dimensional data structure, whereas a Series is one-dimensional. Each column in a DataFrame may be considered a Series.

Since the characteristics of categorical and numeric data are very different, it is better to cover them separately.

Categorical Columns

If a column contains categorical data as does the product group column in our DataFrame, we can check the count of distinct values in it. We do so with the unique() or nunique() functions.

>>> df["product_group"].unique()
array(['A', 'C', 'B', 'G', 'D', 'F', 'E'], dtype=object)
>>> df["product_group"].nunique()
7

The nunique() function returns the count of distinct values, whereas the unique() function displays the distinct values. Another commonly used summary function on categorical columns is value_counts(). It shows the distinct values in a column along with the counts of their occurrences. Thus, we get an overview of the distribution of the data.

>>> df["product_group"].value_counts()
A    102
B     75
C     63
D     37
G      9
F      8
E      6
Name: product_group, dtype: int64

Group A has the most products, followed by Group B with 75 products. The output of the value_counts() function is sorted in descending order by the count of occurrences.

Numeric Columns

When working with numeric columns, we need different methods to summarize data. For instance, it does not make sense to check the number of distinct values for the sales quantity column. Instead, we calculate statistical measures such as mean, median, minimum, and maximum.

Let’s first calculate the average value of the sales quantity column.

>>> df["sales_qty"].mean()
473.557

We simply select the column of interest and apply the mean() function. We can perform this operation on multiple columns as well.

>>> df[["sales_qty","stock_qty"]].mean()
sales_qty     473.557
stock_qty    1160.837
dtype: float64

When selecting multiple columns from a DataFrame, make sure to specify them as a list. Otherwise, pandas generates a key error.

Just as easily as we can calculate a single statistic on multiple columns in a single operation, we can calculate multiple statistics at once. One option is to use the apply() function as below:

>>> df[["sales_qty","stock_qty"]].apply(["mean","median"])

Output:

sales_qtystock_qty
mean473.5566671160.836667
median446.0000001174.000000

The functions are written in a list and then passed to apply(). The median is the value in the middle when the values are sorted. Comparing the mean and median values gives us an idea about the skewness of the distribution.

We have lots of options to create a data summary in pandas. For instance, we can use a dictionary to calculate separate statistics for different columns. Here is an example:

df[["sales_qty","stock_qty"]].apply(
    {
        "sales_qty":["mean","median","max"],
        "stock_qty":["mean","median","min"]
    }
)

Output:

sales_qtystock_qty
mean473.5566671160.836667
median446.0000001174.000000
max999.000000NaN
minNaN302.000000

The keys of the dictionary indicate the column names and the values show the statistics to be calculated for that column.

We can do the same operations with the agg() function instead of apply(). The syntax is the same, so don’t be surprised if you come across tutorials that use the agg() function instead.

pandas is a highly useful and practical library in many aspects. For instance, we can calculate various statistics on all numeric columns with just one function: describe():

>>> df.describe()

Output:

sales_qtystock_qty
count300.000000300.000000
mean473.5566671160.836667
std295.877223480.614653
min4.000000302.000000
25%203.000000750.500000
50%446.0000001174.000000
75%721.7500001590.500000
max999.0000001988.000000

The statistics in this DataFrame give us a broad overview of the distribution of values. The count is the count of values (i.e., rows). The “25%,” “50%,” and “75%” indicate the first, second, and third quartiles, respectively. The second quartile (i.e., 50%) is also known as the median. Finally, “std” is the standard deviation of the column.

A data summary in Python can be created for a specific part of the DataFrame. We just need to filter the relevant part before applying the functions.

For instance, we describe the data for just Product Group A as below:

df[df["product_group"]=="A"].describe()

We first select the rows whose product group value is A and then use the describe() function. The output is in the same format as in the previous example, but the values are calculated only for Product Group A.

We can apply filters on numeric columns as well. For instance, the following line of code calculates the average sales quantity of products with a stock greater than 500.

df[df["stock_qty"]>500]["sales_qty"].mean()

Output:

476.951

pandas allows for creating more complex filters quite efficiently. Here is an article that explains in great detail how to filter based on rows and columns with pandas.

Summarizing Groups of Data

We can create a data summary separately for different groups in the data. It is quite similar to what we have done in the previous example. The only addition is grouping the data.

We group the rows by the distinct values in a column with the groupby() function. The following code groups the rows by product group.

df.groupby("product_group")

Once the groups are formed, we can calculate any statistic and describe or summarize the data. Let’s calculate the average sales quantity for each product group.

df.groupby("product_group")["sales_qty"].mean()

Output:

product_group
A    492.676471
B    490.253333
C    449.285714
D    462.864865
E    378.666667
F    508.875000
G    363.444444
Name: sales_qty, dtype: float64

We can also perform multiple aggregations in a single operation. In addition to the average sales quantities, let’s also count the number of products in each group. We use the agg() function, which allows for assigning names for aggregated columns as well.

df.groupby("product_group").agg(
    avg_sales_qty = ("sales_qty", "mean"),
    number_of_products = ("product_code","count")
)

Output:

product_groupavg_sales_qtynumber_of_products
A492.676471102
B490.25333375
C449.28571463
D462.86486537
E378.6666676
F508.8750008
G363.4444449

Data Distribution With a Matplotlib Histogram

Data visualization is another highly efficient technique for summarizing data. Matplotlib is a popular library in Python for exploring and summarizing data visually.

There are many different types of data visualizations. A histogram is used to check the data distribution of numeric columns. It divides the entire value range into discrete bins and counts the number of values in each bin. As a result, we get an overview of the distribution of the data.

Let’s create a histogram of the sales quantity column.

import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.hist(df["sales_qty"], bins=10)

In the first line, we import the pyplot interface of Matplotlib. The second line creates an empty figure object with the specified size. The third line plots the histogram of the sales quantity column on the figure object. The bins parameter determines the number of bins.

Here is the plot generated by this code:

How to Generate a Data Summary in Python

The values on the x-axis show the bin edges. The values on the y-axis show the number of values in each bin. For example, there are more than 40 products whose sales quantity is between 100 and 200.

Data Summary in Python

It is of crucial importance to understand the data at hand before proceeding to create data-based products. You can start with a data summary in Python. In this article, we have reviewed several examples with the pandas and Matplotlib libraries to summarize data.

Python has a rich selection of libraries that expedite and simplify tasks in data science. Python for Data Science track is a great start for your data science journey.