23rd Oct 2023 9 minutes read Python Data Cleaning: A How-to Guide for Beginners Xavier Rigoulet pandas data analysis data visualization Have you ever wondered why your data analysis sometimes yields unexpected results or errors? Do you know how to ensure the data for your project is accurate and reliable? Are you curious about the tools and techniques data professionals use to clean messy datasets? Have you ever struggled with data containing misspelled words, mixed-case text, or missing values? Would you like to discover how Python, a versatile programming language, helps you automate the process of data cleaning? Do these questions spark your curiosity or resonate with your data-related experiences? If so, you've come to the right place! In this article, we dive deep into the world of data cleaning in Python. We explore what data cleaning is, why it is crucial, and how you can harness the power of Python. We also explain two of the most helpful Python data-cleaning modules, pandas and NumPy, to transform messy datasets into valuable insights. Let's embark on a journey to discover the art and science of data cleaning. We will see why it is an indispensable skill for data analysts, scientists, and programmers alike. Data is the lifeblood of modern decision-making and analysis. Whether you are a data analyst, data scientist, or any kind of programmer dealing with data, one thing is clear: the quality of your data greatly influences the quality of your insights and results. This is where data cleaning in Python comes into play. It is such an important topic that there are two tracks on LearnPython.com dedicated to processing data: Data Processing in Python and Python for Data Science. The courses in these tracks walk you through how to process various data formats in Python, such as CSV, JSON, and Excel files. There are so many coding exercises worth dozens of hours of practice! Even then, you only need a laptop and an internet connection to get started. And yes, we may even use Python to manipulate data stored in Excel tables. There are a few useful Python packages for working with Excel. Python Data Cleaning? Data cleaning, or data cleansing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets. It ensures the data you work with is accurate, reliable, and ready for analysis, visualization, or further processing. It is a critical step in the data preparation pipeline. Who Needs Cleaning Data in Python? Anyone who analyzes, visualizes, or processes data benefits from data cleaning. This includes: Data analysts who rely on clean data to generate accurate reports and insights. Data scientists whose machine learning model performance depends on the quality of the training data. Other programmers also need clean data for various purposes, from creating web applications to making data-driven decisions. Why Is Cleaning Data in Python Important? Data cleaning is essential for several reasons: Data Accuracy: Clean data ensures the accuracy of your analysis. Repetitive or erroneous values may skew statistics and lead to incorrect conclusions. Data Consistency: Cleaning data helps maintain consistency in the dataset, making it easier to work with and preventing unexpected errors in the code. Data Quality: Removing bad spelling, fixing date formats, and addressing common data issues enhances data quality. Efficiency: Cleaning data streamlines the data processing pipeline, saving time and resources. Cleaning Data in Python Python provides a robust environment for data cleaning, thanks to libraries like pandas and NumPy. While you can clean data manually using tools like Excel, Python allows for automation, making it ideal for larger datasets and repetitive tasks. Let's look at an example. We will use the NFL dataset from Kaggle. It is an open-access dataset containing a lot of American football data. It may, for example, be used to predict the probability of winning and therefore be of interest to those involved in algorithmic betting. First, we use Python pandas to display our data as a table. Let’s look at it by loading it and displaying the shape of the dataset. import pandas as pd import numpy as np # Load data df = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017.csv") df.head() DateGameIDDriveqtrdown...yacWPASeason 02009-09-10200909100011NaN...NaN2009 12009-09-102009091000111.0...0.0368992009 22009-09-102009091000112.0...NaN2009 32009-09-102009091000113.0...-0.1562392009 42009-09-102009091000114.0...NaN2009 5 rows × 102 columns # Display the shape of the dataset df.shape Output: (407688, 102) We have a huge dataset of 407688 rows and 102 columns. For this article, we focus our Python data cleaning on the first 10 columns. Let’s use df.iloc to help us filter rows and select the relevant columns. df = df.iloc[:,:10] df.head() Below, we see the first 10 columns of our dataset. DateGameIDDriveqtrdowntimeTimeUnderTimeSecsPlayTimeDiffSideofField 02009-09-10200909100011NaN15:00153600.00.0TEN 12009-09-102009091000111.014:53153593.07.0PIT 22009-09-102009091000112.014:16153556.037.0PIT 32009-09-102009091000113.013:35143515.041.0PIT 42009-09-102009091000114.013:27143507.08.0PIT We also see that the columns have inconsistent naming conventions with a mix of lower and upper cases. This may create problems such as having duplicated columns. For example, GameID and gameID may be duplicated content. To avoid this pitfall, let’s use pandas to lowercase everything. df.columns= df.columns.str.lower() Here is our new list of column names: Index(['date', 'gameid', 'drive', 'qtr', 'down', 'time', 'timeunder', 'timesecs', 'playtimediff', 'sideoffield'], dtype='object') Next, let's see if there are any missing values. With Python pandas, dealing with null values is very straightforward! # Missing values missing_count = df.isnull().sum() missing_count Output: Column nameNumber of missing values date0 gameid0 drive0 qtr0 down61154 time224 timeunder0 timesecs224 playtimediff444 sideoffield528 dtype: int64 How about in total? # Total of missing values total_cells = np.product(df.shape) total_missing = df.isnull().sum().sum() # Percentage of missing data (total_missing/total_cells) * 100 A lot is going on in this code! Let’s explain it step by step. The shape attribute returns a tuple containing the counts of rows and columns in the dataset. np.product multiplies these two values to output the total count of cells. Next, we call isnull().sum() to get a count of missing values for each column. Then, we call sum() again to get the total count of missing values. Finally, we output the percentage of missing values for the entire dataset. # Output 1.5348501795490668 Missing values are only a small percentage in these 10 columns. As a good data scientist, your role here is to understand why these values are missing. It may simply be that they do not exist. There are many ways to deal with missing data. We will get to handling missing values with pandas shortly. But let’s start by trying to remove the columns with one or more missing values as follows: # Remove all columns with at least one missing value new_df = df.dropna(axis=1) new_df.head() Here, the axis parameter needs to be set to 1 to apply the function to columns. If axis=0, it removes the rows with at least one missing value, and this is not what we want. Note that the default is axis=0. Let’s look at what happens. dategameiddriveqtrtimeunder 02009-09-1020090910001115 12009-09-1020090910001115 22009-09-1020090910001115 32009-09-1020090910001114 42009-09-1020090910001114 The downside is the loss of data that comes with it. How much data do we lose in the process? # Data loss print("Number of columns in the original dataset: %d \n" % df.shape[1]) print("Number of columns with NaN values removed: %d" % new_df.shape[1]) # Output Number of columns in the original dataset: 10 Number of columns with NaN values removed: 5 As we see, we just lost half of our dataset. Another way to address missing values is to leverage pandas to fill them in. For example, we may replace a NaN with a 0. # Replace missing values with 0 df = df.fillna(0) df.head() DateGameIDDriveqtrdowntimeTimeUnderTimeSecsPlayTimeDiffSideofField 02009-09-102009091000110.015:00153600.00.0TEN 12009-09-102009091000111.014:53153593.07.0PIT 22009-09-102009091000112.014:16153556.037.0PIT 32009-09-102009091000113.013:35143515.041.0PIT 42009-09-102009091000114.013:27143507.08.0PIT However, be thoughtful when dealing with missing data. For example, filling missing values with 0s for an age variable does not make much sense. Here, though not perfect, it may make sense to fill the missing values with 0s in the context of your analysis. Missing values are also commonly replaced with median or mean values. If a variable is important, you may want to spend a little more time and use a machine-learning model to find the best value. In any case, think about what makes the most sense in each scenario. In the dataset above, many of the missing values are due to data not being recorded for one reason or another; filling the NaN values with 0 may or may not be a good option. Let’s do one last example with a small dataset with the last names, first names, and ages of some fictitious NFL players to illustrate. First, let’s create our dataset. We create a Python dictionary in which the keys are the column names. We use NumPy np.nan to generate the NaN values. data = { 'lastname': ['Douglas', 'Smith', 'Doe', 'Peterson', 'Johnson', 'Jones', 'Anderson'], 'firstname': ['John', 'Daniel', 'Michael', 'David', 'Robert', 'Anthony', 'Thomas'], 'age': [23, 25, np.nan, 22, np.nan, np.nan, 28] } Next, we use pandas to create a DataFrame from a Python dictionary: agefirstnamelastname 023JohnDouglas 125DanielSmith 2NaNMichaelDoe 322DavidPeterson 4NaNRobertJohnson 5NaNAnthonyJones 628ThomasAnderson You see that the age is missing for some players. As mentioned earlier, we do not want to fill an age value with 0. Let’s say it is more desirable to use something representing the middle value in this case. So, we choose to replace the missing values with the median value. It does not mean we may not use the mean (no pun intended!); it may very well be an option. Feel free to try both and see what makes better sense. Let’s replace the NaN values with the median. The inplace option is set to True to save the modified dataset by replacing the original dataset. df_players['age'].fillna(df_players.median()['age'], inplace=True) Here is the updated dataset: agefirstnamelastname 023JohnDouglas 125DanielSmith 224MichaelDoe 322DavidPeterson 424RobertJohnson 524AnthonyJones 628ThomasAnderson All right! Now you know how to use the median to fill the missing values. Similarly, you can use the mean or the mode if either of these is desired. Of course, there is much more to data cleaning. It is time-consuming to address all the issues in a dataset. There may be inconsistent data formats, incorrect values, or mixed letter cases in strings. However, it is a necessary step to ensure we get as much insight as possible from our data. Once done with data cleaning, you may move on to visualizing the data using powerful Python visualization libraries such as Matplotlib. Practice Cleaning Data in Python! Data cleaning in Python is a critical step in any data-related project. Ensuring data accuracy, consistency, and quality sets the foundation for meaningful analysis and insights. Powerful libraries like Python pandas and Python NumPy let you automate the process and handle even large datasets, making data cleaning efficient and effective. Clean data leads to more reliable results. Take advantage of this essential step in your data workflow. And because practice helps you achieve your goals, get your hands dirty on some data and try data cleaning in Python. Last but not least, if you have not yet done so, take a look at our courses on Python for data science and working with Excel files in Python. Remember to visit LearnPython.com to keep learning about Python! Tags: pandas data analysis data visualization