30th Sep 2021 9 minutes read How to Filter Rows and Select Columns in a Python Data Frame With Pandas Yiğit Aras Tunalı python python basics For a data scientist, pandas is a must-know library for modifying data. It is essential and expected in many other jobs that deal with data using Python. Let’s get you up to speed with all the powerful tools pandas offers! Python is one of the most widely used programming languages today. The ability to work with data is highly sought after, and jobs as data scientists, data analysts, and machine learning engineers are very popular. There are a lot of skills data scientists need to have under their belt. If you are a beginner or are unsure where to start, Introduction to Python for Data Science is the perfect course for you. You can quickly build up the necessary skills to start chasing your dream job! Or if you already know Python and are looking to improve and build on your knowledge, you can follow our Data Science track. Before you build machine learning models or otherwise use the collected data, you need to do some preliminary exploration and data cleaning. Since tabular data is the most common type of data structure, it makes a lot of sense to use pandas to accomplish these tasks. The time series is another important type of structure, obtained by recording observations of some phenomenon repeatedly over time. This type of data is widely used in, for example, finance and weather forecasting. To gain insights from time series data, it is important to know how to properly visualize them. We won’t cover time series here, but this article by my colleague Usman can get you up and running in visualizing such data. Most Widely Used Pandas Operations A full-on tour of pandas would be too daunting of a task to accomplish with just one article. Instead, we will go over the most common functionalities of pandas and some tasks you face when dealing with tabular data. As I mentioned, the very first thing to do when faced with a new data set is some exploration and cleaning. The cleaning aspect consists of eliminating unwanted parts of the data and dealing with missing data entries. For all the examples in this article, we use a data set of students. Let’s start with the exploration – we begin by peeking into the data set. Indexing Rows With Pandas Let’s say we have the data in a file called “Report_Card.csv.” We can use the following code snippet to read the data and then show a few entries from the top or the bottom of the data. import pandas as pd Report_Card = pd.read_csv("Report_Card.csv") Report_Card.head(3) This snippet returns the first 3 elements from the top of the data frame. It is a good idea to use head() or tail() to take a peek at large data sets and plan our exploration. The latter, as you might've guessed, is used for printing elements from the bottom of the data frame. The output of this snippet is as follows: NameClassLectureGradeHomework 0Cindy LarsDHistory963.0 1Siya VuDMathematics722.0 2Billy KeyAGerman891.0 Notice we have numbers such as 0 and 1 as indices. If we already know which rows we want, we can simply use the iloc property of a data frame to specify the rows by their indices. This property lets us access a group of rows and columns by their integer positions. In other words, we can work with indices as we do with anything else in Python. Let’s say we want the row belonging to Siya Vu. We can simply use: Report_Card.iloc[1] To get the same result as head(3), we can use range-based indexing like this: iloc[0:3]. If we want to get specific information about specific students and we already know their index numbers, we can use iloc with arguments for both columns and rows: Report_Card.iloc[[7,4,12],0:3] Keep in mind that you can use an array of indices or simply ranges. The above code snippet returns the 7th, 4th, and 12th indexed rows and the columns 0 to 2, inclusive. If we omit the second argument to iloc above, it returns all the columns. Indexing Columns With Pandas Let’s say we would like to see the average of the grades at our school for ranking purposes. We can extract the Grades column from the data frame. Using Report_Card["Grades"] returns the entire column. We can then apply the function mean() to the column and get the value 72.3789. If we were the principal of this imaginary school, maybe we would compare the average grades by class to find out where we could improve. For this, we need both the Grades and Class columns; we can get them by indexing. To get the class averages, we can use the following code snippet: class_info = Report_Card[["Grades","Class"]] class_info.groupby("Class").mean() The first line of code gives us a new data frame with only two columns. It is important to point out that we provide a list of column names as an argument since we want more than one of them. In the second line, we use the groupby() function with “Class” as the argument. This groups all the rows containing the same class value. Then, using the mean() function on the groups gives us the following table: ClassGrade A70.06 B76.20 C71.70 D73.13 It seems the lowest average grade is coming from Class A. We can now investigate this further to help struggling students. Another usage of column indexing for getting the parts we need is with ranges. Here’s an example: Report_Card.loc[:,'Name':'Lectures'] Report_Card.iloc[:,0:3] The colon in both cases stands for "all." You should be careful with the syntax. With loc, we use the column names, and both ends of the range are inclusive. In contrast, with iloc, we use numerical indices, and the right end of the range is not inclusive. Filtering Rows Based on Conditions Let’s start by selecting the students from Class A. This can be done like this: class_A = Report_Card.loc[(Report_Card["Class"] == "A")] We use the loc property, which lets us access a group of rows and/or columns by labels or a Boolean array. This time, however, we use the latter and write a simple conditional statement. The code (Report_Card["Class"] == "A") returns a pandas.Series object of False and True values, which in turn is used to index our main data frame. Now that we have the students in Class A, we need a plan to improve their performance. Let’s take a closer look at the students whose grades are lower than average, say by 15 points or more. We can either work with the class_A data frame we created or combine two conditionals and create another data frame. For the sake of this example, let's go with the combination approach: class_A_lower = Report_Card.loc[(Report_Card["Class"] == "A") & (Report_Card["Grades"] < 55)] The & operator is the “logical and" here, which means we want the rows where the class is A and the grade is less than 55. It is important to note that both conditions for which we use the & operator should be in parentheses. Otherwise, Python misinterprets the whole expression, and an error is thrown. If instead, we need one condition or the other, we use the | operator, known as the “logical or.” We can now focus our efforts on helping these students to improve the grade average of Class A. Creating New Columns and Applying Functions Doing homework can be boring, but it is a great way to review and reinforce the topics covered. Let’s continue from the previous section and assign extra homework to the students who are struggling in Class A. Let’s add a new column to the Grade_Report data frame that indicates how many additional homework assignments the student has to complete. Since we already know how many of the original assignments each student has completed, we can give those who have slacked on their original homework a bit more extra! First, we define a very simple homework function. def extra_hw(homework): if homework >= 2: return 2 elif homework == 0: return 6 else: return 4 As you can see, if the student has completed two or more assignments, we only give them 2 extra. Students who have not completed any are punished with 6 extra assignments, and all other students get 4 extra assignments. Now, we apply this function to each of our rows and create a new column stating how many new assignments each student needs to complete. Using the class_A_lower data frame we created earlier, our update of the data looks like this: class_A_lower["Extra"] = class_A_lower.apply(lambda row: extra_hw(row["Homework"]),axis=1) This line of code looks a bit daunting, but it is pretty simple. We make use of the pandas apply() with a lambda function as an argument. The lambda function uses our extra_hw() function on the Homework column to create the new Extra value for each row. The additional argument axis=1 means we are applying this lambda function in a row-wise manner. The apply() function is a very efficient way of modifying the rows of our data frame; it outperforms iterating over the whole set by a large margin. Let’s go over another example. The school assigns an email address to each student according to their name and their email provider. We can use the data frame we already have to create this extra column by combining two existing columns. The following code snippet accomplishes this goal: Report_Card["Email"] = (Report_Card["Name"].str.lower()).str.replace(" ", ".") + "@" + Report_Card["Domain"].str.lower() When applying string functions to a pandas.Series object, we first need to use str to access its string value. This creates a column with the structure “name.middle.last_class@school.edu” for each student. Should we run into a situation in which we only have the email addresses of the students, we could easily revert them into the two original columns by splitting the email column as follows: Report_Card[["Name2","Class2"]] = Report_Card["Email"].str.split("@", n=1, expand=True) (Report_Card["Name2"].str.replace("."," ")).str.capitalize() Now You Know the Basics of Pandas! Congratulations! You have taken your first step towards mastering the pandas module in Python. There are many more impressive functions we can take a look at, but that would make this article way too long! Most of the more advanced functionalities of pandas build upon what we have discussed in our toy example and do not use too many different ideas than these basic ones. Since you have learned how to clean and analyze data using pandas, the next logical step is to learn how to present your findings. Since you often present your findings to people with no programming background, a more visual approach is necessary. If brushing up your visualization skills is what you are after, our articles that go over Matplotlib are just the thing for you. Part One and Part Two are enough to give you the head start you need to create astonishing charts and visuals. Another step you can take to improve your skills is to learn how to deal with different types of data. One of the most important types of data is the datetime variants. Most beginners in programming don't get the chance to work with these data types, and that can reduce the chances of scoring a great job. But fear not; we got you covered! You can quickly catch up with the necessary skills with our article on the date and time objects. Tags: python python basics