12th Nov 2020 7 minutes read

How to Read CSV Files Python

Have you encountered CSV files? In this article, I’ll show you what CSV files are and how easy it is to work with them in Python.

If you are working as a back-end developer or data scientist, chances are that you’ve already dealt with CSV files. It is one of the most used formats for working with and transferring data. Many Python libraries can handle CSVs, but in this article, we’ll focus on Python’s csv module.

What Are CSV Files?

A CSV file, also known as a comma-separated values file, is a text file that contains data records. Each line represents a different record and includes one or more fields. These fields represent different data values.

Let’s look at some CSV examples. Below we have a snippet of a CSV file containing student data:

firstname,lastname,class
Benjamin,Berman,2020
Sophie,Case,2018

The first line is the header, which is essentially column names. Each line will have the same number of fields as the first line has column names. We’re using commas as delimiters (i.e. to separate fields in a line).

Let’s look at a second example:

firstname|lastname|class
Benjamin|Berman|2020
Sophie|Case|2018

This snippet has the same structure as the first one. The difference is the delimiter: we’re using a vertical bar. As long as we know the general structure of the CSV file, we can deal with it.

Why Are CSV Files So Common?

In essence, CSV files are plain-text files, meaning they are as simple as it gets. This simplicity makes it easy to create, modify, and transfer them – regardless of the platform. Thus, tabular data (i.e. data structured as rows, where each row describes one item) can be moved between programs or systems that otherwise might be incompatible.

Another benefit of this simplicity is that it’s very easy to import this data into spreadsheets and databases. For spreadsheets, just opening the CSV file often automatically imports the data into the spreadsheet program.

One of the most common uses of CSV files is when part of a database’s data needs to be extracted for use by a non-technical coworker. Most modern database systems allow users to export their data into CSV files. Instead of making non-technical people struggle through the database system, we can easily give them a CSV file with the data they need. We could also easily extract a CSV file from a spreadsheet and insert that into our database. This makes interfacing between non-technical personnel and databases a lot easier.

At times, we might work on actual CSV files – e.g. when one team scrapes data and delivers it to the team that is supposed to work with it. The most common way to deliver the data would be in a CSV file. Or perhaps we need to get some data from a legacy system that we can’t interface with. The easiest solution is to acquire this data in CSV format, since textual data is easier to move from system to system.

Reading CSV files is so common that questions about it frequently appear in Python technical interviews. You can learn more about the questions you might face in a Python-focused data science job interview in this article. Even if you’re not interested in a data science role, check it out; you might run across some of these questions in other Python jobs.

Using Python’s csv Module

There are many Python modules that can read a CSV file, but there might be cases where we aren’t able to use those libraries, i.e. due to platform or development environment limitations. For that reason, we’ll focus on Python’s built-in csv module. Below we have a CSV file containing two students’ grades:

Name,Class,Lecture,Grade
Benjamin,A,Mathematics,90
Benjamin,A,Chemistry,54
Benjamin,A,Physics,77
Sophie,B,Mathematics,90
Sophie,B,Chemistry,90
Sophie,B,Physics,90

This file includes six records. Each record contains a name, a class, a lecture, and a grade. Each field is separated by commas. To work with this file, we’ll use the csv.reader() function, which accepts an iterable object. In this case, we will be providing it with a file object. Here is the code to print all rows of the Report.csv file:

import csv
with open("Report.csv", "r") as handler:
 reader = csv.reader(handler, delimiter=',')
 for row in reader:
   print(row)

Let’s analyze this code line by line. First, we import the CSV module that comes with the regular Python installation. Then we open the CSV file and create a file handler called handler. Since this file handler is an iterable object that returns a string whenever the __next__ method is called on it, we can give it as an argument in the reader() function and get a CSV handler that we call reader. And now we can iterate over reader; each element of it will be a list of fields for each line in our original CSV file.

Keep in mind that the CSV file can include field names as its first line. If we know that this is the case, we can use the csv.DictReader() function to create a handler. Instead of returning a list for each row, this function will return a dictionary for each line. The key for each dictionary is the names in the first line of the CSV file.

CSV Dialects and How to Deal With Them

Even though CSV stands for “comma separated values”, there is no set standard for these files. Thus, csv allows us to specify the CSV dialect. The csv.list_dialects() function lists the csv module’s built-in dialects. For me, these are excel, excel-tab, and unix.

The excel dialect is the default setting for CSV files exported directly from Microsoft Excel; its delimiter is a comma. A variant of this is excel-tab, where the delimiter is a tab. More info on these dialects can be seen on the Python GitHub page.

If your company or team is using a custom-styled CSV, you can create your own CSV dialect and put it into the system using the register_dialect() function. See the Python GitHub page for more details. An example would look as follows:

csv.register_dialect('myDialect',delimiter='|',
   skipinitialspace=True,
                    quoting=csv.QUOTE_ALL)

You could then use the new myDialect to read a CSV file:

import csv
with open("Report.csv","r") as handler:
 reader = csv.reader(handler, dialect="myDialect")

This works much like our previous example, but instead of supplying an argument for the delimiter, we simply give our new dialect as the argument.

Here we state that we are creating a dialect called “myDialect”. This dialect will use the vertical bar ( | ) as the delimiter. It also indicates that we want to skip any whitespaces (empty spaces) after delimiters and that all values are inside quotes. There are a few more parameters that can be set; see the links above for details.

What If We Don’t Know the CSV Dialect?

Sometimes we won’t know what dialect the CSV file has. For times like this, we can use the csv.Sniffer() functionality. I’ve found the two functions below very useful:

	header_exists  = csv.Sniffer().has_header(reader)
	sniffed_dialect = csv.Sniffer().sniff(reader)

The first function returns a Boolean value indicating if there is a header. The second function returns the dialect as found by csv.Sniffer(). It is always beneficial to use these functions when we don’t know the structure of the CSV file.

Now That You Know About CSV Files and Python ...

… you need to practice! The CSV file format is one of the oldest and most common data transfer methods out there. We simply cannot hope to avoid it when working as a data scientist or machine learning engineer. Even back-end developers deal with CSV files, either when receiving data or when writing it back to the system for some other component to use.

As the csv module is already installed in Python, it’ll probably be your go-to tool for dealing with CSV files. For some hands-on practice in working with CSVs in Python, take a look at our interactive course How to Read and Write CSV Files in Python.

Tags: