Back to articles list Articles
10 minutes read

How to Read XML Files into Python

In this article, you’ll learn what an XML file is, what they are used for, and how to read XML into Python using a few different libraries.

The ability to extract information from various file formats is a crucial data analysis skill. This is no different with an XML file: XML is a common file format in data processing, particularly when you’re dealing with data received from an API. If you're a novice data analyst venturing into the Python ecosystem, mastering the art of reading XML into Python can significantly enhance your skill set.

In this beginner-friendly guide, we will walk through the process of reading XML files into Python. We will start by explaining what XML files are and how they structure data. Afterwards, we will learn how to read XML data into Python using a few simple methods.

If you are interested in learning more about other data formats and how to process them in Python, we recommend our dedicated track on Data Processing with Python. With five interactive courses and over 35 hours of content, this track empowers you to work with some of the most common data formats you’ll encounter as a data analyst – like CSV, JSON, and Excel files.

But for now, let's find out how to work with XML in Python!

What’s in an XML File?

XML (short for Extensible Markup Language) is a plain text file format designed to store and transport data. Its main feature is that data is stored in an arbitrary hierarchy. This is a direct contrast to other plain text formats that store data, such as a CSV table (which you can read about in this article on CSV files in Python).

As the name suggests, XML is a markup language, so you can expect data to be stored in tags like this: <tag>. In XML, tags are often referred to as elements. HTML (HyperText Markup Language) is another popular markup file format, but unlike XML it is primarily used for displaying data on web pages. XML focuses on describing the structure and content of data in a hierarchical, customizable format.

XML File Structure

XML files consist of hierarchical structures organized into elements and contents. These elements encapsulate the data and provide a way to represent relationships between different pieces of information.

This is not unlike Python dictionaries, which store elements as key-value pairs. In this analogy, the dictionary key is the element in the XML file, and the dictionary value is the XML content.

Consider the following XML file that store information about books:


  
    The Great Gatsby
    F. Scott Fitzgerald
    Classic
    1925
  
  
    To Kill a Mockingbird
    Harper Lee
    Novel
    1960
  

In this XML file, everything is contained inside the <library> element: the element is “opened” in the very first line, and is then “closed” at the very end by writing </library>. This means that <library> is the root element of this XML file.

XML File Hierarchy

The <library> root element contains multiple <book> elements. In turn, each <book> element contains the <title>, <author>, <genre>, and <year> elements, storing different information of each book as content in the XML.

Elements in the XML naturally compose a parent-child hierarchy. In this example, we could say that the <library> element is the parent of <book>, or that the <title> and <year> elements are the children of <book>. This hierarchical structure of libraries containing books, which themselves contain information of the book’s title, author, genre, and year, is the type of information that XML files can effectively store and relay.

Additionally, note how the hierarchy is created by simply opening and closing the elements at certain parts of the XML file. This simple mechanism allows us to store arbitrary hierarchical structures into the XML, even if they include hundreds of elements. In fact, formats such as XLSX use XML under the hood in order to manage all the complexity behind an Excel spreadsheet. (This article shows how to read Excel files in Python directly).

XML files can be further customized by passing attributes and values inside an element’s tag. For example, we could add two arbitrary attributes (attr1 and attr2) and their respective values to the <book> element by writing <book attr1="value1", attr2="value2">.

In this article, we will keep things simple by working with elements without any attributes or values.

Where Are XML Files Used?

As we saw, XML provides a flexible and platform-independent way to represent structured data. Its human-readable format and ability to define custom elements make it suitable for various applications where data organization and interoperability are essential. Because of this, XML files are mainly used in two scenarios: configuration files and data interchange.

As a configuration file, XML is often used to define how a certain program should operate. This could be a web server, a program used by your own operating system, or a simple Python script that you wrote.

For example, an XML configuration file for a database could look like this:


  
    admin
    secretpassword
    mydatabase
  
  
    8080
    300
    100
  
  
    DEBUG
  

The XML file above stores information related to database credentials, server settings, and logging functionality.

On the other hand, XML can also serve to interchange data between systems. If your Python script imports and exports data in XML, it becomes much easier to combine it with other systems that also use XML. It doesn’t matter what kind of database or storage your data lives in – as long as you know how to read from it and write back into it through XML. Because of this, XML is a popular format for data consumption in APIs.

As an example, the XML sample below could represent the result from a query in your company’s employee database:


  
    John
    Marketing
    Manager
  
  
    Mary
    Finance
    Analyst
  

Now that we've grasped the basics of XML, let's delve into how we can efficiently read XML files into Python for further analysis and processing.

How to Read XML Files in Python

We have a couple different options for libraries when working with XML in Python. In this section, we'll explore how to read XML files in Python using both built-in and third-party libraries. We’ll compare their approaches and functionalities.

In every example, we will use the books.xml file, whose content is laid out below:


  
    Python Programming
    John Smith
    29.99
  
  
    Data Analysis with Python
    Mary Kate
    39.99
  

Reading an XML File with Python's Built-In xml Module

Python's xml module provides functionalities to parse and manipulate XML data. Since this module is a part of Python’s standard library, there is no need to install anything besides Python itself.

The ElementTree class within this module offers a convenient way to navigate XML trees and extract relevant information. Take a look at the following example:

import xml.etree.ElementTree as ET

tree = ET.parse('books.xml')
root = tree.getroot()

for book in root.findall('book'):
	title = book.find('title').text
	author = book.find('author').text
	price = book.find('price').text
	print(f"Title: {title}, Author: {author}, Price: {price}")

# output:
# Title: Python Programming, Author: John Smith, Price: 29.99
# Title: Data Analysis with Python, Author: Mary Kate, Price: 39.99

In the code above, we use ElementTree to parse the books.xml file. The ElementTree.getroot() method extracts the root element (i.e. the <bookstore> tag) into the root variable. Afterwards, we use the root.findall() method to find all <book> elements beneath the root element in the XML hierarchy.

We then iterate over each retrieved <book> element and get the corresponding content of its child elements <title>, <author>, and <price> with the book.find() method. Finally, we print the extracted information for each book in a formatted string. This displays the title, author, and price of each book.

Using BeautifulSoup to Read an XML File in Python

We can also use BeautifulSoup to read an XML file into Python. BeautifulSoup is a third-party Python library used to parse data stored as markup language. It is more commonly used to parse information in HTML files (particularly those obtained from web scraping), but you can use the library to parse content in XML files as well.

If you do not have BeautifulSoup in your Python environment, you need to install it using pip. You’ll also need to install lxml in order to make BeautifulSoup compatible with XML files. Execute the following command in your terminal to install both libraries at once:

pip install beautifulsoup4 lxml

The following examples does the exact same task as the previous one, but it replaces the xml module with BeautifulSoup:

from bs4 import BeautifulSoup

with open('books.xml') as f:
    soup = BeautifulSoup(f, 'xml')

for book in soup.find_all('book'):
    title = book.title.text
    author = book.author.text
    price = book.price.text
    print(f"Title: {title}, Author: {author}, Price: {price}")

# output:
# Title: Python Programming, Author: John Smith, Price: 29.99
# Title: Data Analysis with Python, Author: Mary Kate, Price: 39.99

As we can see, the basic structure of the code remains the same. The only differences are:

  • We had to explicitly open the XML file with Python’s open()
  • BeautifulSoup allows us to access a child element as if it were an attribute, i.e. we can write title.text instead of book.find("title").text.
  • There are minor syntax differences, e.g. the method for retrieving all <book> elements is written find_all() instead of findall().

Other than these small details, both approaches do exactly the same thing. Use whichever works best for you!

Reading an XML File into a Python Dictionary with xmltodict

As previously mentioned, XML files are structured similarly to Python dictionaries. While the syntax is different, the hierarchical structure and arbitrary tag names of an XML file is reminiscent of a Python dictionary.

In fact, we could combine Python dictionaries and lists to store the exact same data in books.xml:

xml_dict = {
  "bookstore": {
    "book": [
      {
        "title": "Python Programming",
        "author": "John Smith",
        "price": "29.99"
      },
      {
        "title": "Data Analysis with Python",
        "author": "Jane Doe",
        "price": "39.99"
      }
    ]
  }
}

This similarity begs the question: Could we parse the contents of an XML file directly into a Python dictionary? Luckily, the third-party library xmltodict does exactly that. To use it, we need to install it with pip:

pip install xmltodict

Once we have it installed, all we need to do is to read the contents of the XML file as a Python string. We pass it to xmltodict.parse() so that it can perform its magic:

import xmltodict

with open('books.xml') as f:
   xml_string = f.read()

xml_dict = xmltodict.parse(xml_string)
print(xml_dict)

# output:
# {
#   'bookstore': {
#     'book': [
#       {
#         'title': 'Python Programming'
#         'author': 'John Smith',
#         'price': '29.99'
#       },
#       {
#         'title': 'Data Analysis with Python',
#         'author': 'Mary Kate',
#         'price': '39.99'
#       }
#     ]
#   }
# }

Thanks to the xmltodict module, we managed to read the entire data from the XML into a Python dictionary in just a few lines.

If you just need to extract a few bits of information – or if you need to perform a custom data processing pipeline when reading the XML file – you might be better off sticking to the xml or BeautifulSoup modules. But if all you need is a quick way to read the data in the XML file, the xmltodict module does the job just fine.

What’s Next with Reading XML Files into Python?

In this article, we went over how to read XML files into Python. We started by understanding what exactly an XML file is and where it is commonly used. We then explored how to read the XML file into Python using three different libraries: xml, BeautifulSoup, and xmltodict.

Being able to read XML files into Python opens up a world of possibilities for data analysts. Whether it's extracting or editing data from configuration files, parsing responses from a web service or API, or analyzing structured data, understanding how to handle XML in Python is a valuable skill for anyone working with data in Python.

To further your Python data analysis skills, we recommend you check out our courses on reading and writing CSV files, reading and writing JSON files, and reading and writing Excel files. Much like XML, these file formats are among some of the most used by any skilled data analyst. And remember that you get access to all of these courses and more in our Data Processing with Python Track!