29th Jun 2022 8 minutes read

Web Scraping With Python Libraries

Here are some useful Python libraries to get you started in web scraping.

Looking for Python website scrapers? In this article, we will get you started with some helpful libraries for Python web scraping. You'll find the tools and the inspiration to kickstart your next web scraping project.

Web scraping is the process of extracting information from the source code of a web page. This may be text, numerical data, or even images. It is the first step for many interesting projects! However, there is no fixed technology or methodology for Python web scraping. The best approach is very use-case dependent.

This article is aimed at people with a little more experience in Python and data analysis. If you're new to Python and need some learning material, take a look at this track to give you a background in data analysis.

Let's get started!

Requests

The first step in the process is to get data from the web page we want to scrape. The requests library is used for making HTTP requests to a URL.

As an example, let's say we're interested in getting an article from the learnpython.com blog. To import the library and get the page just requires a few lines of code:

>>> import requests
>>> url = 'https://learnpython.com/blog/python-match-case-statement/'
>>> r = requests.get(url)

The object r is the response from the host server and contains the results of the get() request. To see if the request was successful, check the status with r.status_code. Hopefully, we don't see the dreaded 404! You also need to be aware of the potential for the equally vexing 403 error in web scraping, but luckily this is something you have more control over, as it normally relates to anti-scraping systems, rather than the missing page issue of 404 errors. It is possible to customize the get() request with some optional arguments to modify the response from the server. For more information on this library, including how to send a customized request, take a look at the documentation and user guide.

To get the contents of the web page, we simply need to do the following:

>>> page_text = r.text

This returns the contents of the whole page as a string. From here, we may try to manually extract the required information, but that is messy and error-prone. Thankfully, there is an easier way.

Beautiful Soup

Beautiful Soup is a user-friendly library with functionality for parsing HTML and XML documents automatically into a tree structure. This library only parses the data, which is why we need another library to get the data as we have seen in the previous section.

The library also provides functions for navigating, searching, and modifying the parsed data. Trying different parsing strategies is very easy, and we do not need to worry about document encodings.

We can use this library to parse the HTML-formatted string from the data we have retrieved and extract the information we want. Let's import the library and start making some soup:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(page_text, 'html.parser')

We now have a BeautifulSoup object, which represents the string as a nested data structure. How to proceed from here depends on what information we want to scrape from the page. That may be the text, the code snippets, the headings, or anything else.

To get a sense of how the information is represented, open the URL in your favorite browser and take a look at the source code behind the web page. It looks something like this:

Let's say we want to scrape the Python code snippets from the HTML source code. Notice they always appear between <pre class="brush: python; title: ; notranslate"> and </pre>. We can use this to extract the Python code from the soup as follows:

>>> string = soup.find(class_ = "brush: python; title: ; notranslate").text

Here, we use the find() method, which extracts only the first match. If you want to find all matches, use find_all() to return a list-like data structure that can be indexed like normal.

Now, we have the code snippet as a string including newline characters and spaces to indent the code. To run this code, we have to clean it up a little to remove unwanted characters and save it to a .py file. For example, we can use string.replace('>', '') to remove the > characters.

Check out this article, which has an example that may be useful at this stage. Writing a program to download and run other programs has a nice recursive feel to it. However, be wary of downloading any potentially malicious code.

Selenium

Selenium was developed primarily as a framework for browser automation and testing. However, the library has found another use as a toolbox for web scraping with Python, making it quite versatile. For example, it's useful if we need to interact with a website by filling out a form or clicking on a button. Selenium may also be used to scrape information from JavaScript used by many sites to load the content dynamically.

Let's use Selenium to open a browser, navigate to a web page, enter text into a field, and retrieve some information. However, before we do all that, we need to download an extra executable file to drive the browser. In this example, we'll work with the Chrome browser, but there are other options. You can find the drivers for your version of Chrome here. Download the correct driver and save it in directory.

To open the browser with Selenium in Python, do the following:

>>> from selenium import webdriver
>>> driver = webdriver.Chrome(directory+'chromedriver.exe')
>>> driver.get('https://learnpython.com/')
>>> driver.maximize_window()

This opens a browser window, navigates to https://learnpython.com and maximizes the window. The next step is to find and click on the "Courses" button:

>>> courses_button = driver.find_element_by_link_text('Courses')
>>> courses_button.click()
>>> driver.refresh()

The browser navigates to the Courses page. Let's find the search box and enter a search term:

>>> search_field = driver.find_element_by_class_name('TextFilterComponent__search-bar')
>>> search_field.clear()
>>> search_field.send_keys('excel')

The results automatically update. Next, we want to find the first result and print out the course name:

>>> result = driver.find_element_by_class_name('CourseBlock')
>>> innerhtml = result.get_attribute('innerHTML')
>>> more_soup = BeautifulSoup(innerhtml, 'html.parser')
>>> title = more_soup.find(class_ = 'CourseBlock__name').text

We use BeautifulSoup to parse the HTML from the first search result and then return the name of the course as a string. If we want to run this code in one block, it may be necessary to let the program sleep for a few seconds to let the page load properly. Try this workflow with a different search term, for example, "strings" or "data science".

To do all this for your own project, you need to inspect the source code of the web page to find the relevant names or IDs of the elements with which you want to interact. This is always use-case dependent and involves a little bit of investigative work.

Scrapy

Unlike the two previous libraries, scrapy is very fast and efficient. This makes it useful for scraping large amounts of data from the web – a big advantage of this library. It also takes care of scraping and parsing the data.

However, it is not the most user-friendly library ever written. It is difficult to get your head around it. It is also difficult to show a simple example here.

The workflow for using scrapy involves creating a dedicated project in a separate directory, where several files and directories are automatically created. You may want to check out the course on LearnPython.com that teaches you how to work with files and directories efficiently.

One of the directories created is the "spiders/" directory in which you put your spiders. Spiders are classes that inherit from the scrapy.Spider class. They define what requests to make, how to follow any links on the web page, and how to parse the content. Once you have defined your spider to crawl a web page and extract content, you can run your script from the terminal. Check out this article to learn more about using Python and the command-line interface.

Another powerful feature of scrapy is the automated login. For some sites, we can access the data only after a successful login, but we can automate this with scrapy.FormRequest.

Read through the scrapy documentation page for more information. There, you find the installation guide and an example of this library in action.

Where to From Here in Web Scraping?

We have seen the basics of web scraping with Python and discussed some popular libraries. Web scraping has a huge number of applications. You may want to extract text from Wikipedia to use for natural language processing. You may want to get the weather forecast for your hometown automatically. You may even write a program to compare the prices of flights or hotels before your next holiday.

There are many advantages of using Python for data science projects. It is generally a good idea to start with a small project and slowly build up your skills. If you develop more complex projects with multiple libraries, keep track of them with a requirements.txt file. Before you know it, you will have mastered another skill on your Python journey!

Tags:

Requests

Beautiful Soup

Selenium

Scrapy

Where to From Here in Web Scraping?

You may also like