Back to articles list Articles
7 minutes read

How to Get a Substring of a String in Python

Learn how to get a substring of a string in Python.

Learning anything new can be a challenge. The more you work with Python, the more you notice how often strings pop up. String manipulation in Python is an important skill. In this article, we give you an introduction to generating a substring of a string in Python.

Python is a great language to learn especially if you’re a beginner, as we discuss in this article. We even have a course on working with strings in Python. It contains interactive exercises designed to start from the basic level and teach you all you need to know about this important data type. Once you’re comfortable working with strings, you can work on some interesting data science problems. Take a look at the Python for Data Science course, which gives you an introduction to this diverse topic.

Slicing and Splitting Strings

The first way to get a substring of a string in Python is by slicing and splitting. Let’s start by defining a string, then jump into a few examples:

>>> string = 'This is a sentence. Here is 1 number.'

You can break this string up into substrings, each of which has the str data type. Even if your string is a number, it is still of this data type. You can test this with the built-in type() function. Numbers may be of other types as well, including the decimal data type, which we discuss here.

Much like arrays and lists in Python, strings can be sliced by specifying the start and the end indexes, inside square brackets and separated by a colon. This returns a substring of the original string.

Remember indexing in Python starts from 0. To get the first 7 characters from the string, simply do the following:

	>>> print(string[:7])
	This is

Notice here we didn’t explicitly specify the start index. Therefore, it takes a default value of 0.

By the way, if you want more information about the print() function, check out this article. There’s probably more to it than you realize.

We can also index relative to the end of the string by specifying a negative start value:

	>>> print(string[-7:])

Since we didn’t specify an end value, it takes the default value of len(string). If you know the start and the end indexes of a particular word, you can extract it from the string like this:

>>> print(string[10:18])

However, this is not optimal for extracting individual words from a string since it requires knowing the indexes in advance.

Another option to get a substring of the string is to break it into words, which can be done with the string.split() method. This takes two optional arguments: a string defining which characters to split at (defaults to any whitespace), and the maximum number of splits (defaults to -1, which means no limit). As an example, if we want to split at a space, you can do the following, which returns a list of strings:

>>> string.split(' ')
['This', 'is', 'a', 'sentence.', 'Here', 'is', '1', 'number.']

But notice the full stop (point character) is included at the end of the words “sentence” and “number”. We’ll come back to this later in the article when we look at regular expressions.

There are plenty of built-in string methods in Python. They allow you to modify a string, test its properties, or search in it. A useful method to generate a more complex substring of a string in Python is the string.join() method. It takes an iterable of strings and joins them. Here’s an example:

>>> print(' and '.join(['one', 'two', 'three']))
one and two and three

With a clever indexing trick, this can be used to print a substring containing every second word from the original:

>>> print(' '.join(string.split(' ')[::2]))
This a Here 1

Since the input to the join() method takes a list, you can do a list comprehension to create a substring from all words with a length equal to 4, for example. For those of you looking for a more challenging exercise, try this for yourself. We’ll also show you a different method to do this later in the article. If you want to know how to write strings to a file in Python, check out this article.

The parse Module

There’s a little-known Python module called parse with great functionality for generating a substring in Python. This module doesn’t come standard with Python and needs to be installed separately. The best way is to run the pip install command from your terminal.

Here’s how to get a substring using the parse function, which accepts two arguments:

>>> import parse
>>> substring = parse.parse('This is {}. Here is 1 {}.', 'This is a sentence. Here is 1 number.')
>>> substring.fixed
('a sentence', 'number')

Calling the fixed method on substring returns a tuple with the substrings extracted from the second argument at the position of the curly braces {} in the first argument. For those of you familiar with string formatting, this may look suspiciously familiar. Indeed, the parse module is the opposite of format(). Check this out, which does the opposite of the above code snippet:

>>> print('This is {}. Here is 1 {}.'.format('a sentence', 'number'))
This is a sentence. Here is 1 number.

While we’re talking about the parse module, it’s worth discussing the search function, since searching is a common use case when working with strings. The first argument of search defines what you’re looking for by specifying the search term with curly braces. The second defines where to look.

Here’s an example:

>>> result ='is a {}.', 'This is a sentence. Here is 1 number')
>>> result.fixed

Once again, calling the fixed method returns a tuple with the results. If you want the start and the end indexes of the result, call the spans method. Using the parse module to search in a string is nice – it’s pretty robust to how you define what you’re searching for (i.e., the first argument).

Regular Expressions

The last Python module we want to discuss is re, which is short for “regex,” which is itself short for “regular expression.” Regular expressions can be a little intimidating – they involve defining highly specialized and sometimes complicated patterns to search in strings.

You can use regex to extract substrings in Python. The topic is too deep to cover here comprehensively, so we’ll just mention some useful functions and give you a feel for how to define the search patterns. For more information on this module and its functionality, see the documentation.

The findall() function takes two required arguments: pattern and string. Let’s start by extracting all words from the string we used above:

>>> re.findall(r'[a-z]+', 'This is a sentence. Here is 1 number.', flags=re.IGNORECASE)
['This', 'is', 'a', 'sentence', 'Here', 'is', 'number']

The [a-z] pattern matches all lowercase letters, the + indicates the words may be of any length, and the flag tells you to ignore the case. Compare this to the result we got above by using string.split(), and you notice the full stop is not included.

Now, let’s extract all numbers from the string:

>>> re.findall(r'\b\d+\b', 'This is a sentence. Here is 1 number.')

\b matches a boundary at the start and end of the pattern, \d matches any digit from 0 to 9, and again the + indicates the numbers may be of any length. For example, we find all words with a length of 4 characters with the following:

>>> re.findall(r'\b\w{4}\b', 'This is a sentence. Here is 1 number.')
['This', 'Here']

\w matches any words, and {4} defines the length of the words to match. To generate a substring, you just need to use string.join() as we did above. This is an alternative approach to the list comprehension we mentioned earlier, which may also be used to generate a substring with all words of length 4.

There are other functions in this module worth taking a look at. match() may be used to determine if the pattern matches at the beginning of the string, and search() scans through the string to look for any location where the pattern occurs.

Closing Thoughts on Generating Substrings in Python

In this article, we have discussed extracting and printing substrings of strings in Python. Use this as a foundation to explore other topics such as scraping data from a website. Can you define a regex pattern to extract an email address from a string? Or remove punctuation from this paragraph? If you can, you’re on your way to becoming a data wrangler!

If you also work a lot with tabular data, we have an article that shows you how to pretty-print tables in Python. Slowly adding all these skills to your toolbox will turn you into an expert programmer.