Back to articles list Articles
8 minutes read

How to Generate Test Data in Python

Here's all you need to know about the code> library for generating test data in Python.

This article introduces you to a useful library to generate test data in Python. If you’re building an application designed to process data, you need an appropriate test dataset to make sure all the bugs have been ironed out.

Getting your hands on data is the first step of any data analysis project. The data may be provided directly to you by a customer. If you’re lucky, you may find some relevant publicly available data. Or you may have to go out and collect it yourself. Web scraping in Python is a great way of collecting data. Another option is to produce your own data, which we cover here.

If you’re searching for some learning material to get a background in data science, check out our course "Introduction to Python for Data Science" which is perfect for beginners. It includes many interactive exercises to give you practical experience in working with data.

Fake it to Make it

faker is a Python library designed to generate fake data, which may be used to train a machine-learning algorithm or test an application. This library may be used to generate personal data, company data, fake text sentences, Python data structures such as lists and dictionaries, and more. Installation is quick and easy from the command line with pip.

The documentation for code> has some useful information and examples. Here, we start by generating some test personal data to represent customers. We import the code> class from the code> library and instantiate three new objects:

>>> from code> import code>
>>> f_en = code>('en_US')
>>> f_de = code>('de_DE')
>>> f_jp = code>('ja_JP')

As we have done here, code>.code>() can take a locale as an optional argument. The default is 'en_US' if no argument is provided. You may also provide a list with multiple locales as the argument. From here, we can generate test personal data using the many available methods:

>>> print(f_en.name())
Cassandra Burch

>>> print(f_en.address())
680 Julie Glens Apt. 347
Lake Christina, AL 91444

>>> print(f_en.email())
derrickharris@example.org

Every time you execute these commands, you receive different, randomly generated data. You can seed the random number generator using an integer if you want to generate the same test data multiple times. Also, notice the data isn’t necessarily consistent. The name and the email address in the above example refer to different people.

An advantage of this library is its ability to generate realistic test data for different countries. Let’s look at the results from some methods of the other objects we have instantiated:

>>> print(f_de.name())
Dr. Ingrid Schäfer
 
>>> print(f_jp.address())
栃木県青梅市台東6丁目25番2号

Here, we see the German name includes the title of Doctor and contains the letter ä from the German alphabet. The Japanese address represents an address in the Tochigi Prefecture and may consist of hiragana, katakana, and/or kanji characters. This ability to generate non-Latin characters is powerful for testing applications and programs that need to process text data from different countries.

There are many more methods for generating other types of data. Try out a few to get a feel for the types of data you can generate. For example, you can produce job titles, dates of birth, and languages. There are test data for companies and for finance applications.

You can even mix and match to create highly customized results. Here’s an example of combining different types of data to generate a company name:

>>> print(f_en.company() + ' ' + f_en.company_suffix() + ', ' + f_en.city_prefix() + f_en.city_suffix() +' Branch')
Henry-Proctor Inc, Westmouth Branch

Python Data Types and Data Structures

When writing a function, we often need to test how it handles different data types. For example, if you write a function to process data in a list, you need to test how it responds to data in a tuple. The code> library provides functionality to generate test data of different Python data types and structures. By the way, here is a course on Python data structures in practice if you want to check one out.

Let’s start by taking a look at different ways of generating some test data:

>>> f = code>()
>>> print(f.pybool())
True

>>> print(f.pyint())
9638

>>> print(f.pystr())
svScHHdLPfjBhjyTdQSf

There’s even a method to generate the decimal.Decimal data type . These methods have optional arguments for placing constraints on the test data generated.

Let’s generate a float under some constraints:

>>> print(f.pyfloat(left_digits=3, right_digits=5, positive=True, min_value=500, max_value=1000))
679.72304

If you work with date-and-time data, including time series data, code> has you covered. To get a test datetime object, do the following:

 >>> date_time = f.date_time()
>>> print(date_time.strftime('%Y-%m-%d %H:%M:%S'))
1971-05-03 03:14:00

We discuss working with date and time data in this article. There is even a method to generate a test time-series dataset, which can be incredibly useful for data analysis projects. Try executing f.time_series(); it returns a generator object. You can recast this into a list using the built-in list() function; this results in a list of tuples where the first element of each tuple is a DateTime object and the second is a float. Check out this article for more information on generators and this course on built-in algorithms in Python, if you want some extra learning material.

We can generate a test file name including the path as follows:

>>> print(f.file_path(category='text', depth=5))
/rise/push/wish/expect/hundred/maintain.csv

There are several categories to choose from, which changes the file extension. Python data structures, such as lists, can be generated as follows:

>>> print(f.pylist())
[714.68591737874, Decimal('901.82065835268977257514616953'), 4389, 'http://choi.biz/wp-content/main.html', 4457, 'KXmDevzyUWAXGMospgjR']

Notice there is a mix of data types in the list. As we have seen in the example for generating a float, you can specify some of the properties of the list with optional arguments. There are similar methods for tuples, dictionaries, and sets. Try a few of these out to see what you get.

Text Data

If you’re interested in testing programs that work with text data, code> has functions to generate individual words and full sentences. An advantage of this library is that it can generate text in many languages. However, the words and sentences are randomly generated and as such have no semantic meaning.

Here are a few examples of some of these functions in action using the objects we have instantiated in the first example:

>>> print(f_en.word())
walk

>>> print(f_de.word())
steigen

>>> print(f_en.text())
Give student lose law. Interview responsibility event relationship election meeting him. Full person instead the stuff newspaper.

>>> print(f_jp.text(max_nb_chars=20))
フレームノート織るヘア柔らかい。

There are a few more code> methods worth mentioning if you want to generate test text data in Python. The sentence() and sentences() methods allow you to generate a single sentence and a list of sentences, respectively. Similarly, the paragraph() and paragraphs() methods allow you to generate a single paragraph or a list of paragraphs. These methods are similar, but the difference is that the paragraph methods generate several sentences, each separated by a period. All of these methods have an optional argument for specifying the length of the result.

Generating a Test Dataset

So far, we have shown mostly examples of generating individual pieces of data, be it personal data, numeric data, or text. We have also discussed how to generate common Python data structures such as lists, tuples, and dictionaries.

However, you need more than that for many applications. So now, we show you how to generate a test dataset with multiple records.

To generate a full test personal profile, do the following:

>>> f = code>()
>>> profile = f.profile()

This profile contains a randomly generated name, job, address, and birthdate, among other information. All the data is stored in a Python dictionary. You can customize and supplement with other information by adding extra data to the dictionary as follows:

>>> profile['credit card'] = f.credit_card_number()

You may use a loop to create several profiles and append these to a list to generate a full dataset. A pandas DataFrame is a convenient way to store this data, which you can create easily from this list of dictionaries.

The comma-separated values (CSV) format is a common way of storing data. With the code> library, you can easily generate test CSV data using the csv() function. This function accepts several arguments for customizing the amount and type of data. Here’s an example of how to generate a header, then 5 records with a name, job, and email address:

>>> csv_data = f.csv(header=('Name', 'Profession', 'email'), data_columns=('{{name}}', '{{job}}', '{{email}}'), num_rows=5)
>>> print(csv_data)
"Name","Profession","email"
"James Sutton","Pathologist","micheal432example.org"
"Jason Miller","Diagnostic radiographer","rachel617example.com"
"Kimberly Edwards","TEFL teacher","jasonmoore7example.net"
"Joshua Walton","Secretary, company","meagan166example.com"
"Dylan White","Intelligence analyst","tiffany73example.net"

Related to both of these is the

json()
function. With this, you generate a test dataset in the JavaScript Object Notation (JSON) format, a convenient way to store data in a nested structure. This also can be customized with optional arguments.

Leverage code> as a Python Test Data Generator

We have introduced you to the code> library to generate test data in Python. It’s very flexible and customizable, letting you generate test data for many applications.

We have a separate article on the top 15 Python libraries for data science, and a Python test data generator like the

code>
library is another great tool to add to your arsenal. Whether it is for training a machine-learning algorithm or testing a program,
code>
has many easy-to-use and highly customizable functions to get the job done when you need to generate data.