28th Jan 2021 9 minutes read

Top 15 Python Libraries for Data Science

We look at basic and advanced Python libraries for data science. Learn about getting, processing, modeling, and visualizing data in Python.

The Python ecosystem offers a wide range of tools for data scientists. For newbies, it might be challenging to distinguish between fundamental data science tools and the ‘nice-to-haves’. In this article, I’ll guide you through the most popular Python libraries for data science.

Python Libraries for Getting Data

Data science starts with data. To do data analysis or modeling with Python, you need to first import your data. Data can be stored in different formats, but luckily the Python community has developed many packages for getting input data. Let’s see which Python libraries are the most popular for importing and preparing data.

csv

CSV (Comma Separated Values) is a common format for storing tabular data as well as importing and exporting data. To handle CSV files, Python has a built-in csv module. For example, if you need to read data from a CSV file, you can use the csv.reader() function, which basically iterates through the rows of the CSV file. If you want to export data to a CSV format, the csv.writer() function can handle this.

LearnPython.com has a dedicated course called How to Read and Write CSV Files in Python, where you can practice working with the csv module.

json

JSON, or JavaScript Object Notation, is a standard format for storing and exchanging text data. Even though it was inspired by a subset of the JavaScript programming language, JSON is language-agnostic – you don’t need to know JavaScript to work with JSON files.

To encode and decode JSON data, Python has a built-in module called json. After importing the json module, you’ll be able to read JSON documents with the json.load() method or convert your data into JSON files with the json.dump() method.

In the course How to Read and Write JSON Files in Python, you’ll get 35 interactive exercises to practice handling JSON data in Python.

openpyxl

If your data is primarily stored in Excel, you’ll find the openpyxl library very helpful. It was born to read and write Excel 2010 docs. The library supports xlsx, xlsm, xltx, and xltm files. In contrast to the above packages, openpyxl is not built into Python; you’ll need to install it before you use it.

This library allows you to read Excel spreadsheets, import specific data from a particular sheet, append data to the existing spreadsheet, and create new spreadsheets with formulas, images, and charts.

Check out the interactive course How to Read and Write Excel Files in Python to practice interacting with Excel Workbooks using Python.

Scrapy

If the data you want to use is on the web, Python has several packages that’ll get it in a fast and simple way. Scrapy is a popular open-source library for crawling web sites and extracting structured data.

With Scrapy you can, for example, scrape Twitter for tweets from a particular account or with specified hashtags. The result may include lots of information beyond the tweet itself; you may get a table with usernames, tweet times and texts, the number of likes, retweets, and replies, etc. Other than web scraping, Scrapy can also be used to extract data using APIs.

Its speed and flexibility make Scrapy a great tool for extracting structured data that can be further processed and used in various data science projects.

Beautiful Soup

Beautiful Soup is another popular library for getting data from the web. It was created to extract useful information from HTML and XML files, including those with invalid syntax and structure. The unusual name of this Python library refers to the fact that such poorly-marked-up pages are often called ‘tag soup’.

When you run an HTML document through Beautiful Soup, you get a BeautifulSoup object that represents the document as a nested data structure. Then you can easily navigate that data structure to get what you need, e.g. the page’s text, link URLs, specific headings, etc.

The flexibility of the Beautiful Soup library is remarkable. Check it out if you need to work with web data.

Python Libraries for Processing and Modeling Data

After getting your data, you’ll need to clean and prepare it for analysis and modeling. Let’s review Python libraries that assist data scientists in preparing data and building and training machine learning models.

pandas

For those working with tabular data in Python, pandas is the first choice for data analysis and manipulation. One of its key features is the data frame, a dedicated data structure for two-dimensional data. Data frame objects have rows and columns just like tables in Excel.

The pandas library has a huge set of tools for data cleaning, manipulation, analysis, and visualization. With pandas, you can:

Add, delete, and update data frame columns.
Handle missing values.
Index, rename, sort, and merge data frames.
Plot data distribution, etc.

If you want to start working with tabular data in Python, check out our Introduction to Python for Data Science course. It includes 141 interactive exercises that let you practice simple data analysis and data manipulation with the pandas library.

NumPy

NumPy is a fundamental Python library for data science. It is designed to perform numerical operations with n-dimensional arrays. Arrays store values of the same data type. The NumPy vectorization of arrays significantly enhances performance and accelerates the speed of computing operations.

With NumPy, you can do basic and advanced array operations (e.g. add, multiply, slice, reshape, index), generate random numbers, and perform linear algebra routines, Fourier transforms, and more.

SciPy

SciPy is a fundamental library for scientific computing. It’s built upon NumPy and leverages many of that library’s benefits for working with arrays.

With SciPy, you can perform scientific programming tasks such as calculus, ordinary differential equations, numerical integration, interpolation, optimization, linear algebra, and statistical computations.

scikit-learn

A fundamental Python library for machine learning, scikit-learn focuses on modeling data after it has been cleaned and prepared (using libraries like NumPy and pandas). This is a very efficient tool for predictive data analysis. Furthermore, it is beginner-friendly, making machine learning with Python accessible to everybody.

With just a few lines of code, scikit-learn allows you to build and train machine learning models for regression, classification, clustering, dimensionality reduction, and more. It supports algorithms such as support vector machines (SVM), random forests, k-means, gradient boosting, and many others.

PyTorch

PyTorch is an open-source deep learning framework built by Facebook’s AI Research lab. It was created to implement advanced neural networks and cutting-edge research ideas in industry and academia.

Like scikit-learn, PyTorch focuses on data modeling. However, it is intended for advanced users who work primarily with deep neural networks. PyTorch is a great tool to use when you need a production-ready machine learning model that is fast, efficient, scalable, and can work with a distributed environment.

TensorFlow

TensorFlow is another open-source library for developing and training machine learning models. Built by the Google Brain team, TensorFlow is a major competitor to PyTorch in the development of deep learning applications.

TensorFlow and PyTorch used to have some major differences, but they have now adopted many good features from each other. They are both excellent frameworks for building deep learning models. When you hear about breakthrough neural network architectures for object detection, facial recognition, language generation, or chatbots, they are very likely coded using either PyTorch or Tensorflow libraries.

Python Libraries for Visualizing Data

In addition to data analysis and modeling, Python is also a great tool for visualizing data. Here are some of the most popular Python libraries that can help you create meaningful, informative, interactive, and appealing data visualizations.

matplotlib

This is a standard library for generating data visualizations in Python. It supports building basic two-dimensional graphs like line plots, histograms, scatter plots, bar charts, and pie charts, as well as more complex animated and interactive visualizations.

The matplotlib library is also flexible with regards to formatting and styling plots; you can choose how to display labels, grids, legends, etc. However, one major disadvantage to matplotlib is that it requires data scientists to write lots of code to create complex and visually appealing plots.

For those willing to learn data visualization with matplotlib, I recommend starting with our two-part tutorial that covers line plots and histograms and bar plots, scatter plots, stack plots, and pie charts. If you’re working with time series data, check out this guide to visualizing it with Python.

Finally, matplotlib is also covered in our Introduction to Python for Data Science course, where you can practice building line plots, histograms, and other plot types.

seaborn

Although it was built upon matplotlib, the seaborn library has a high-level interface that enables users to draw attractive and informative statistical graphs in just a few lines of code – or only one line of code! Its concise syntax and advanced features make it my favorite visualization tool.

Thanks to an expansive collection of visualizations and a set of built-in themes, you can create professional plots even if you are very new to coding data visualizations. Leverage seaborn’s extensive features to create heatmaps, violin plots, joint plots, multi-plot grids, and more.

Example of a scatterplot matrix (source)

Bokeh

Bokeh is a great tool for creating interactive visualizations inside browsers. Like seaborn, it allows you to build complex plots using simple commands. However, its main focus is on interactivity.

With Bokeh, you can link plots, display relevant data while hovering over specific data points, embed different widgets, etc. Its extensive interactive abilities make Bokeh a perfect tool for building dashboards, network graphs, and other complex visualizations.

Plotly

Plotly is another browser-based visualization library. It offers many useful out-of-the-box graphics, including:

Basic plots (e.g. scatterplots, line plots, bar charts, pie charts, bubble charts)
Statistical plots (e.g., error bars, box plots, histograms).
Scientific plots (e.g. contour plots, heatmaps).
Financial charts (e.g. time series and candlestick charts).
Maps (e.g. adding lines, filled areas, bubbles, and heatmaps to geographic maps).
3D plots (e.g. scatterplots, surface plots).

Consider using Plotly if you want to build interactive and publication-quality graphs.

Example of a mapbox density heatmap with Plotly (source)

Learn More About Python’s Data Science Libraries

Now that you’ve been introduced to the Python libraries available for data science, don’t be a stranger to them! To master your data science skills, you’ll need lots of practice. I recommend starting with interactive courses, where an explanation of basic concepts is combined with coding challenges.

Our Introduction to Python for Data Science course is perfect for beginners who want to learn how to perform simple data analysis using Python. It teaches you how to work with tabular data and create basic plots with a few lines of code.

For data enthusiasts who want to expand their knowledge, LearnPython.com has developed the Python for Data Science mini-track. It consists of five courses that cover importing and exporting data in different formats, working with strings in Python, and the basics of data analysis and visualization. This track is a great option for a gentle introduction to the world of data science.

Thanks for reading, and happy learning!

Tags:

Python Libraries for Getting Data

Python Libraries for Processing and Modeling Data

Python Libraries for Visualizing Data

Learn More About Python’s Data Science Libraries

You may also like