5 Tasty Python Web Scraping Libraries

pyhon-web-scraping-feature-image-small

Web scraping is a common and effective way of collecting data for projects and for work. In this guide, we’ll be touring the essential stack of Python web scraping libraries.

Python Web Scraping Libraries

Why only 5 libraries?

There are dozens of packages for web scraping out there... but you only need a handful to be able to scrape almost any site. This is an opinionated guide. We've decided to feature the 5 Python libraries for web scraping that we love most. Together, they cover all the important bases, and they are well-documented.

Do I need to learn every library below?

No, but everyone will need Requests, because it's how you communicate with websites. The rest depend on your use case. Here's a rule of thumb:

  • You should learn at least one of BeautifulSoup or lxml. Pick depending on which is more intuitive for you (more on this below).
  • Learn Selenium if you need to scrape sites with data tucked away by JavaScript.
  • Learn Scrapy if you need to build a real spider or web-crawler, instead of just scraping a few pages here and there.

Why are they tasty?

Because they are yummy! So without further ado...

The Farm: Requests

The Requests library is vital to add to your data science toolkit. It's a simple yet powerful HTTP library, which means you can use it to access web pages.

We call it The Farm because you'll be using it to get the raw ingredients (i.e. raw HTML) for your dishes (i.e. usable data).

Its simplicity is definitely its greatest strength. It's so easy use that you could jump right in without reading documentation.

For example, if you want to pull down the contents of a page, it's as easy as:

But that's not all that Requests can do. It can access API's, post to forms, and much more.

Plus, it's got character... It's the only library that calls itself Non-GMO, organic, and grass-fed. You gotta love that.

Resources

  • Requests Quickstart Guide - Official documentation. Covers practical topics like passing parameters, handling responses, and configuring headers.

The Stew: Beautiful Soup 4

After you have your ingredients, now what? Now you make them into a stew... a beautiful stew.

Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.

Beautiful Soup's default parser comes from Python's standard library. It's flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.

One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.

In addition, BS4 can help you navigate a parsed document and find what you need. This makes it quick and painless to build common applications. For example, if you wanted to find all the links in the web page we pulled down earlier, it's only a few lines:

This charming simplicity has made it one of the most beloved Python web scraping libraries!

Resources

The Salad: lxml

Lxml is a high-performance, production-quality HTML and XML parsing library. We call it The Salad because you can rely on it to be good for you, no matter which diet you're following.

Among all the Python web scraping libraries, we've enjoyed using lxml the most. It's straightforward, fast, and feature-rich.

Even so, it's quite easy to pick up if you have experience with either XPaths or CSS. Its raw speed and power has also helped it become widely adopted in the industry.

Beautiful Soup vs lxml

Historically, the rule of thumb was:

  • If you need speed, go for lxml.
  • If you need to handle messy documents, choose Beautiful Soup.

Yet, this distinction no longer holds. Beautiful Soup now supports using the lxml parser, and vice-versa. It's also pretty easy to learn the other once you've learned one.

So to start, we recommend trying both and picking the one that feels more intuitive for you. We prefer lxml, but many swear by Beautiful Soup.

Resources

The Restaurant: Selenium

Sometimes, you do need to go to a restaurant to eat certain dishes. The farm is great, but you can't find everything there.

Likewise, sometimes the Requests library is not enough to scrape a website. Some sites out there use JavaScript to serve content. For example, they might wait until you scroll down on the page or click a button before loading certain content.

Other sites may require you to click through forms before seeing their content. Or select options from a dropdown. Or perform a tribal rain dance...

For these sites, you'll need something more powerful. You'll need Selenium (which can handle everything except tribal rain dancing).

Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right?

It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.

Resources

The Chef: Scrapy

Ok, we covered a lot just now. You've got Requests and Selenium for fetching HTML/XML from web pages. Then, you can use Beautiful Soup or lxml to parse it into useful data.

But what if you need more than that? What if you need a complete spider that can crawl through entire websites in a systematic way?

Introducing: Scrapy! Scrapy is technically not even a library... it's a complete web scraping framework. That means you can use it to manage requests, preserve user sessions, follow redirects, and handle output pipelines.

It also means you can swap out individual modules with other Python web scraping libraries. For instance, if you need to insert Selenium for scraping dynamic web pages, you can do that (see example).

Scrapy architecture

Scrapy architecture, image borrowed from official documentation

So if you need to reuse your crawler, scale it, manage complex data pipelines, or cook up some other sophisticated spider, then Scrapy was made for you.

Resources

Leave A Response

* Denotes Required Field