Web scraping is a common and effective way of collecting data for projects and for work. In this guide, we’ll be touring the essential stack of Python web scraping libraries.
Why only 5 libraries?
There are dozens of packages for web scraping out there… but you only need a handful to be able to scrape almost any site. This is an opinionated guide. We’ve decided to feature the 5 Python libraries for web scraping that we love most. Together, they cover all the important bases, and they are well-documented.
Do I need to learn every library below?
No, but everyone will need Requests, because it’s how you communicate with websites. The rest depend on your use case. Here’s a rule of thumb:
- You should learn at least one of BeautifulSoup or lxml. Pick depending on which is more intuitive for you (more on this below).
- Learn Scrapy if you need to build a real spider or web-crawler, instead of just scraping a few pages here and there.
Why are they tasty?
Because they are yummy! So without further ado…
- The Farm: Requests
- The Stew: Beautiful Soup 4
- The Salad: lxml
- The Restaurant: Selenium
- The Chef: Scrapy
The Farm: Requests
The Requests library is vital to add to your data science toolkit. It’s a simple yet powerful HTTP library, which means you can use it to access web pages.
We call it The Farm because you’ll be using it to get the raw ingredients (i.e. raw HTML) for your dishes (i.e. usable data).
Its simplicity is definitely its greatest strength. It’s so easy use that you could jump right in without reading documentation.
For example, if you want to pull down the contents of a page, it’s as easy as:
page = requests.get('http://examplesite.com')
contents = page.content
But that’s not all that Requests can do. It can access API’s, post to forms, and much more.
Plus, it’s got character… It’s the only library that calls itself Non-GMO, organic, and grass-fed. You gotta love that.
- Requests Quickstart Guide – Official documentation. Covers practical topics like passing parameters, handling responses, and configuring headers.
The Stew: Beautiful Soup 4
After you have your ingredients, now what? Now you make them into a stew… a beautiful stew.
Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.
One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
In addition, BS4 can help you navigate a parsed document and find what you need. This makes it quick and painless to build common applications. For example, if you wanted to find all the links in the web page we pulled down earlier, it’s only a few lines:
from bs4 import BeautifulSoup
soup = BeautifulSoup(contents, 'html.parser')
This charming simplicity has made it one of the most beloved Python web scraping libraries!
- Beautiful Soup Documentation – Includes convenient quickstart guide.
- Really Short Example – Short example of using Beautiful Soup and Requests together.
The Salad: lxml
Lxml is a high-performance, production-quality HTML and XML parsing library. We call it The Salad because you can rely on it to be good for you, no matter which diet you’re following.
Among all the Python web scraping libraries, we’ve enjoyed using lxml the most. It’s straightforward, fast, and feature-rich.
Even so, it’s quite easy to pick up if you have experience with either XPaths or CSS. Its raw speed and power has also helped it become widely adopted in the industry.
Beautiful Soup vs lxml
Historically, the rule of thumb was:
- If you need speed, go for lxml.
- If you need to handle messy documents, choose Beautiful Soup.
Yet, this distinction no longer holds. Beautiful Soup now supports using the lxml parser, and vice-versa. It’s also pretty easy to learn the other once you’ve learned one.
So to start, we recommend trying both and picking the one that feels more intuitive for you. We prefer lxml, but many swear by Beautiful Soup.
- lxml Documentation – Official documentation.
- HTML Scraping with lxml and Requests – Short and sweet tutorial on pulling a webpage with Requests and then using XPath selectors to mine the desired data. This is more beginner-friendly than the official documentation.
The Restaurant: Selenium
Sometimes, you do need to go to a restaurant to eat certain dishes. The farm is great, but you can’t find everything there.
Other sites may require you to click through forms before seeing their content. Or select options from a dropdown. Or perform a tribal rain dance…
For these sites, you’ll need something more powerful. You’ll need Selenium (which can handle everything except tribal rain dancing).
Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right?
It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.
- Selenium with Python – Documentation for Selenium’s Python bindings.
- Webscraping with Selenium – Excellent, thorough 3-part tutorial for scraping websites with Selenium.
- Scraping Hotel Prices – Code snippet for scraping hotel prices using Selenium and lxml.
The Chef: Scrapy
Ok, we covered a lot just now. You’ve got Requests and Selenium for fetching HTML/XML from web pages. Then, you can use Beautiful Soup or lxml to parse it into useful data.
But what if you need more than that? What if you need a complete spider that can crawl through entire websites in a systematic way?
Introducing: Scrapy! Scrapy is technically not even a library… it’s a complete web scraping framework. That means you can use it to manage requests, preserve user sessions, follow redirects, and handle output pipelines.
It also means you can swap out individual modules with other Python web scraping libraries. For instance, if you need to insert Selenium for scraping dynamic web pages, you can do that (see example).
[images style=”1″ image=”https%3A%2F%2Fdatonauts.com%2Fwp-content%2Fuploads%2F2016%2F10%2Fscrapy_architecture.png” width=”640″ caption=”Scrapy%20architecture%2C%20image%20borrowed%20from%20%3Ca%20href%3D%22https%3A%2F%2Fdoc.scrapy.org%2Fen%2Flatest%2Ftopics%2Farchitecture.html%22%20target%3D%22_blank%22%3Eofficial%20documentation%3C%2Fa%3E” align=”center” top_margin=”0″ alt_text=”Scrapy%20architecture” full_width=”Y”]
So if you need to reuse your crawler, scale it, manage complex data pipelines, or cook up some other sophisticated spider, then Scrapy was made for you.
- Scrapy Documentation – Official site with links to many other resources.
- Extracting data from websites with Scrapy – Detailed tutorial for scraping an e-commerce site using Scrapy.
- Scrapinghub – Cloud-based crawling service by the creators of Scrapy. The first cloud unit is free.