Natural language processing (NLP) is an exciting field in data science and artificial intelligence that deals with teaching computers how to extract meaning from text. In this guide, we’ll be touring the essential stack of Python NLP libraries.
These packages handle a wide range of tasks such as part-of-speech (POS) tagging, sentiment analysis, document classification, topic modeling, and much more.
Why only 5 libraries?
We write every guide with the practitioner in mind. There are dozens of packages for NLP out there… but you’ll cover all the important bases once you master a handful of them. This is an opinionated guide that features the 5 Python NLP libraries we’ve found to be the most useful.
Do I need to learn every library below?
No, it all depends on your use case. Here’s a summary:
- We recommend NLTK only as an education and research tool. Its modularized structure makes it excellent for learning and exploring NLP concepts, but it’s not meant for production.
- TextBlob is built on top of NLTK, and it’s more easily-accessible. This is our favorite library for fast-prototyping or building applications that don’t require highly optimized performance. Beginners should start here.
- Stanford’s CoreNLP is a Java library with Python wrappers. It’s in many existing production systems due to its speed.
- SpaCy is a new NLP library that’s designed to be fast, streamlined, and production-ready. It’s not as widely adopted, but if you’re building a new application, you should give it a try.
- Gensim is most commonly used for topic modeling and similarity detection. It’s not a general-purpose NLP library, but for the tasks it does handle, it does them well.
Why are they heroic?
Because they are valiant! So without further ado…
- The Conqueror: NLTK
- The Prince: TextBlob
- The Mercenary: Stanford CoreNLP
- The Usurper: spaCy
- The Admiral: gensim
The Conqueror: NLTK
You can’t talk about NLP in Python without mentioning NLTK. It’s the most famous Python NLP library, and it’s led to incredible breakthroughs in the field. NLTK is responsible for conquering many text analysis problems, and for that we pay homage.
NLTK is also popular for education and research. On its own website, NLTK claims to be an “an amazing library to play with natural language.”
In our experience, the key word there is “play.” NLTK has over 50 corpora and lexicons, 9 stemmers, and dozens of algorithms to choose from. It’s an academic researcher’s theme-park.
Yet, this is also one of NLTK’s major downsides. It’s heavy and slippery, and it has a steep learning curve. The second major weakness is that it’s slow and not production-ready.
The next 3 libraries will address these weaknesses.
Resources
- NLTK Book – Complete course on Natural Language Processing in Python with NLTK.
- Dive into NLTK – Detailed 8-part tutorial on using NLTK for text processing.
The Prince: TextBlob
TextBlob sits on the mighty shoulders of NLTK and another package called Pattern. In fact, we left out Pattern from this list because we recommend TextBlob instead.
TextBlob makes text processing simple by providing an intuitive interface to NLTK. It’s a welcome addition to an already solid lineup of Python NLP libraries because it has a gentle learning curve while boasting a surprising amount of functionality.
For example, let’s say you wanted to find a text’s sentiment score. You can do that out of the box:
1 2 3 |
from textblob import TextBlob opinion = TextBlob("EliteDataScience.com is dope.") opinion.sentiment |
By default, the sentiment analyzer is the PatternAnalyzer from the Pattern library. But what if you wanted to use a Naive Bayes analyzer? You can easily swap to a pre-trained implementation from the NLTK library.
1 2 3 4 |
from textblob import TextBlob from textblob.sentiments import NaiveBayesAnalyzer opinion = TextBlob("EliteDataScience.com is dope!", analyzer=NaiveBayesAnalyzer()) opinion.sentiment |
TextBlob is a simple, fun library that makes text analysis a joy. We’ll at least use TextBlob for initial prototyping for almost every NLP project.
Resources
- TextBlob Documentation – Official documentation and quickstart guide.
- Natural Language Processing Basics with TextBlob – Excellent, short NLP crash course using TextBlob.
The Mercenary: Stanford CoreNLP
Stanford CoreNLP is a suite of production-ready natural analysis tools. It includes part-of-speech (POS) tagging, entity recognition, pattern learning, parsing, and much more.
“The Mercenary” is actually written in Java, not Python. You can get around this with Python wrappers made by the community.
Many organizations use CoreNLP for production implementations. It’s fast, accurate, and able to support several major languages.
Resources
- CoreNLP Documentation – Official documentation and resource compilation.
- List of Python wrappers for CoreNLP – Kept up-to-date by Stanford NLP.
The Usurper: spaCy
SpaCy is the new kid on the block, and it’s making quite a splash. It’s marketed as an “industrial-strength” Python NLP library that’s geared toward performance.
SpaCy is minimal and opinionated, and it doesn’t flood you with options like NLTK does. Its philosophy is to only present one algorithm (the best one) for each purpose. You don’t have to make choices, and you can focus on being productive.
Because it’s built on Cython, it’s also lightning-fast. Folks have called spaCy “state-of-the-art,” and it’s hard to disagree. Its main weakness is that it currently only supports English.
SpaCy is newer, so its support community is not as large as some other libraries’. Yet, its approach to NLP is so compelling that it could possible dethrone NLTK.
If you’re building a new application or revamping an old one (and you only need English support), then we strongly recommend trying spaCy.
- spaCy Documentation – Official documentation and quickstart guide.
- Intro to NLP with SpaCy – Short tutorial showcasing spaCy’s functionality.
The Admiral: gensim
Last but not least, we have gensim. Gensim is not for all challenges, but what it does do, it does them well. You don’t send your admiral to a land battle, and you don’t use gensim for general NLP.
Gensim is a well-optimized library for topic modeling and document similarity analysis. Among the Python NLP libraries listed here, it’s the most specialized.
Even so, it’s a valuable tool to add to your repertoire. Its topic modeling algorithms, such as its Latent Dirichlet Allocation (LDA) implementation, are best-in-class. In addition, it’s robust, efficient, and scalable.
Plus, the sub-field semantics analysis (or topic modeling), is one of the most exciting areas of modern natural language processing.
Resources
- gensim Documentation – Official documentation and tutorials. The tutorials page is very helpful.