Recently, some of our readers have been asking us about the best programming language for data science. Immediately, R and Python both come to mind… but which of these two giants to choose?
We felt that this was a good time to address this question because we recently watched an excellent presentation on recent advances of both languages by Eduardo Ariño de la Rubia, the Chief Data Scientist at Domino Data Lab.
The main reason we liked the video is because it shows how both Python and R have progressed so far. Both languages have become well rounded for data science.
Some people point to traditional weaknesses of each language (e.g. data visualization in Python or data wrangling in R), but thanks to recent packages like Altair for Python and dplyr for R, those weaknesses have been alleviated.
This post is a summary of the modern advances discussed in the video. We recommend watching the full video at their blog, but you can use this page to find links to each library mentioned.
We have 2 main goals for this post:
- For experienced data scientists, we hope to introduce you to a library or two that solves an annoying or painful problem you’re currently facing in your chosen language.
- For beginner data scientists, we want you to introduce you to all the great work that’s going into both languages so you can feel at ease with the one you chose.
Finally, at the end of this post, we’ll provide our recommendations for the best language to start with depending on your background and your goals.
First, here is the summary from the presentation:
The Case for Python
Key quote: “I have this hope that there is a better way. Higher-level tools that actually let you see the structure of the software more clearly will be of tremendous value.” – Guido van Rossum
Guido van Rossum was the creator of the Python programming language.
Why Python is Great for Data Science
- Python was released in 1989. It has been around for a long time, and it has object-oriented programming baked in.
- IPython / Jupyter’s notebook IDE is excellent.
- There’s a large ecosystem. For example, Scikit-Learn’s page receives 150,000 – 160,000 unique visitors per month.
- There’s Anaconda from Continuum Analytics, making package management very easy.
- The Pandas library makes it simple to work with data frames and time series data.
Advances in Modern Python for Data Science
1. Collecting Data
Feather (Fast reading and writing of data to disk)
- Fast, lightweight, easy-to-use binary format for filetypes
- Makes pushing data frames in and out of memory as simply as possible
- Language agnostic (works across Python and R)
- High read and write performance (600 MB/s vs 70 MB/s of CSVs)
- Great for passing data from one language to another in your pipeline
Ibis (Pythonic way of accessing datasets)
- Bridges the gap between local Python environments and remote storages like Hadoop or SQL
- Integrates with the rest of the Python ecosystem
ParaText (Fastest way to get fixed records and delimited data off of disk and into RAM)
- C++ library for reading text files in parallel on multi-core machines
- Integrates with Pandas: paratext.load_csv_to_pandas("data.csv")
- Enables CSV reading of up to 2.5GB a second
- A bit difficult to install
bcolz (Helps you deal with data that’s larger than your RAM)
- Compressed columnar storage
- You have the ability to define a Pandas-like data structure, compress it, and store it in memory
- Helps get around the performance bottleneck of querying from slower memory
2. Data Visualization
Altair (Like a Matplotlib 2.0 that’s much more user friendly)
- You can spend more time understanding your data and its meaning.
- Altair’s API is simple, friendly and consistent.
- Create beautiful and effective visualizations with a minimal amount of code.
- Takes a tidy DataFrame as the data source.
- Data is mapped to visual properties using the group-by operation of Pandas and SQL.
- Primarily for creating static plots.
Bokeh (Reusable components for the web)
- Interactive visualization library that targets modern web browsers for presentation.
- Able to embed interactive visualizations.
- D3.js for Python, except better.
- Already has a big gallery that you can
borrowsteal from.
Geoplotlib (Interactive maps)
- Extremely clean and simple way to create maps.
- Can take a simple list of names, latitudes, and longitudes as input.
3. Cleaning & Transforming Data
Blaze (NumPy for big data)
- Translates a NumPy / Pandas-like syntax to data computing systems.
- The same Python code can query data across a variety of data storage systems.
- Good way to future-proof your data transformations and manipulations.
xarray (Handles n-dimensional data)
- N-dimensional arrays of core pandas data structures (e.g. if the data has a time component as well).
- Multi-dimensional Pandas dataframes.
Dask (Parallel computing)
- Dynamic task scheduling system.
- “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments.
4. Modeling
Keras (Simple deep learning)
- Higher level interface for Theano and Tensorflow
- We wrote a complete Keras tutorial for beginners
PyMC3 (Probabilistic programming)
- Contains the most high end research from labs in academia
- Powerful Bayesian statistical modeling
Do you want to see tutorials for any of these libraries? Leave a comment below to let us know which ones!
The Case for R
Key quote: “There should be an interface to the very best numerical algorithms available.” – John Chambers
John Chambers actually created S, the precursor to R, but the spirit of R is the same.
Why R is Great for Data Science
- R was created in 1992, after Python, and was therefore able to learn from Python’s lessons.
- Rcpp makes it very easy to extend R with C++.
- RStudio is a mature and excellent IDE.
- (Our note) CRAN is a candyland filled with machine learning algorithms and statistical tools.
- (Our note) The Caret package makes it easy to use different algorithms from 1 single interface, much like what Scikit-Learn has done for Python
Advances in Modern R for Data Science
1. Collecting Data
Feather (Fast reading and writing of data to disk)
- Same as for Python
Haven (Interacts with SAS, Stata, SPSS data)
- Reads SAS and brings it into a dataframe
Readr (Reimplements read.csv into something better)
- read.csv sucks because it takes strings into factors, it’s slow, etc
- Creates a contract for what the data features should be, making it more robust to use in production
- Much faster than read.csv
JsonLite (Handles JSON data)
- Intelligently turns JSON into matrices or dataframes
2. Data Visualization
ggplot2 (ggplot2 was recently massively upgraded)
- Recently had a very significant upgrade (to the point where old code will break)
- You can do faceting and zoom into facets
htmlwidgets (Reusable components)
- Brings of the best of JavaScript visualization to R
- Has a fantastic gallery you can
borrowsteal from
Leaflet (Interactive maps for the web)
- Nice Javascript maps that you can embed in web applications
Tilegramsr (Proportional maps)
- Create maps that are proportional to the population
- Makes it possible to create more interesting maps than those that only highlight major cities due to population density
3. Cleaning & Transforming Data
Dplyr (Swiss army chainsaw)
- The way R should’ve been from the first place
- Has a bunch of amazing joins
- Makes data wrangling much more humane
Broom (Tidy your models)
- Fixes model outputs (gets around the weird incantations needed to see model coefficients)
- tidy, augment, glance
Tidy_text (Text as tidy data)
- Text mining using dplyr, ggplot2, and other tidy tools
- Makes natural language processing in R much easier
4. Modeling
MXNet (Simple deep learning)
- Intuitive interface for building deep neural networks in R
- Not quite as nice as Keras
- Now has an interface in R
Do you want to see tutorials for any of these libraries? Leave a comment below to let us know which ones!
Our Recommendation
As you can see, both languages are actively being developed and have an impressive suite of tools already. It sounds cliché to say this, but there’s really no one-size-fits-all answer.
If you’re just starting out, one simple way to choose would be based on your comfort zone. For example, if you come from a C.S./developer background, you’ll probably feel more comfortable with Python. On the other hand, if you come from a statistics/analyst background, R will likely be more intuitive.
At EliteDataScience, we do love R, but we more often prefer to use Python. Python is a general-purpose programming language, making it possible to do pretty much anything you want to do.
Python also has the wonderful Keras package, as mentioned above, making it a breeze to get started with deep learning.
If you’d like to learn Python for Data Science, we recommend checking out our free guide: