At the start of any machine learning project, you face an important choice: Which language or software should I use?
Well, you have many options to choose from.
Python, R, SAS, MATLAB… the list goes on. But first, you’ll actually need to make another choice: Should I go with open source or commercial software?
Open source code is “freely available and may be redistributed and modified.” The community supports the code. On the other hand, commercial software is developed and maintained by a single company.
In this guide, we’ll compare open source and commercial options for machine learning, and then explore hybrid options.
Advantages of Open Source
Python and R are programming languages with rich open source ecosystems.
For example, Python has Scikit Learn, a powerful general-purpose framework that can run classification, regression, clustering, and other tasks out of the box. Python also has specialized packages for deep learning and NLP, such as TensorFlow, Theano, and Keras.
R also has mature packages for machine learning. They include specific tasks such as randomForest (random forests for classification and regression) as well as caret, a general-purpose framework that can interface with many other packages.
To learn more about Python or R, check out our guide on R vs. Python for Data Science.
Open source software have several key advantages:
More eyes = faster bug fixes
A quick peek on Github for any of these packages reveals thousands of commits by hundreds of contributors.
Because they’re free to use and customize, open source software tends to attract huge communities. That means more people to improve it… and fix it when it breaks.
Answers to common problems
Large communities have great support for common problems. StackOverflow will be one of your most invaluable resources for debugging your applications.
One way to take search StackOverflow is by tag, such as by “scikit-learn”:
As you can see, you’ll find answers to hundreds of common questions.
Another technique you can use that will be more helpful when debugging is to do a simple Google search of “stackoverflow { paste your error message }”, which will help you pinpoint exact solutions.
Between StackOverflow and Github discussions, you should be able to overcome most roadblocks you encounter.
Broader adoption = easier hiring
Another great reason to go open source is for recruiting and hiring. Due to their popularity, it’s often easier to recruit experienced team members.
Advantages of Commercial
MATLAB and SAS are commercial software supported by MathWorks and the SAS Institute, respectively.
MATLAB offers a toolbox for classic ML algorithms like logistic regression and SVMs, but it also supports advanced tasks like deep learning and cloud computing.
Similarly, SAS offers the Enterprise Miner tool for machine learning and data mining. Enterprise Miner is also scalable and able to be deployed to the cloud.
Commercial software also have their own advantages:
Specialized support
When StackOverflow isn’t enough, teams can get support directly from the software provider. The more specialized the project, the more important this support will be, especially with a deadline or low margin for error.
In addition, commercial providers will offer support setting up and integrating their software into your existing technology stack.
As a result, commercial packages are popular for mission-critical, enterprise applications.
Maintenance and version control
Commercial solutions typically offer maintenance and backwards compatibility (bridging between versions of the code).
While open source software can be forked and customized, the onus is then on you to have proper version control. This can be tedious for teams that are already using a slew of modules and tools.
In addition, open source projects are constantly evolving from user suggestions and patches, which shifts more of the maintenance overhead onto you.
Standardized modules and functionality
MATLAB and SAS offer standardized modules that are clearly defined and easy to implement.
You won’t need to find, install, and glue together a Frankenstein of different packages just to complete your project.
Instead, you can expect up-to-date algorithms, data preprocessing methods, and model deployment options right out of the box.
For enterprise-level applications with many moving pieces and a large codebase, it’s can take some of the burden off your shoulders by having a commercial partner.
Can You Have Both?
So, should you go with open source or commercial software for your project?
First of all, it depends on the size of your project and your team’s needs. We’ll get to our recommendation in a moment, but first, let’s discuss a third option.
Open source software with commercial support combine customizability and community of open source with the dedicated support from commercial partners.
These hybrid options are appropriate for teams that want the flexibility of open source packages but also need a support safety net for mission-critical applications.
RStudio and ActiveState are great examples for R and Python, respectively.
RStudio is the company behind the RStudio IDE, but they also offer support, services, and tools for securing and scaling R projects. For example, their commercial license allows you to run multiple versions of R side-by-side and run multiple analyses in parallel.
ActiveState is the company behind ActivePython, a commercial-grade Python distribution. ActiveState offers dedicated support, security and licensing vetting, and version control.
ActivePython also has pre-compiled packages for machine learning, such as Theano, TensorFlow, and Keras. It can be a pain to install these packages on your own (e.g. updating your C++ compiler, checking all dependencies, etc.).
You can download the free community version of ActivePython here.
Our Recommendation
Obviously, the software you choose will depend on your team’s needs. From a pure functionality standpoint, you can find most common ML tasks in any of the options listed in this guide.
For smaller projects, especially one-person or simple projects, open source software is perfect. For example, it’s not the end of the world if a side project goes offline for a couple hours.
For mission-critical applications – analytics teams, product teams, startups, etc. – you might sleep easier with an additional safety net of dedicated support, versioning, and legal compliance.
In general, it makes sense for most people to start with open source software thanks to their large, active communities. For ML, we recommend Python because it’s also a general-purpose programming language that can be more easily integrated with data pipelines and end-user applications.
Then, as your project starts to scale to more data or users, you have the option to acquire a commercial partner. You’ll be able to offload some of your support and maintenance needs. This hybrid approach will allow you to tap into a large open source community while also having access to dedicated one-to-one support.