How to Learn Python for Data Science, The Self-Starter Way

Do you want to learn Python for data science, but don’t want to take a slow, expensive course? Most courses are just rehashed versions of the excellent free content out there. Here are resources for self-starters to acquire this valuable skill at their own pace!

How to Learn Python for Data Science

At its heart, data science is about problem solving, exploration, and extracting valuable information from data. To do so effectively, you'll need to be able to wrangle datasets, implement statistical models, write programs, and much more.

Therefore, developing sharp programming skills is critical to your success. It's like learning how to ride a bike in a crowded city. Not only will you reach your destinations faster, but you'll also have the freedom to visit areas you could never reach on foot.

Plus, your chosen programming tool will become your trusty sidekick in this journey. For most aspiring data scientists, we strongly recommend starting with Python. Then, you should learn R after you become fluent with Python.

Python is one of the most widespread languages in the world, and it has a passionate community of users:

Python popularity in 2016, TIOBE Index

Within the data science community, Python is even more popular. Here's why...

Why Learn Python for Data Science?

Some people judge the quality of a programming language by the simplicity of its "hello, world!" program. Python does pretty well by this standard:

For comparison, here's the same output in Java:

Great, case closed! See you back here after you've mastered Python, sound good?

...

Okay, okay... but in all seriousness... simplicity is definitely one of Python's biggest strengths. Thanks to its precise and efficient syntax, Python can often accomplish the same tasks with much less code compared to other languages. This makes implementing solutions refreshingly fast.

In addition, Python's vibrant data science community means you'll be able to find plenty of tutorials, code snippets, and people to commiserate with fixes to common bugs. Stackoverflow will be one of your best friends.

Finally, Python has all-star lineup of libraries (a.k.a. packages) for numeric and scientific computing, all of which will make your life much easier. More on this later.

The Self-Starter Way

We believe in a hyper-practical, action-centric approach to learning Python for data science as quickly as possible, but you must be a self-starter to succeed with this strategy.

The reason is that we're going to completely cut out "classroom" study. You'll learn just enough of the fundamentals to jump into real-world problems, and then gradually build mastery over time by "just doing shit." (not the formal term)

You'll also have a ton of fun using this method because it's the fastest way gain the essential programming skills required to start doing data science.

However, you must first build a rock-solid foundation of core programming concepts. This is the one place where you cannot take any shortcuts because you'll need to know how to translate solutions in your head into instructions for a computer. Effective programming is not about memorizing syntax, but rather mastering a new way of thinking.

We recommend learning Python for data science through the following 3 reliable steps:

  • 1

    Core Programming Concepts

    Learn how to solve problems using code.

  • 2

    Drills and Challenges

    Practice to master the core skills.

  • 3

    Essential Data Science Libraries

    Equip the tools needed for data science.

After completing these 3 steps, you'll be ready to dive into projects and analyses while continuing to learn as you go.

Aside: Installing Python through Anaconda

There are many ways to install Python on your computer, but we recommend installing it through the Anaconda bundle, which includes many of the libraries you'll need for data science. Here's a quick tutorial on installing Python using Anaconda.

Python 2.7 or 3.0+? Use Python 2.7, plain and simple. Python 2.7 is more widely used in almost every field. It supports more packages, especially those required for machine learning. 

Step 1: Core Programming Concepts

The amount of time you spend at this step depends on how much previous programming experience you have and whether you can work on this full-time or part-time, but it typically ranges from 1 week to 6 weeks.

If you are completely new to programming, be prepared to spend at least 1 month on this step. You'll want the time to absorb these rich concepts. They form the base needed to learn Python for data science quickly.

Among all the courses, tutorials, and guides out there, we've found the following two resources to be the best for self-starters. They are both self-paced, hands-on, and comprehensive (and free).

You're new to programming?

How to Think Like a Computer Scientist is a fantastic interactive online book that takes a whirlwind tour through key programming concepts (with Python). If you're new to programming, we suggest starting here, as it's like a condensed "Computer Science 101" course.

You've programmed before?

Learn Python the Hard Way is an excellent online book for people with some previous exposure to programming concepts. The "hard way" simply refers to learning through instructive exercises. Through 52 short exercises, you'll start with setting up Python and incrementally work your way up to writing multi-file programs.

Step 2: Drills and Challenges

If you want to learn Python for data science well, then don't skip this step.

After you grasp the core programming concepts, spend a week or two solidifying them by completing drills and challenges.

If you try to jump into a real project right away, you'll be overwhelmed by the number of moving parts. It's easy for our brains to trick us into believing we know something after reading about it in a book, but it takes concentrated practice to really learn the skills.

Think about it this way. Professional basketball players cannot just play games all the time if they want to improve. They must also spend hours every day practicing specific shots from different parts of the court.

When you take your newfound programming skills and hone them through short, targeted drills and challenges, you'll improve much faster than jumping into projects immediately.

Here's what we recommend:

Get into fighting shape...

Code Fights is a platform with many short coding challenges that can be completed in 5-minute chunks (although it's so fun that you might find yourself playing through it for hours at at time). You'll gain points along the way and unlock new levels, making it a nice way to track your progression as well.

Solve a mystery...

The Python Challenge is one of the coolest puzzles on the web, so don't be put off by its 1990's graphics. You can complete all 33 levels with the help of Python scripts. One user called it "an addictive way to learn the ins and outs of Python..." We agree!

Consider alternative solutions...

PracticePython.org is a collection of short practice problems in Python. It's updated almost every week with a new problem. What's really nice is that the author includes multiple user-submitted solutions for each problem so you can see alternative ways of solving them.

Step 3: Essential Data Science Libraries

Now you're almost ready to dive into real data science projects!

First, we built a strong foundation of core concepts. Then, we practiced pure Python through drills and challenges. Now, we're going to focus on the for data science part of "how to learn Python for data science."

As we mentioned earlier, Python has an all-star lineup of libraries that are essential for data science. To begin, we recommend acquiring a working knowledge of NumPypandas, SciPy and matplotlib, while using them in the IPython notebook environment. This is the core stack of tools you'll need for data analysis.

Other important libraries, such as scikit-learn (machine learning) or beautifulsoup4 (web scraping), can be picked up when you need to learn their specific use cases later.

The Big 5 Essential Libraries

  • NumPy - NumPy is the grand-daddy of all data science libraries. It allows easy and efficient numeric computation, and many other machine learning libraries are built on top of it.
  • Pandas - Pandas is high-performance library for data structures and exploratory analysis.
  • Matplotlib - Flexible plotting and visualization library.
  • IPython - Interactive shell for Python that makes it much easier to explore data and debug errors. Makes it much more enjoyable to learn Python for data science.
  • SciPy - Extends NumPy with more functionality, such as calculating integrals, linear algebra, and statistics.

Training Videos

More Resources

1 Comment