In this guide, we’ll cover how to learn Python for data science, including our favorite curriculum for self-study.
You see, data science is about problem solving, exploration, and extracting valuable information from data.
To do so effectively, you’ll need to wrangle datasets, train machine learning models, visualize results, and much more. Enter Python.
This is the best time ever to learn Python. In fact, Berkeley named it the second most in-demand programming language in terms of job demand. Let’s discuss why…
Why Learn Python for Data Science?
Python is one of the most widespread languages in the world, and it has a passionate community of users:
[images style=”1″ image=”https%3A%2F%2Felitedatascience.com%2Fwp-content%2Fuploads%2F2022%2F07%2FPython-TIOBE-Index-2022.png” width=”889″ caption=”Python%20Popularity%2C%20%3Ca%20href%3D%22https%3A%2F%2Fwww.tiobe.com%2Ftiobe-index%2F%22%20target%3D%22_blank%22%3ETIOBE%20Index%3C%2Fa%3E” align=”center” top_margin=”0″ full_width=”Y”]
It has an even more loyal following within the data science profession.
Some people judge the quality of a programming language by the simplicity of its “hello, world!” program. Python does pretty well by this standard:
1 |
print( "hello, world!" ) |
For comparison, here’s the same output in Java:
1 2 3 4 5 |
public class Main { public static void main(String[] args) { System.out.println("hello, world!"); } } |
Great, case closed. See you back here after you’ve mastered Python?
Well, in all seriousness, simplicity is one of Python’s greatest strengths. Thanks to its precise and efficient syntax, Python can accomplish the same tasks with less code than other languages. This makes implementing solutions refreshingly fast.
In addition, Python’s vibrant data science community means you’ll be able to find plenty of tutorials, code snippets, and people to commiserate with fixes to common bugs. Stackoverflow will be one of your best friends.
Finally, Python has an all-star lineup of libraries (a.k.a. packages) for data analysis and machine learning, which drastically reduce the time it takes to produce results. More on these later.
How to Learn Python Efficiently
Before we go into what you’ll need to learn, let’s discuss what you won’t need.
You won’t need a C.S. degree.
Most data scientists will never deal with topics such as memory leaks, cryptography, or “Big O” notation. You’ll be fine as long as you can write clean, logical code in a scripting language such as Python or R.
You won’t need a complete course on Python.
Python and data science are not synonymous.
[images style=”0″ image=”https%3A%2F%2Felitedatascience.com%2Fwp-content%2Fuploads%2F2017%2F08%2FAll-uses-for-Python.png” custom_width=”Y” width=”640″ custom_width_val=”360″ align=”center” top_margin=”0″ alt_text=”All%20Uses%20for%20Python” full_width=”Y”]
You won’t need to memorize all the syntax.
Instead, focus on grasping the intuition, such as when function is appropriate or how conditional statements work. You’ll gradually remember the syntax after Googling, reading documentation, and good ol’ fashioned practice.
We recommend a top-down approach.
We advocate a top-down approach with the goal of getting results first and then solidifying concepts over time. In fact, we prefer to cut out “classroom” study in favor of real-world practice.
- You’ll start by learning core programming concepts.
- Next, you’ll gain working knowledge of essential data science libraries.
- Finally, you’ll practice and refine your skills through actual projects.
This approach will allow you to build mastery over time while having more fun.
[feature_box style=”33″ title=”Aside%3A%20Installing%20Python%20through%20Anaconda” alignment=”center”]
There are many ways to install Python on your computer, but we recommend the Anaconda bundle, which comes with the libraries you’ll need for data science. Check out our Python Quickstart Guide for more information.
[/feature_box]
Step 1: Core Programming Concepts
Effective programming is not about memorizing syntax, but rather mastering a new way of thinking.
Therefore, take your time in building a solid foundation of core programming concepts. These will help you translate solutions in your head into instructions for a computer.
If you are new to programming…
If you are completely new to programming, we recommend the excellent Automate the Boring Stuff with Python book, which has been released for free online under a creative commons license.
The book promises “practical programming for total beginners,” and it keeps each lesson down-to-earth. Read up to Chapter 6 – Manipulating Strings and complete the practice questions along the way.
[images style=”1″ image=”https%3A%2F%2Felitedatascience.com%2Fwp-content%2Fuploads%2F2016%2F10%2FAutomate-the-Boring-Stuff-Cover.png” width=”227″ link_url=”https%3A%2F%2Fautomatetheboringstuff.com%2F” new_window=”Y” align=”center” top_margin=”0″ alt_text=”Automate%20the%20Boring%20Stuff%20by%20Al%20Sweigart” full_width=”Y”]
If you have experience in another language…
If you only need to brush up on Python syntax, then we recommend the following video, aptly named “Learn Python in One Video:”
[video_player type=”youtube” youtube_remove_logo=”Y” youtube_show_title_bar=”Y” style=”1″ dimensions=”640×360″ width=”640″ height=”360″ align=”center” margin_top=”0″ margin_bottom=”20″ ipad_color=”black”]aHR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g/dj1ONG1FekZEanF0QQ==[/video_player]
Again, the goal of this step is not to learn everything about Python and programming. Instead, focus on the intuition.
You should be able to answer questions such as:
- What’s the difference between an integer, float, and string?
- How can I use Python as a calculator?
- What is a for loop? When would I write one?
- What is the basic structure of a function?
- How can I use conditional statements (if… else…) to add logic?
- How do import statements work?
Additional resources
If you’d like more practice with the core programming concepts, check out the following resources.
- Edabit is a platform with many short coding challenges that can be completed in 5-minute chunks. The bite-sized nature of it is perfect for getting into the habit of coding every day. You can also filter the challenges from Very Easy to Expert, so there’s a smooth progression curve.
- The Python Challenge is one of the coolest puzzles on the web, so don’t be put off by its 1990’s graphics. You can complete all 33 levels with the help of Python scripts. One user called it “an addictive way to learn the ins and outs of Python…” We agree!
- PracticePython.org is a collection of short practice problems in Python. It’s updated often with new problems. What’s really nice is that the author includes multiple user-submitted solutions for each problem so you can see alternative ways of solving them.
- How to Think Like a Computer Scientist is a fantastic interactive online book that takes a whirlwind tour through key programming concepts (with Python). If you’re completely new to programming, this might be a good option. It’s like a condensed “C.S. 101” course.
Step 2: Essential Data Science Libraries
Next, we’re going to focus on the for data science part of “how to learn Python for data science.”
As we mentioned earlier, Python has an all-star lineup of libraries for data science. Libraries are simply bundles of pre-existing functions and objects that you can import into your script to save time.
These are the action steps we recommend for efficiently picking up a new library:
- Open up a new Jupyter Notebook (see below).
- Read the library’s documentation for 30 minutes for a high-level introduction of its modules.
- Import the library into your Jupyter Notebook.
- Follow its step-by-step quickstart tutorial to see the library in action.
- Review its documentation for another 30 minutes to learn what else it’s capable of.
We don’t recommend diving much deeper into a library right now because you’ll likely forget most of what you’ve learned by the time you jump into projects. Instead, aim to discover what each library is capable of.
If you installed Python through the Anaconda bundle as we recommended above, it will also come with Jupyter Notebook. Jupyter Notebook is a lightweight IDE that’s a favorite among data scientists. We recommend it for your projects. You can open a new notebook through Anaconda Navigator, which came with Anaconda. Check out this short video for instructions.
These are the essential libraries you’ll need:
NumPy
NumPy allows easy and efficient numeric computation, and many other data science libraries are built on top of it.
Pandas
Pandas is high-performance library for data structures and exploratory analysis. It’s built on top of NumPy.
Matplotlib
Matplotlib is a flexible plotting and visualization library. It’s powerful but somewhat cumbersome. You have the option of skipping Matplotlib for now and using Seaborn to get started (see our Seaborn recommendation below).
Scikit-Learn
Scikit-Learn is the premier general-purpose machine learning library in Python. It has many popular algorithms and modules for pre-processing, cross-validation, and much more.
Bonus: Seaborn
Seaborn makes it much easier to plot common data visualizations. It’s built on top of Matplotlib and offers a more pleasant high-level wrapper.
Step 3: End-to-End Projects
By now, you’ll have a basic understanding of programming and a working knowledge of essential libraries. This actually covers most of the Python you’ll need to get started with data science.
At this point, some students will feel a bit overwhelmed. That’s OK, and it’s perfectly normal.
If you were to take the slow and traditional bottom-up approach, you might feel less overwhelmed, but it would have taken you 10 times as long to get here.
Now the key is to dive in immediately and start gluing everything together. Again, our goal up to here has been to just learn enough to get started.
Next, it’s time to solidify your knowledge through plenty of practice and projects.
You have several options.
Kaggle Competitions
The first option is to participate on Kaggle, a site that hosts data science competitions.
The main advantage of Kaggle is that every project is self-contained. You’re given the dataset, a goal, and tutorials to get you started.
The major disadvantage of competitions is that they’re usually not representative of real-world data science. The “Getting Started” competitions are way too basic while the standard competitions (i.e. those with prize pools) are usually too tough for beginners.
If you’re interested in this path, check out our Beginner’s Guide to Kaggle.
DIY Projects
The other option is to structure your own projects and pick datasets that interest you.
The main advantage of this approach is that the projects are more representative of real-world data science. You’ll likely need to define your own goals, collect data, clean your dataset, engineer features, and so on.
The disadvantage of DIY projects is that you’ll need to already be familiar with a proper data science workflow. Without one, you could miss important steps or get stuck without knowing how to proceed. If you go with this path, check out our article with several DIY project ideas.