Welcome to the Data Science Primer by EliteDataScience! This mini-course will provide a gentle intro to data science and applied machine learning. If you're a developer, analyst, manager, or aspiring data scientist looking to get into the field, then you're in the right place.
Let's get started!
Chapter 1: Bird's Eye View
First, let’s start with the “80/20” of data science. Generally speaking, we can break down applied machine learning into the following chunks:
This intro to data science will cover exploratory analysis, data cleaning, feature engineering, algorithm selection, and model training. As you can see, those chunks make up 80% of the pie. They also set the foundation for more advanced techniques.
In this first chapter, you’ll see how these moving pieces fit together. Therefore, we suggest the following two tips to making the most out of this primer:
Tip #1 - Don’t sweat the details (for now).
We’ve seen students master this subject 2X faster by first understanding how all the pieces fit together… and then diving deeper. Our trainings all follow this “top-down” approach.
Tip #2 - Don’t worry about coding (yet).
Again, it’s easy to get lost in the weeds at the beginning… so our goal is to see the forest instead of the trees. We won't get into the code during this intro to data science, but once you have the conceptual foundation, the code will come easier.
Chapter 2: Exploratory Analysis
There’s a big challenge in data science called “Tactical Hell.” This is actually a term from startups, and it’s when you have too many tactics to choose from:
Should you develop your product more? Invest in marketing? Hire an accountant? Etc.
In many ways, training a ML model is like growing a startup. You also have too many tactics to choose from:
Should you clean your data more? Engineer features? Test new algorithms? Etc.
There’s a lot of trial and error, so how do you avoid chasing dead ends? The answer is “Exploratory Analysis.” (Which is just fancy-talk for “getting to know” your data.)
Doing this upfront helps you save time and avoid wild goose chases… As a data scientist, you are a commander with limited resources (i.e. time). Exploratory analysis is like sending scouts to learn where to deploy your forces!
Chapter 3: Data Cleaning
Proper data cleaning is the “secret” sauce behind machine learning… Well, it’s not really a “secret”… It’s just a bit boring, so no one really talks about it. But the truth is:
Better data beats fancier algorithms…
(Even if you forget everything else from this primer, please remember this point)
Garbage in = Garbage out... plain and simple! If you have a clean dataset, even simple algorithms can learn impressive insights from it!
Now, as you might imagine, different problems will require different methods… For now though, let’s at least ensure we know how to fix the most common issues. This chapter will give you a reliable starting point, regardless of your dataset.
Chapter 4: Feature Engineering
No intro to data science would be complete without emphasizing the importance of feature engineering. In a nutshell, “feature engineering” is creating new model input features from your existing ones.
That doesn’t sounds like much… Yet Andrew Ng, former head of Baidu AI and Google Brain, said:
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
Wow! No pressure, right?
So why is it so difficult and time-consuming?
To start, feature engineering is very open-ended. There are literally infinite options for new features to create. Plus, you’ll need domain knowledge to add informative features instead of more noise.
This is a skill that you’ll develop with time and practice, but heuristics will give you a head start. Heuristics help you know where to start looking, spark ideas, and get unstuck.
Chapter 5: Algorithm Selection
Next, we'll introduce five very effective ML algorithms for regression. They each have classification counterparts as well.
Just five?
Yes. Instead of giving you a long list of algorithms...
...our goal is to explain a few essential concepts (e.g. regularization, ensembling, automatic feature selection) that will teach you why some algorithms tend to perform better than others.
In applied machine learning, individual algorithms should be swapped in and out depending on which performs best for the problem and the dataset. Therefore, we will focus on intuition and practical benefits over math and theory.
We have two main goals:
1. To explain powerful mechanisms in modern ML.
2. To introduce several algorithms that use those mechanisms.
So if you're ready, then we’re ready. Let’s go!
Chapter 6: Model Training
At last… it’s time to build our models! We'll bring together everything we've covered so far in this intro to data science.
It might seem like it took a while to get here, but data scientists actually do spend most their time on the earlier steps:
1. Exploring the data.
2. Cleaning the data.
3. Engineering new features.
Again, that’s because better data beats fancier algorithms.
Now you'll learn how to maximize model performance while safeguarding against overfitting. Plus, you'll learn how to automatically find the best parameters for each algorithm.
We'll get an overview of splitting your dataset, deciding on hyperparameters, setting up cross-validation, fitting and tuning models, and finally… selecting a winner!