Photo by Pietro Jeng on Unsplash

Summary of the article “A few useful things to know about machine learning” by Pedro Domingos

Filipe Good
5 min readMar 6, 2021

--

This post aims to summarize the most important aspects of Pedro Domingo’s article: “A few useful things to know about machine learning”.

The goal of the main article is to “communicate the folk knowledge” about machine learning that usually does not come in the scientific articles. Pedro has a lot of experience in ML so I think that these tips, based on his experience, are quite relevant! You can find the article here — https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf.

Also, if you like the work that my fellow compatriot does (I’m also Portuguese :)), I recommend his book about machine learning — The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. The book explains the different types of machine learning algorithms and is an excellent starting point on machine learning.

I divided this post into 9 main tips:

  1. Learning = Representation + Evaluation + Optimization
  2. It’s generalization that counts — generalize beyond the training data
  3. Oveffiting has many faces
  4. Intuition fails in high dimension — curse of dimensionality
  5. Theoretical guarantees are not what they seem
  6. Feature Engineering is the key
  7. More data beats a cleverer algorithm
  8. Learn many models, not just one
  9. Simplicity does not imply accuracy

1. Learning = Representation + Evaluation + Optimization

Often it’s hard to choose the best ML algorithm. And we only focus on choosing the algorithm and leave other vital tasks behind. However, learning is comprised of these 3 important aspects ⇒ Representation, Evaluation and Optimization. So, in order to succeed we need to carefully select how we will represent the data, how we will evaluate the outcome and how we will find the best algorithm/parameters.

As Pedro writes, “some choices in a machine learning project maybe even more important than the choice of the learner”.

2. It’s generalization that counts

The main goal is to be able to generalize beyond the training data. This means, that the learner must perform well with new and unseen data.

Using the same set of data to train and test the model brings the illusion of success. In this case, Pedro recommends to “set some of the data aside from the beginning, and only use it to test your chosen classifier”. He also talks about how we can use cross-validation if we don’t have a lot of data.

3. Overfitting has many faces

Everyone knows about overfitting — having great results with the training data but getting poor results with new and unseen data. In this sense, Pedro says that overfitting can come in “many forms that are not immediately obvious”.

To understand better overfitting, Pedro recommends dividing generalization error into bias and variance. As Pedro explains, “Bias is a learner’s tendency to consistently learn the same wrong thing and variance is the tendency to learn random things irrespective of the real signal”.

Depending on how our selected model is behaving, we can choose algorithms based on bias and variance. As an example, a linear learner has high bias and decision trees don’t have high bias but have high variance.

Pedro also suggests using cross-validation and regularization to combat overfitting. He ends this section warning that “It’s easy to avoid overfitting by falling into the opposite error of underfitting”.

4. Intuition fails in high dimensions

Pedro starts this section with: “After overfitting, the biggest problem in ML is the curse of dimensionality”. Basically, ML algorithms can start to work really poorly in higher dimensions — when we use more and more features.

Learning the data becomes exponentially harder as the dimensionality grows because the ration between number examples and the number of features becomes extremely low. So, as a suggestion, it’s important to choose the features wisely and only use the most relevant ones. Don’t feed noise to your learner.

5. Theoretical guarantees are not what they seem

This one is simple — just because a certain algorithm theoretically achieves good results, does not mean that in practice it will work well. As Pedro says: “just because a learner has a theoretical justification and works in practice doesn’t mean the former is the reason for the latter.”

6. Feature Engineering is the key

Is this section, Pedro reflects on the importance of Feature Engineering. He explains that the most important factor of the success of a machine learning project is the features that are used.

The explanation for this is easy: “ If you have many independent features that each correlate well the class, learning is easy. On the other hand, if the class is a very complex function of the features, you may not be able to learn it.”

He also says that the ML process is not “a one-shot process”. It is an iterative process where the engineers test new hypothesis with new and different features in order to achieve the best outcome.

7. More data beats a cleverer algorithm

When your model is not that accurate you have two main choices: “design a better learning algorithm, or gather more data (more examples, and possibly more raw features”. Pedro recommends the second path. First, because it’s often quicker to get more data than to build a new algorithm. Second, because the algorithms learn from data and the more data we have, the more examples the algorithm has to try to learn.

8. Learn many models, not just one

In this section, Pedro talks about the benefits of model ensembles. He starts by saying that before, “one would test a lot of models and select the best model”. But, a better approach is to combine many variations of models. This is a good approach because we are basically combining the strengths of multiple algorithms into one. So the strengths of one algorithm tackle the weaknesses of the other algorithm.

9. Simplicity does not imply accuracy

This one is straightforward: “Given two classifiers with the same training error, the simpler of the two will likely have the lowest test error”. As a rule of thumb, we should always opt for the simpler algorithms. As Pedro explains, complex learners have larger hypothesis space and so “a learner with a larger hypothesis space that tries fewer hypotheses from it is less likely to overfit than one that tries more hypotheses from a smaller space.”

Feel free to comment other useful tips that you learned from experience :)

--

--