• Teb's Lab
  • Posts
  • ML Failure Part 1: Underfitting and Overfitting

ML Failure Part 1: Underfitting and Overfitting

News: Lawsuits and Regulation Coming for ML Firms

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: cut through the noise with our weekly rundown of software and technology news.

If you’re new to the Lab Report you can subscribe here. If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

My birthday was this week, huzzah! If you want to help me celebrate you could:

  • Forward this newsletter to someone who might enjoy it.

  • Checkout our course catalog and take one of our classes.

  • Give me the gift of feedback: we’ve only done 3 editions of the revised newsletter and I have no idea how it’s going. Reply to this email!

Today’s Lesson: How ML Fails Part 1

All the code from today’s lesson can be viewed on Github and Google Colab.

Over the next few editions we’re highlighting the strengths and weaknesses of machine learning systems. ML is certainly having a heyday, but nearly 80% of ML endeavors ultimately fail.

ML news has a substantial survivor bias: We only hear about the models that actually get deployed. But the path to success is littered with failures. Today we’re looking at some of the most common ways ML projects fail.

The Happy Path to Success

Machine learning is an most effective for problems that share these three features:

  • There is a large volume of high-quality data about the problem.

  • The problem has relevant statistical patterns that could plausibly be explained by mathematics.

  • The problem is not highly subjective.

Big data: Machine learning succeeds by finding and replicating patterns in their training data. In order to find complex patterns you must have complex data. When the data is complex you need a lot of it to have a “representative sample.”

Statistical patterns: In ML the pattern finding process is almost universally done with mathematical models. If a phenomenon cannot be explained with math then these models won’t work. There must be discernible statistical patterns for the machine to imitate, exploit, or discover.

Subjectivity: The more subjective a problem is the more likely human biases are incorporated into the data used to train the model. ML can handle subjectivity in some circumstances — for example social media feeds, advertisement, and movie/TV recommendation systems — but some of the most harmful failures of ML have come in highly subjective domains such as bail setting, predictive policing, and hiring.

Games have long been at the forefront of ML research because they have these properties. Near infinite amounts of data can be generated by encoding the rules and simulating play. There are often clear statistical patterns related to winning and losing. And — the the games ML researches choose —there’s generally nothing subjective about the rules of play.

In fact, the phrase “machine learning” was coined by Arthur Samuels in the 1950’s during the development of a Checkers playing program. That program recorded games and used the wins and losses to decide which moves were good or bad. Samuels’ system used both ML and search algorithms — a combination that state of the art systems like AlphaGo still employ today.

Other areas where ML has been highly successful — like financial fraud detection and spam filtering — also have lots of good data, relevant statistical patterns, and relatively low levels of subjectivity.

To a certain extent these guidelines can be broken. But with less data, fewer statistical trends, and more subjective problems you invite failure.

When Things Go Wrong

In this edition we’re looking at two broad categories of failure called “underfitting” and “overfitting.” Next week we’ll look at some other more nuanced failure types and some subtle causes of overfitting and underfitting.

Underfitting

Underfitting is the most straightforward way that things can go wrong in ML. This happens when your model just doesn’t find the right patterns in the training data to map inputs to labels. This often happen because the model you’re using is too simplistic to capture the patterns in the data. Last week we gave such an example: Using linear regression to fit a parabolic data.

Linear Regression will always underfit parabolic trends.

Underfitting can also stem from lack of data. When there’s not enough data it might be impossible to discern meaningful patterns. Imagine randomly picking 5 dots from a parabolic distribution… Can you still tell it’s a parabola? Neither can an ML model.

With a small number of data points its impossible to discern meaningful patterns.

This problem is exacerbated in many machine learning contexts because of something called “the curse of dimensionality.” Essentially this “curse” means: the more features your input data has the more of it you need to find patterns across those features. Consider what the same 40 data points looks like in 1D, 2D, and 3D:

As you add features the data becomes increasingly sparse.

In one dimension the data is tightly packed, in two dimensions it’s a bit more spaced out, and in three dimensions the chart is mostly unoccupied space. Said another way: in one dimension most values of x were represented; in two dimensions many (x,y) combinations were not represented; in three dimensions a tiny fraction of (x,y,z) combinations were represented.

When the data is not dense across all the dimensions the data set is not representative of all the relevant combinations of values in those dimensions. As a result whatever patterns ML models find are not likely to be representative either. Ironically, small datasets also put models at greater risk over overfitting (more on that in a moment).

Finally, underfitting can also happen when a model has good data but doesn’t train enough. Many models train iteratively. If you only train for a few rounds, you might miss critical patterns. Consider this series of charts showing a neural network’s training progress across 400 rounds (or “epochs”).

At first the model can only capture the general trend in the data, meaning it’s underfit. By the end it’s much more precise.

Many ML models get better with more training.

Overfitting

Overfitting happens when a model learns too much about the specific training data instead of learning about the phenomenon in general.

A helpful analogy is a student cheating on an exam. Say the student had access to the answers before the test and memorizes them. If the exam is an exact match then this student will pass, but if the exam has been changed some — even if the questions are very similar — this student will fail because they only know the answers, not how to derive those answers.

Like the cheating student some of the more powerful ML models can effectively memorize the training data and how each data point maps to the output label. When this happens, the models stop finding patterns that are generally predictive and starts using patterns to identify individual training data points. When this happens we call it overfitting.

Here’s an example: I generated a small dataset that follows a linear trend with a little bit of noise. Then, I trained a neural network for about fifty thousand epochs on this small dataset. Here’s the result:

Strong overfitting, especially when x < 20. More training would have resulted in further overfitting on the right hand side of the distribution.

Even though the true underlying trend in the data is linear, my model has learned some wacky function that tries to perfectly capture the training data’s distribution. The problem is that this isn’t accurate. Lets zoom in on the area with the most overfitting (0 <= x<= 20) and generate more data from the exact same distribution.

The learned peaks and valley’s are spurious, just artifacts of overfitting to noisy data.

Clearly this model isn’t really learning the trend in our underlying distribution. Instead we’ve learned the noise values from the training data.

This example demonstrates the importance of two major ML principles: the need for big data and something called validation.

Validation is a collection of methods (we’ll discuss some in future editions) that help us detect when a model is failing. The most common methods involve curating at least two datasets from the same underlying pool of data. Then we train models on one of those sets, and use the other to test whether our model “generalizes” to the data that was held out of training.

This is exactly what we did in the chart above, and as you can see it demonstrated a major flaw in our model.

Bigger datasets make it harder to overfit because for any given combination of input there will be multiple (ideally many) datapoints represented, each with unique output. In our little experiment a chart with “big data” would look something like this:

Dense data distributions make it hard for a model to become overfit.

It’s much more difficult to draw a weird squiggly line that hits every datapoint, because there are multiple y values for most x values. Given enough training some complex models will still start to overfit, you can see that here on the very left:

Very complex models can still overfit when trained for a very long time, but the effect is substantially diminished compared to smaller datasets.

Next week we’ll talk about some more nuanced causes of failure in ML, including how an amateur Go player beat AlphaGo and the cause of a fatal self driving car crash.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

Are the following examples of underfitting, overfitting, or neither.

Themes in the News

An ML generated image of Lady Justice… for irony’s sake.

ML lawsuits and regulation are still the matter of the day

A slew of lawsuits and investigations continue to target ML companies over their models and their data collection procedures.

The FTC is investigating OpenAI on consumer protection grounds, Getty is suing Stability AI for unauthorized use and reproduction of copy-protected images, the SAG-AFTRA is highlighting generative models as part of their ongoing strike and negotiations, Microsoft and Github’s Copilot tool is facing a class action lawsuit for utilizing open-source (but not public domain) code, new EU legislation targeting big tech firms is about to go into effect, the US Senate published their investigation into how Meta, Google, and others got access to private tax data

The list goes on and on.

While just about everyone agrees that some regulation is needed, there is substantial disagreement about exactly how to proceed. Fast.ai published a great example of this disagreement on July 10th, claiming:

Proposals for stringent AI model licensing and surveillance will likely be ineffective or counterproductive, concentrating power in unsustainable ways, and potentially rolling back the societal gains of the Enlightenment. The balance between defending society and empowering society to defend itself is delicate. We should advocate for openness, humility and broad consultation to develop better responses aligned with our principles and values — responses that can evolve as we learn more about this technology with the potential to transform society for good or ill.

Teb’s Tidbits

Answers To The News Quiz

Are the following examples of underfitting, overfitting, or neither.

  • In it’s recent lawsuit Getty Images is alleging that ML based image generators produce exact or near-exact replicas of copyright protected images.

    • Overfitting. The model learned to perfectly reproduce the training data, it’s a classic example of overfitting.

  • State of the art discriminators cannot reliably tell the difference between AI and human written content.

    • Underfitting. Models, for a variety of reasons, have a hard time finding patterns that prove something is AI vs human generated.

  • The type of identity based discrimination Local Law 144 in New York City is trying to prohibit AI systems from exhibiting.

    • Neither. Generally this type of discrimination comes from models that are well-fit to their training datasets. The problem is that those datasets (i.e. historical hiring and promotion data) contain artifacts of identity based discrimination. These models do exactly what they were trained to do: recreate historical hiring practices, which were racist, sexist, etc.

    • More on this next week!

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.

Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

Reply

or to participate.