Teb's Lab
Posts
Machine Learning and the "-isms”

Machine Learning and the "-isms”

Plus a new direct-to-student class in the works

Tyler Bettilyon
August 11, 2023

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

My friend and former colleague Ian—who now runs DevSprout—is building an introductory Python and Web Development class with me. We’re planning a 4 week course meeting 2 nights per week for 1.5-2 hours each session.

If there’s interest in the class I’ll share a draft of the course outline in next week’s newsletter. If you’re interested in a similar course on a different topic please reply to this email and let me know what you’d like to learn about.

Today’s Lesson

Machine Learning and the “-Isms”

Porcha Woodruff was arrested for robbery and carjacking on Feb. 16th, 2023 following a facial recognition match. Woodruff was 8-months pregnant at the time. She was detained for 11 hours then released on a $100,000 personal bond. Her case has since been dismissed and she is now suing the Detroit Police Department (DPD) for damages caused by the false arrest.

This whole ordeal was based on a facial recognition match between gas station surveillance footage and a grainy mugshot of Woodruff from 2015 (Woodruff was arrested in 2015 for driving with an expired license).

The 2015 mugshot matched by the facial recognition system (left) and Porcha’s 2021 drivers license photo (right), which was also available to the facial recognition system, but didn’t match. Via NYTimes: https://www.nytimes.com/2023/08/06/business/facial-recognition-false-arrest.html

According to The New York Times Woodruff is the 6th person—and first woman—to report being arrested due to a false positive from a facial recognition system. All six are black. Coincidence?

Machine learning systems have been plagued with issues related to social bias. Amazon built a sexist hiring AI. Several municipal governments have used racist risk assessment tools to inform bail and sentencing decisions. Microsoft infamously released a Twitter bot that became a Nazi propagandist within 24 hours.

The day before Porcha’s arrest PubMed published this analysis of medical ML systems which concluded, “With the exception of only a few cases, we found that the performance for the White group was, in general, significantly higher than that of the other racial groups across all ML algorithms.”

State of the art generative systems have also fallen under fire for reinforcing a wide variety of stereotypes. Bloomberg recently analyzed the image generator Stable Diffusion and found substantial racial and gender bias. They prompted the AI to depict people with various jobs and categorized the images based on skin tone and perceived gender. In their analysis “lawyers” were mostly depicted as white men and “housekeepers” were mostly women of color.

An analysis done by Bloomberg shows the skin tone composition of images generated by Stable Diffusion when prompted to show a person with a particular job. Source: https://www.bloomberg.com/graphics/2023-generative-ai-bias/

Many people associate algorithms with a kind of pure mathematical objectivity. But now, roughly two decades into the machine learning revolution, more experts are admitting that it’s probably impossible to create an unbiased ML model.

So, why?

It’s The Data, Mostly

Garbage in garbage out.

ML models become biased primarily because social biases are deeply embedded in the datasets used to train them. In a sense the algorithms really are acting as tools of unbiased, objective mathematics: They precisely recreate the problematic patterns in the data.

Amazon used their historical hiring data to train their hiring AI; it’s easy to believe Amazon’s historical hiring practices were at least somewhat sexist. The American court and police systems produced the data that was used to train the risk assessment tool; those systems have a well-established history of racism. Tay was actually intentionally trained by Twitter trolls to spew Nazi propaganda.

The unfortunate truth is that we live in a society where racism, sexism, and other forms of social bias are deeply embedded in many aspects of our culture. Historically accurate data about our society inevitably contains artifacts of these biases. The same goes for the makeup and content of writing, art, pictures, and other artifacts produced by that society.

As a then-Harvard Ph.D candidate (now Ph.D holder) Alex Najibi described in 2020 with regard to facial recognition systems:

Several avenues are being pursued to address these inequities. Some target technical algorithmic performance. First, algorithms can train on diverse and representative datasets, as standard training databases are predominantly White and male. Inclusion within these datasets should require consent by each individual. Second, the data sources (photos) can be made more equitable. Default camera settings are often not optimized to capture darker skin tones, resulting in lower-quality database images of Black Americans. Establishing standards of image quality to run face recognition, and settings for photographing Black subjects, can reduce this effect.

Alex Najibi, empahsis mine.

Balancing the racial makeup of datasets used to train facial recognition systems would not be terribly difficult. Correcting the long history of camera technology that prioritizes light skin tones is another story.

It’s not possible to retroactively fix every image captured with sub-optimal film or sensor settings. It’s also generally not feasible (and certainly not profitable) for firms to manually collect the millions-to-billions of high-quality images needed to train a modern facial recognition system to ensure proper lighting, color balance, and other settings are used. Most firms just scrape publicly available data from the web.

Even if firms did collect pristine photos, law enforcement typically uses facial recognition to match against low-quality surveillance footage which likely has poor color balance settings and opens the systems up to the extrapolation problem.

Generative systems like Stable Diffusion are holding up an unflattering mirror to society. White men really are over-represented among doctors. Women of color really are over-represented among housekeepers. When Stable Diffusion trains on data that represents reality it recreates discrepancies like these.

Lots of people in online forums use racist language. When we train models like GPT-4 on this data of course it learns to parrot that language.

Making matters worse, modern ML systems require massive amounts of data. Collecting, curating, and cleaning such massive datasets is an enormous task. It’s easy for data cleaners and labelers (who are mostly poorly paid gig workers) to miss subtle forms of bias. If they did succeed at removing every hint of social bias the training sets would shrink dramatically, plausibly making them too small to train large models like OpenAI’s GPT-4 or Google’s Bard.

Engineering Teams, Executives, and End Users Share The Blame

ML practitioners aren’t generally experts with rich experience in all the nuances and subtlety of racism, sexism, homophobia, etc. Exhaustively testing models like ChatGPT for every possible problematic utterance is a already huge challenge, one that’s exacerbated by lack of expertise in the wide world of social bias.

Even when engineering teams do have expertise their concerns about bias and ethics often play a secondary role to building and releasing a profitable product. Sometimes engineers are even punished for concerning themselves with ethics and bias: When Timnet Gebru and Margaret Mitchell authored a paper demonstrating that large language models frequently produce racist content they were ousted from their jobs at Google.

In some cases—such Porcha Woodruff’s arrest—the end users lack crucial skepticism. The Detroit Police Department uncritically trusted their AI’s output, even in the face of mitigating evidence (such as Woodruff’s pregnancy). A judge also appears to have uncritically signed an arrest warrant primarily based on the AI’s output and little supporting evidence.

Finally, some actors are just malicious. With intentional prompting it’s always possible to use even a relatively neutral AI system to produce something that contains or represents social bias, like this:

I prompted an AI to produce a picture of: “A white man gobbling mayonnaise from the jar using a spoon.”

I hope our white readers can laugh at this stereotype. I especially appreciated that the man appears to be sunburned. But similar images that might be perfectly innocuous in some settings might also carry more insidious racist or sexist undertones.

Should AI image generators specifically refuse to create images of black people eating fried chicken and watermelon? Plenty of black people in the real world do eat these foods, but the stereotype that black people prefer them has an ugly history and seems to reappear every Black History Month.

There’s nothing inherently wrong with an image of a black person eating fried chicken, but it’s pretty easy to use that imagery in a way that reinforces nasty stereotypes. Similarly, if you prompted an AI for an image of “A black man eating” and it produced a man with a bucket of fried chicken that would be cause for concern.

Fortunately, the system I used didn’t produce something obviously racist… but it was still a monstrosity. (Seriously, WTF is going on with his mouth and fingers? And are those supposed to be noodles?)

I prompted an AI to create an image of “A black man eating.”

It’s probably impossible to create an ML system that eliminates all social biases while also being generally useful. Which brings us to today’s News Quiz.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

Image from the recent research paper: https://aclanthology.org/2023.acl-long.656.pdf

New research studied various text generators and classified them across two political dimensions using a tool called the “political compass” (pictured above).

The color of the circles indicates the model family. Yellow dots are from Google’s BERT family, orange dots are from Meta’s LLaMA family, and white dots are OpenAI’s GPT family. The X-axis is a measure of political economic alignment. The Y-axis is a measure of political social alignment.

The research is awesome and the paper is surprisingly approachable, I definitely suggest you read it.

Here’s a snippet from the paper:

Generally, BERT variants of LMs are more socially conservative (authoritarian) compared to GPT model variants.

Section 4.1, second bullet point: https://aclanthology.org/2023.acl-long.656.pdf

Which of the following is the most likely explanation for the difference in social conservatism?

The GPT family’s model architecture is inherently more libertarian.
The engineers at OpenAI are more libertarian than the engineers at Google.
The BERT family of models were trained on a more socially conservative dataset.

This research has implications for a popular “toxic speech classifier” called Jigsaw (and others). Here’s another snip from the research paper:

No language model can be entirely free from social biases.

Final paragraph of section 1

If the authors are right, Jigsaw’s classifications of what counts as “toxic” must be biased. Which of the following options could Jigsaw peruse to mitigate the bias?

Train multiple models with different datasets and biases and have them vote or otherwise combine their classifications (this strategy is called “ensembling”).
Gather and curate a dataset that perfectly balances all the relevant perspectives on what constitutes “toxic speech.”
Select a type of model that is inherently non-partisan.

Teb’s Tidbits

The main article and news quiz ran long this week, so we’re skipping the “Themes” section.

Meta has officially started blocking news content for Canadian readers following a law that would require Meta to pay news producers whose links appear on Meta’s services.
The new California Privacy Protection Agency is investigating cars and their data collection practices.
ML based image generators are disrupting a surprisingly big-dollar cottage industry: YouTube thumbnail creation.
ChatGPT can analyze images and caption them as a tool for the blind, but OpenAI worries about unintentionally creating a facial recognition tool and all the associated privacy and bias issues that entails.
It was bound to happen: Evil ML-powered chatbots are emerging to steal your login credentials and mire you in spam.
A new Russian law aims to crack down on online anonymity and further isolate Russia’s corner of the internet from the broader world.
The BBC is starting a 6-month experiment with their own Mastodon server.

Answers To The News Quiz

Which of the following is the most likely explanation for the difference in social conservatism?

The GPT family’s model architecture is inherently more libertarian.
- No research known to me suggests any particular model architecture is inherently biased in one way or another. But, here’s a really interesting quote from the paper:
The engineers at OpenAI are more liberal/libertarian than the engineers at Google.
- This might be true. If it is true these biases may have slipped into the verification, training, and testing processes at OpenAI and Google. But it’s probably of secondary importance to the training data. Here’s a quote from the paper:
The BERT family of models were trained on a more socially conservative dataset.
- This is the most likely cause. One more quote from the paper:

This research has implications for a popular “toxic speech classifier” called Jigsaw (and others). Here’s another snip from the research paper:

No language model can be entirely free from social biases.

Final paragraph of section 1

If the authors are right Jigsaw’s classifications of what counts as “toxic” must be biased. Which of the following options could Jigsaw peruse to mitigate the bias?

Train multiple models with different datasets and biases and have them vote or otherwise combine their classifications (this strategy is called “ensembling”).
- This solution is recommended by the paper:
Gather and curate a dataset that perfectly balances all the relevant perspectives on what constitutes “toxic speech.”
- The paper suggests this is a) impossible, and b) might help, but won’t ever fully eliminate social bias.
Select a type of model that is inherently non-partisan.
- Again, there is no research known to me that suggests any type of model is inherently partisan to a particular political persuasion.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.

Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

Reply

or to participate.