Teb's Lab
Posts
How Does ChatGPT "Understand" Words?

How Does ChatGPT "Understand" Words?

It's numbers all the way down...

Tyler Bettilyon
August 25, 2023

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

I had classes from 9:00-5:00pm every day this week. Additionally, I have a side gig as a high school debate coach and school started this week in my district. So, today’s edition is significantly shorter than usual.

Today’s Lesson

How Do AI's "Understand" Words?

Last week we wrote about “The Bitter Lesson,” in which Richard Sutton documents the superiority of “general purpose methods” that scale with computation. A practical consequence of this lesson is that modern Natural Language Processing (NLP) is dominated by statistical models called neural networks. Specifically, a type of neural network called a transformer is the model du jour.

These models are fundamentally numerical in nature. Each neural network is literally just a math function.

In fairness to the models, they are immensely complex math functions. OpenAI’s Large Language Model (LLM) GPT-3 has 175 billion parameters and GPT-4 is rumored to have ~100 trillion. A “parameter” in this context is a number that the model repeatedly changes during the training process. To the extent that a neural network “knows” anything, that knowledge is encoded in these numbers. The parameters are spread across a complex web of mathematical sub-components. But, at the end of the day, these models are still just fancy math functions.

A representation of the Transformer architecture from the paper that first introduced them: https://arxiv.org/pdf/1706.03762.pdf

Words — and natural language in general — are absolutely not numerical or mathematical in nature. Natural language evolved … naturally. Most natural languages have rules, but there are almost always exceptions. Humans, the inventors of natural language, do not internally contextualize words in mathematical terms. But this mismatch hasn’t prevented remarkable progress in the form of ChatGPT, Bard, LLaMA, and other LLMs. So, how do these models reconcile this fundamental mismatch?

The answer is something called an embedding.

Embeddings

Embeddings are a general purpose tactic used to represent some kind of data — words, movies, music — as a vector (i.e. as several numbers). Embeddings can be manually created or be “learned” as part of the training process. Either way, each number in the vector represents some aspect of the thing being embedded. For example, here are three small embeddings manually created by me for encoding TV shows or movies:

A sample of embeddings representing two TV shows and a movie.

When Pandora first launched they hired several Ph.D. music theorists to manually create high-quality and embeddings for the songs in their catalog. These embeddings included fields like acoustic sonority, minor key tonality, vocal centric aesthetic, and other features of songs that the music theorists thought were strongly connected to people’s taste in music. These embeddings were a core part of Pandora’s recommendation engine.

When we embed words the each number corresponds to some aspect of the word. Sometimes these are grammatical in nature such as plurality and being a proper noun. Sometimes they are semantic in nature such as having gender implications, association with various emotions, or association with abstract concepts like royalty, nature, or courage.

For example, the word “queens” is plural, not a proper noun, implies female gender, has a strong relationship with royalty and a weak relationship with nature.

Depending on the size of the embedding more or less meaning can be encoded. GPT-3 uses an embedding length of 12,288 numbers per word, although technically GPT-3 uses “tokens” rather than “words” which breaks some words into parts and allows the system to encode tokens for punctuation marks.

Additionally, for a variety of reasons, GPT-3 (and most LLMs) use embeddings that are learned as part of the training process, rather than manually crafted embeddings like the ones Pandora pioneered. This makes interpreting GPT-3’s embeddings quite difficult, and full of guesswork — but it also allows different neural networks to build embeddings that help with the specific task at hand, which often improves performance.

Shakespeare’s Juliet philosophically asked “what’s in a word?”

Well, at least to ChatGPT the answer is clear: A rose represented by the same 12,288 floating point numbers surely smells just as sweet.

Themes in the News

More on Moore’s Law

A few pieces of news related to last week’s lesson.

Driven largely by ML (although cryptocurrency is also a major contributor) energy use for computation has skyrocketed in recent years. Our chips’ ever increasing capabilities aren’t free. As of right now the computation industry uses about as much energy as all of Britain.

Speaking of Britain, Prime Minister Rishi Sunak wants to buy £100 million worth of NVIDIA chips to further the UK’s position in the global AI race. NVIDIA has become an industry leader in AI not just by building top-end hardware but also by building excellent support to help software engineers get the most out of NVIDIA’s chips.

Some folks are focusing on ways to reduce or mitigate the energy cost of computation, for example this company wants you to heat your water using a powerful server as part of a distributed compute cluster.

Teb’s Tidbits

A district court confirmed the USPTO’s decision that AI generated content cannot be copyrighted. Though the SCOTUS may ultimately have to weigh in, this is a big deal for the world of generative AI.
- Meanwhile, several authors are suing Meta for pirating their books.
A vulnerability in a Chinese keyboard app may have exposed an enormous amount of private data.
Microsoft has been savaged by commentators over high severity vulnerabilities and Microsoft’s inadequate response to breaches based on those vulnerabilities.
Maybe we should all be hoping for AI’s that are a bit more mundane.
San Francisco recently allowed Cruise to massively expand their robotaxi offering. This week, following two crashes (one with a firetruck) the city has forced Cruise to cut their cars on the road in half.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.

Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

Reply

or to participate.