- Teb's Lab
- Posts
- What Does it Mean to Block GPTBot?
What Does it Mean to Block GPTBot?
News: Moderation, Censorship, and Section 230
The Weekly Lab Report
I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.
If you’re new to the Lab Report you can subscribe here.
If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.
From The Lab
We’ve expanded our open enrollment offerings with DevSprout! Initially we planned to offer an introductory Python class, but we’ve decided to offer an introductory SQL class as well. The outlines for these classes can be found here:
The Intro to SQL start has a tentative start date of September 18, we will meet twice a week, Mondays and Thursdays, for 4 weeks. Each session will meet over Zoom for two hours, from 5:30pm - 7:30pm US Pacific Time. The cost is $100 for the first 10 people who sign up, $200 after that. We cap classes at 25 attendees to ensure a high quality virtual classroom experience. Browse the curriculum on Github.
Respond to this email to reserve your spot in class. An official enrollment portal is forthcoming, but not available yet.
The Python class still needs a little curriculum development so our start date is TBA. Today’s edition of the newsletter is, in part, a preview of something you’ll learn in our Intro to Python course: web scraping.
P.S. I’m traveling next week, so there won’t be a Lab Report.
Today’s Lesson
All the code from today’s lesson can be viewed on Github
Crawling in The Web, Looking For The Data
Yes, that’s a Hoobastank reference. You’re welcome.
Today’s topic is brought to you by several news organizations who’ve started blocking OpenAI’s web crawler from their websites, and by me because it’s one of the topics in my Introduction to Python class with DevSprout.
Web crawling and scraping have been around since 1993, when the first web crawler was built in an attempt to measure the size of the then-nascent world wide web. Then, search engines built their own crawlers in order to identify, analyze, and rank websites. Flash forward to today, where crawlers and other bots comprise a huge portion of web traffic.
“Crawling” generally refers to automated systems that make web requests to web pages, identify links on those pages, then follow those links and repeat the process recursively. Crawlers often extract additional information from each page they visits.
Periodically, web crawling has been a hot-button issue. Once upon a time Linkedin filed a lawsuit — which they ultimately lost — alleging that hiQ Labs was illegally scraping LinkedIn’s publicly available user data. LinkedIn ultimately lost that case. Clearview AI, a company that sells facial recognition software, settled a lawsuit brought by the ACLU related to their massive photo database which was largely harvested by web crawlers. Now, OpenAI has announced GPTBot, which crawls the web to collect training data, apparently to the chagrin of news organizations and other copyright holders who are scrambling to block the bot.
Despite the occasional controversy, web crawling is common, generally legal, and often quite simple. For example, here’s a few lines of Python that prints all the links on Abe Lincoln’s Wikipedia page:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Abraham_Lincoln")
soup = BeautifulSoup(page.content, 'html.parser')
all_links = soup.find_all('a')
for link in all_links:
print(link.get_text(), link.attrs.get('href'))
It’s crude, but these 7 lines of code are the basis of crawling. It gathers all the links on the page by finding all the ‘a’ tags (short for anchor, HTML’s standard tag for hyperlinks). To really “crawl” you can put those links in a queue, and repeat the process until the queue is empty.
Sophisticated crawlers also need to handle special cases such as links you’ve already visited and websites that use Javascript on the front end to populate the content sometime after the initial page load.
Copying content from pages adds a bit more complexity, but not too much. Here’s a crude approximation of what Clearview AI did. Mine harvests images from Wikipedia — rather than LinkedIn — and saves the images to files:
from bs4 import BeautifulSoup
import requests
import shutil
base_url = "https://en.wikipedia.org"
page = requests.get("https://en.wikipedia.org/wiki/Abraham_Lincoln")
soup = BeautifulSoup(page.content, 'html.parser')
all_img_tags = soup.find_all('img')
img_count = 0
for img in all_img_tags:
img_url = img.attrs.get('src')
# Images on wikipedia have two cases
if img_url.startswith('//'):
absolute_url = f'https:{img_url}'
else:
absolute_url = base_url + img_url
response = requests.get(absolute_url, stream=True)
file_type = response.headers['content-type'].split('/')[-1] # kinda gross, but works.
with open(f'09-01-2023/img_out/{img_count}.{file_type}', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
img_count += 1
GPTBot is designed to scrape the text of news articles from various news websites. Each website operator makes decisions about how to format and present that text, which means GPTBot’s operators need specialized code that extracts this text on a per-website basis. They also probably want to ignore advertisements, image captions, and other ancillary text that might appear.
I made two scrapers to demonstrate that this is also, generally, quite easy. Here’s one that gets the body of CNN news articles:
from bs4 import BeautifulSoup
import requests
# A randomly selected CNN article from the day I wrote this script.
page = requests.get("https://www.cnn.com/2023/08/30/business/san-francisco-union-square-retail-closures/index.html")
soup = BeautifulSoup(page.content, 'html.parser')
individual_p_tags = soup.select('.article__content p') # CNN's content sits in p tags under a div with this class
texts = [tag.text.strip() for tag in individual_p_tags]
a_text = '\n'.join(texts)
with open('09-01-2023/news_text_out/cnn_article.txt', 'w') as file:
file.write(a_text)
And one for the BBC:
from bs4 import BeautifulSoup
import requests
# A randomly selected BBC article from the day I wrote this script.
page = requests.get("https://www.bbc.com/sport/football/66662060")
soup = BeautifulSoup(page.content, 'html.parser')
# BBC wraps the main body in a div with this class, but uses p's for the text
individual_p_tags = soup.select('.story-body p')
texts = [tag.text for tag in individual_p_tags]
a_text = '\n'.join(texts)
with open('09-01-2023/news_text_out/bbc_article.txt', 'w') as file:
file.write(a_text)
My point is that harvesting data from websites is remarkably easy, which makes it appealing to the increasingly data-hungry ML industry.
How Do Companies Stop This?
The bottom line is that if human users can access data using a web browser without some kind of authentication then a bot can too. Web crawlers can be designed such that they are basically indistinguishable to the companies’ servers from a human using a browser. But there are a few options that companies often use to reduce or manage bot traffic.
Robots.txt
The first option is a robots.txt file. These files are more of a request than a defense. Websites use these files to indicate to web crawlers and other bots that certain pages shouldn’t be indexed, viewed, or otherwise accessed by bots. When OpenAI announced GPTBot, they also added documentation for how to modify your site’s robots.txt to prevent GPTBot from accessing certain pages or directories.
When you read that “news organizations are scrambling to block GPTBot” what they mean is that websites are updating their robots.txt file, which is a totally standard and very easy thing for a web developer to do.
Unfortunately, robots.txt only works for bots that choose to respect the specified rules.
Honeypots
Some websites host fake content on a particular URL, then add a line in their robots.txt explicitly banning access to that URL. If anything accesses that URL, sites ban that IP address from accessing the site at all. Here’s a write up describing an implementation of this tactic.
With the ubiquity of VPNs, proxy servers, and other ways to for malicious actors get a new IP address, this can become a game of whack a mole if your adversary is tenacious. Additional user fingerprinting tactics can help somewhat, but are not surefire.
Authentication and reCAPTCHA
I’m sure most of you have done a reCAPTCHA, either by clicking a checkmark, finding the crosswalks in an image, or some similar task. This is a technology built by Google specifically to prevent bots from accessing specific content by hiding that content behind the task.
This process introduces a minor headache for human users, but a major hurdle for bots. Some bots can likely perform some of the reCAPTCHA tasks some of the time, but it makes the bot operators job much harder.
Classic authentication is even better for two reasons. First, signing up and gaining authentication credentials is often a multi-step process that involves an email, a text message, or some other second factor authentication. This process further complicates the automation process. Second, if someone who is authenticated starts behaving like a bot, banning that user is more effective than a simple IP ban, because they’ll have to repeat the signup flow.
While there is no 100% surefire way to prevent people from scraping your online content, a robots.txt, honeypots, authentication, and constant vigilance can substantially reduce successful scraping.
The News Quiz
Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.
A spider crawling on a newspaper.
Web crawling and scraping can be messy. The fact that OpenAI is doing it suggests (at least to me) they can’t get enough high quality training data from more traditional sources (such as “The Pile” and other open NLP datasets). Classify the following issues that can arise specifically from scraping the text of news articles as high risk, medium risk, or low risk from the perspective of a company training a large language model (LLM):
Accidentally capturing pieces of text that aren’t part of the article, such as advertisements, image captions, pull quotes, embedded links to other articles, etc.
Pulling in native advertising content.
Pulling in articles that were themselves written by an LLM.
Consuming articles with factual errors before they’ve been corrected.
Incorporating copyright protected content into your training data.
Increased cost incurred from the actual process of scraping the data.
Themes in the News
Censorship, Moderation, and Section 230
Lots of recent buzz around these perennial topics.
A judge dismissed a lawsuit brought by the Republican National Committee (RNC) against Google. The lawsuit alleged that Google’s spam filter was biased against Republican candidates’ and officials’ emails. U.S. District Court Judge Daniel Calabretta concluded that the RNC had not “sufficiently pled that Google acted in bad faith.” The RNC plans to refile the lawsuit with an amended complaint.
Meanwhile, the Biden administration urged the Supreme Court to take on cases related to laws in Texas and Florida that substantially limit social media companies ability to perform moderation. The Florida law, for example, imposes fines on social media platforms if they “refuse to transmit” a politicians post, regardless of whether that post violates the company’s content policy.
X (formerly Twitter), Meta, and YouTube have all indicated they will decrease their moderation efforts regarding misinformation. In part this is probably because platforms have realized it’s really hard to do this well. Many of them fumbled the Hunter Biden laptop story, suppressing the NYPost’s original article that turned out to be totally real. The “Twitter files” should also cast doubt on platforms ability and willingness to be neutral and trustworthy.
At the same time, these platforms have an ongoing and increasingly adversarial relationship with news organizations, including various efforts to deprioritize news content on their sites. Many people have come to rely on social media as a news aggregator. Between reduced moderation and deprioritization of legitimate news content, there’s a major void that is being filled with trolls, trashy AI generated content, and other kinds of misinformation.
A report from Rest Of World this week highlighted a fairly predictable outcome of backing off moderation policies as X has done under Elon Musk’s leadership: scams become more prevalent. In this case, sextortion scams targeting prominent Chinese figures on the platform.
So here we are, stuck between platforms that can’t be trusted to moderate particularly well and the free-for-all of the internet that can’t be trusted at all.
Is it good news that OpenAI thinks ChatGPT will soon be able to moderate on social sites?
Teb’s Tidbits
Ben Evans write a thoughtful piece on the state of Generative AI and copyright.
I like the piece, but he gets something important wrong when he claims generative models never store the training data. Researchers have been able to prompt LLMs to produce exact replicas of lengthy passages from books, and Getty Images is currently suing Stability AI for their model that produced exact replicas of copy protected images. This implies that these systems do store training data, albeit in an obfuscated and compressed way within their internal parameters. This fact is important in the copyright discussion, since consumption by an ML system is not always “transformative” in the legal sense.
An argument that the UK’s recently unveiled plans to change their surveillance rules would violate international law.
Meta released Llama Code, an LLM designed specifically to write computer programs.
A fantastic and grounded risk assessment of self driving cars in IEEE Spectrum.
And a pleasant surprise: Apple has come out in favor of California’s Right to Repair Act.
Answers To The News Quiz
Accidentally capturing pieces of text that aren’t part of the article, such as advertisements, image captions, pull quotes, embedded links to other articles, etc.
High Risk: if the training data ends up with lots of incongruent articles with random advertising tidbits interspersed with the real text, this could definitely degrade model performance. Cleaning up the scraped text to remove these errors would be worth the hard work.
Pulling in native advertising content.
Low Risk: Native advertising is still a perfectly legitimate use of language even if it’s not particularly high-brow.
Pulling in articles that were themselves written by an LLM.
High Risk: It’s well established that training LLMs on text created by LLMs degrades model performance. Making matters worse current methods for discerning if a particular article is AI generated are not reliable at all.
Consuming articles with factual errors before they’ve been corrected.
Low Risk: LLM’s ‘hallucinate’ all the time, even if they’ve been trained exclusively on factually accurate data. LLM’s learn to recreate patterns in writing mostly related to word order, not specific facts from pieces of writing.
Incorporating copyright protected content into your training data.
Medium to High Risk: Depending on the outcome of some pending lawsuits this might end up being a very big risk or something that can be easily addressed with a license agreement, royalty scheme, or something similar.
Increased cost incurred from the actual process of scraping the data.
Low Risk: scraping is generally cheap and easy to perform.
Remember…
The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.
Reply