• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia

Trendingnow

1

After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup

2

The Pentagon said Iran War costs $29 billion, but the real cost is closer to $200 billion—and counting

3

Current price of oil as of June 23, 2026

1

After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup

2

The Pentagon said Iran War costs $29 billion, but the real cost is closer to $200 billion—and counting

3

Current price of oil as of June 23, 2026
TechMeta

A new web crawler launched by Meta last month is quietly scraping the internet for AI training data

By
Kali Hays
Kali Hays
Down Arrow Button Icon
By
Kali Hays
Kali Hays
Down Arrow Button Icon
August 20, 2024, 6:59 PM ET
Meta CEO Mark Zuckerberg is betting big on AI.
Meta CEO Mark Zuckerberg is betting big on AI.Jason Henry—Bloomberg/Getty Images
Add Fortune on Google for similar content.

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

Meta, the parent company of Facebook, Instagram, and WhatsApp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.

A Meta spokesperson said the company has had a crawler under a different name “for years,” although this crawler—dubbed Facebook External Hit—”has been used for different purposes over time, like sharing link previews.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”    

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July).

Flying under the radar

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information. However, typically the specific name of a scraper bot needs to be added as well in order for robots.txt to be respected. That’s difficult to accomplish if the name has not been openly disclosed. An operator of a scraper bot can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way. 

Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chatbot that now appears on various Meta platforms. While the company did not disclose the training data used for the latest version of the model, Llama 3, its initial version of the model, used large datasets put together by other sources, like Common Crawl.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

The existence of the new crawler suggests Meta’s vast trove of data may no longer be enough, however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

Are you a Meta employee or someone with insight or a tip to share? Contact Kali Hays securely through Signal at +1-949-280-0267 or at kali.hays@fortune.com.

About the Author
By Kali Hays
See full bioRight Arrow Button Icon
Add Fortune on Google for similar content.

Latest in Tech

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Fortune Secondary Logo
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • World's Most Admired Companies
  • See All Rankings
  • Lists Calendar
Sections
  • Finance
  • Fortune Crypto
  • Features
  • Leadership
  • Health
  • Commentary
  • Success
  • Retail
  • Mpw
  • Tech
  • Lifestyle
  • CEO Initiative
  • Asia
  • Politics
  • Conferences
  • Europe
  • Newsletters
  • Personal Finance
  • Environment
  • Magazine
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
  • Group Subscriptions
About Us
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in Tech

Institute's Global Conference at the Beverly Hilton Hotel,on May 6, 2024 in Beverly Hills, California.
RetailSpaceX
Elon Musk was the world’s first trillionaire for 12 days
By Eva RoytburgJune 24, 2026
12 minutes ago
President Donald Trump pictured in September 2025 signing an executive order that overhauled the H-1B visa program.
EconomyImmigration
Trump’s international student crackdown kicked off a domino effect that could shave nearly $500 billion off the economy
By Tristan BoveJune 24, 2026
2 hours ago
How Home Depot is rebuilding retailing with AI
NewslettersCIO Intelligence
How Home Depot is rebuilding retailing with AI
By John KellJune 24, 2026
3 hours ago
bob
AIbooks
Robert Wright sees an ‘earthquake’ coming from AI that goes far beyond jobs: ‘cultural, political, personal, family, psychological’
By Nick LichtenbergJune 24, 2026
3 hours ago
A man wearing a red and black jacket and a red hat walks down a hallway lined with servers.
InnovationChina
For the first time since 2017, it’s China, not the U.S., that has the world’s most powerful supercomputer
By The Associated PressJune 24, 2026
4 hours ago
Jack Schlossberg, Kennedy scion and sardonic social media star, loses in bid for New York state assembly
PoliticsPolitics
Jack Schlossberg, Kennedy scion and sardonic social media star, loses in bid for New York state assembly
By The Associated Press, Danny Peltz and Anthony IzaguirreJune 24, 2026
5 hours ago

Most Popular

After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup
Success
After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup
By Orianna Rosa RoyleJune 23, 2026
1 day ago
The Pentagon said Iran War costs $29 billion, but the real cost is closer to $200 billion—and counting
Economy
The Pentagon said Iran War costs $29 billion, but the real cost is closer to $200 billion—and counting
By Jacqueline MunisJune 24, 2026
13 hours ago
Current price of oil as of June 23, 2026
Personal Finance
Current price of oil as of June 23, 2026
By Joseph HostetlerJune 23, 2026
1 day ago
Current price of gold as of June 23, 2026
Personal Finance
Current price of gold as of June 23, 2026
By Danny BakstJune 23, 2026
1 day ago
Texas and Charlotte used to build huge McMansions—now they're copying the California design tricks they once mocked
Real Estate
Texas and Charlotte used to build huge McMansions—now they're copying the California design tricks they once mocked
By Sydney LakeJune 22, 2026
2 days ago
Markets tumble worldwide as Fed resets expectations: $400 billion wiped off SpaceX stock
Banking
Markets tumble worldwide as Fed resets expectations: $400 billion wiped off SpaceX stock
By Jim EdwardsJune 23, 2026
1 day ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.