• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia

Trendingnow

1

After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup

2

Markets tumble worldwide as Fed resets expectations: $400 billion wiped off SpaceX stock

3

Current price of oil as of June 23, 2026

1

After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup

2

Markets tumble worldwide as Fed resets expectations: $400 billion wiped off SpaceX stock

3

Current price of oil as of June 23, 2026
AIOpenAI

OpenAI and Microsoft are teaming up with Harvard’s libraries to train AI models on 600-year-old books

By
Matt O'Brien
Matt O'Brien
and
The Associated Press
The Associated Press
Down Arrow Button Icon
By
Matt O'Brien
Matt O'Brien
and
The Associated Press
The Associated Press
Down Arrow Button Icon
June 12, 2025, 3:10 PM ET
Banners on the Harry Elkins Widener Memorial Library at the Harvard University campus in Cambridge, Massachusetts, US, on Tuesday, May 27, 2025.
Banners on the Harry Elkins Widener Memorial Library at the Harvard University campus in Cambridge, Massachusetts, US, on Tuesday, May 27, 2025. Sophie Park/Bloomberg via Getty Images
Add Fortune on Google for similar content.

Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.

Recommended Video

Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston’s public library.

Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.

“It is a prudent decision to start with public domain data because that’s less controversial right now than content that’s still under copyright,” said Burton Davis, a deputy general counsel at Microsoft.

Davis said libraries also hold “significant amounts of interesting cultural, historical and language data” that’s missing from the past few decades of online commentary that AI chatbots have mostly learned from. Fears of running out of data have also led AI developers to turn to “synthetic” data, made by the chatbots themselves and of a lower quality.

Supported by “unrestricted gifts” from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries and museums around the world on how to make their historic collections AI-ready in a way that also benefits the communities they serve.

“We’re trying to move some of the power from this current AI moment back to these institutions,” said Aristana Scourtas, who manages research at Harvard Law School’s Library Innovation Lab. “Librarians have always been the stewards of data and the stewards of information.”

Harvard’s newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter’s handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.

It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.

“A lot of the data that’s been used in AI training has not come from original sources,” said the data initiative’s executive director, Greg Leppert, who is also chief technologist at Harvard’s Berkman Klein Center for Internet & Society. This book collection goes “all the way back to the physical copy that was scanned by the institutions that actually collected those items,” he said.

Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn’t think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.

Harvard’s new AI training collection has an estimated 242 billion tokens, an amount that’s hard for humans to fathom but it’s still just a drop of what’s being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.

Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from “shadow libraries” of pirated works.

Now, with some reservations, the real libraries are standing up.

OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University’s 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.

When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.

“OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,” Chapel said.

Digitization is expensive. It’s been painstaking work, for instance, for Boston’s library to scan and curate dozens of New England’s French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.

Harvard’s collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.

Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.

Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.

The new effort was applauded Thursday by the same authors’ group that sued Google over its book project and more recently has brought AI companies to court.

“Many of these titles exist only in the stacks of major libraries and the creation and use of this dataset will provide expanded access to these volumes and the knowledge within,” said Mary Rasenberger, CEO of the Authors Guild, in a Thursday statement. “Importantly, the creation of a legal, large training dataset, will democratize the creation of new AI models.”

How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.

The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.

A book collection steeped in 19th century thought could also be “immensely critical” for the tech industry’s efforts to build AI agents that can plan and reason as well as humans, Leppert said.

“At a university, you have a lot of pedagogy around what it means to reason,” Leppert said. “You have a lot of scientific information about how to run processes and how to run analyses.”

At the same time, there’s also plenty of outdated data, from debunked scientific and medical theories to racist and colonial narratives.

“When you’re dealing with such a large data set, there are some tricky issues around harmful content and language,” said Kristi Mukk, a coordinator at Harvard’s Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to “help them make their own informed decisions and use AI responsibly.”

————

The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP’s text archives.

Subscribe to Fortune Gulf Brief. Every Tuesday, this new newsletter delivers clear-eyed, authoritative intelligence on the deals, decisions, policies, and power shifts shaping one of the world’s most consequential regions, written for the people who need to act on it. Sign up here.
About the Authors
By Matt O'Brien
See full bioRight Arrow Button Icon
By The Associated Press
See full bioRight Arrow Button Icon
Add Fortune on Google for similar content.

Latest in AI

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Fortune Secondary Logo
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • World's Most Admired Companies
  • See All Rankings
  • Lists Calendar
Sections
  • Finance
  • Fortune Crypto
  • Features
  • Leadership
  • Health
  • Commentary
  • Success
  • Retail
  • Mpw
  • Tech
  • Lifestyle
  • CEO Initiative
  • Asia
  • Politics
  • Conferences
  • Europe
  • Newsletters
  • Personal Finance
  • Environment
  • Magazine
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
  • Group Subscriptions
About Us
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in AI

The founding team at Seltz, a startup trying to reinvent web search for AI agents, pose for a group photo with San Francisco's Golden Gate Bridge in the background.
Startups & VentureVenture Capital
Exclusive: Seltz, a startup rebuilding web search for AI agents, raises $12.5 million in seed funding
By Jeremy KahnJune 24, 2026
24 minutes ago
Sarah Youngwood, EVP and CFO at Nasdaq.
C-SuiteFinance
Inside Nasdaq CFO Sarah Youngwood’s AI playbook
By Sheryl EstradaJune 24, 2026
54 minutes ago
You can ignore Trump’s threats to leave NATO: Pimco says they’re a ‘paper tiger’
EconomyMarkets
You can ignore Trump’s threats to leave NATO: Pimco says they’re a ‘paper tiger’
By Jim EdwardsJune 24, 2026
2 hours ago
rh
AIReid Hoffman
Reid Hoffman says SpaceX is ‘not an AI company’ and xAI is a ‘complete train wreck’—and there’s room for both OpenAI and Anthropic
By Nick LichtenbergJune 24, 2026
2 hours ago
The hidden cost of your AI rollout: burning out the high performers running it
Workplace Cultureburnout
The hidden cost of your AI rollout: burning out the high performers running it
By Mikaela Cohen and HR BrewJune 23, 2026
14 hours ago
Alan Greenspan testifying before the Senate Banking Committee.
BankingFederal Reserve
The man who invented the Fed’s magic trick just died. His successor is about to try it again
By Eva RoytburgJune 23, 2026
17 hours ago

Most Popular

After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup
Success
After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup
By Orianna Rosa RoyleJune 23, 2026
23 hours ago
Markets tumble worldwide as Fed resets expectations: $400 billion wiped off SpaceX stock
Banking
Markets tumble worldwide as Fed resets expectations: $400 billion wiped off SpaceX stock
By Jim EdwardsJune 23, 2026
1 day ago
Current price of oil as of June 23, 2026
Personal Finance
Current price of oil as of June 23, 2026
By Joseph HostetlerJune 23, 2026
23 hours ago
Meet the 2 men putting New York's $300 billion pension fund in play for the first time in 20 years
Investing
Meet the 2 men putting New York's $300 billion pension fund in play for the first time in 20 years
By Nick LichtenbergJune 22, 2026
2 days ago
Texas and Charlotte used to build huge McMansions—now they're copying the California design tricks they once mocked
Real Estate
Texas and Charlotte used to build huge McMansions—now they're copying the California design tricks they once mocked
By Sydney LakeJune 22, 2026
2 days ago
Former U.S. Secret Service agent says bringing your authentic self to work stifles teamwork: 'You don’t get high performers, you get sloppiness'
Success
Former U.S. Secret Service agent says bringing your authentic self to work stifles teamwork: 'You don’t get high performers, you get sloppiness'
By Sydney LakeJune 21, 2026
3 days ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.