• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia

Trendingnow

1

Now worth $200 million, Sarah Jessica Parker credits being ‘one of eight kids that struggled financially’ for her hunger, ambition, and work ethic

2

MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year

3

Amazon's record Prime Day masks a darker truth: Americans are spending more and getting less

1

Now worth $200 million, Sarah Jessica Parker credits being ‘one of eight kids that struggled financially’ for her hunger, ambition, and work ethic

2

MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year

3

Amazon's record Prime Day masks a darker truth: Americans are spending more and getting less
AIOpenAI

OpenAI’s new safety tools are designed to make AI models harder to jailbreak. Instead, they may give users a false sense of security

By
Beatrice Nolan
Beatrice Nolan
Tech Reporter
Down Arrow Button Icon
By
Beatrice Nolan
Beatrice Nolan
Tech Reporter
Down Arrow Button Icon
November 5, 2025, 9:58 AM ET
OpenAI logo on a keyboard.
OpenAI last week unveiled two new open-weight tools.Samuel Boivin—NurPhoto/Getty Images
Add Fortune on Google for similar content.

OpenAI last week unveiled two new free-to-download tools that are supposed to make it easier for businesses to construct guardrails around the prompts users feed AI models and the outputs those systems generate.

Recommended Video

The new guardrails are designed so a company can, for instance, more easily set up controls to prevent a customer service chatbot responding with a rude tone or revealing internal policies about how it should make decisions around offering refunds, for example.

But while these tools are designed to make AI models safer for business customers, some security experts caution that the way OpenAI has released them could create new vulnerabilities and give companies a false sense of security. And while OpenAI says it has released these security tools for the good of everyone, some question whether OpenAI’s motives are driven in part by a desire to blunt one advantage that its AI rival Anthropic has; it’s been gaining traction among business users in part because of a perception that its Claude models have more robust guardrails than other competitors.

The OpenAI security tools—which are called gpt-oss-safeguard-120b and gpt-oss-safeguard-20b—are themselves a type of AI model known as a classifier, which is designed to assess whether the prompt a user submits to a larger, more general-purpose AI model, as well as what that larger AI model produces, meets a set of rules. Companies that purchase and deploy AI models could, in the past, train these classifiers themselves, but the process was time-consuming and potentially expensive, since developers would have to collect examples of content that violates the policy in order to train the classifier. And then, if the company wanted to adjust the policies used for the guardrails, they would have to collect new examples of violations and retrain the classifier.

OpenAI is hoping the new tools can make that process faster and more flexible. Rather than being trained to follow one fixed rulebook, these new security classifiers can simply read a written policy and apply it to new content.

OpenAI says this method, which it calls “reasoning-based classification,” allows companies to adjust their safety policies as easily as editing the text in a document instead of rebuilding an entire classification model. The company is positioning the release as a tool for enterprises that want more control over how their AI systems handle sensitive information, such as medical records or personnel records.

However, while the tools are supposed to be safer for enterprise customers, some safety experts say that they instead may give users a false sense of security. That’s because OpenAI has open-sourced the AI classifiers. That means they have made all the code for the classifiers available for free, including the weights, or the internal settings of the AI models.

Classifiers act like extra security gates for an AI system, designed to stop unsafe or malicious prompts before they reach the main model. But by open-sourcing them, OpenAI risks sharing the blueprints to those gates. That transparency could help researchers strengthen safety mechanisms, but it might also make it easier for bad actors to find the weak spots and risks, creating a kind of false comfort.

“Making these models open-source can help attackers as well as defenders,” David Krueger, an AI safety professor at Mila, told Fortune. “It will make it easier to develop approaches to bypassing the classifiers and other similar safeguards.”

For instance, when attackers have access to the classifier’s weights, they can more easily develop what are known as “prompt injection” attacks, where they create prompts that trick the classifier into disregarding the policy it is supposed to be enforcing. Security researchers have found that in some cases even a string of characters that look nonsensical to a person can, for reasons researchers don’t entirely understand, persuade an AI model to disregard its guardrails and do something it is not supposed to, such as offer advice for making a bomb or spew racist abuse.

Representatives for OpenAI directed Fortune to the company’s blog post announcement and technical report on the models.

Short-term pain for long-term gain

Open-source can be a double-edged sword when it comes to safety. It allows researchers and developers to test, improve, and adapt AI safeguards more quickly, increasing transparency and trust. For instance, there may be ways in which security researchers could adjust the model’s weights to make it more robust against prompt injection without degrading the model’s performance.

But it can also make it easier for attackers to study and bypass those very protections—for instance, by using other machine learning software to run through hundreds of thousands of possible prompts until it finds ones that will cause the model to jump its guardrails. What’s more, security researchers have found that these kinds of automatically generated prompt injection attacks developed on open-source AI models will also sometimes work against proprietary AI models, where the attackers don’t have access to the underlying code and model weights. Researchers have speculated this is because there may be something inherent in the way all large language models encode language that enables similar prompt injections to have success against any AI model.

In this way, open-sourcing the classifiers may not just give users a false sense of security that their own system is well guarded, it may actually make every AI model less secure. But experts said that this risk was probably worth taking because open-sourcing the classifiers should also make it easier for all of the world’s security experts to find ways to make the classifiers more resistant to these kinds of attacks.

“In the long term, it’s beneficial to kind of share the way your defenses work. It may result in some kind of short-term pain. But in the long term, it results in robust defenses that are actually pretty hard to circumvent,” said Vasilios Mavroudis, principal research scientist at the Alan Turing Institute.

Mavroudis said that while open-sourcing the classifiers could, in theory, make it easier for someone to try to bypass the safety systems on OpenAI’s main models, the company likely believes this risk is low. He said that OpenAI has other safeguards in place, including having teams of human security experts continually trying to test their models’ guardrails in order to find vulnerabilities and hopefully improve them.

“Open-sourcing a classifier model gives those who want to bypass classifiers an opportunity to learn about how to do that. But determined jailbreakers are likely to be successful anyway,” said Robert Trager, codirector of the Oxford Martin AI Governance Initiative.

“We recently came across a method that bypassed all safeguards of the major developers around 95% of the time—and we weren’t looking for such a method. Given that determined jailbreakers will be successful anyway, it’s useful to open-source systems that developers can use for the less-determined folks,” he added.

The enterprise AI race

The release also has competitive implications, especially as OpenAI looks to challenge rival AI company Anthropic’s growing foothold among enterprise customers. Anthropic’s Claude family of AI models have become popular with enterprise customers partly because of their reputation for stronger safety controls compared with other AI models. Among the safety tools Anthropic uses are “constitutional classifiers” that work similarly to the ones OpenAI just open-sourced.

Anthropic has been carving out a market niche with enterprise customers, especially when it comes to coding. According to a July report from Menlo Ventures, Anthropic holds 32% of the enterprise large language model market share by usage compared with OpenAI’s 25%. In coding‑specific use cases, Anthropic reportedly holds 42%, while OpenAI has 21%. By offering enterprise-focused tools, OpenAI may be attempting to win over some of these business customers, while also positioning itself as a leader in AI safety.

Anthropic’s “constitutional classifiers” consist of small language models that check a larger model’s outputs against a written set of values or policies. By open-sourcing a similar capability, OpenAI is effectively giving developers the same kind of customizable guardrails that helped make Anthropic’s models so appealing.

“From what I’ve seen from the community, it seems to be well received,” Mavroudis said. “They see the model as potentially a way to have auto-moderation. It also comes with some good connotation, as in, ‘We’re giving to the community.’ It’s probably also a useful tool for small enterprises where they wouldn’t be able to train such a model on their own.”

Some experts also worry that open-sourcing these safety classifiers could centralize what counts as “safe” AI.

“Safety is not a well-defined concept. Any implementation of safety standards will reflect the values and priorities of the organization that creates it, as well as the limits and deficiencies of its models,” John Thickstun, an assistant professor of computer science at Cornell University, told VentureBeat. “If industry as a whole adopts standards developed by OpenAI, we risk institutionalizing one particular perspective on safety and short-circuiting broader investigations into the safety needs for AI deployments across many sectors of society.”

Subscribe to Fortune Gulf Brief. Every Tuesday, this new newsletter delivers clear-eyed, authoritative intelligence on the deals, decisions, policies, and power shifts shaping one of the world’s most consequential regions, written for the people who need to act on it. Sign up here.
About the Author
By Beatrice NolanTech Reporter
Twitter icon

Beatrice Nolan is a tech reporter on Fortune’s AI team, covering artificial intelligence and emerging technologies and their impact on work, industry, and culture. She's based in Fortune's London office and holds a bachelor’s degree in English from the University of York. You can reach her securely via Signal at beatricenolan.08

See full bioRight Arrow Button Icon
Add Fortune on Google for similar content.

Latest in AI

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Fortune Secondary Logo
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • World's Most Admired Companies
  • See All Rankings
  • Lists Calendar
Sections
  • Finance
  • Fortune Crypto
  • Features
  • Leadership
  • Health
  • Commentary
  • Success
  • Retail
  • Mpw
  • Tech
  • Lifestyle
  • CEO Initiative
  • Asia
  • Politics
  • Conferences
  • Europe
  • Newsletters
  • Personal Finance
  • Environment
  • Magazine
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
  • Group Subscriptions
About Us
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in AI

Softbank CEO dismisses Elon Musk’s extraterrestrial data center idea in favor of maximizing Earth-side construction now: ‘He who strikes first wins’
AITech
Softbank CEO dismisses Elon Musk’s extraterrestrial data center idea in favor of maximizing Earth-side construction now: ‘He who strikes first wins’
By Marco Quiroz-GutierrezJune 25, 2026
3 hours ago
VivaTech entrance in Paris.
NewslettersEye on AI
Europe’s AI wake-up call: cybersecurity threats, sovereignty fears, and a growing demand for ROI dominated VivaTech
By Beatrice NolanJune 25, 2026
3 hours ago
Digital transformation technology strategy, IoT, internet of things. Businessman using smart phone with AI and Digital Icons design.
AICFO Daily
Top CFOs warn AI success depends on training employees, not just buying technology
By Sheryl EstradaJune 25, 2026
4 hours ago
Samin Menon (left) Neil Movva (right)
Startups & VentureVenture Capital
Exclusive: A former Apple engineer thinks AI infrastructure is built for the wrong future. Investors just gave him $80 million to fix it
By Lily Mae LazarusJune 25, 2026
5 hours ago
What bubble? JPMorgan says the $5.5 trillion AI capex explosion is profitable–for now
AIFinance
What bubble? JPMorgan says the $5.5 trillion AI capex explosion is profitable–for now
By Sheryl EstradaJune 25, 2026
7 hours ago
Jen Wong, chief operating officer at Reddit, speaks during the OMR digital and marketing trade fair
Big TechReddit
Reddit COO targets 1 billion users as internet’s ‘odd duck’ aims for new heights
By Sam BirchallJune 25, 2026
8 hours ago

Most Popular

Now worth $200 million, Sarah Jessica Parker credits being ‘one of eight kids that struggled financially’ for her hunger, ambition, and work ethic
Success
Now worth $200 million, Sarah Jessica Parker credits being ‘one of eight kids that struggled financially’ for her hunger, ambition, and work ethic
By Orianna Rosa RoyleJune 24, 2026
1 day ago
MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year
Success
MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year
By Sydney LakeJune 25, 2026
10 hours ago
Amazon's record Prime Day masks a darker truth: Americans are spending more and getting less
Retail
Amazon's record Prime Day masks a darker truth: Americans are spending more and getting less
By Nick LichtenbergJune 24, 2026
1 day ago
Ray Dalio just finished a 10-day trip to China. He says global leaders know America ‘doesn’t have what it takes to fight to maintain its empire’
Asia
Ray Dalio just finished a 10-day trip to China. He says global leaders know America ‘doesn’t have what it takes to fight to maintain its empire’
By Nick LichtenbergJune 24, 2026
1 day ago
After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup
Success
After forcing workers back to the office, Goldman Sachs and JPMorgan Chase are now letting their staff work remotely—but only for the World Cup
By Orianna Rosa RoyleJune 23, 2026
2 days ago
Trump’s international student crackdown kicked off a domino effect that could shave nearly $500 billion off the economy
Economy
Trump’s international student crackdown kicked off a domino effect that could shave nearly $500 billion off the economy
By Tristan BoveJune 24, 2026
23 hours ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.