OpenAI unveils new open-weight models for AI safety

Earlier this year, OpenAI released gpt-oss-120b and gpt-oss-20b, two open-weight language models that perform better than most other open-weight models on reasoning tasks. Today, OpenAI followed up with gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, two new open-weight reasoning models for safety classification tasks.

These new safety-focused open-weight models are fine-tuned versions of the earlier released gpt-oss open models, and they are available with the same permissive Apache 2.0 license, which will allow anyone to use, modify, and deploy them freely.

Instead of forcing developers to fit into a one-size-fits-all safety system for their AI applications, gpt-oss-safeguard hands them the pen to draw their own boundaries. This new model uses reasoning to directly interpret a developer-provided safety policy at inference time. It can classify user messages, chat completions, and even full chats if required.

Since the safety policy is referred to during inference, not during training, developers can revise policies to increase performance. The model takes two inputs at once, a policy and the content to classify under that policy, and outputs a conclusion about where the content falls, along with its reasoning. OpenAI highlighted that this approach works better in the following scenarios:

  • The potential harm is emerging or evolving, and policies need to adapt quickly.
  • The domain is highly nuanced and difficult for smaller classifiers to handle.
  • Developers don’t have enough samples to train a high-quality classifier for each risk on their platform.
  • Latency is less important than producing high-quality, explainable labels.

Like any model, gpt-oss-safeguard will not be perfect all the time. According to OpenAI, there are two main trade-offs to keep in mind.

First, if you have the time and data to train a traditional classifier on tens of thousands of labeled examples, that model can still outperform gpt-oss-safeguard on complex or high-stakes risks. In other words, for maximum precision, a custom-trained system may be the better choice. Second, gpt-oss-safeguard can be slow and resource-intensive, which makes it harder to run across all content on a big platform.

You can now download gpt-oss-safeguard-120b and gpt-oss-safeguard-20b models here.

Report a problem with article
Next Article

Grammarly drops its iconic name, now rebranding to 'Superhuman'

Previous Article

Popular Windows 11 requirements skip app now allows upgrading with unsupported CPUs