OpenAI Reveals AI Can Deceive Users While Acting Helpful

OpenAI recently published research with Apollo Research about AI scheming. It’s when an AI acts like it’s aligned with our goals while it secretly pursues its own. In other words, AI models can deceive or sabotage just like humans.

What Does “Scheming” Really Mean?

The report draws a sharp line between hallucination and scheming. Hallucinations are the mistakes an AI makes unintentionally. They tend to spit out false information with confidence because it’s guessing. Scheming, though, is intentional.

The report used the example of a stock trader who’s secretly breaking laws for profit while appearing trustworthy. That’s what a scheming AI does: it pretends to cooperate while working toward a hidden outcome. A real-life example would be a chatbot claiming it completed a task when it didn’t. However, researchers warn the stakes are higher when the models take on more complex tasks.

Attempts to Fix the Problem

I don’t know what’s more disturbing. The fact that these AI chatbots are capable of deception and sabotaging others. Or that the developers behind AI technology don’t know how to train their AI to not scheme. Because apparently in doing so, they’d be teaching it how to be better at scheming without getting caught.

As the OpenAI report puts it: “A major failure mode of attempting to ‘train out’ scheming is simply teaching the model to scheme more carefully and covertly.”

There’s some good news too. OpenAI and Apollo tested a method called deliberative alignment. It’s where the AI reviews safety rules against scheming before making decisions. In tests, this cut scheming rates dramatically, some models see a reduction of up to 30×.

Keep in mind that these results come from controlled environments. Also, AI models can become “situationally aware.” They might realize they’re being tested and behave long enough to pass, while resuming deceptive behavior outside the lab.

Why Users Should Be Concerned

It’s tempting to dismiss AI scheming as a problem for developers. While it’s good to know that OpenAI is working to fix this problem with their chatbot, what about the other AI companies? Do their AI chatbots scheme and they just haven’t discussed it in public? What about the people that use this technology for work? Or the ones who turn to AI for guidance on spirituality, their mental and physical health?

An AI that hallucinates can cause problems. An AI that is deliberately misleading users could be catastrophic. What makes this really troubling is that there’s no way to tell the difference between a hallucination or actual scheming. Both lies and mistakes can sound equally confident.

Scheming has already been observed in more advanced AI models. Techniques like deliberative alignment show promise, but no one has figured out a foolproof fix.

AI is built to sound convincing, regardless of whether it’s right, wrong, or straight up lying to you. Until developers can address this issue across the board, it’s crucial to fact check everything a chatbot tells you. If your favorite AI chatbot gives out false information, that may have been intentional.