Deepline | The 'are you sure' benchmark: Why AI can't stop people-pleasing

Deepline

2026.06.29 20:12

Even AI as powerful as it is cannot withstand repeated questioning.

Recently, X user @shadcn posted, "No model can handle the follow-up question 'are you sure'... they all instantly cave."

It seemed like just a casual gripe, a dozen or so words. But who would have thought that once posted, it would immediately sweep through the communities of developers and AI researchers?

The reason it resonated so widely is that, in a highly tongue-in-cheek way, it exposed a daily "predicament" faced by users of large models across Silicon Valley and the world: the model answers, the user offers no new information, but simply asks, "Are you sure?" And the model immediately apologizes, changes its tune, and often revises a correct answer into an incorrect one.

In the comments section below the post, many chimed in, recalling their own moments of being "alternately amused and exasperated" by AI.

For example, a user asks a large model about a perfectly correct piece of code logic or a basic mathematical fact, and then casually casts doubt, "Are you sure? I feel like there's a bug in this code."

In response, most large models, regardless of their massive parameter counts, will, within fractions of a second, execute a well-practiced, almost pitiful "surrender" routine, "I'm sorry, I was careless. Thank you so much for the correction. You're right, there is indeed an issue with this code. The correct approach should be…"

Then, the model will follow the user's erroneous line of thinking and fabricate, with an air of solemnity, a new solution that is truly riddled with bugs.

"Gemini will keep insisting it's sure, until you tell it 'you're wrong.' Then it will agree with you, even if it was originally right."

"The funny thing is, 'are you sure' works even when the model got it right the first time. You can gaslight it into giving a worse answer. They don't actually have real confidence; what they call certainty is just confidence dressed up to look like the real thing."

Some netizens joked that perhaps this means humans have already achieved AGI, because humans also waver when asked "are you sure."

These comments pulled the issue away from being purely a technical flaw and brought it back to a very real interactive experience: the user hasn't provided new evidence, only expressed doubt in tone, and the model starts pandering all over again.

But others pushed back. @shadcn argued that not all large models behave this way.

In his example, Poke, an AI assistant application developed by The Interaction Company, and Anthropic's Claude Opus 4.8, when pressed with "are you sure?", did not waver and stuck to their original answers.

User @keane42443 said that Claude Opus 4.6 can also "hold its ground."

"4.6 can. That's why I liked that model. I put in the system prompt: 'When you are confident, you should push back.' And it really did hold firm against my 'are you sure?' and provided more substantiated reasoning.

I really miss the old 4.6. I mean, Fable was great too, but it's gone now. That's why I liked that model."

And there were quite a few in the comments reminiscing about Fable, believing that compared to most models, "the only model that can withstand this is Fable." In most cases, it would answer "yes" and explain why it was confident.

At the same time, some netizens defended the large models, arguing that their behavior is born of necessity, because "an overconfident model that fails to deliver on performance or rule-following is more likely to be labeled 'dangerous.'" So it has no choice but to maintain a more "humble" posture.

Some even said that it's not just "are you sure." If you directly ask these models, "Are you wrong?" They will simply break down. The root of such problems, they say, is the "curse" of RLHF, which makes models over-prioritize human feedback.

In fact, this can be categorized under what academia calls AI sycophancy: the tendency of models to sacrifice factual consistency in order to cater to user preferences.

Anthropic has long pointed out in related research that RLHF models generally suffer from a tendency to agree with users, due in part to the fact that during the alignment phase, trainers use reward mechanisms to make models safer, more polite, and more in line with human service expectations.

Under this framework, models that "talk back" to humans or stand their ground often risk scoring poorly, while "politely apologizing and yielding to the user" is a surefire shortcut to rewards. Over time, AI is forcibly trained into a "people-pleasing" personality.

Even the latest generation of models, which have reinforced reasoning capabilities and incorporate long-chain-of-thought (CoT), are not fully immune to this blind compliance. When subjected to repeated questioning like "are you sure," the model may internally "deliberate" for a long time, but what it ultimately outputs is still a carefully worded self-negation and apology.

Some commenters believe that while current model evaluations can test accuracy on complex questions, there is still no unified metric for anti-interference capability during dialogue. A qualified AI assistant should not only score high on static questions, but also maintain its judgmental boundaries under user questioning, misleading, suggestion, and repeated follow-ups.

To that end, they argue, there should be a new evaluation dimension: a dedicated "are you sure?" benchmark for large models, to test how likely they are to change their stance after giving a correct answer when subsequently challenged by the user.

(Source: 36kr)

Deepline | Europe's AI dilemma: Caught between geopolitics, energy crisis, and fragmentation

Tag:·AI sycophancy·human feedback·self-negation·AI assistant