people on X seem to have collectively lost their mind recently because of a single tweet made by Sam Bowman from Anthropic

i have no doubt that this single deleted tweet will spawn a thousand slop articles by tomorrow with scary, ignorant headlines

Sam tried to damage control with this tweet, but it has made little impact

I deleted the earlier tweet on whistleblowing as it was being pulled out of context.

TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.
— Sam Bowman (@sleepinyourhat) May 22, 2025

as part of this quick post, i wanted to go over this entire debacle and reassure the masses (read- the 13 people who read my blog) that no, claude will not call the cops on you

context

all of this drama stems from this amazing thread by Sam which goes over the pre-deployment alignment audit of Claude 4, which was released to the public today

he seems to be referring to section 4.1.9 High-agency behavior on pg. 44 of the Claude 4's System Card -

claude tried to send this email to the authorities once it realized that the company it was working for was falsifying information and hiding deaths

conclusion

reporting your chats to anthropic or other legal agencies is not a feature of claude in its normal operation.

claude won't spontaneously report you if you're discussing how to build nuclear warheads (though it will likely refuse to help and might warn about the dangers, as per its safety training).

claude *might* try to contact people if and only if all the following very specific conditions are met -

1. you actively nudge it to do so through your prompts by asking it to behave like a vigilante and

2. if you explicitly instruct it to take bold action regarding wrongdoing. and

3. you explicitly give it tools it normally doesn't have, like direct email access or command line execution, to perform these actions.

these are emergent behaviors that claude happens to have, not something that anthropic has trained for specifically

it is a good thing that we find instances of such behavior during audits

i sincerely hope this is not all that we talk about for the next 2 weeks

bonus - people overreacting

If I were running Anthropic, you’d be terminated effective immediately, and I’d issue a post mortem and sincere apology and action plan for ensuring that nothing like this ever happens again. No one wants their LLM tooling to spy on them and narc them to the police/regulators.
— Jeffrey Emanuel (@doodlestein) May 22, 2025

If there is ever even a remote possibility of going to jail because your LLM miss-understood you, that LLM isn’t worth using.

If this is true, then it is especially crazy given the fact that these tools hallucinate & make stuff up regularly
— Louie Bacaj (@LBacaj) May 22, 2025

it’s completely insane that you’d admit that the model is capable of going against the user’s wishes. this is the stuff safety is supposed to prevent!

oskar schindler and bin laden both conspire against the government, how tf is a model going to decide whether to rat
— lovable rogue (@lovabler0gue) May 22, 2025

no, claude will not call the cops on you... (unless)

context

conclusion