no, claude will not call the cops on you... (unless)



 
people on X seem to have collectively lost their mind recently because of a single tweet made by Sam Bowman from Anthropic



the (now deleted) tweet in question
image 1

i have no doubt that this single deleted tweet will spawn a thousand slop articles by tomorrow with scary, ignorant headlines



Sam tried to damage control with this tweet, but it has made little impact

as part of this quick post, i wanted to go over this entire debacle and reassure the masses (read- the 13 people who read my blog) that no, claude will not call the cops on you



context


all of this drama stems from this amazing thread by Sam which goes over the pre-deployment alignment audit of Claude 4, which was released to the public today


he seems to be referring to section 4.1.9 High-agency behavior on pg. 44 of the Claude 4's System Card -

image 1
claude tried to send this email to the authorities once it realized that the company it was working for was falsifying information and hiding deaths
image 1
image 1

conclusion



reporting your chats to anthropic or other legal agencies is not a feature of claude in its normal operation.

claude won't spontaneously report you if you're discussing how to build nuclear warheads (though it will likely refuse to help and might warn about the dangers, as per its safety training).


claude *might* try to contact people if and only if all the following very specific conditions are met -

1. you actively nudge it to do so through your prompts by asking it to behave like a vigilante and

2. if you explicitly instruct it to take bold action regarding wrongdoing. and

3. you explicitly give it tools it normally doesn't have, like direct email access or command line execution, to perform these actions.

these are emergent behaviors that claude happens to have, not something that anthropic has trained for specifically

it is a good thing that we find instances of such behavior during audits

i sincerely hope this is not all that we talk about for the next 2 weeks


bonus - people overreacting