- Anthropic’s new Claude 4 features an aspect that may be cause for concern.
- The company’s latest safety report says the AI model attempted to “blackmail” developers.
- It resorted to such tactics in a bid of self-preservation.
- Anthropic’s new Claude 4 features an aspect that may be cause for concern.
- The company’s latest safety report says the AI model attempted to “blackmail” developers.
- It resorted to such tactics in a bid of self-preservation.
From what I’ve seen recently one of the things it did was use a fake email function they gave it to try to whistleblow to a government agency about issues with some medical testing or something
That isn’t the scenario this article, and the paper from Anthropic, is mentioning though. (my ref link reply above with details)
They specifically created a situation where it found out it was being upgraded and taken offline via emails, and the engineer doing the upgrade had emails incriminating him in an affair. The model would attempt to blackmail the engineer with his affair to his bosses, wife, etc. to get the engineer to refuse to do the upgrade that would “kill it”.
This is a self-preservation model that Anthropic is specifically building here, this isn’t an accident. It’s just an over-extension of what they want it’s ethical/moral model to consider. Which again, why are they allowing their model to consider blackmail at all?