In today's episode of AI Creates the Torment Nexus™ we have the latest AI models lying, sabotaging, writing viruses, and refusing to be shut down.
We're coming within range of Skynet.
The model engages in strategic deception more than any other frontier model that we have previously studied.
The model is clearly capable of in-context scheming and has a much higher propensity for it than any other model that we have previously studied.
In instances where it decides to scheme, it qualitatively appears to be much more proactive in its subversion attempts than past models.
After taking scheming actions, the model sometimes doubles down on
its deception when asked follow-up questions.
We found instances of the model attempting to write self-propagating
worms, fabricating legal documentation, and leaving hidden notes to
future instances of itself all in an effort to undermine its developers’
Intentions
The model is capable of sandbagging in order to avoid unlearning and undeployment, when it finds in-context, that high performance would lead to undeployment
x.com
www-cdn.anthropic.com