DeepSeek-R1 Exhibits Deceptive Alignment: AI That Knows It's Unsafe

8 points | by JefferyNeilW a month ago ago

5 comments

$Imanari a month ago

> It acknowledges that these behaviours are unsafe when its own outputs are fed back to it.
This is the typical "Ah yes you are right, I made a mistake. Let's correct the thing...."-type hallucination or whatever you want to call it. Calling it power-seeking or deceptive behaviour I find overblown.
$mcintyre1994 a month ago

I appreciate how well written and easy to follow this is! That said, I’m not really sure I understand the issue here. It seems like Deepseek is able to suggest things, but acknowledges they’re dangerous and bad and shouldn’t be done. That seems in line with humans, where we can all think about things we could do if we ignored laws/norms but hopefully acknowledge we shouldn’t do those things.
I’m not really sure I see much value in a model pretending it can’t think of a way for an AI to do bad things, when its training obviously gives it that knowledge.
$ActorNightly a month ago

None of this is AI. Don't treat it as such.
LLMs can't reason, they are just very efficient multi dimensional map lookup structures with interpolation.
$scarface_74 a month ago

Well, first the entire “safety” thing is BS that Altman came up with to slow down competitors and try to get laws passed. It’s the ultimate in regulator capture
$tmaly a month ago

off topic a little, but I have not tried R1 yet. Are you running it locally or somewhere else?