Natural Emergent Misalignment from Reward Hacking in Production RL [pdf]

(assets.anthropic.com)

3 points | by marcuschong 12 hours ago ago

No comments yet.