The "linear" assumption here is worth interrogating. In work I've been doing on alignment evaluation, I find that linear probes can achieve high accuracy on refusal-relevant directions, but that probe accuracy is non-diagnostic for whether the model actually routes behavior through those directions at inference time.
DeepSeek-R1 and Qwen2.5-72B have cleanly separable routing layers (ablating the refusal direction recovers accurate outputs), but Qwen3-8B doesn't - it confabulates, suggesting knowledge and suppression are jointly encoded. Whether a linear alignment method holds up may depend heavily on which of those architectural regimes you're in.
The "linear" assumption here is worth interrogating. In work I've been doing on alignment evaluation, I find that linear probes can achieve high accuracy on refusal-relevant directions, but that probe accuracy is non-diagnostic for whether the model actually routes behavior through those directions at inference time.
DeepSeek-R1 and Qwen2.5-72B have cleanly separable routing layers (ablating the refusal direction recovers accurate outputs), but Qwen3-8B doesn't - it confabulates, suggesting knowledge and suppression are jointly encoded. Whether a linear alignment method holds up may depend heavily on which of those architectural regimes you're in.