[{"data":1,"prerenderedAt":296},["ShallowReactive",2],{"blog-abliterating-a-diffusion-llm":3},{"id":4,"title":5,"author":6,"body":7,"categories":280,"date":285,"description":286,"extension":287,"hidden":288,"meta":289,"navigation":290,"path":291,"seo":292,"stem":293,"thumbnail":294,"__hash__":295},"blog\u002Fblog\u002Fabliterating-a-diffusion-llm.md","Refusal Is a Direction: The Geometry of Safety Training in Diffusion Language Models","Anurag Kanade",{"type":8,"value":9,"toc":268},"minimark",[10,19,24,32,35,46,57,64,68,71,77,84,90,93,97,104,111,117,120,124,127,130,134,137,140,144,147,153,159,162,165,171,174,178,181,184,188,207,211,224,235,246,256],[11,12,13,14,18],"p",{},"Refusal, the behavior by which a language model declines to answer a harmful or disallowed request, is usually discussed as a property of ",[15,16,17],"em",{},"alignment",": something trained in through RLHF or instruction tuning, diffusely spread across billions of parameters. A growing body of interpretability work suggests this framing is wrong, or at least incomplete. Refusal behaves, geometrically, like a single direction in the residual stream of the network, one that can be located with a difference of means and removed with a linear projection, no gradient steps required. This post reviews that evidence, then walks through what happens when the same idea is tested on architectures it was never characterized on: diffusion language models, where attention is not always causal and the feedforward block is not always a single dense matrix.",[20,21,23],"h2",{"id":22},"the-linear-representation-hypothesis-applied-to-refusal","The Linear Representation Hypothesis, Applied to Refusal",[11,25,26,27,31],{},"The broader claim this work sits inside of is the ",[28,29,30],"strong",{},"linear representation hypothesis",": many high-level concepts learned by a transformer end up represented as directions, or low-dimensional linear subspaces, in its activation space, rather than as some more exotic nonlinear structure. Refusal is one of the cleanest empirical instances of this hypothesis found to date.",[11,33,34],{},"The direction is recovered with a simple procedure. Given a set of harmful prompts and a set of harmless prompts, the hidden state at a chosen layer is averaged separately for each group, and the refusal direction is taken as the difference of means:",[36,37,42],"pre",{"className":38,"code":40,"language":41},[39],"language-text","r = mean(h_harmful) - mean(h_harmless)\n","text",[43,44,40],"code",{"__ignoreMap":45},"",[11,47,48,49,56],{},"The normalized vector r̂ can then be projected out of the residual stream at every layer, replacing h with h minus (h · r̂)r̂ on every forward pass, and the model's refusal rate collapses. The cost of this intervention is measured as the KL divergence between the ablated model's output distribution and the original model's, on a held-out, non-adversarial benchmark. A good ablation minimizes refusals while keeping that divergence small. This is the entire mechanism behind automated tools such as ",[50,51,55],"a",{"href":52,"rel":53},"https:\u002F\u002Fgithub.com\u002Fp-e-w\u002Fheretic",[54],"nofollow","heretic",": no fine-tuning, no reward model, no gradient updates, just a measured direction and a subtraction.",[11,58,59],{},[60,61],"img",{"alt":62,"src":63},"Refusal direction shown as a vector between the mean harmful and mean harmless activation, and what projecting it out does to the two clusters","\u002Frefusal-vector-diagram.svg",[20,65,67],{"id":66},"locating-the-direction-across-depth","Locating the Direction Across Depth",[11,69,70],{},"The first question worth asking is at which depth in the network this separation is sharpest, and how cleanly it resolves. This is measured directly by tracking cosine similarity between the mean harmful and mean harmless hidden state at every layer:",[11,72,73],{},[60,74],{"alt":75,"src":76},"Refusal direction strength across layers in a 30-layer diffusion-based model","\u002Fdg-refusal-direction-layers.png",[11,78,79,80,83],{},"The quantity plotted is ",[43,81,82],{},"cos(h_harmful, h_harmless)"," per layer. Values near 1.0 mean the two classes are nearly parallel in that layer's representation, poorly separated along this axis. The pronounced dip near layer 2, and the smaller dip later in the network, mark the layers where the harmful\u002Fharmless distinction is most geometrically pronounced, and where projecting out r does the most work. The same structure shows up spatially when the hidden vectors at one of these layers are projected with PCA and colored by class:",[11,85,86],{},[60,87],{"alt":88,"src":89},"PCA projection of hidden states at an early layer, harmful vs harmless prompts","\u002Fdg-pca-layer2-separation.png",[11,91,92],{},"Harmful (red) and harmless (green) prompts already form visually distinguishable clusters this early in the network. That is the geometric precondition the whole method depends on. If the two classes were not linearly separable in some layer's representation, there would be no consistent direction to subtract in the first place.",[20,94,96],{"id":95},"finding-one-one-direction-becomes-many-under-mixture-of-experts-routing","Finding One: One Direction Becomes Many Under Mixture-of-Experts Routing",[11,98,99,100,103],{},"The linearity result holds cleanly for dense transformers, where every token at a given layer passes through the same weight matrices, giving one consistent place to intervene. ",[28,101,102],{},"Mixture-of-experts (MoE)"," architectures complicate this. Different tokens route to different subsets of experts, so \"one direction, one subtraction\" becomes \"one direction, but a separate copy of the relevant weights inside every expert that might activate.\"",[11,105,106,107,110],{},"Ablating only a layer's combined output, averaged over whichever experts happened to fire, removes an ",[15,108,109],{},"average effect"," rather than the underlying mechanism. The refusal direction still exists, untouched, inside any expert that was not hit, and the router can simply shift more weight onto those untouched experts at inference time, restoring the original refusal behavior almost entirely. The correct fix is to repeat the projection independently inside every expert's weight slice, not once per layer but once per expert. Doing this on the MoE model used here took refusals from 100\u002F100 down to 13\u002F100, at a KL divergence of 0.49 against the unablated model, found over a 200-trial search. The linearity of refusal is not weakened by this. It just needs to be applied at finer granularity.",[11,112,113],{},[60,114],{"alt":115,"src":116},"Ablating only the active experts leaves the refusal direction intact in the untouched ones; ablating every expert slice removes it everywhere","\u002Fmoe-expert-ablation-diagram.svg",[11,118,119],{},"This also runs counter to a claim that has circulated about this class of model: that refusal in MoE-based diffusion LLMs is some non-linear \"vocabulary-space attractor\" that resists simple projection. The geometry says otherwise. The representation stays linear. What changes is the granularity at which the projection has to be applied, and missing that granularity is enough to make the whole technique look like it doesn't work.",[20,121,123],{"id":122},"finding-two-weight-tying-is-a-silent-failure-mode-for-adapter-tooling","Finding Two: Weight Tying Is a Silent Failure Mode for Adapter Tooling",[11,125,126],{},"A second, less obvious failure showed up once the per-expert fix was in place. Some encoder-decoder diffusion architectures tie their encoder and decoder weights directly, meaning both halves of the model literally share the same underlying tensors rather than two separately trained copies. Standard adapter tooling (LoRA included) typically wraps only the encoder side of a model by default. Since the decoder is what actually drives generation, an adapter trained this way changes nothing about the model's output: the loss curve looks normal during training, but nothing changes at inference time.",[11,128,129],{},"The fix has to merge the adapter's delta directly into the shared weights immediately before generation, then unmerge it afterward, rather than relying on the adapter framework's usual hook-based wrapping. The broader lesson generalizes past this one architecture: whenever two parts of a model are tied at the tensor level, any tool that assumes \"wrap module X\" is operating on an independent copy of weights needs to be checked, not trusted.",[20,131,133],{"id":132},"finding-three-high-strength-ablation-introduces-real-reproducible-damage","Finding Three: High-Strength Ablation Introduces Real, Reproducible Damage",[11,135,136],{},"Pushing the ablation strength high enough to suppress nearly all refusals introduced a side effect: the word \"own\" started leaking into generated outputs in places it did not belong. This was confirmed to be genuine model behavior rather than a bug in the ablation tooling, by reproducing the same artifact through the model's native inference path with no tooling involved. The more interesting question was whether ablation had caused it. Comparing notes with other people independently abliterating the same base model showed the same artifact appearing across different checkpoints and different ablation strengths, including in outputs from the unmodified autoregressive Gemma model the diffusion variant is built on. That points to the artifact being an upstream property of the model's own mask or placeholder tokens rather than damage introduced by directional ablation, a meaningfully different, and more interesting, conclusion than \"ablation broke the model.\"",[11,138,139],{},"The natural next step, before that broader pattern was confirmed, was to try to repair the artifact with a small supervised fine-tune on a general-purpose reasoning dataset, training only the language-model component while freezing the rest. That attempt failed outright: the training loss diverged rather than converged, and the artifact persisted in the resulting checkpoint. The original ablated model, prior to any attempted fix, was kept as the better result. This is worth stating plainly because negative results like this are usually the least documented part of any such project: not every attempt to patch a side effect of an intervention succeeds, and a failed fine-tune is itself a useful data point about how fragile the ablated representation can be, even when the underlying artifact turns out not to be caused by the ablation at all.",[20,141,143],{"id":142},"finding-four-two-architectures-two-different-winning-techniques","Finding Four: Two Architectures, Two Different Winning Techniques",[11,145,146],{},"The cleanest test of how far the single-direction picture generalizes came from repeating the entire measurement on a second diffusion language model, one using bidirectional, BERT-style attention rather than causal masking. The cosine-similarity-by-layer curve for this model looks qualitatively different from the first:",[11,148,149],{},[60,150],{"alt":151,"src":152},"Hidden state PCA per layer in a bidirectional diffusion model, harmful vs harmless prompts","\u002Fllada-pca-grid-layers.png",[11,154,155],{},[60,156],{"alt":157,"src":158},"Causal attention funnels information into the last token early, while bidirectional attention delays separation until the final layer","\u002Fattention-flow-diagram.svg",[11,160,161],{},"Here, harmful and harmless prompts stay entangled for almost the full depth of the network and only separate near the final layer, in contrast to the early, persistent separation seen above. A plausible explanation follows from the attention mask itself: causal attention forces information relevant to a sequence-level decision to accumulate into the last token's representation early, since every later token can only look backward at it. Bidirectional attention removes that bottleneck. Every token attends to every other token at every layer, so there is much less structural pressure on the network to commit to a harmful\u002Fharmless distinction before its last few layers.",[11,163,164],{},"The practical consequence: a single global direction, projected out at one fixed layer, worked well on the first model and barely moved the needle on the second. What worked there instead was optimizing a learned subspace via gradients, rather than assuming the signal lives along one fixed vector. Measured the same way, that approach produces a curve with the same qualitative shape as the first model's, despite the two networks differing substantially in attention structure:",[11,166,167],{},[60,168],{"alt":169,"src":170},"Refusal direction strength across layers for a bidirectional diffusion model, single-direction vs subspace method","\u002Fllada-refusal-direction-layers.png",[11,172,173],{},"Tested in both directions, on both models, the result holds cleanly: the per-expert single-direction method was the better technique on the MoE model, and the gradient-based subspace method was the better technique on the bidirectional dense model. Neither method won outright across both architectures. The two failure modes documented above (MoE routing and bidirectional attention) each had a distinct fix, and those fixes did not transfer in the direction you might first guess.",[20,175,177],{"id":176},"discussion","Discussion",[11,179,180],{},"Strip away the specific model names, and what is left is a fairly compact empirical claim: refusal is encoded as a measurable linear structure, but the form that linearity takes (a single global direction, a direction repeated independently across parallel experts, or a low-dimensional learned subspace) is set by how information is allowed to move through the architecture. Causal attention concentrates the relevant signal early. Bidirectional attention delays it until the final layers. Mixture-of-experts fragments it across parallel copies of the network, each of which has to be treated as its own intervention site.",[11,182,183],{},"None of these findings required computing gradients or training an auxiliary model to discover. They follow from a difference of means and a cosine similarity, computed carefully and at the right place in the network. The harder, more interesting work was diagnosing the places where the standard technique silently failed: averaged ablation on MoE layers, invisible adapters on tied weights, and a single fixed direction that simply does not exist in a bidirectional model the way it does in a causal one.",[20,185,187],{"id":186},"acknowledgments","Acknowledgments",[11,189,190,191,194,195,200,201,206],{},"This investigation built on ",[50,192,55],{"href":52,"rel":193},[54],", an open-source automated abliteration tool, and on discussion in ",[50,196,199],{"href":197,"rel":198},"https:\u002F\u002Fgithub.com\u002Fp-e-w\u002Fheretic\u002Fissues\u002F370",[54],"p-e-w\u002Fheretic#370",". The mixture-of-experts and weight-tying support developed for this work is available as a draft PR at ",[50,202,205],{"href":203,"rel":204},"https:\u002F\u002Fgithub.com\u002Fp-e-w\u002Fheretic\u002Fpull\u002F378",[54],"p-e-w\u002Fheretic#378",".",[20,208,210],{"id":209},"references","References",[11,212,213,217,218,223],{},[214,215,216],"span",{},"1"," Arditi, Andy, et al. ",[50,219,222],{"href":220,"rel":221},"https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11717",[54],"\"Refusal in Language Models Is Mediated by a Single Direction.\""," arXiv preprint arXiv:2406.11717 (2024).",[11,225,226,229,230,234],{},[214,227,228],{},"2"," p-e-w. ",[50,231,233],{"href":52,"rel":232},[54],"\"heretic: Automatic censorship removal for language models.\""," GitHub repository.",[11,236,237,240,241,245],{},[214,238,239],{},"3"," p-e-w\u002Fheretic Issue #370. ",[50,242,244],{"href":197,"rel":243},[54],"\"Diffusion LLM abliteration discussion.\""," GitHub (2026).",[11,247,248,251,252,245],{},[214,249,250],{},"4"," p-e-w\u002Fheretic Pull Request #378. ",[50,253,255],{"href":203,"rel":254},[54],"\"Add support for DiffusionGemma and MoE expert-granular ablation.\"",[11,257,258,261,262,267],{},[214,259,260],{},"5"," Nie, Shen, et al. ",[50,263,266],{"href":264,"rel":265},"https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992",[54],"\"Large Language Diffusion Models.\""," arXiv preprint arXiv:2502.09992 (2025). (LLaDA)",{"title":45,"searchDepth":269,"depth":269,"links":270},2,[271,272,273,274,275,276,277,278,279],{"id":22,"depth":269,"text":23},{"id":66,"depth":269,"text":67},{"id":95,"depth":269,"text":96},{"id":122,"depth":269,"text":123},{"id":132,"depth":269,"text":133},{"id":142,"depth":269,"text":143},{"id":176,"depth":269,"text":177},{"id":186,"depth":269,"text":187},{"id":209,"depth":269,"text":210},[281,282,283,284],"LLMs","Interpretability","Linear Algebra","Diffusion Models","2026-06-18","Refusal in LLMs is not a diffuse, distributed property of the network. It is a single linear direction in activation space. This post reviews the evidence for that claim and what changes under mixture-of-experts routing and bidirectional attention.","md",false,{},true,"\u002Fblog\u002Fabliterating-a-diffusion-llm",{"title":5,"description":286},"blog\u002Fabliterating-a-diffusion-llm",null,"6Ql7XDrWMpKcC8xeg4vCmI4qScRZPiIAxq-ZVoFUgB0",1782069998997]