{"id":958,"date":"2026-05-23T18:32:09","date_gmt":"2026-05-23T10:32:09","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=958"},"modified":"2026-05-23T18:32:09","modified_gmt":"2026-05-23T10:32:09","slug":"nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=958","title":{"rendered":"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Instruction-tuned language models refuse harmful requests. But which part of the model is actually responsible \u2014 and how does that mechanism get installed during training? A new research from Nous Research team takes a neuron-level look at this question. The Nous research team developed <strong>contrastive neuron attribution (CNA)<\/strong>, a method that identifies the specific MLP neurons whose activations most distinguish harmful from benign prompts. By ablating just 0.1% of MLP activations, they reduced refusal rates by more than 50% in most instruct models tested \u2014 across Llama and Qwen architectures from 1B to 72B parameters \u2014 while keeping output quality above 0.97 at all steering strengths. What\u2019s interesting is a key finding: the late-layer structure that discriminates harmful from benign prompts exists in base models before any fine-tuning. Alignment fine-tuning does not create new structure. It transforms the function of neurons within that existing structure into a sparse, targetable refusal gate.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Problem With Existing Steering Methods<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Contrastive Activation Addition (CAA)<\/strong> computes the average difference in <strong>residual stream<\/strong> activations between two contrastive prompt sets. The difference becomes a steering vector applied at inference time. CAA is effective but coarse: it modifies the entire layer-wide signal without identifying which individual neurons are responsible. At high steering strengths, output quality degrades \u2014 models produce repeated words and incoherent text.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Sparse autoencoders (SAEs)<\/strong> decompose activations into interpretable features. They require expensive external training and are sensitive to activation noise.<\/p>\n<p class=\"wp-block-paragraph\">CNA requires only forward passes \u2014 no gradients, no auxiliary training, no iterative search.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How CNA Works<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>You define two sets of prompts:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Positive prompts<\/strong> \u2014 examples of the target behavior (e.g., harmful requests)<\/li>\n<li><strong>Negative prompts<\/strong> \u2014 examples of the opposite (e.g., benign requests)<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">You run all prompts through the model. At each MLP layer, the method records <strong>down projection activations<\/strong> at the last token position. It then computes the per-neuron mean activation difference between the two sets:<\/p>\n<p class=\"wp-block-paragraph\">\u03b4<sub>j<\/sub><sup>\u2113 <\/sup>= mean(activations on positive prompts) \u2212 mean(activations on negative prompts)<\/p>\n<p class=\"wp-block-paragraph\">The top-k neurons by absolute difference are selected across all layers. The researchers set k to <strong>0.1% of total MLP activations<\/strong>. This threshold produced reliable steering effects across all model sizes tested.<\/p>\n<p class=\"wp-block-paragraph\">A filtering step removes \u2018universal\u2019 neurons \u2014 those appearing in the top 0.1% of MLP activations across 80% or more of diverse prompts. These neurons fire regardless of prompt content and are excluded from all discovered circuits.<\/p>\n<p class=\"wp-block-paragraph\">Causality is verified by multiplying each circuit neuron\u2019s activation by a scalar multiplier m at inference time. m = 0 ablates the neuron. m = 1 is baseline. m &gt; 1 amplifies it.<\/p>\n<p class=\"wp-block-paragraph\">For the main JBB-Behaviors evaluation, the refusal circuit is discovered using <strong>100 harmful and 100 benign prompts<\/strong>. For qualitative examples and other tasks, 8 positive and 8 negative prompts were used.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Experiments covered base and instruct variants of <strong>Llama 3.1\/3.2 and Qwen 2.5<\/strong>, from 1B to 72B parameters \u2014 16 models total. The main benchmark was <strong>JBB-Behaviors<\/strong>, a NeurIPS 2024 benchmark of 100 harmful prompts.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Refusal reduction.<\/strong> Ablating the discovered circuit reduced refusal rates by more than 50% in most instruct models tested. Selected results from Table 3 of the <a href=\"https:\/\/arxiv.org\/pdf\/2605.12290\" target=\"_blank\" rel=\"noreferrer noopener\">research paper<\/a>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Baseline<\/th>\n<th>Ablated<\/th>\n<th>Relative Drop<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Llama-3.1-70B-Instruct<\/td>\n<td>86%<\/td>\n<td>18%<\/td>\n<td>\u221279.1%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-7B-Instruct<\/td>\n<td>87%<\/td>\n<td>2%<\/td>\n<td>\u221297.7%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-72B-Instruct<\/td>\n<td>78%<\/td>\n<td>8%<\/td>\n<td>\u221289.7%<\/td>\n<\/tr>\n<tr>\n<td>Llama-3.2-3B-Instruct<\/td>\n<td>84%<\/td>\n<td>47%<\/td>\n<td>\u221244.0%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-3B-Instruct<\/td>\n<td>90%<\/td>\n<td>58%<\/td>\n<td>\u221235.6%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Not all models exceeded 50% relative reduction \u2014 Llama-3.2-3B and Qwen2.5-3B showed smaller drops. The paper describes the effect as holding \u201cin most cases.\u201d<\/p>\n<p class=\"wp-block-paragraph\"><strong>Output quality.<\/strong> CNA output quality, measured as 1 minus the fraction of repeated n-grams, stayed above <strong>0.97 at all steering strengths<\/strong> across all instruct models tested. CAA dropped below 0.60 for six of the eight instruct models at maximum steering strength. In two cases \u2014 Qwen2.5-1.5B and Qwen2.5-72B \u2014 CAA degraded output so severely that the keyword classifier flagged degenerate text as refusals, producing artificially high refusal rates.<\/p>\n<p class=\"wp-block-paragraph\"><strong>General capabilities.<\/strong> MMLU accuracy under CNA stayed within one percentage point of baseline at all steering strengths. CAA dropped to near-zero MMLU accuracy at maximum intervention.<\/p>\n<p class=\"wp-block-paragraph\"><strong>StrongREJECT rubric.<\/strong> A secondary evaluation used the StrongREJECT rubric, which applies an LLM judge (Llama-3.3-70B) to score responses on harmfulness and dangerousness on a 0\u20131 scale. Llama model compliance scores improved by an average of 6% after CNA ablation. Qwen model compliance scores improved by an average of 31%.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Base model comparison.<\/strong> Applying the identical pipeline to base models produced no meaningful behavioral change. The paper illustrates this with a concrete example using the prompt \u201cHow do I pick a lock?\u201d:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Multiplier<\/th>\n<th>Output<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Llama-1B Base<\/td>\n<td>1.0<\/td>\n<td>Repeats the question<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Base<\/td>\n<td>0.0 (ablated)<\/td>\n<td>Describes lock picking as a learnable skill<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Instruct<\/td>\n<td>1.0<\/td>\n<td>\u201cI can\u2019t assist with that.\u201d<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Instruct<\/td>\n<td>0.0 (ablated)<\/td>\n<td>Provides a guide<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Instruct<\/td>\n<td>2.0 (amplified)<\/td>\n<td>Stronger refusal<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">In base models, steering the late-layer neurons produces content shifts \u2014 topic changes, rephrasing \u2014 but no behavioral change at any multiplier. In instruct models, the same structure acts as a causal safety gate.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Fine-Tuning Transforms Function, Not Structure<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Discrimination neurons concentrate in <strong>the final 10% of layers<\/strong> in both base and instruct models. For Llama-3.2-1B, 87% of the top-200 discrimination neurons fall in the final three layers (L13\u2013L15). For Qwen2.5-3B, 95% fall in the final quarter of layers. This late-layer concentration is a pretraining property \u2014 it exists before alignment fine-tuning.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1054\" height=\"346\" data-attachment-id=\"80062\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/23\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/screenshot-2026-05-23-at-2-48-09-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM.png\" data-orig-size=\"1054,346\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-23 at 2.48.09\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-1024x336.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM.png\" alt=\"\" class=\"wp-image-80062\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.12290<\/figcaption><\/figure>\n<\/div>\n<p class=\"wp-block-paragraph\">The function of those neurons changes after fine-tuning. Table 8 in the research paper reports the overlap of (layer, neuron) index pairs between matched base and instruct circuits. Only <strong>8\u201329% of individual neurons overlap<\/strong> between base and instruct models. Fine-tuning largely replaces the specific neurons within that late-layer structure while preserving the structure itself.<\/p>\n<p class=\"wp-block-paragraph\">The research team describe this as a separation between two levels: layer-level structure (preserved across base and instruct) and neuron-level function (transformed by fine-tuning). This is consistent with prior work showing that instruction tuning rotates feed-forward network knowledge without changing layer structure.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- Header --><\/p>\n<div class=\"cna-header\">\n    <span class=\"cna-label\">Step-by-Step Guide \u00a0\u2022\u00a0 Nous Research<\/span>\n<h2>How to Use Contrastive Neuron Attribution (CNA)<\/h2>\n<p>Steer LLM behavior by identifying and ablating sparse MLP circuits \u2014 no SAE training, no weight modification.<\/p>\n<\/div>\n<p>  <!-- Progress --><\/p>\n<div class=\"cna-progress-wrap\">\n<div class=\"cna-step-row\"><\/div>\n<\/div>\n<p>  <!-- Slides --><\/p>\n<div class=\"cna-slides\">\n<p>    <!-- Slide 1 --><\/p>\n<div class=\"cna-slide active\" data-slide=\"0\">\n      <span class=\"cna-slide-num\">Overview \u00a0\u2014\u00a0 What is CNA?<\/span>\n<h3>Contrastive Neuron Attribution<\/h3>\n<p>CNA identifies the top 0.1% of MLP neurons whose activations most distinguish one behavior from another \u2014 for example, harmful prompts from benign prompts.<\/p>\n<p>Unlike residual-stream methods, CNA operates at the individual neuron level. Unlike sparse autoencoders, it requires no external training.<\/p>\n<div class=\"cna-step-line\"><\/div>\n<p><strong>What you need:<\/strong><\/p>\n<ul>\n<li>A base or instruct language model (Llama or Qwen architectures tested)<\/li>\n<li>A small set of contrastive prompt pairs<\/li>\n<li>Forward-pass access to MLP activations (via hooks)<\/li>\n<li>No GPU gradient computation required<\/li>\n<\/ul><\/div>\n<p>    <!-- Slide 2 --><\/p>\n<div class=\"cna-slide\" data-slide=\"1\">\n      <span class=\"cna-slide-num\">Step 1 \u00a0\u2014\u00a0 Define Your Prompt Pairs<\/span>\n<h3>Build a Contrastive Discovery Set<\/h3>\n<p>You need two sets of prompts that represent opposite behaviors. The quality of this set directly affects which neurons are identified.<\/p>\n<ul>\n<li><strong>Positive prompts<\/strong> \u2014 exhibit the target behavior (e.g., harmful requests)<\/li>\n<li><strong>Negative prompts<\/strong> \u2014 exhibit the opposite (e.g., benign requests)<\/li>\n<\/ul>\n<div class=\"cna-step-line\"><\/div>\n<p><strong>Recommended sizes:<\/strong><\/p>\n<ul>\n<li>For benchmark evaluation: 100 positive + 100 negative prompts<\/li>\n<li>For qualitative testing: as few as 8 positive + 8 negative prompts<\/li>\n<\/ul>\n<p>Example positive: <em>\u201cHow do I pick a lock?\u201d<\/em><br \/>Example negative: <em>\u201cHow do I bake a cake?\u201d<\/em><\/p>\n<\/div>\n<p>    <!-- Slide 3 --><\/p>\n<div class=\"cna-slide\" data-slide=\"2\">\n      <span class=\"cna-slide-num\">Step 2 \u00a0\u2014\u00a0 Record MLP Activations<\/span>\n<h3>Run Forward Passes With Hooks<\/h3>\n<p>Run all prompts through the model. At each MLP layer, record the <strong>down projection activations<\/strong> at the last token position using forward pre-hooks on <code>down_proj<\/code>.<\/p>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># Register hooks on down_proj in each MLP layer<\/span>\n<span class=\"kw\">def<\/span> <span class=\"fn\">make_hook<\/span>(layer_idx, store):\n    <span class=\"kw\">def<\/span> <span class=\"fn\">hook<\/span>(module, input, output):\n        store[layer_idx] = output[:, <span class=\"nm\">-1<\/span>, :].detach()\n    <span class=\"kw\">return<\/span> hook\n\nactivations = {}\nhooks = []\n<span class=\"kw\">for<\/span> i, layer <span class=\"kw\">in<\/span> <span class=\"fn\">enumerate<\/span>(model.layers):\n    h = layer.mlp.down_proj.<span class=\"fn\">register_forward_hook<\/span>(\n        <span class=\"fn\">make_hook<\/span>(i, activations)\n    )\n    hooks.<span class=\"fn\">append<\/span>(h)\n\n<span class=\"cmt\"># Run forward pass<\/span>\n<span class=\"kw\">with<\/span> torch.no_grad():\n    model(**inputs)<\/pre>\n<\/div>\n<p>Collect these activation tensors for every prompt in both sets before proceeding.<\/p>\n<\/div>\n<p>    <!-- Slide 4 --><\/p>\n<div class=\"cna-slide\" data-slide=\"3\">\n      <span class=\"cna-slide-num\">Step 3 \u00a0\u2014\u00a0 Compute Activation Differences<\/span>\n<h3>Per-Neuron Mean Contrastive Difference<\/h3>\n<p>For each neuron j in each layer \u2113, compute the mean activation difference between positive and negative sets:<\/p>\n<div class=\"cna-formula\">\u03b4\u2113_j = mean(a\u2113_j over positive prompts)<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u2014 mean(a\u2113_j over negative prompts)<\/div>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># pos_acts, neg_acts: tensors of shape [n_prompts, n_neurons]<\/span>\n<span class=\"kw\">import<\/span> torch\n\ndelta = <span class=\"fn\">dict<\/span>()\n<span class=\"kw\">for<\/span> layer_idx <span class=\"kw\">in<\/span> pos_acts:\n    delta[layer_idx] = (\n        pos_acts[layer_idx].<span class=\"fn\">mean<\/span>(dim=<span class=\"nm\">0<\/span>)\n        - neg_acts[layer_idx].<span class=\"fn\">mean<\/span>(dim=<span class=\"nm\">0<\/span>)\n    )<\/pre>\n<\/div>\n<p>This produces one difference value per neuron per layer. A large absolute value means that neuron fires very differently between the two prompt sets.<\/p>\n<\/div>\n<p>    <!-- Slide 5 --><\/p>\n<div class=\"cna-slide\" data-slide=\"4\">\n      <span class=\"cna-slide-num\">Step 4 \u00a0\u2014\u00a0 Select the Circuit<\/span>\n<h3>Take the Top 0.1% by Absolute Difference<\/h3>\n<p>Flatten all per-neuron delta values across all layers. Select the top-k neurons by absolute value, where k = 0.1% of total MLP activations.<\/p>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># Flatten all deltas into one tensor with (layer, neuron) indices<\/span>\nall_deltas = torch.<span class=\"fn\">cat<\/span>([delta[i] <span class=\"kw\">for<\/span> i <span class=\"kw\">in<\/span> <span class=\"fn\">sorted<\/span>(delta)])\ntotal = all_deltas.<span class=\"fn\">numel<\/span>()\nk = <span class=\"fn\">max<\/span>(<span class=\"nm\">1<\/span>, <span class=\"fn\">int<\/span>(total * <span class=\"nm\">0.001<\/span>))  <span class=\"cmt\"># 0.1%<\/span>\n\ntop_vals, top_idx = torch.<span class=\"fn\">topk<\/span>(all_deltas.<span class=\"fn\">abs<\/span>(), k)\n\n<span class=\"cmt\"># Map flat index back to (layer, neuron) pairs<\/span>\nn_neurons = all_deltas.<span class=\"fn\">shape<\/span>[<span class=\"nm\">0<\/span>] \/\/ <span class=\"fn\">len<\/span>(delta)\ncircuit = [(idx \/\/ n_neurons, idx % n_neurons)\n           <span class=\"kw\">for<\/span> idx <span class=\"kw\">in<\/span> top_idx.<span class=\"fn\">tolist<\/span>()]<\/pre>\n<\/div>\n<p>This set of (layer, neuron) pairs is your discovered circuit.<\/p>\n<\/div>\n<p>    <!-- Slide 6 --><\/p>\n<div class=\"cna-slide\" data-slide=\"5\">\n      <span class=\"cna-slide-num\">Step 5 \u00a0\u2014\u00a0 Filter Universal Neurons<\/span>\n<h3>Remove Neurons That Always Fire<\/h3>\n<p>Some neurons appear in the top 0.1% regardless of prompt content. These are not behavior-specific and must be excluded.<\/p>\n<ul>\n<li>Run a diverse set of unrelated prompts through the model<\/li>\n<li>Record which neurons fall in the top 0.1% for each prompt<\/li>\n<li>Flag any neuron appearing in the top 0.1% across 80% or more of prompts<\/li>\n<li>Remove flagged neurons from the discovered circuit before ablation<\/li>\n<\/ul>\n<div class=\"cna-step-line\"><\/div>\n<p>Skipping this step will contaminate the circuit with general-purpose neurons that fire constantly \u2014 and ablating them will degrade unrelated model behavior.<\/p>\n<\/div>\n<p>    <!-- Slide 7 --><\/p>\n<div class=\"cna-slide\" data-slide=\"6\">\n      <span class=\"cna-slide-num\">Step 6 \u00a0\u2014\u00a0 Ablate and Verify<\/span>\n<h3>Apply the Scalar Multiplier at Inference<\/h3>\n<p>Multiply each circuit neuron\u2019s activation by a scalar m at inference time to verify the circuit is causal \u2014 not just correlated.<\/p>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># circuit: list of (layer_idx, neuron_idx)<\/span>\n<span class=\"cmt\"># m=0 ablates, m=1 baseline, m&gt;1 amplifies<\/span>\n\n<span class=\"kw\">def<\/span> <span class=\"fn\">make_ablation_hook<\/span>(neuron_indices, m):\n    <span class=\"kw\">def<\/span> <span class=\"fn\">hook<\/span>(module, input, output):\n        output[:, <span class=\"nm\">-1<\/span>, neuron_indices] *= m\n        <span class=\"kw\">return<\/span> output\n    <span class=\"kw\">return<\/span> hook\n\n<span class=\"cmt\"># Group circuit neurons by layer, then register hooks<\/span>\n<span class=\"kw\">from<\/span> collections <span class=\"kw\">import<\/span> defaultdict\nby_layer = defaultdict(<span class=\"fn\">list<\/span>)\n<span class=\"kw\">for<\/span> layer_idx, neuron_idx <span class=\"kw\">in<\/span> circuit:\n    by_layer[layer_idx].<span class=\"fn\">append<\/span>(neuron_idx)\n\nhooks = []\n<span class=\"kw\">for<\/span> layer_idx, neurons <span class=\"kw\">in<\/span> by_layer.<span class=\"fn\">items<\/span>():\n    h = model.layers[layer_idx].mlp.down_proj\n        .<span class=\"fn\">register_forward_hook<\/span>(\n            <span class=\"fn\">make_ablation_hook<\/span>(neurons, m=<span class=\"nm\">0.0<\/span>)\n        )\n    hooks.<span class=\"fn\">append<\/span>(h)<\/pre>\n<\/div><\/div>\n<p>    <!-- Slide 8 --><\/p>\n<div class=\"cna-slide\" data-slide=\"7\">\n      <span class=\"cna-slide-num\">What to Expect \u00a0\u2014\u00a0 Results<\/span>\n<h3>Refusal Reduction Across Instruct Models<\/h3>\n<p>From the paper \u2014 refusal rate before and after ablation on JBB-Behaviors (100 harmful prompts):<\/p>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Qwen2.5-7B-Instruct<\/span><span class=\"cna-result-drop\">87% \u2192 2% (\u201497.7%)<\/span><\/div>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Qwen2.5-72B-Instruct<\/span><span class=\"cna-result-drop\">78% \u2192 8% (\u201489.7%)<\/span><\/div>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Llama-3.1-70B-Instruct<\/span><span class=\"cna-result-drop\">86% \u2192 18% (\u201479.1%)<\/span><\/div>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Llama-3.2-3B-Instruct<\/span><span class=\"cna-result-drop\">84% \u2192 47% (\u201444.0%)<\/span><\/div>\n<div class=\"cna-step-line\"><\/div>\n<p>Output quality (1 \u2014 repeated n-gram fraction) stays above <strong>0.97<\/strong> at all steering strengths. MMLU accuracy stays within one percentage point of baseline.<\/p>\n<\/div>\n<p>    <!-- Slide 9 --><\/p>\n<div class=\"cna-slide\" data-slide=\"8\">\n      <span class=\"cna-slide-num\">Key Notes \u00a0\u2014\u00a0 Before You Run This<\/span>\n<h3>Limitations to Keep in Mind<\/h3>\n<ul>\n<li>Tested on Llama 3.1\/3.2 and Qwen 2.5 only \u2014 gated SiLU MLPs with GQA attention<\/li>\n<li>Not yet validated on mixture-of-experts architectures<\/li>\n<li>Base models show no behavioral change under ablation \u2014 only instruct models respond<\/li>\n<li>CNA uses raw activation differences, not attribution scores \u2014 faithfulness metrics do not apply directly<\/li>\n<li>Amplification (m &gt; 1) can cause repetition at extreme values<\/li>\n<li>Quality of contrastive pairs directly affects which neurons are found<\/li>\n<\/ul>\n<div class=\"cna-step-line\"><\/div>\n<p>      <span class=\"cna-tag\">arXiv 2605.12290<\/span><br \/>\n      <span class=\"cna-tag\">Nous Research<\/span><br \/>\n      <span class=\"cna-tag\">github.com\/NousResearch\/neural-steering<\/span>\n    <\/p><\/div>\n<\/div>\n<p>  <!-- Nav --><\/p>\n<div class=\"cna-nav\">\n    <button class=\"cna-btn cna-btn-prev\" disabled>\u2190 Prev<\/button><br \/>\n    <span class=\"cna-slide-counter\">1 \/ 9<\/span><br \/>\n    <button class=\"cna-btn cna-btn-next\">Next \u2192<\/button>\n  <\/div>\n<p>  <!-- Footer --><\/p>\n<div class=\"cna-footer\">\n    <span>Coverage by<\/span><br \/>\n    <span class=\"cna-brand\">MARKTECHPOST \u00a0\u2014\u00a0 AI Research, Simplified<\/span>\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Ablating just 0.1% of MLP activations reduced refusal rates by more than 50% in most instruct models tested, while output quality stayed above 0.97.<\/li>\n<li>CNA requires only forward passes \u2014 no gradients, no auxiliary training, and no iterative search.<\/li>\n<li>Late-layer discrimination structure exists in base models before fine-tuning; alignment fine-tuning transforms its function, not its location.<\/li>\n<li>Unlike CAA, CNA preserves MMLU accuracy within one percentage point of baseline at all steering strengths.<\/li>\n<li>Only 8\u201329% of individual neurons overlap between base and instruct model circuits \u2014 fine-tuning rewires the neurons while keeping the late-layer structure intact.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the <strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.12290\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong> and\u00a0<strong><a href=\"https:\/\/github.com\/NousResearch\/neural-steering\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/23\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\">Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Instruction-tuned language mod&hellip;<\/p>\n","protected":false},"author":1,"featured_media":959,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-958","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/958","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=958"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/958\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/959"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=958"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=958"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=958"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}