2026-05-27alignmentscalingcode

Multi-Adapter Representation Interventions via Energy Calibration

Manjiang Yu, Hongji Li, Junwei Chen, Xue Li, Priyanka Singh, Yang Cao, Lijie Hu

PDF preview unavailable

Key claim

MARI improves alignment without sacrificing general capabilities.

The paper presents MARI, a method that adapts intervention strategies for large language models based on sample-specific needs. This approach not only aligns models more effectively but also enhances their general capabilities on various tasks. The key result shows significant improvements on safety benchmarks while maintaining performance on general tasks.

In plain English

Novelty

8.0/10

The introduction of a competitive multi-adapter mechanism represents a significant advancement in intervention methods for large language models.

Reliability

8.0/10

The extensive experiments across diverse model families and benchmarks provide strong support for the claims made.

Deep reliability assessment

The methodology supports the claim that a fixed, global representation intervention can be suboptimal and that input-adaptive routing plus an energy gate can reduce unnecessary interventions on the evaluated alignment/general-capability benchmarks. The paper overclaims if read as proving general alignment robustness: the evidence is still benchmark-bound, layer/site-specific, and the provided excerpt gives limited concrete end-to-end performance numbers beyond an expert-selection lower bound.

Reproducibility

Yes for code: the paper states code is available at https://github.com/V1centNevwake/MARI. Datasets appear to be public benchmarks including TruthfulQA, BBQ, Safety, MMLU, and ARC, but the excerpt does not include full experimental configuration details or all result tables.

Discussion questions

1.Does the need for multiple adapters genuinely refute the linear representation hypothesis, or could it simply reflect that the chosen intervention layer, token position, or training objective is poorly specified?
2.For builders deploying LLMs, is the extra inference cost and routing complexity of energy-gated multi-adapter intervention worth it compared with simpler approaches such as fine-tuning, system prompts, refusal classifiers, or post-generation filters?
3.What result would falsify MARI: for example, if energy scores fail to separate intervention-needed from benign inputs on a new domain, or if a single well-tuned dynamic steering vector matches MARI while preserving MMLU/ARC performance?

Key figure

Figure 1 contrasts static representation intervention such as ReFT, which applies one fixed adapter to every input, with MARI, which uses an energy gate to decide whether to intervene and a router over multiple adapters to select the intervention direction.

Benchmark results

~TruthfulQA-MC1, TruthfulQA-MC2, and BBQLBagree lower bound: 81.74vs Llama-3-8B-Instructnot reported

GitHub1 repo

V1centNevwake/MARIOfficial