PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective
Yangyi Huang, Ruotian Peng, Zeju Qiu, Jiale Kang, Yandong Wen, Bernhard Schölkopf, Weiyang Liu
Read on arXiv →Key claim
Orthogonal finetuning offers optimal stability-plasticity balance.
This paper introduces PEFT-Arena, a benchmark that evaluates parameter-efficient finetuning by measuring both downstream performance and the retention of pretrained capabilities. The key finding is that orthogonal finetuning achieves the best balance between adaptation and retention under similar parameter budgets, highlighting the importance of stability-plasticity profiles in finetuning methods.
In plain English
This paper introduces PEFT-Arena, a benchmark that evaluates parameter-efficient finetuning by measuring both downstream performance and the retention of pretrained capabilities. The key finding is that orthogonal finetuning achieves the best balance between adaptation and retention under similar parameter budgets, highlighting the importance of stability-plasticity profiles in finetuning methods.
The introduction of PEFT-Arena as a benchmark for assessing both performance and capability retention represents a significant advancement in the evaluation of parameter-efficient finetuning methods.
The study provides solid empirical analysis and distinct profiles across methods, though it could benefit from more extensive baselines.
Deep reliability assessment
The methodology supports the claim that, in the evaluated math and medicine finetuning settings, PEFT methods differ substantially in their stability-plasticity trade-offs and OFT often provides a stronger Pareto frontier under comparable parameter budgets. The broader claim that OFT is generally best for PEFT retention is overclaimed without wider multilingual, dialogue, safety, non-reasoning, and non-weight-parameterized PEFT evaluations.
Reproducibility
No explicit open-source code or dataset release is shown in the provided text; the paper mentions a project page, SphereLab.ai/PEFT-Arena, but not a concrete GitHub/HuggingFace repository or downloadable benchmark artifact.
Discussion questions
- 1.Does measuring retention on general evaluation suites actually capture the pretrained capabilities that matter in production, or does it just reward models that stay close to the base model?
- 2.For builders adapting LLMs to SEA domains, should PEFT selection optimize a Pareto trade-off between target gains and general retention rather than target-task accuracy alone, and how would that change deployment evaluation budgets?
- 3.What result would falsify the paper’s interpretation: a PEFT method that causes large non-isometric activation distortion while retaining general capabilities, or OFT losing its frontier advantage across broader domains and model families?
Key figure
Figure 1 summarizes PEFT-Arena’s stability-plasticity benchmark, links external forgetting to weight-space and activation-space geometry, and illustrates interpolation/path-wise rewinding as a way to diagnose and correct SFT overshoot.
