2026-05-27agentsreasoningcode

Self-Improving Language Models with Bidirectional Evolutionary Search

Guowei Xu, Zhenting Qi, Huangyuan Su, Weirui Ye, Himabindu Lakkaraju, Sham M. Kakade, Yilun Du

Key claim

BES significantly improves search performance in language models.

The paper presents Bidirectional Evolutionary Search (BES), a new framework that enhances search methods for language models by combining forward and backward search strategies. The key result shows that BES outperforms existing frameworks on challenging tasks, enabling better performance in both average and best-case scenarios.

In plain English

Novelty

8.0/10

BES introduces a novel search framework that combines forward candidate evolution with backward goal decomposition, significantly extending existing search methods.

Reliability

7.5/10

The paper provides experimental results demonstrating consistent gains over existing methods, supported by theoretical motivation and code availability.

Deep reliability assessment

The experiments support that BES can improve sample generation versus GRPO/Tree-GRPO on the reported MuSiQue setup and improve average objective values versus several open-source evolutionary frameworks under matched inference settings. The broader claim that BES generally escapes model-distribution limits or is a robust path to self-improving agents is less established, since evidence is concentrated on a few benchmarks, uses strong verifier/decomposition assumptions, and does not clearly separate gains from extra compute, prompting, or implementation details.

Reproducibility

Yes. The paper states that code and trained models are available at https://github.com/Embodied-Minds-Lab/BES; datasets/benchmarks mentioned include MuSiQue, logical reasoning tasks, Circle Packing Square/Rect, and Heilbronn Convex, with detailed configurations said to be in appendices.

Discussion questions

1.Does BES truly explore outside the model's effective distribution, or do LLM-prompted evolutionary operators still mostly recombine high-probability model priors in a more compute-intensive way?
2.For builders, when is the extra complexity of backward subgoal generation and evolutionary candidate recombination worth it compared with simpler best-of-N, verifier-guided reranking, or domain-specific search?
3.What result would falsify BES: matched-compute experiments where tree search plus dense verifier feedback closes the gap, or tasks where recombination produces plausible but invalid trajectories more often than useful candidates?

Key figure

Figure 1 contrasts ordinary tree search, which expands candidates forward within a narrow reachable solution shell, with BES, which combines forward evolutionary recombination and backward goal decomposition into verifiable subgoals.

Benchmark results

MuSiQue validationaccuracy: 7vs Llama-3.2-3B-Instruct + Tree-GRPO+3.1 percentage points

MuSiQue validationaccuracy: 10.4vs Llama-3.1-8B-Instruct + Tree-GRPO+3.0 percentage points

Circle Packing (Square)average objective value: 2.623vs GEPA with GPT-5+0.010

Circle Packing (Square)best objective value: 2.632vs GEPA with GPT-5+0.004

Circle Packing (Rect.)average objective value: 2.349vs ShinkaEvolve with GPT-5+0.014

Circle Packing (Rect.)best objective value: 2.36vs ShinkaEvolve with GPT-5+0.002

Heilbronn (Convex)average objective value: 0.026vs OpenEvolve/GEPA with GPT-5+0.001

Heilbronn (Convex)best objective value: 0.027vs OpenEvolve/GEPA with GPT-5+0.000

GitHub1 repo

Embodied-Minds-Lab/BESOfficial