2026-05-28agentsreasoning

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Nhat-Minh Nguyen

Key claim

Supervision design determines AI output trustworthiness.

This study explores the role of AI agents in research through a case where a physicist supervised an AI coding agent. The key finding is that effective supervision practices were crucial for ensuring the agent's outputs were trustworthy, highlighting the importance of supervision design over model capability.

In plain English

Novelty

7.0/10

The paper presents a unique case study on AI agent supervision that extends understanding of AI capabilities in a practical context.

Reliability

8.0/10

The findings are well-supported by detailed documentation of the supervision events and the agent's performance.

Deep reliability assessment

The methodology supports a rich, instrumented N=1 case study showing specific failure modes of an AI coding agent under expert supervision in scientific software development. It overclaims if read as a general result about AI agents, scaling, or all scientific domains, because the evidence comes from one physicist, one project, one agent stack, and one validation regime.

Reproducibility

No public repository or dataset URL is mentioned in the provided text. The paper reports that CLAX-PT is a ~2,100-line JAX implementation validated against CLASS-PT and that 15 supervision events were documented, but the excerpt does not provide open-source code, oracle test suites, logs, or datasets.

Discussion questions

1.Is the core failure really a limitation of the AI agent's scientific reasoning, or a consequence of the human providing an incomplete problem decomposition and validation harness?
2.For builders using coding agents in high-stakes technical domains, what supervision artifacts should be mandatory: diverse parameter tests, changelogs, anti-fudge-factor rules, architectural review checkpoints, or something else?
3.What evidence would falsify the paper's conclusion: an agent independently proposing the anisotropic BAO damping redesign, rejecting its own unphysical scalar patch, or succeeding across many scientific software projects without expert intervention?

Key figure

No Figure 1 or architectural diagram is visible in the provided excerpt; the key described workflow is a physicist-supervised Claude Code development loop using oracle tests, changelogs, parameter sweeps, and expert physics review.

Benchmark results

Planck 2018 fiducial cosmology, z = 0.38, k < 0.3 h/Mpcmax relative error (%) vs CLASS-PT: 0.31vs CLASS-PT reference codenot applicable; error measured against reference

Planck 2018 fiducial cosmology, z = 0.38, k < 0.3 h/Mpcmax relative error (%) vs CLASS-PT: 0.59vs CLASS-PT reference codenot applicable; error measured against reference

Planck 2018 fiducial cosmology, z = 0.38, k < 0.3 h/Mpcmax relative error (%) vs CLASS-PT: 0.89vs CLASS-PT reference codenot applicable; error measured against reference

Planck 2018 fiducial cosmology, z = 0.38, k < 0.3 h/Mpcmax normalized error (%) vs CLASS-PT using |Δ|/max(|ref|) due to zero crossings: 1.43vs CLASS-PT reference codenot applicable; error measured against reference