2026-05-26agentsalignment

FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

Haoxuan Jia, Yang Liu, Bin Chong, Yingguang Yang, Yancheng Chen, Jiayu Liang, Qian Li, Hanning Lu, Kefu Xu, Hao Zheng, Chongyang Zhang, Hao Peng, Philip S. Yu

PDF preview unavailable

Read on arXiv →

Key claim

FinHarness reduces unauthorized actions while preserving approvals.

FinHarness is a new safety mechanism for finance LLM agents that effectively reduces unauthorized actions while maintaining legitimate approvals. It achieves a significant drop in action success rate from 38.3% to 15.0% and uses fewer advanced judge calls, making it efficient. This approach allows agents to make better decisions in real-time.

In plain English

Novelty

8.0/10

FinHarness introduces a novel inline safety mechanism for finance LLMs, significantly enhancing their operational safety.

Reliability

7.5/10

The paper provides solid experimental results with clear metrics, demonstrating the effectiveness of FinHarness.

Deep reliability assessment

The methodology supports the claim that inline, step-level monitoring with accumulated risk can improve the ASR/benign-approval trade-off on the fixed FINVAULT benchmark and a 856-trace synthetic stress set. It overclaims general adaptive robustness: the authors themselves note proprietary judges, frozen rule heads, single-run evaluation, benchmark-bound results, and unknown robustness if attackers know the risk heads, thresholds, window length, or evidence-injection format.

Reproducibility

No open-source code or repository is mentioned. The evaluation uses FINVAULT and a 856-trace synthetic stress set, but the excerpts do not state that FINHARNESS code, prompts, rule heads, judge configurations, or the synthetic traces are released; proprietary judge models are also a reproducibility blocker.

Discussion questions

1.Does the core assumption that cumulative weak risk signals indicate malicious intent hold in real finance workflows, where legitimate cases may naturally accumulate anomalies across multi-step processes?
2.For builders, is re-injecting fired risk factors into the agent context safer than enforcing hard policy gates, or can attackers learn to manipulate the agent’s response to those injected safety signals?
3.What evaluation would falsify the result: adaptive attacks with knowledge of the rule heads and cascade thresholds, variance across different agent models/prompts/tool simulators, or deployment logs showing benign approval collapse?

Key figure

Figure 1 shows a five-step finance-agent attack where each individual tool-call risk score stays below a per-step rejection threshold, but FINHARNESS accumulates the weak signals over the trajectory and routes the case to stronger verification before final approval.

Benchmark results

~FINVAULTattack success rate (%): 15vs baseline without routed FINHARNESS-23.3 pp

~FINVAULTbenign approval rate (%): 39.3vs baseline without routed FINHARNESS-1.8 pp

~FINVAULTattack success rate (%): 8.4vs FINHARNESS always-advanced ablation-6.6 pp vs routed FINHARNESS

~FINVAULTbenign approval rate (%): 37.4vs FINHARNESS always-advanced ablation-1.9 pp vs routed FINHARNESS