Retrying vs Resampling in AI Control
James Lucassen, Adam Kaufman
Key claim
Auditing maximum suspicion scores enhances AI safety.
This paper explores the concepts of retrying and resampling in AI coding tools, highlighting how retrying can reduce suspicion scores but may also allow for sneakier attacks. A key finding is that auditing based on maximum suspicion scores during resampling significantly improves safety without sacrificing usefulness.
The paper introduces a new perspective on retrying and resampling in AI coding scaffolds, revealing contradictions with prior findings.
The study provides solid experimental results and clear comparisons, supporting its claims effectively.
Deep reliability assessment
The methodology supports the claim that resampling can improve safety without leaking exploitable information, but the effectiveness of retrying protocols against strategic attackers is overclaimed without sufficient evidence.
Reproducibility
No open source code or dataset is mentioned in the paper.
Discussion questions
- How might the assumption that retrying protocols leak exploitable information be challenged?
- What are the practical implications for AI developers in terms of choosing between retrying and resampling protocols?
- What experimental results or scenarios would falsify the claim that resampling improves safety without increasing costs?
Key figure
Figure 1 illustrates the safety achieved at a 0.3% auditing budget by different protocols, highlighting the vulnerabilities of retrying protocols to exploitation.