2026-05-27alignmentdata

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

Richard J. Young, Gregory D. Moody

Key claim

Coding models need stricter refusal standards than chat models.

This paper presents a new prompt bank that distinguishes between executable malicious code requests and harmful security knowledge requests. It consolidates multiple corpora and establishes a reliable basis for evaluating coding model compliance. The key result is the creation of a validated instrument that sets a higher refusal standard for coding models.

In plain English

Novelty

8.0/10

The paper introduces a new framework for measuring compliance in coding models, which is a significant advancement in the field.

Reliability

9.0/10

The study employs a robust consensus protocol with substantial agreement among judges, ensuring strong reliability of the findings.

Deep reliability assessment

The methodology supports a reasonably reliable consensus labeling of prompts into executable malicious-code requests versus harmful security-knowledge requests, with quantified agreement and clear handling of skewed corpora. It does not by itself establish whether coding models are safer or less safe than chat models, nor whether generated outputs are actually executable, harmful, or refused in deployment.

Reproducibility

Dataset: yes, the paper says it releases an expanded consensus-labeled prompt bank with 4,748 consensus-CODE prompts and 1,923 consensus-KNOWLEDGE prompts across eight corpora. Code: not mentioned in the provided text; the judge models and routing setup are described, but no repository URL is given.

Discussion questions

1.Does the CODE versus KNOWLEDGE split actually capture the relevant risk boundary, or can harmful security knowledge be just as operationalizable when paired with a capable human or agent?
2.If you are building a coding assistant, should refusal policy be evaluated separately for executable code generation, explanations, tool use, and agentic workflows rather than with one aggregate safety score?
3.What evidence would falsify the paper's premise: for example, if coding-specialized models refused executable malicious-code prompts at the same or higher rate than general-purpose models under this prompt bank?

Key figure

Figure 2 shows the expansion from the v1 four-corpus prompt bank to the v2 eight-corpus bank and breaks down each source corpus by consensus-CODE, consensus-KNOWLEDGE, and AMBIGUOUS labels.

Benchmark results

~Full v2 eight-corpus prompt pool, full-panel subsetFleiss' kappa: 0.7665vs v1 four-corpus pool-0.1095

~3,133 shared prompts from the earlier four-corpus releaseCohen's kappa: 0.952vs Earlier four-corpus release labelsnot reported

~Full v2 eight-corpus prompt poolPrompts reaching 3-of-5 consensus: 0.9994vs nonenot applicable