SWE-bench Lite Governance & Auditability Benchmark

Does TrustableClaw record and verify AI failures, not just successes?

To find out, we ran a 20-task SWE-bench Lite pilot where GPT-5.4 mini failed every single task. Every failure was fully recorded, cryptographically receipted, and tamper-verified without a single auditability gap. That is the point.

Most benchmarks only ask: did the AI get it right?

TrustableClaw asks a different question: when the AI gets it wrong, can you prove exactly what happened?

Governance matters most at the moment of failure. When a patch does not apply, when tests do not pass, when an agent produces unusable output, that is when audit trails, tamper-evident receipts, and verifiable records become the difference between accountability and guesswork.

What was tested

20 SWE-bench Lite tasks were run through TrustableClaw's full proof pipeline on a local macOS environment using GPT-5.4 mini. For every task, resolved or not, TrustableClaw generated six chained cryptographic receipts covering task selection, agent start, patch generation, test execution, result recording, and verification completion. 120 receipts total. All 120 verified. Zero auditability failures.

Metric

Result

Tasks attempted

20 / 20

Tasks resolved by GPT-5.4 mini

0 / 20

Receipts generated

120 / 120

Receipts verified

120 / 120

Tamper scenarios detected

15 / 15

Auditability failures

What the 0/20 means

GPT-5.4 mini failed to resolve any of the 20 tasks. 17 patches did not apply cleanly to the target repositories. 3 patches applied and tests ran, but the tasks remained unresolved. These are GPT-5.4 mini model failures, not TrustableClaw failures. TrustableClaw recorded, hashed, and verified every one of them.

An audit trail that only works when the AI succeeds is not an audit trail. This pilot demonstrates that TrustableClaw's governance layer holds regardless of outcome. Most AI governance tools are never stress-tested against failure. This one was, deliberately.

SWE-bench Lite Governance & Auditability Benchmark

What was tested

Results

What the 0/20 means