← Back to Benchmarks

SWE-bench Lite Governance & Auditability Benchmark

Does TrustableClaw record and verify AI failures, not just successes?

To find out, we ran a 20-task SWE-bench Lite pilot where GPT-5.4 mini failed every single task. Every failure was fully recorded, cryptographically receipted, and tamper-verified without a single auditability gap. That is the point.

Most benchmarks only ask: did the AI get it right?

TrustableClaw asks a different question: when the AI gets it wrong, can you prove exactly what happened?

Governance matters most at the moment of failure. When a patch does not apply, when tests do not pass, when an agent produces unusable output, that is when audit trails, tamper-evident receipts, and verifiable records become the difference between accountability and guesswork.

What was tested

20 SWE-bench Lite tasks were run through TrustableClaw's full proof pipeline on a local macOS environment using GPT-5.4 mini. For every task, resolved or not, TrustableClaw generated six chained cryptographic receipts covering task selection, agent start, patch generation, test execution, result recording, and verification completion. 120 receipts total. All 120 verified. Zero auditability failures.

Results

MetricResult
Tasks attempted20 / 20
Tasks resolved by GPT-5.4 mini0 / 20
Receipts generated120 / 120
Receipts verified120 / 120
Tamper scenarios detected15 / 15
Auditability failures0

What the 0/20 means

GPT-5.4 mini failed to resolve any of the 20 tasks. 17 patches did not apply cleanly to the target repositories. 3 patches applied and tests ran, but the tasks remained unresolved. These are GPT-5.4 mini model failures, not TrustableClaw failures. TrustableClaw recorded, hashed, and verified every one of them.

An audit trail that only works when the AI succeeds is not an audit trail. This pilot demonstrates that TrustableClaw's governance layer holds regardless of outcome. Most AI governance tools are never stress-tested against failure. This one was, deliberately.

View Benchmark Evidence on GitHub →