SWE-bench Lite Governance & Auditability Benchmark
Does TrustableClaw record and verify AI failures, not just successes?
To find out, we ran a 20-task SWE-bench Lite pilot where GPT-5.4 mini failed every single task. Every failure was fully recorded, cryptographically receipted, and tamper-verified without a single auditability gap. That is the point.
Most benchmarks only ask: did the AI get it right?
TrustableClaw asks a different question: when the AI gets it wrong, can you prove exactly what happened?
Governance matters most at the moment of failure. When a patch does not apply, when tests do not pass, when an agent produces unusable output, that is when audit trails, tamper-evident receipts, and verifiable records become the difference between accountability and guesswork.
What was tested
20 SWE-bench Lite tasks were run through TrustableClaw's full proof pipeline on a local macOS environment using GPT-5.4 mini. For every task, resolved or not, TrustableClaw generated six chained cryptographic receipts covering task selection, agent start, patch generation, test execution, result recording, and verification completion. 120 receipts total. All 120 verified. Zero auditability failures.
Results
| Metric | Result |
|---|---|
| Tasks attempted | 20 / 20 |
| Tasks resolved by GPT-5.4 mini | 0 / 20 |
| Receipts generated | 120 / 120 |
| Receipts verified | 120 / 120 |
| Tamper scenarios detected | 15 / 15 |
| Auditability failures | 0 |
What the 0/20 means
GPT-5.4 mini failed to resolve any of the 20 tasks. 17 patches did not apply cleanly to the target repositories. 3 patches applied and tests ran, but the tasks remained unresolved. These are GPT-5.4 mini model failures, not TrustableClaw failures. TrustableClaw recorded, hashed, and verified every one of them.
An audit trail that only works when the AI succeeds is not an audit trail. This pilot demonstrates that TrustableClaw's governance layer holds regardless of outcome. Most AI governance tools are never stress-tested against failure. This one was, deliberately.