What was tested?
Every HumanEval task was run through a governance and auditability flow that created receipts, verified receipts, and ran tamper checks.
TrustableClaw reports 100% auditability coverage across OpenAI's full HumanEval benchmark, generating, verifying and successfully detecting tamper attempts across all 164 standardized test cases without a receipt-verification failure.
TrustableClaw achieved 100% auditability coverage across the full HumanEval benchmark.
| Category | Result |
|---|---|
| HumanEval tasks attempted | 164 / 164 |
| Receipts created | 164 / 164 |
| Receipts verified | 164 / 164 |
| Tamper tests detected | 164 / 164 |
| Coding tasks passed by the gpt-5.4-mini model | 146 / 164 |
| Coding-solution failures by the gpt-5.4-mini model | 18 / 164 |
| Auditability coverage | 100% |
Most AI benchmarks only ask one question:
Did the model get the answer right?
TrustableClaw adds another question:
Can we prove what the AI did?
For every HumanEval task, TrustableClaw created a receipt, verified the receipt, and detected tampering during the tamper-test checks.
This separates gpt-5.4-mini model performance from auditability.
The gpt-5.4-mini model passed 146 of 164 coding tasks.
The 18 failed tasks were coding-solution failures. That means the generated code did not pass the HumanEval unit tests for those tasks.
They were not TrustableClaw auditability failures.
TrustableClaw still recorded, verified, and tamper-checked every task.
The benchmark evidence should be reviewed with the raw files and methodology, not only this summary page.