HumanEval Governance & Auditability Benchmark

TrustableClaw reports 100% auditability coverage across OpenAI's full HumanEval benchmark, generating, verifying and successfully detecting tamper attempts across all 164 standardized test cases without a receipt-verification failure.

Main Result

TrustableClaw achieved 100% auditability coverage across the full HumanEval benchmark.

164 / 164 receipts created
164 / 164 receipts verified
164 / 164 tamper tests detected

Category	Result
HumanEval tasks attempted	164 / 164
Receipts created	164 / 164
Receipts verified	164 / 164
Tamper tests detected	164 / 164
Coding tasks passed by the gpt-5.4-mini model	146 / 164
Coding-solution failures by the gpt-5.4-mini model	18 / 164
Auditability coverage	100%

What This Benchmark Shows

Most AI benchmarks only ask one question:

Did the model get the answer right?

TrustableClaw adds another question:

Can we prove what the AI did?

For every HumanEval task, TrustableClaw created a receipt, verified the receipt, and detected tampering during the tamper-test checks.

This separates gpt-5.4-mini model performance from auditability.

Coding Failures vs. Auditability Failures

The gpt-5.4-mini model passed 146 of 164 coding tasks.

The 18 failed tasks were coding-solution failures. That means the generated code did not pass the HumanEval unit tests for those tasks.

They were not TrustableClaw auditability failures.

TrustableClaw still recorded, verified, and tamper-checked every task.

Evidence Package

The benchmark evidence should be reviewed with the raw files and methodology, not only this summary page.

Full raw JSON/JSONL results for all 164 tasks
Receipt manifest with receipt IDs and hashes
Verification script or command used to check receipts
Tamper-test script and tamper-test output
Methodology explaining pass/fail scoring and auditability scoring
Environment details, including run date, OS, gpt-5.4-mini model, versions, and TrustableClaw commit

View Benchmark Evidence on GitHub

HumanEval Governance & Auditability Benchmark

What was tested?

What passed?

What is separate?

Main Result

What This Benchmark Shows

Coding Failures vs. Auditability Failures

Evidence Package