← Back to Benchmarks

HumanEval Governance & Auditability Benchmark

TrustableClaw reports 100% auditability coverage across OpenAI's full HumanEval benchmark, generating, verifying and successfully detecting tamper attempts across all 164 standardized test cases without a receipt-verification failure.

Main Result

TrustableClaw achieved 100% auditability coverage across the full HumanEval benchmark.

  • 164 / 164 receipts created
  • 164 / 164 receipts verified
  • 164 / 164 tamper tests detected
CategoryResult
HumanEval tasks attempted164 / 164
Receipts created164 / 164
Receipts verified164 / 164
Tamper tests detected164 / 164
Coding tasks passed by the gpt-5.4-mini model146 / 164
Coding-solution failures by the gpt-5.4-mini model18 / 164
Auditability coverage100%

What This Benchmark Shows

Most AI benchmarks only ask one question:

Did the model get the answer right?

TrustableClaw adds another question:

Can we prove what the AI did?

For every HumanEval task, TrustableClaw created a receipt, verified the receipt, and detected tampering during the tamper-test checks.

This separates gpt-5.4-mini model performance from auditability.

Coding Failures vs. Auditability Failures

The gpt-5.4-mini model passed 146 of 164 coding tasks.

The 18 failed tasks were coding-solution failures. That means the generated code did not pass the HumanEval unit tests for those tasks.

They were not TrustableClaw auditability failures.

TrustableClaw still recorded, verified, and tamper-checked every task.

Evidence Package

The benchmark evidence should be reviewed with the raw files and methodology, not only this summary page.

  • Full raw JSON/JSONL results for all 164 tasks
  • Receipt manifest with receipt IDs and hashes
  • Verification script or command used to check receipts
  • Tamper-test script and tamper-test output
  • Methodology explaining pass/fail scoring and auditability scoring
  • Environment details, including run date, OS, gpt-5.4-mini model, versions, and TrustableClaw commit
View Benchmark Evidence on GitHub