Benchmark Evidence

TrustableClaw tests AI agents not only for output quality, but for governability, auditability, receipt verification, and tamper detection.

These benchmarks show whether AI work can be recorded, verified, and proven.

HumanEval Governance & Auditability Benchmark

TrustableClaw reports 100% auditability coverage across OpenAI's full HumanEval benchmark, generating, verifying and detecting tamper attempts across all 164 standardized test cases without a receipt-verification failure.

  • 164 / 164 receipts created
  • 164 / 164 receipts verified
  • 164 / 164 tamper tests detected
  • Model coding accuracy is reported separately from auditability coverage

SWE-bench Lite Governance & Auditability Benchmark

Does TrustableClaw record and verify AI failures, not just successes?

To find out, we ran a 20-task SWE-bench Lite pilot where GPT-5.4 mini failed every single task. Every failure was fully recorded, cryptographically receipted, and tamper-verified without a single auditability gap. Because governance that only works when the AI succeeds is not governance at all.

Independent Technical Verification

This section reviews the technical claims in the Rational AI vs. LLM infographic. Each claim is evaluated against observed TrustableClaw architecture and runtime behavior, with verdicts showing which claims are backed, nuanced, or unsupported.

Claude Verdict