Benchmark Evidence

TrustableClaw tests AI agents not only for output quality, but for governability, auditability, receipt verification, and tamper detection.

These benchmarks show whether AI work can be recorded, verified, and proven.

HumanEval Governance & Auditability Benchmark

TrustableClaw reports 100% auditability coverage across OpenAI's full HumanEval benchmark, generating, verifying and detecting tamper attempts across all 164 standardized test cases without a receipt-verification failure.

164 / 164 receipts created
164 / 164 receipts verified
164 / 164 tamper tests detected
Model coding accuracy is reported separately from auditability coverage

View HumanEval Results

SWE-bench Lite Governance & Auditability Benchmark

Does TrustableClaw record and verify AI failures, not just successes?

To find out, we ran a 20-task SWE-bench Lite pilot where GPT-5.4 mini failed every single task. Every failure was fully recorded, cryptographically receipted, and tamper-verified without a single auditability gap. Because governance that only works when the AI succeeds is not governance at all.

View SWE-bench Results

Independent Technical Verification

This section reviews the technical claims in the Rational AI vs. LLM infographic. Each claim is evaluated against observed TrustableClaw architecture and runtime behavior, with verdicts showing which claims are backed, nuanced, or unsupported.

Benchmark Evidence

What is measured?

What is not overclaimed?

What evidence matters?

HumanEval Governance & Auditability Benchmark

SWE-bench Lite Governance & Auditability Benchmark

Independent Technical Verification