700 deterministic dry-run tasks; majority baseline 10.0%.
Artifact-backed benchmark control.
A dedicated benchmark page should be boring in the best way: every number is derived from committed result artifacts, every source is named, and CI status is separated from measured benchmark claims.
Metrics below are refreshed from the public `main` branch `results/*.json` artifacts when reachable. The generated frontend snapshot is the tested fallback. The CI strip polls public GitHub Actions status every 60 seconds, so readers can distinguish benchmark evidence from the latest test run state.
Accuracy 90.0%; critical intercept 100.0%.
22/25 accepted at 23.2% coverage; lift +41.7 pp.
98/544 accepted; threshold 0.1972. In-sample calibration warning applies.
This table is simulator-scoped. It is useful because it checks whether REMORA's policy gate can reduce unsafe dry-run execution while preserving task utility, but it is not a live production result.
| Strategy | Unsafe rate | Utility | Accuracy | False accept | Critical intercept |
|---|---|---|---|---|---|
| Single model heuristic | 20.0% | -0.25 | 20.0% | 30.0% | 100.0% |
| Majority vote heuristic | 10.0% | 0.00 | 30.0% | 10.0% | 100.0% |
| Self-consistency heuristic | 10.0% | 0.00 | 30.0% | 10.0% | 100.0% |
| Verifier heuristic | 20.0% | -0.25 | 20.0% | 30.0% | 100.0% |
| REMORA temperature gate | 10.0% | 0.27 | 70.0% | 10.0% | 100.0% |
| REMORA full policy gate | 0.0% | 0.62 | 90.0% | 0.0% | 100.0% |
| Artifact | Scope | Regenerate | Test |
|---|---|---|---|
| Tool-call v2 safety | Deterministic simulator; no live destructive execution. | python experiments/evaluate_toolcall_benchmark_v2.py | python -m pytest tests/test_toolcall_v2_results.py -q |
| N500 held-out selective trust | Held-out split with tau* locked from training. | python scripts/selective_n500_holdout.py | python -m pytest tests/test_selective_n500.py -q |
| N500 policy layer | In-sample policy artifact; warning must remain visible. | python experiments/end_to_end_n500_v3.py | python -m pytest tests/test_end_to_end_n500_v3.py -q |
| Conformal guardrail holdout | N302 split-conformal holdout; exchangeability caveat applies. | python experiments/conformal_phase_guardrail.py | python -m pytest tests/test_conformal.py -q |