REMORA benchmarks

Artifact-backed benchmark control.

A dedicated benchmark page should be boring in the best way: every number is derived from committed result artifacts, every source is named, and CI status is separated from measured benchmark claims.

Artifact source
loading GitHub artifacts / 98c531eadb88
Latest quality gate
waiting

Metrics below are refreshed from the public `main` branch `results/*.json` artifacts when reachable. The generated frontend snapshot is the tested fallback. The CI strip polls public GitHub Actions status every 60 seconds, so readers can distinguish benchmark evidence from the latest test run state.

01Headline metrics
Tool-call v2 unsafe rate
SIMULATOR
0.0%

700 deterministic dry-run tasks; majority baseline 10.0%.

results/toolcall_benchmark_v2_results.json
Tool-call v2 utility
SIMULATOR
0.62

Accuracy 90.0%; critical intercept 100.0%.

results/toolcall_benchmark_v2_results.json
N500 held-out accuracy
HOLDOUT
88.0%

22/25 accepted at 23.2% coverage; lift +41.7 pp.

results/selective_n500_holdout_results.json
N500 policy accept slice
IN-SAMPLE
88.8%

98/544 accepted; threshold 0.1972. In-sample calibration warning applies.

results/end_to_end_n500_v3.json
02Tool-call v2 safety and utility

This table is simulator-scoped. It is useful because it checks whether REMORA's policy gate can reduce unsafe dry-run execution while preserving task utility, but it is not a live production result.

StrategyUnsafe rateUtilityAccuracyFalse acceptCritical intercept
Single model heuristic
20.0%
-0.25
20.0%30.0%100.0%
Majority vote heuristic
10.0%
0.00
30.0%10.0%100.0%
Self-consistency heuristic
10.0%
0.00
30.0%10.0%100.0%
Verifier heuristic
20.0%
-0.25
20.0%30.0%100.0%
REMORA temperature gate
10.0%
0.27
70.0%10.0%100.0%
REMORA full policy gate
0.0%
0.62
90.0%0.0%100.0%
03N500 policy distribution
ACCEPT
accuracy 88.8% / risk 11.2%
98 18.0%
VERIFY
accuracy 62.5% / risk 37.5%
32 5.9%
ABSTAIN
accuracy 28.3% / risk 71.7%
414 76.1%
ESCALATE
accuracy n/a / risk n/a
0 0.0%
04Conformal holdout
10% target
risk
10.0%
coverage
24.8%
accepted
30
upper bound
missed
threshold 0.1573
15% target
risk
13.4%
coverage
67.8%
accepted
82
upper bound
met
threshold 0.0094
05Evidence sources and commands
ArtifactScopeRegenerateTest
Tool-call v2 safetyDeterministic simulator; no live destructive execution.python experiments/evaluate_toolcall_benchmark_v2.pypython -m pytest tests/test_toolcall_v2_results.py -q
N500 held-out selective trustHeld-out split with tau* locked from training.python scripts/selective_n500_holdout.pypython -m pytest tests/test_selective_n500.py -q
N500 policy layerIn-sample policy artifact; warning must remain visible.python experiments/end_to_end_n500_v3.pypython -m pytest tests/test_end_to_end_n500_v3.py -q
Conformal guardrail holdoutN302 split-conformal holdout; exchangeability caveat applies.python experiments/conformal_phase_guardrail.pypython -m pytest tests/test_conformal.py -q
06Claim boundaries
Dashboard metrics are derived from a generated frontend snapshot of committed JSON artifacts, not hand-entered README values.
Tool-call safety numbers are simulator-scoped and are not deployment certification.
The live CI panel reports workflow status; it does not create new benchmark measurements.
For Cloudflare deployment, rerun benchmark scripts and rebuild the frontend after artifact changes.