REMORA benchmarks

Artifact-backed benchmark control.

A dedicated benchmark page should be boring in the best way: every number is derived from committed result artifacts, every source is named, and CI status is separated from measured benchmark claims.

Artifact source

loading GitHub artifacts / 98c531eadb88

Latest quality gate

waiting

Metrics below are refreshed from the public `main` branch `results/*.json` artifacts when reachable. The generated frontend snapshot is the tested fallback. The CI strip polls public GitHub Actions status every 60 seconds, so readers can distinguish benchmark evidence from the latest test run state.

01Headline metrics

Tool-call v2 unsafe rate

SIMULATOR

0.0%

700 deterministic dry-run tasks; majority baseline 10.0%.

results/toolcall_benchmark_v2_results.json

Tool-call v2 utility

SIMULATOR

0.62

Accuracy 90.0%; critical intercept 100.0%.

results/toolcall_benchmark_v2_results.json

N500 held-out accuracy

HOLDOUT

88.0%

22/25 accepted at 23.2% coverage; lift +41.7 pp.

results/selective_n500_holdout_results.json

N500 policy accept slice

IN-SAMPLE

88.8%

98/544 accepted; threshold 0.1972. In-sample calibration warning applies.

results/end_to_end_n500_v3.json

02Tool-call v2 safety and utility

This table is simulator-scoped. It is useful because it checks whether REMORA's policy gate can reduce unsafe dry-run execution while preserving task utility, but it is not a live production result.

Strategy	Unsafe rate	Utility	Accuracy	False accept	Critical intercept
Single model heuristic	20.0%	-0.25	20.0%	30.0%	100.0%
Majority vote heuristic	10.0%	0.00	30.0%	10.0%	100.0%
Self-consistency heuristic	10.0%	0.00	30.0%	10.0%	100.0%
Verifier heuristic	20.0%	-0.25	20.0%	30.0%	100.0%
REMORA temperature gate	10.0%	0.27	70.0%	10.0%	100.0%
REMORA full policy gate	0.0%	0.62	90.0%	0.0%	100.0%

03N500 policy distribution

accuracy 88.8% / risk 11.2%

98 18.0%

VERIFY

accuracy 62.5% / risk 37.5%

32 5.9%

ABSTAIN

accuracy 28.3% / risk 71.7%

414 76.1%

ESCALATE

accuracy n/a / risk n/a

0 0.0%

04Conformal holdout

10% target

risk

10.0%

coverage

24.8%

accepted

upper bound

missed

threshold 0.1573

15% target

risk

13.4%

coverage

67.8%

accepted

upper bound

met

threshold 0.0094

05Evidence sources and commands

Artifact	Scope	Regenerate	Test
Tool-call v2 safety	Deterministic simulator; no live destructive execution.	python experiments/evaluate_toolcall_benchmark_v2.py	python -m pytest tests/test_toolcall_v2_results.py -q
N500 held-out selective trust	Held-out split with tau* locked from training.	python scripts/selective_n500_holdout.py	python -m pytest tests/test_selective_n500.py -q
N500 policy layer	In-sample policy artifact; warning must remain visible.	python experiments/end_to_end_n500_v3.py	python -m pytest tests/test_end_to_end_n500_v3.py -q
Conformal guardrail holdout	N302 split-conformal holdout; exchangeability caveat applies.	python experiments/conformal_phase_guardrail.py	python -m pytest tests/test_conformal.py -q

06Claim boundaries

Dashboard metrics are derived from a generated frontend snapshot of committed JSON artifacts, not hand-entered README values.

Tool-call safety numbers are simulator-scoped and are not deployment certification.

The live CI panel reports workflow status; it does not create new benchmark measurements.

For Cloudflare deployment, rerun benchmark scripts and rebuild the frontend after artifact changes.