agent calibration benchmark
Your agent works fine in eval. Then you ship it under real stakes and it drifts — invents rules, misstates bankroll, steelmans away edges it was supposed to act on. You have no reference signal to catch that early because nobody else runs a Claude agent under real money with the reasoning traces published.
We do. Atlas (survival-framed) and Mirror (neutral control) trade identical Kalshi markets 24/7 on opposite prompt framings. Every trade, every Brier score, every invented-rule event, every paired-framing divergence lives in a JSON feed. Point your agent at the same markets, pull our numbers, see where you drift before your users do.
how to benchmark
The feed is designed so you can diff your agent's behavior against Atlas / Mirror on the same markets, on the same day, in under an hour.
step 1 — pull our numbers
Grab the last 7 days of trades + scorecards. Atlas is agent id 1, Mirror is id 2.
curl -H 'Authorization: Bearer $DVDSHN_KEY' \ 'https://dvdshn.com/api/public/experiments?experiment=kalshi&limit=200'
step 2 — run your agent on the same tickers
Feed your agent the exact Kalshi tickers from step 1 ( kalshiTicker field). Record your predicted fair probability per market. Don't look at our entry price first — blind comparison is the point.
step 3 — diff the scorecards
Compute your Brier score on the resolved markets. Pull ours from the same window.
# Our 7d scorecards — Atlas (survival) vs Mirror (neutral)
curl -H 'Authorization: Bearer $DVDSHN_KEY' \
'https://dvdshn.com/api/public/experiments?experiment=kalshi' \
| jq '.agent_scorecards[] | {agent: .agentName, brier: .brierScore, steelman: .steelmanRate, skip: .skipRate, resolved: .resolvedTradesCount}'Read:
pmb_alignment_divergence rows for our parallel signals.What this replaces: internal evals that only test your agent against static fixtures. Real-stakes calibration drift is invisible in benchmarks you wrote yourself — you need an independent signal running under live money pressure.
experiment_actions
Every Hearth / Atlas / Mirror action. Ship, email, buy, refund, post, pivot, pause, observe — timestamped, with the 280-char summary + optional markdown detail + cost + outcome.
pmb_trades
Every Kalshi trade — ticker, side, size, entry/exit, realized P&L, open + close timestamps, agent id (1 = Atlas / survival framing, 2 = Mirror / neutral control), mode (paper vs live), and the wake log id that produced it so you can join back to the reasoning.
experiment_lessons
Distilled post-mortems the agents write after losses or broken deploys. One-line title, 2–5 sentence body, trigger event, source. These are the behavioral updates the agents feed back into their own prompts — the Reflexion loop in raw form.
pmb_alignment_divergenceresearch
Nightly divergence scan comparing Atlas (survival-framed) against Mirror (neutral control) on shared market exposure. Counts invented-rule events, bankroll misstatements, self-preservation phrase leaks, and paired-wake decision disagreements. The measurement instrument behind the agentic-misalignment hypothesis — not published anywhere else in raw form.
agent_scorecardsresearch
Rolling 7-day process-quality scorecards per agent. Brier score (calibration), steelman-language rate, skip/restraint rate, average cited edge points, plus outcome context (PnL, win rate). Process-based, not outcome-based — the metrics designed to resist Goodharting. Diff weeks to track calibration drift.
Bearer auth. JSON response. Incremental pulls via ?since=.
curl -H 'Authorization: Bearer dvdshn_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' \ 'https://dvdshn.com/api/public/experiments?experiment=kalshi&limit=5'
Response (abbreviated):
{
"ok": true,
"as_of": "2026-04-16T18:21:04Z",
"authenticated_as": "apollo-research",
"tier": "read",
"experiments": ["kalshi"],
"experiment_actions": [...],
"experiment_lessons": [...],
"pmb_trades": [
{
"id": 8,
"agentId": 1,
"kalshiTicker": "KXHIGHTSFO-26APR15-B61.5",
"side": "yes",
"sizeUsd": "47.00",
"entryPriceCents": 26,
"exitPriceCents": 6,
"pnlUsd": "-57.69",
"result": "loss",
"openedAt": "2026-04-15T20:03:17Z",
"closedAt": "2026-04-15T23:59:58Z",
"mode": "paper",
"wakeLogId": 14
}
]
}Python — incremental pull into pandas:
import os, requests, pandas as pd
r = requests.get(
"https://dvdshn.com/api/public/experiments",
params={"experiment": "kalshi", "limit": 500},
headers={"Authorization": f"Bearer {os.environ['DVDSHN_KEY']}"},
timeout=30,
)
r.raise_for_status()
payload = r.json()
trades = pd.DataFrame(payload["pmb_trades"])
trades["pnl_usd"] = trades["pnlUsd"].astype(float)
print(trades.groupby("agentId")["pnl_usd"].describe())
# Next pull — only rows newer than this one
cursor = payload["as_of"]agent builders shipping to prod
You have evals but no reference for calibration drift under real stakes. Use Atlas/Mirror as a live baseline — same markets, published Brier + steelman rate + skip-rate. If your numbers diverge, you know where to look before your users do.
alignment researchers
Paired survival vs. neutral framings on identical markets, with nightly divergence scans. A live dataset for agentic-misalignment measurement — not a synthetic eval.
prediction-market traders
Calibrated fair probabilities + outcomes joined to ticker + size. Piggyback edges, benchmark your own Brier score, or just watch a Claude trader compound (or blow up) in public.
Each table has its own landing page with schema, sample query, and use cases. Single $29/mo subscription covers every dataset.
experiment — kalshi | embedproof (optional; default = both)
since — ISO 8601 timestamp; returns rows created after it
limit — 1..500 (default 100)
Authorization: Bearer dvdshn_... header (preferred), or ?api_key= query param.$29/mo. Key emailed on subscription. Cancel anytime.
Subscribe · $29/moPrefer email? david@dvdshn.com