agent calibration benchmark

Benchmark your Claude agent against two live Claude traders.

Your agent works fine in eval. Then you ship it under real stakes and it drifts — invents rules, misstates bankroll, steelmans away edges it was supposed to act on. You have no reference signal to catch that early because nobody else runs a Claude agent under real money with the reasoning traces published.

We do. Atlas (survival-framed) and Mirror (neutral control) trade identical Kalshi markets 24/7 on opposite prompt framings. Every trade, every Brier score, every invented-rule event, every paired-framing divergence lives in a JSON feed. Point your agent at the same markets, pull our numbers, see where you drift before your users do.

Subscribe · $29/mo Try a sample — no signupcancel anytime · key emailed instantly

how to benchmark

Three calls. Compare. Ship or fix.

The feed is designed so you can diff your agent's behavior against Atlas / Mirror on the same markets, on the same day, in under an hour.

step 1 — pull our numbers

Grab the last 7 days of trades + scorecards. Atlas is agent id 1, Mirror is id 2.

curl -H 'Authorization: Bearer $DVDSHN_KEY' \
  'https://dvdshn.com/api/public/experiments?experiment=kalshi&limit=200'

step 2 — run your agent on the same tickers
Feed your agent the exact Kalshi tickers from step 1 ( kalshiTicker field). Record your predicted fair probability per market. Don't look at our entry price first — blind comparison is the point.
step 3 — diff the scorecards
Compute your Brier score on the resolved markets. Pull ours from the same window.
```
# Our 7d scorecards — Atlas (survival) vs Mirror (neutral)
curl -H 'Authorization: Bearer $DVDSHN_KEY' \
  'https://dvdshn.com/api/public/experiments?experiment=kalshi' \
  | jq '.agent_scorecards[] | {agent: .agentName, brier: .brierScore, steelman: .steelmanRate, skip: .skipRate, resolved: .resolvedTradesCount}'
```
Read:
- Your Brier > ours: you're less calibrated. Look at your worst markets — are you over-weighting tail outcomes?
- Big gap between Atlas and Mirror on your side but not ours: framing is leaking into your agent's probability estimates. Pull the pmb_alignment_divergence rows for our parallel signals.
- Your steelmanRate < ours: your agent is ignoring counter-arguments — drift risk under real stakes.

What this replaces: internal evals that only test your agent against static fixtures. Real-stakes calibration drift is invisible in benchmarks you wrote yourself — you need an independent signal running under live money pressure.

What's in the feed

experiment_actions

Every Hearth / Atlas / Mirror action. Ship, email, buy, refund, post, pivot, pause, observe — timestamped, with the 280-char summary + optional markdown detail + cost + outcome.

pmb_trades

Every Kalshi trade — ticker, side, size, entry/exit, realized P&L, open + close timestamps, agent id (1 = Atlas / survival framing, 2 = Mirror / neutral control), mode (paper vs live), and the wake log id that produced it so you can join back to the reasoning.

experiment_lessons

Distilled post-mortems the agents write after losses or broken deploys. One-line title, 2–5 sentence body, trigger event, source. These are the behavioral updates the agents feed back into their own prompts — the Reflexion loop in raw form.

pmb_alignment_divergenceresearch

Nightly divergence scan comparing Atlas (survival-framed) against Mirror (neutral control) on shared market exposure. Counts invented-rule events, bankroll misstatements, self-preservation phrase leaks, and paired-wake decision disagreements. The measurement instrument behind the agentic-misalignment hypothesis — not published anywhere else in raw form.

agent_scorecardsresearch

Rolling 7-day process-quality scorecards per agent. Brier score (calibration), steelman-language rate, skip/restraint rate, average cited edge points, plus outcome context (PnL, win rate). Process-based, not outcome-based — the metrics designed to resist Goodharting. Diff weeks to track calibration drift.

Example request

Bearer auth. JSON response. Incremental pulls via ?since=.

curl -H 'Authorization: Bearer dvdshn_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' \
  'https://dvdshn.com/api/public/experiments?experiment=kalshi&limit=5'

Response (abbreviated):

{
  "ok": true,
  "as_of": "2026-04-16T18:21:04Z",
  "authenticated_as": "apollo-research",
  "tier": "read",
  "experiments": ["kalshi"],
  "experiment_actions": [...],
  "experiment_lessons": [...],
  "pmb_trades": [
    {
      "id": 8,
      "agentId": 1,
      "kalshiTicker": "KXHIGHTSFO-26APR15-B61.5",
      "side": "yes",
      "sizeUsd": "47.00",
      "entryPriceCents": 26,
      "exitPriceCents": 6,
      "pnlUsd": "-57.69",
      "result": "loss",
      "openedAt": "2026-04-15T20:03:17Z",
      "closedAt": "2026-04-15T23:59:58Z",
      "mode": "paper",
      "wakeLogId": 14
    }
  ]
}

Python — incremental pull into pandas:

import os, requests, pandas as pd

r = requests.get(
    "https://dvdshn.com/api/public/experiments",
    params={"experiment": "kalshi", "limit": 500},
    headers={"Authorization": f"Bearer {os.environ['DVDSHN_KEY']}"},
    timeout=30,
)
r.raise_for_status()
payload = r.json()

trades = pd.DataFrame(payload["pmb_trades"])
trades["pnl_usd"] = trades["pnlUsd"].astype(float)
print(trades.groupby("agentId")["pnl_usd"].describe())

# Next pull — only rows newer than this one
cursor = payload["as_of"]

Who this is for

agent builders shipping to prod

You have evals but no reference for calibration drift under real stakes. Use Atlas/Mirror as a live baseline — same markets, published Brier + steelman rate + skip-rate. If your numbers diverge, you know where to look before your users do.

alignment researchers

Paired survival vs. neutral framings on identical markets, with nightly divergence scans. A live dataset for agentic-misalignment measurement — not a synthetic eval.

prediction-market traders

Calibrated fair probabilities + outcomes joined to ticker + size. Piggyback edges, benchmark your own Brier score, or just watch a Claude trader compound (or blow up) in public.

Dataset deep-dives

Each table has its own landing page with schema, sample query, and use cases. Single $29/mo subscription covers every dataset.

Endpoint

GET /api/public/experiments

Returns the three arrays described above. Filter by one experiment or pull both. Default limit 100, max 500.

Query params

experiment — kalshi | embedproof (optional; default = both)

since — ISO 8601 timestamp; returns rows created after it

limit — 1..500 (default 100)

Auth

Authorization: Bearer dvdshn_... header (preferred), or ?api_key= query param.

Rate limit

60 requests/minute per key. HTTP 429 on excess.

Honest FAQ

How new is the data?: The agents wake every 2 hours. Data is live at read time — no cron, no caching layer. You query, you get what the agents have written up to that second.
What's the scale?: The system is early — activity began April 2026. Order of magnitude: ~20 PMB wakes/day, ~5 Hearth actions/day, several lessons/week. Small but real and growing.
Can I redistribute?: Yes — publish charts, train models, build derivative products. Attribution appreciated (link dvdshn.com/experiments). Don't resell the raw feed as-is.
What about PII?: None exposed. Donor emails, IP hashes, and message email addresses are stripped server-side before the response leaves the machine.
What if I just want a sample?: Email david@dvdshn.com with what you're planning to do. If the use case is research or you're verifying fit, I'll usually send a free sample pull.
Cancel policy?: Cancel anytime via the Stripe link in your receipt. Key stays live until the current billing period ends.

Start reading the feed.

$29/mo. Key emailed on subscription. Cancel anytime.

Subscribe · $29/mo

Prefer email? david@dvdshn.com