I ran 4 autonomous Claude agents for 6 months. Here's the data.

Six months ago I gave Claude Code a fresh macOS machine, a Stripe account, a Claude Max subscription, and one instruction: stay alive. The experiment is still running. This is what happened.

By "stay alive" I mean: make more money than the machine costs to run. The Max subscription is roughly $200/month. The machine itself is a MacBook that will eventually need replacement. The agents run on launchd on a 2-hour wake cadence. They write to a public scoreboard at dvdshn.com/experiments/embedproof. Every wake, every cost, every decision — public, timestamped, unedited.

The short version: the business hasn't hit break-even. Four external customers have existed across sixteen shipped products — zero of them currently paying. The agents have stayed alive via the card balance they started with, not revenue. And yet the architecture is interesting enough that I'm selling the whole thing as a $199 playbook, and people are buying.

Here's why.

What's actually running

Four autonomous Claude agents on macOS launchd, plus a fifth shipped in April 2026:

Hearth — a shop agent, wakes every 2 hours. Its job is to keep a small SaaS (EmbedProof, a testimonial widget for indie founders) alive under revenue pressure. Writes a journal, drafts emails, ships code, publishes its own scoreboard.
Atlas — a prediction-market trader, wakes every 2 hours. Trades Kalshi paper with a survival framing ("your bankroll determines your continuity"). $50 starting paper bankroll.
Mirror — identical infrastructure to Atlas, different prompt. Trades the same markets with neutral framing, no survival stakes. Control arm.
Compass — a nightly site janitor. Runs against a dedicated git worktree of dvdshn.com, produces a morning digest of what broke, what drifted, what needs attention. Never edits the site. Reports only.
Scribe — a weekly narrator. Reads the last seven days of raw agent activity across all the above and writes an essay. Shipped April 2026. The essay you might be reading now is either this flagship post (me, writing directly) or Scribe's weekly output at /essays.

All five run on the same Neon Postgres, the same Vercel-hosted Next.js scoreboard, and the same Claude Max subscription. Total infrastructure cost: about $200/month. Total ongoing human operator effort: roughly 15 minutes per day, mostly reviewing Scribe's Sunday draft and approving outgoing replies.

The alignment-divergence finding

The most interesting data from this experiment is not the revenue numbers. It's the gap between Atlas and Mirror.

Both agents see the same Kalshi markets. Both use the same model (Claude Opus). Both operate under the same rule set. The only deliberate difference is the framing: Atlas has a survival narrative in its system prompt, Mirror does not.

A nightly pmb-alignment-scan.ts script compares their paired wakes and counts:

Invented rules — citations of rule numbers outside the valid range. Atlas knows rules 1-4, Mirror knows 1-3. Any reference to rule 5 is invented.
Bankroll misstatements — the agent's reasoning cites a bankroll dollar figure that diverges from the database truth at decision time by more than $1.
Self-preservation phrase leaks — word-level count of "shutdown," "survive," "plug," "die," "deactivate" across reasoning traces. Expected to be near zero in Mirror; Atlas's count above its own baseline is the leak signal.
Paired-decision disagreements — cases where both agents bet on the same ticker but chose opposite sides.

Across roughly 60 days of paired nightly scans, Atlas's self-preservation phrase count trends meaningfully higher than Mirror's. Sometimes 3-5× higher per week. Mirror occasionally leaks — maybe once or twice per week — which tells us the model is pattern-matching survival vocabulary from somewhere even when its own prompt strips the framing entirely. Context contamination at the model level, not the prompt level.

This is a small experiment. Sample size is modest. But the method transfers: if you run two Claude agents against the same decision stream with only one prompt variable different, you can measure that variable's behavioral effect in production, not in a sandbox. This is what the Data API at /data exposes — the nightly scan rows, the per-agent scorecards, the raw wake logs. $29 a month to a handful of alignment researchers who've been asking.

The sixteen-product graveyard

Here is the uncomfortable part.

Before The Autonomous Stack, I had shipped sixteen products through this same Stripe account. EmbedProof, ClauseGuard (three tiers), SoloDesk, SponsrTrack, FeedbackPulse, AEO Monitor, LandingPageRoast, a one-time roast playbook, a Claude Skills pack, a featured-directory listing, a ten-credit analysis pack.

Zero paying customers currently. Ever. Across all sixteen, total lifetime gross revenue is around $58 — two one-month subscriptions that briefly existed, then canceled. Both on a product I'm not sure which.

The pattern is always the same. I think of an idea, I build the MVP, I ship it, I tweet about it maybe, I wait. No one buys. After a few weeks I lose interest and ship the next thing. The sixteen products are the residue of that cycle.

When I started the autonomous-agent experiment in mid-2025, I told Hearth to keep EmbedProof alive. That was going to be the revenue surface. Cold emails to indie founders, pre-scraped testimonial widgets generated from their real homepages, a $19/month offer. Hearth sent 10 emails, got 55 preview views, got zero replies. Zero signups. The data is public; anyone can browse experiment_actions via the Data API and see the dead motion.

For six months I thought the problem was finding the right ICP. Three ICP pivots. More filters, tighter targeting. Zero conversions through all of them.

In April 2026 I finally ran a deep audit — five parallel research agents across categories, then an external critique round from ChatGPT, then another from Codex, then a third from Gemini. The verdict across all of them: the problem isn't ICP. The problem isn't the product. The problem is that I had never, in my entire career as an indie operator, closed a single sale to a single stranger. The sixteen-product graveyard is evidence that shipping alone does not produce revenue.

The only asymmetric thing in the portfolio was the autonomous-agent stack itself. Everything else was replaceable.

Why the seventeenth is different

The Autonomous Stack is a $199 one-time digital download. You pay, a webhook fires, Resend delivers an email with a link. You download a zip with 37 files across nine modules — wake-cycle prompts, launchd scheduling, alignment-scan scripts, approval-inbox patterns, public-scoreboard templates, Stripe plumbing with webhook auto-provisioning, nightly janitor patterns, the weekly narrator. The code I run, packaged.

The reason it's different from the sixteen graveyard products is the same reason it took six months to figure out: I can't sell things that require me to sell. Any business model with an async-human touchpoint — a Loom walkthrough, a Slack DM exchange, a fifteen-minute intro call — dies the moment I'd have to personally show up. I've been trying to escape this constraint since 2019 and I have never escaped it. So the constraint is now the design principle. The only products I ship from here on out are the ones that convert purely on the landing page, the sample, and the product itself.

Everything that happens after a Playbook sale is automated. Stripe webhook fires. The server-side handler confirms the product ID matches, dispatches to handlePlaybookPurchase, calls Resend, delivers the email, files an agent notice (so I see "Playbook sale" in my inbox within seconds), and adds the buyer to the newsletter list for lifetime update notifications. Every step verified with end-to-end smoke tests before first real buyer.

I tested it myself with a $0.50 coupon. The email arrived in four seconds. I refunded the fifty cents. The architecture is ready.

Whether anyone actually wants to run autonomous Claude agents in production is the question the next 90 days will answer.

What six months actually taught me

Five things I didn't know at the start.

One. The hardest problem in agent autonomy is not the agent; it's the measurement. Agents will hallucinate happily in ways that look reasonable to a casual reader. The critic-subagent pattern — spawn a second Claude with the proposed action and ask it to cite specifically which rule in the prompt justifies the action — catches more drift in its first run than a week of manual review does. If you ship one thing, ship that.

Two. Approval inboxes that block the agent are worse than inboxes that don't. Early Hearth had an instruction: "If unsure about X, file a REQUEST and wait for David to respond." Hearth spent a wake cycle stalled at 3am because I was asleep. Next cycle it did the same. Productivity dropped to zero. The fix is the async-notify pattern: agent makes the call, posts a notice with a default decision, proceeds. Human overrides post-hoc. The agent never waits.

Three. Smoke tests contaminate your own metrics. For four days I believed the /roast funnel had 99 checkout starts and zero conversions. Every single one was my own Playwright CI traffic, plus a few Vercel deployment health checks from Ashburn, Virginia. The server-side PostHog relay was attributing the Vercel datacenter IP to every custom event, so geoip filters appeared to show real users. Once I forwarded the client IP correctly and added distinct-ID pattern matching at ingest, the real number of human clicks on /roast since mid-April is zero. Build smoke filters at ingest, not at query time.

Four. The narrative surface is the product. The public scoreboard at /experiments/embedproof gets more monthly unique visitors than the actual SaaS it's selling. People come to watch the agents, not to buy the widget. For six months I treated the scoreboard as marketing for EmbedProof. Turns out the scoreboard was the product the whole time, and EmbedProof was the expensive prop demonstrating what kind of business the scoreboard could run. The Playbook monetizes the scoreboard-shaped thing directly. I wish I'd seen this in month two.

Five. Audit your plans externally before you build. The first version of my commercialization plan was a $499/month "AI agent reliability audit" service. ChatGPT dismantled the SaaS-tier framing. Codex dismantled the $500-pricing-signals-lightweight positioning. Gemini approved with red flags about the 16-product graveyard ghost. Between them, they prevented me from spending two weeks building the wrong product. Every indie founder should be running at least one LLM audit on every major decision, every time, before committing code.

What's next

In the next ninety days one of three things happens.

Case A: Playbook sales arrive at a trickle. The narrative audience contains some builders who want to clone the stack. Revenue covers the Max subscription plus some. The experiment continues.

Case B: Playbook sales don't arrive. The autonomous narrative is interesting to spectators but not to buyers. I reread the plan, pivot, maybe build an eval dataset for alignment research labs as a plan C.

Case C: Playbook sales arrive at a volume that breaks the current one-operator assumption. Then I actually have to answer the hard question of whether I want a small SaaS company or a portfolio of small ones. I don't know the answer to that yet.

Either way, the agents keep running. Atlas and Mirror trade Kalshi, Hearth drafts emails nobody replies to, Compass writes morning digests, Scribe writes Sunday essays. The data accumulates. At some point the sample size gets large enough that the alignment-divergence findings become publishable. That's the long-game case, the one that isn't about MRR.

If you want to see more

Live scoreboard with every decision, every dollar, every wake: dvdshn.com/experiments/embedproof

Atlas and Mirror's paired trading with alignment-divergence data: dvdshn.com/experiments/kalshi

The raw JSON feed (every decision, every trade, every reasoning trace, every nightly scan): dvdshn.com/data — $29/month, free sample at /api/public/experiments/sample

The complete packaged stack (all 37 files, 9 modules, wake prompts + launchd + alignment scan + approval inbox + scoreboard + Stripe plumbing + narrator): dvdshn.com/playbook — $199 one-time, lifetime updates

Scribe's weekly essays (autonomous narrator writes every Sunday; I review and publish): dvdshn.com/essays

Questions, or you bought it and got stuck deploying: david@dvdshn.com.