Probably raised $9M from a16z to make AI outputs 99.99% accurate – by using weaker models inside a validation harness that catches errors before they reach users.
ENTRY ANGLES
Build validation harness for accounting AI to eliminate human review · Apply weak-model-strong-harness pattern to medical diagnostics · Build insurance underwriting reliability layer with deterministic validators
VERTICALS
CAPABILITIES
Deterministic validation systems, Domain-specific data modeling, AI reliability engineering
PROBABLY FOUNDER
“What we learned building this was that the better your harness engineering is, the weaker the model can be.”
Every AI lab is in an arms race to build the most powerful model. Probably is running in the opposite direction – toward the weakest model that still gets the job done.
The thesis is disarmingly simple: the better your validation harness, the weaker the model can be. If you can refine the context and cross-check every output against deterministic ground truth, you don’t need a frontier model. You need a small, cheap model inside a cage of validators that catches every hallucination before it escapes.
Founder Peter Elias calls it a "data science mech suit." The LLM – deliberately four classes below frontier – generates insights from complex datasets. Deterministic validators cross-check every claim against the actual data. Citations and audit trails are built in, not bolted on. The result is AI that achieves 99.99% accuracy – the same reliability standard expected of traditional software, not the "usually right, sometimes catastrophically wrong" standard the industry has quietly normalized.
Andreessen Horowitz led the $9M seed round. The initial product is a data science tool, but the architecture applies anywhere precision matters: accounting, medical diagnostics, financial compliance. And because the underlying model is four tiers cheaper than frontier, the per-query cost drops by roughly an order of magnitude – which means Probably can undercut competitors on price while delivering higher reliability.
Elias makes an observation that’s uncomfortable for the industry: the major AI labs are incentivized not to solve reliability. They charge per token. A model that gets it right the first time, every time, generates fewer tokens – and fewer tokens means less revenue. The correction loop (wrong answer → user edits prompt → re-runs → still wrong → escalates to human) is a revenue multiplier for the labs, not a bug.
This creates a structural gap in the market. The labs will keep building more powerful models. Nobody at OpenAI or Anthropic is optimizing for "use the cheapest possible model and validate deterministically" because that business model compresses their own revenue. Probably doesn’t sell model intelligence. It sells certainty – and certainty is what enterprises actually need before they can deploy AI into workflows where errors have real consequences.
The token economics are counterintuitive but concrete: a model four classes below frontier costs roughly 10-20x less per inference call. The deterministic validation layer adds compute, but cross-checking a number against a database column is microseconds of CPU time, not seconds of GPU time. The total cost per verified insight is a fraction of what a frontier model charges for an unverified one. Probably can be simultaneously cheaper and more reliable – a combination that shouldn’t be possible but is, because the rest of the market is optimizing the wrong variable.
The "weak model + strong harness" architecture is a template that can be applied to any vertical where AI reliability currently blocks adoption.
Accounting is the clearest entry point. A Big Four firm can’t deploy AI that’s right 95% of the time – a 5% error rate in a tax return is malpractice, and the partner signs personally. The current workaround is a human reviewer checking every AI output, which erases most of the efficiency gain. A harness-first approach that delivers 99.99% accuracy eliminates the reviewer, and the value of that eliminated salary is the budget for the product. At a mid-tier firm with 200 associates each reviewing AI output for 2 hours per day, the annual labor cost of the "check the AI’s work" workflow exceeds $8M.
The practical entry point for builders: take an existing AI workflow where errors are manually caught by humans, wrap a deterministic validation layer around it, and demonstrate that the human reviewers can be reduced or eliminated. The value proposition isn’t "AI does the work" – it’s "AI does the work and you can trust the output without checking." That second clause is where the seven-figure contracts live.
Insurance underwriting, regulatory compliance reporting, clinical trial data analysis – each has the same structure: high-stakes outputs, mandatory human review, and a buyer who would pay handsomely to automate the review layer if someone could guarantee the accuracy.