Karpathy’s March of Nines exhibits why 90% AI reliability isn’t even shut to sufficient

“Whenever you get a demo and one thing works 90% of the time, that’s simply the first 9.” — Andrej Karpathy

The “March of Nines” frames a typical manufacturing actuality: You’ll be able to attain the first 90% reliability with a powerful demo, and every further 9 typically requires comparable engineering effort. For enterprise groups, the distance between “often works” and “operates like reliable software program” determines adoption.

The compounding math behind the March of Nines

“Each single 9 is the similar quantity of labor.” — Andrej Karpathy

Agentic workflows compound failure. A typical enterprise movement would possibly embrace: intent parsing, context retrieval, planning, a number of instrument calls, validation, formatting, and audit logging. If a workflow has n steps and every step succeeds with likelihood p, end-to-end success is roughly p^n.

In a 10-step workflow, the end-to-end success compounds due to the failures of every step. Correlated outages (auth, price limits, connectors) will dominate except you harden shared dependencies.

Per-step success (p)	10-step success (p^10)	Workflow failure price	At 10 workflows/day	What does this imply in apply
90.00%	34.87%	65.13%	~6.5 interruptions/day	Prototype territory. Most workflows get interrupted
99.00%	90.44%	9.56%	~1 each 1.0 days	Fantastic for a demo, however interruptions are nonetheless frequent in actual use.
99.90%	99.00%	1.00%	~1 each 10.0 days	Nonetheless feels unreliable as a result of misses stay widespread.
99.99%	99.90%	0.10%	~1 each 3.3 months	This is the place it begins to really feel like reliable enterprise-grade software program.

Outline reliability as measurable SLOs

“It makes much more sense to spend a bit extra time to be extra concrete in your prompts.” — Andrej Karpathy

Groups obtain greater nines by turning reliability into measurable aims, then investing in controls that scale back variance. Begin with a small set of SLIs that describe each mannequin habits and the surrounding system:

Workflow completion price (success or express escalation).
Device-call success price inside timeouts, with strict schema validation on inputs and outputs.
Schema-valid output price for each structured response (JSON/arguments).
Coverage compliance price (PII, secrets and techniques, and safety constraints).
p95 end-to-end latency and value per workflow.
Fallback price (safer mannequin, cached information, or human evaluate).

Set SLO targets per workflow tier (low/medium/excessive affect) and handle an error funds so experiments keep managed.

9 levers that reliably add nines

1) Constrain autonomy with an express workflow graph

Reliability rises when the system has bounded states and deterministic dealing with for retries, timeouts, and terminal outcomes.

Mannequin calls sit inside a state machine or a DAG, the place every node defines allowed instruments, max makes an attempt, and a hit predicate.
Persist state with idempotent keys so retries are secure and debuggable.

2) Implement contracts at each boundary

Most manufacturing failures begin as interface drift: malformed JSON, lacking fields, unsuitable items, or invented identifiers.

Use JSON Schema/protobuf for each structured output and validate server-side before any instrument executes.
Use enums, canonical IDs, and normalize time (ISO-8601 + timezone) and items (SI).

3) Layer validators: syntax, semantics, enterprise guidelines

Schema validation catches formatting. Semantic and business-rule checks stop believable solutions that break programs.

Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins by ID when out there.
Enterprise guidelines: approvals for write actions, information residency constraints, and customer-tier constraints.

4) Route by danger utilizing uncertainty indicators

Excessive-impact actions deserve greater assurance. Danger-based routing turns uncertainty right into a product function.

Use confidence indicators (classifiers, consistency checks, or a second-model verifier) to determine routing.
Gate dangerous steps behind stronger fashions, further verification, or human approval.

5) Engineer instrument calls like distributed programs

Connectors and dependencies typically dominate failure charges in agentic programs.

Apply per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits.
Model instrument schemas and validate instrument responses to stop silent breakage when APIs change.

6) Make retrieval predictable and observable

Retrieval high quality determines how grounded your software can be. Deal with it like a versioned information product with protection metrics.

Monitor empty-retrieval price, doc freshness, and hit price on labeled queries.
Ship index adjustments with canaries, so if one thing will fail before it fails.
Apply least-privilege entry and redaction at the retrieval layer to scale back leakage danger.

7) Construct a manufacturing analysis pipeline

The later nines rely on discovering uncommon failures rapidly and stopping regressions.

8) Spend money on observability and operational response

As soon as failures turn into uncommon, the velocity of prognosis and remediation turns into the limiting issue.

Emit traces/spans per step, retailer redacted prompts and gear I/O with robust entry controls, and classify each failure right into a taxonomy.
Use runbooks and “secure mode” toggles (disable dangerous instruments, change fashions, require human approval) for quick mitigation.

9) Ship an autonomy slider with deterministic fallbacks

Fallible programs want supervision, and manufacturing software program wants a secure approach to dial autonomy up over time. Deal with autonomy as a knob, not a change, and make the secure path the default.

Default to read-only or reversible actions, require express affirmation (or approval workflows) for writes and irreversible operations.
Construct deterministic fallbacks: retrieval-only solutions, cached responses, rules-based handlers, or escalation to human evaluate when confidence is low.
Expose per-tenant secure modes: disable dangerous instruments/connectors, drive a stronger mannequin, decrease temperature, and tighten timeouts throughout incidents.
Design resumable handoffs: persist state, present the plan/diff, and let a reviewer approve and resume from the precise step with an idempotency key.

Implementation sketch: a bounded step wrapper

A small wrapper round every mannequin/instrument step converts unpredictability into policy-driven management: strict validation, bounded retries, timeouts, telemetry, and express fallbacks.

def run_step(identify, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15):

# hint all retries below one span

span = start_span(identify)

for try in vary(1, max_attempts + 1):

strive:

# sure latency so one step can’t stall the workflow

with deadline(timeout_s):

out = attempt_fn()

# gate: schema + semantic + enterprise invariants

validate_fn(out)

# success path

metric(“step_success”, identify, try=try)

return out

besides (TimeoutError, UpstreamError) as e:

# transient: retry with jitter to keep away from retry storms

span.log({“try”: try, “err”: str(e)})

sleep(jittered_backoff(try))

besides ValidationError as e:

# dangerous output: retry as soon as in “safer” mode (decrease temp / stricter immediate)

span.log({“try”: try, “err”: str(e)})

out = attempt_fn(mode=”safer”)

# fallback: hold system secure when retries are exhausted

metric(“step_fallback”, identify)

return EscalateToHuman(cause=f”{identify} failed”)

Why enterprises insist on the later nines

Reliability gaps translate into enterprise danger. McKinsey’s 2025 global survey studies that 51% of organizations utilizing AI skilled at the least one unfavorable consequence, and practically one-third reported penalties tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.

Closing guidelines

Choose a prime workflow, outline its completion SLO, and instrument terminal standing codes.
Add contracts + validators round each mannequin output and gear enter/output.
Deal with connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries).
Route high-impact actions via greater assurance paths (verification or approval).
Flip each incident right into a regression check in your golden set.

The nines arrive via disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and quick operational studying loops.

Nikhil Mungel has been constructing distributed programs and AI groups at SaaS firms for greater than 15 years.

Welcome to the VentureBeat group!

Our visitor posting program is the place technical specialists share insights and supply impartial, non-vested deep dives on AI, information infrastructure, cybersecurity and different cutting-edge applied sciences shaping the way forward for enterprise.

Read more from our visitor submit program — and take a look at our guidelines when you’re concerned with contributing an article of your individual!

Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.