superpositionLabs← back

Clinical AI fails at the last mile because models can pass benchmarks but hospitals can't install them.

By Pablo Díaz

Co-founder, Superposition Labs, Inc.

Published

01

What does “last mile” mean in clinical AI?

In telecom, the last mile is the final stretch of cable between a neighborhood junction box and an individual home. The backbone infrastructure - fiber optics spanning continents, undersea cables, switching stations - accounts for the majority of investment but only a fraction of the difficulty. The hard part has always been the last 1,500 feet. In logistics, it is the same: the last leg of a delivery route accounts for roughly 53% of total shipping cost, not because the package is heavy, but because the conditions at every doorstep are different. The last mile is where standardization meets the particular.

Clinical AI has its own last mile, and it is wider than anyone in Silicon Valley seems willing to admit. The model is the backbone. Google, OpenAI, Anthropic, and a dozen well-capitalized labs are spending billions to push diagnostic accuracy, clinical reasoning, and multimodal understanding forward. They are succeeding. But a model that can reason about a chest X-ray in a research paper cannot reason about a chest X-ray inside the radiology workflow at Intermountain Health on a Tuesday morning. The distance between those two things is the deployment gap.

Most healthcare AI startups die here. They build a model, demonstrate performance on a benchmark, raise a Series A on the strength of the demo, and then spend three years trying to get a single health system to integrate it into production. The integration never completes, or it completes but clinicians do not trust the output, or clinicians trust it but legal will not sign off, or legal signs off but the reimbursement pathway does not exist. The deployed system is the product. And the deployed system requires an entire layer of infrastructure that does not exist yet.

We call this layer the harness. It is the connective tissue between a foundation model that can practice medicine and a hospital that will let it. Everything we build at Superposition Labs is a piece of the harness. This essay is about why the harness has to exist and why, as of April 2026, almost none of it does.

02

How good is the AI, really?

Good enough that the capability question is settled. The deployment question is all that remains.

Google's Med-Gemini scores 91.1% on MedQA, the United States Medical Licensing Examination-style benchmark that has become the standard yardstick for clinical AI. For reference, the passing threshold for human examinees is 60%. Three years ago, the best medical LLM scored in the low 60s. The improvement curve is not linear - it is logarithmic in compute but roughly linear in accuracy, which means we are still on the steep part. Med-Gemini also achieves state-of-the-art on 10 of 14 medical benchmarks simultaneously, including clinical reasoning, medical image interpretation, and genomics.

More consequential than any benchmark: Google's AMIE system reached 81.7% diagnostic accuracy in a head-to-head study against primary care physicians, who scored 53.3%. This was not a toy evaluation. It was a randomized, double-blind study with real clinical scenarios, published in Nature. The AI was not assisting physicians - it was replacing the diagnostic function entirely, and outperforming by 28.4 percentage points.

These are not prototypes. They are diagnostic systems that outperform the median physician on the tasks that consume the majority of primary care time. The WHO projects an 11-million health worker shortfall by 2030. The capability to close that gap exists today, trapped inside research papers and corporate demos. The capability gap has closed. The deployment gap has not.

03

Why can't hospitals just install it?

Because there is no “it” to install. A foundation model is not a product any more than a diesel engine is a truck. Between the engine and a functional delivery fleet sits a stack of engineering, regulation, insurance, and operational process that no one has built for clinical AI. The obstacles are concrete and compounding.

Start with the electronic health record. Epic and Oracle Health (formerly Cerner) control roughly 80% of the US hospital EHR market. Integrating a new clinical decision-support tool into either system takes 12 to 18 months per deployment, requires dedicated interface engineers on both sides, and costs mid-six to low-seven figures in implementation services. There is no standard API for clinical AI inference. Every integration is bespoke. A startup that signs its first health system customer on January 1st might - if everything goes well - see its model serve a single recommendation in a clinician's workflow by the following autumn.

Then there is compliance. HIPAA's technical safeguard requirements were written for humans accessing records, not for autonomous systems generating clinical decisions at machine speed. The minimum necessary standard - the rule that limits access to the minimum PHI needed for a given purpose - has no established interpretation for an AI agent that ingests an entire patient chart to reason about a differential diagnosis. No OCR guidance exists. No audit trail standard exists for AI-generated clinical recommendations. Every hospital legal team is improvising, and most are improvising toward “no.”

Hospital procurement makes it worse. The average purchasing cycle for a new clinical IT system is 18 to 24 months from initial vendor contact to signed contract. That timeline assumes the product category already exists, that a budget line item covers it, and that the C-suite has a mental model for what they are buying. None of those conditions hold for autonomous clinical AI. IT budgets are already consumed by cybersecurity mandates following the Change Healthcare breach, which affected over 100 million patient records and prompted emergency spending across the industry.

The result: the most capable medical AI in history sits behind a demo login, and the clinicians who need it most have never seen it run.

04

Who carries the liability when AI is wrong?

Nobody knows. That is the problem.

Current US medical malpractice doctrine rests on respondeat superior: the physician is liable for clinical decisions, and the institution is vicariously liable for the physician. This works when a human makes the decision. It breaks when the decision is generated by a model trained on 10 million patient encounters and deployed by a software company that has never treated a patient. If an autonomous AI system misdiagnoses a pulmonary embolism as anxiety - a mistake human physicians make routinely - the patient sues. But who is the defendant? The hospital that deployed the system? The software vendor that built the integration? The foundation model provider whose weights generated the output? The clinician who was nominally “supervising” but had 40 other patients?

No statutory framework exists for AI-as-practitioner in any US jurisdiction. The FDA's Software as a Medical Device pathway covers diagnostic aids and clinical decision support, but it was not designed for autonomous agents that generate treatment plans without physician review. Malpractice insurers have no actuarial table for autonomous clinical AI because there is no claims history to build one from. Underwriters are pricing blind, which means they are either refusing coverage entirely or quoting premiums that make deployment uneconomical.

This liability vacuum is itself a deployment blocker, independent of every technical obstacle. A hospital CEO who wants to deploy clinical AI faces a question with no legal answer: if this system harms a patient, what happens to us? Until someone constructs a liability architecture - a clear allocation of responsibility among model provider, integration vendor, institution, and clinician - rational administrators will choose inaction. The risk of deploying is unbounded; the risk of not deploying is merely the status quo.

The irony is sharp. The status quo kills roughly 250,000 Americans per year through medical error, making it the third leading cause of death. An AI system that reduced diagnostic error by even 20% would save tens of thousands of lives annually. But the legal system is structured to punish novel harm more severely than familiar harm, so the rational institutional choice is to keep making the same mistakes with human doctors rather than risk making new mistakes with AI.

05

What has to exist that doesn't exist yet?

The deployment gap is not a single missing piece. It is five missing pieces, each load-bearing, each dependent on the others. We think of them collectively as the harness - the infrastructure that holds autonomous clinical AI in place while it operates on real patients in real hospitals.

Clinical integrations. Standardized, bidirectional pipes between foundation models and EHR systems. Not one-off integrations that take 18 months each, but a connective layer that lets any qualified model plug into any standards-compliant health system in weeks. This requires FHIR-native interfaces, real-time event streaming from clinical workflows, and a permissions model that maps to existing hospital credentialing. Nothing like this exists today. Every deployment is hand-wired.

Liability architecture. A contractual and statutory framework that allocates responsibility when autonomous AI causes harm. Who pays? Under what conditions? With what caps? Modeled on how the automotive industry handled autonomous vehicle liability - not perfectly, but well enough to let deployment begin. We have written about this in detail.

Regulatory scaffolding. Utah moved first, approving an AI system to prescribe roughly 190 chronic medications under defined conditions. China has deployed autonomous clinical AI across 260+ hospitals in 93.5% of its provinces. ARPA-H is funding the first US agentic-AI clinical pilot. The regulatory surface is fracturing - state by state, country by country - and someone has to build systems that can operate across all of it. Our regulatory analysis covers the full landscape.

Data standards. HL7 FHIR is the closest thing to a universal clinical data standard, and it was designed for human-to-system communication. It has no resource type for autonomous AI-generated clinical output. No standard envelope for an AI differential diagnosis. No provenance model for a treatment recommendation generated by a model whose training data spans 40 countries. The standards bodies are aware of the gap; closing it will take years of committee work unless the market forces a de facto standard first.

Trust infrastructure. Hospital boards do not read arXiv papers. They need deployment evidence: outcomes data from comparable institutions, malpractice tail-risk analysis, staff satisfaction metrics from pilot sites, and a vendor with enough operational history to survive a reference check. Benchmarks prove capability. Trust infrastructure proves safety in practice. The distinction is the entire deployment gap in miniature.

06

Who is going to build it?

Google, OpenAI, and Anthropic are in the business of building foundation models and selling API access. Their incentive is to make models more capable, not to spend three years navigating Epic integration certifications and hospital procurement cycles. Google Health has tried the vertical approach twice and retreated twice. The labs will build the engine. They will not build the road.

Health systems are operationally excellent at delivering care and structurally incapable of building software. The average hospital IT department is understaffed, underfunded, and consumed by compliance mandates. The few systems that have tried to build in-house AI capabilities - Mayo, Cleveland Clinic, Mass General Brigham - have produced research publications, not production deployments. Academic medical centers can validate AI. They cannot ship it.

Epic and Oracle Health are incumbents with incumbent incentives. Their business model depends on being the system of record, not on enabling autonomous agents that might eventually replace the workflows their software monetizes. Epic has built a generative AI layer - they call it cognitive computing - but it is designed to augment their existing product, not to serve as an open platform for third-party clinical AI. Incumbents optimize for retention, not disruption.

The harness has to be purpose-built. It requires a company native to both stacks - AI and clinical - willing to do the unglamorous work: building integrations, writing liability frameworks, engaging regulators, earning trust from hospital administrators one deployment at a time. The deployment layer is the whole product.

That is Superposition. We build the infrastructure that lets models reach patients. The harness. The road. The last mile. Every piece of what we ship - every clinical integration, every compliance engine, every regulatory mapping - is a piece of the deployment layer that has to exist before autonomous medical AI can move from research papers to hospital floors.

The model is ready. The hospitals are not. The distance between those two facts is the industry we are building.