
Accelerating Radiology AI Triage with Shared GPU Resources
A healthcare case study on improving imaging turnaround time while keeping GPU costs predictable.
"Urgent cases waited 2–3 minutes for AI—and we had no idea where the GPU spend went"
A hospital group processes over 1.2 million imaging studies annually. The AI triage system flags urgent CT and X-ray cases to reduce clinician workload and speed turnaround. But triage latency was unstable, cold starts hit urgent cases hardest, and quarterly GPU spending swung wildly—making budgeting and clinical planning difficult.
Three Core Pain Points: Unstable Throughput, Cold Starts, and Budget Volatility
Pain Point 1: Unstable Throughput During Morning Peaks
- Morning rush reality: Triage P95 latency 2.5–3.2 minutes; during morning peaks it often exceeded 3.5 minutes. Urgent cases suffered most.
- Root cause: GPUs were spread across sites with no pooling or priority; morning peaks overwhelmed local capacity while other sites had idle GPUs.
- Quantified impact: Urgent case turnaround 45–55 minutes end-to-end; clinicians complained that "AI triage feels slower than manual when it matters most."
Pain Point 2: Model Cold Starts Delaying Urgent Cases
- 2–3 minute delays for urgent cases when models were cold—exactly when speed mattered most.
- No warm-cache strategy: Each site ran models independently; no preloading or memory tiering for high-priority studies.
- Compliance constraint: Data had to stay within jurisdiction—so any solution had to improve utilization without moving imaging data across regions.
Pain Point 3: GPU Spending Volatility in Quarterly Budgeting
- Quarterly variance ±25%: Finance couldn't predict GPU spend; surprises led to caps and delayed expansions.
- No chargeback by department: Radiology, ER, and outpatient couldn't see who drove spend, so optimization was guesswork.
Baseline metrics (before TensorFusion):
| Metric | Baseline |
|---|---|
| Triage P95 latency | 2.5–3.2 min |
| GPU utilization | 24–30% |
| Urgent case turnaround | 45–55 min |
| GPU cost variance | ±25% / quarter |
How TensorFusion Solves These Pain Points
TensorFusion provides GPU pooling with strict data locality, warm-cache model shards, priority preemption for emergency scans, and chargeback by department—so throughput is stable, urgent cases are fast, and budgets are predictable while staying compliance-safe.
Why Pain 1 (Unstable Throughput) Is Solved
- GPU pooling across hospitals with strict data locality—compute can be shared where policy allows, data stays in jurisdiction. Morning peaks are served by pooled capacity, not single-site headroom.
- Priority preemption for emergency scans ensures urgent studies get GPU headroom first; routine studies absorb remaining capacity.
- Kubernetes-native scheduling ties scaling to queue pressure and SLO thresholds, so capacity aligns to actual demand.
Why Pain 2 (Cold Starts) Is Solved
- Warm-cache model shards for high-volume modalities—triaged models stay warm at class start times or by department schedule, eliminating 2–3 minute cold-start delays for urgent cases.
- Memory tiering keeps critical models in hot/warm tiers; cold tier reclaims idle capacity without hurting latency-sensitive triage.
- GPU virtualization and slicing let one physical GPU serve multiple light inference streams, so more studies get "warm" capacity without overbuying.
Why Pain 3 (Budget Volatility) Is Solved
- Chargeback by department (radiology, ER, outpatient) gives finance and department heads clear attribution—spend visibility drives right-sizing and planning.
- Predictable utilization and pooling reduce idle spend; cost variance in this deployment dropped from ±25% to ±8% per quarter.
Results: Before vs After
| Metric | Before | After | Improvement |
|---|---|---|---|
| Triage P95 latency | 3.0 min | 45 sec | ~75% reduction |
| GPU utilization | 27% | 66% | ~2.4× |
| Urgent case turnaround | 50 min | 22 min | ~56% faster |
| GPU cost variance | ±25% | ±8% | ~68% lower variance |
| Before TensorFusion | After TensorFusion |
|---|---|
| Urgent cases waited 2–3 min for cold models | Warm-cache + priority; triage P95 45 sec |
| Morning peaks caused 3+ min triage latency | Pooling + priority preemption; stable <1 min |
| Quarterly GPU spend swung ±25%; no attribution | Chargeback by department; variance ±8% |
"We cut urgent triage time in half and gained budget predictability. That mattered more than raw speed." — Radiology Operations Lead
Why TensorFusion Fits Healthcare
Healthcare workloads are time-critical and compliance-heavy. TensorFusion preserves data locality (data stays in jurisdiction; only compute is pooled where policy allows) while maximizing compute efficiency through GPU pooling, warm cache, and priority scheduling. True GPU virtualization (memory isolation, oversubscription) and Kubernetes-native integration make it possible to improve throughput, eliminate cold-start delays for urgent cases, and keep quarterly spend predictable—without moving data or compromising auditability.
Author

Categories
More Posts

AI Infra Partners: Building a Federated Compute Network with SLA Control
A customer story on federating GPU supply across clusters while keeping SLAs, data locality, and operations sane.


FinOps for GPU: Right-Sizing, Karpenter, and Cost Guardrails in Practice
A customer-led guide to making GPU spend predictable with right-sizing, Kubernetes autoscaling, and practical cost guardrails.


SMB AI Acceleration: Launching GPU Workloads Without Heavy Capex
A customer-first story on launching GPU workloads without buying a GPU rack—and keeping burn rate under control.

Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates