2 posts tagged with "Inference"

High-volume inference on a three-vendor sovereign cluster

May 29, 2026 · 7 min read

Pravein Govindan Kannan

Staff Research Scientist, IBM

Praveen Jayachandran

Senior Technical Staff Member, IBM

Jaikrishnan Hari

Research Partnerships & BD Executive, IBM

Varun Raste

Solution Architect, IBM

Prasad Mukhedkar

Associate Principal AI Architect, Red Hat

Vinod Pathangay

Chief Architect, Field CTO Organization, Red Hat

Jayanth Babu Reddy

Principal Architect, NxtGen Cloud Technologies

Abhisyant Anasapurapu

VP, NxtGen Cloud Technologies

Most production inference clusters today are single-vendor — not because it's an optimal design, but because it's the simplest way to configure a cluster.

That is starting to change. Procurement cycles bring new generations alongside older ones, supply constraints push teams across vendors, and the cost gap between accelerators makes a one-size-fits-all fleet increasingly expensive to defend. Real production fleets are accumulating heterogeneity whether or not the architecture planned for it.

This is an opportunity to unlock real value: lower-cost accelerators can absorb low-priority workloads while premium hardware handles latency-sensitive paths, stranded capacity gets reclaimed, and the organization is no longer held hostage to one supplier's roadmap or pricing. The case is stronger still for sovereign and on-premise deployments, where data residency, regulatory alignment, and the long-term economics of high-volume inference are pushing AI workloads off centralized hyperscaler stacks.

But making it work in practice is hard. Divergent driver stacks, firmware versions, container images, hardware-specific attention kernels, and the absence of standardized performance comparisons across accelerators all combine to make a coherent serving layer over a heterogeneous fleet a non-trivial systems problem.

Predicted-Latency Based Scheduling for LLMs

March 13, 2026 · 28 min read

Kaushik Mitra

Software Engineer, Google

Benjamin Braun

Software Engineer, Google

Abdullah Gharaibeh

Senior Staff Software Engineer, Google

Clayton Coleman

Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.