2 posts tagged with "KV Cache"

High-volume inference on a three-vendor sovereign cluster

May 29, 2026 · 7 min read

Pravein Govindan Kannan

Staff Research Scientist, IBM

Praveen Jayachandran

Senior Technical Staff Member, IBM

Jaikrishnan Hari

Research Partnerships & BD Executive, IBM

Varun Raste

Solution Architect, IBM

Prasad Mukhedkar

Associate Principal AI Architect, Red Hat

Vinod Pathangay

Chief Architect, Field CTO Organization, Red Hat

Jayanth Babu Reddy

Principal Architect, NxtGen Cloud Technologies

Abhisyant Anasapurapu

VP, NxtGen Cloud Technologies

Most production inference clusters today are single-vendor — not because it's an optimal design, but because it's the simplest way to configure a cluster.

That is starting to change. Procurement cycles bring new generations alongside older ones, supply constraints push teams across vendors, and the cost gap between accelerators makes a one-size-fits-all fleet increasingly expensive to defend. Real production fleets are accumulating heterogeneity whether or not the architecture planned for it.

This is an opportunity to unlock real value: lower-cost accelerators can absorb low-priority workloads while premium hardware handles latency-sensitive paths, stranded capacity gets reclaimed, and the organization is no longer held hostage to one supplier's roadmap or pricing. The case is stronger still for sovereign and on-premise deployments, where data residency, regulatory alignment, and the long-term economics of high-volume inference are pushing AI workloads off centralized hyperscaler stacks.

But making it work in practice is hard. Divergent driver stacks, firmware versions, container images, hardware-specific attention kernels, and the absence of standardized performance comparisons across accelerators all combine to make a coherent serving layer over a heterogeneous fleet a non-trivial systems problem.

Native KV Cache Offloading to Any Filesystem with llm-d

February 10, 2026 · 11 min read

Kfir Toledo

Research Staff Member, IBM

Danny Harnik

Senior Technical Staff Member, IBM

Effi Ofer

Research Staff Member, IBM

Or Ozeri

Research Staff Member, IBM

Guy Margalit

Senior Technical Staff Member, IBM Storage CTO Office

llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.

While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.