2 posts tagged with "SIG-Benchmarking"

High-volume inference on a three-vendor sovereign cluster

May 29, 2026 · 7 min read

Pravein Govindan Kannan

Staff Research Scientist, IBM

Praveen Jayachandran

Senior Technical Staff Member, IBM

Jaikrishnan Hari

Research Partnerships & BD Executive, IBM

Varun Raste

Solution Architect, IBM

Prasad Mukhedkar

Associate Principal AI Architect, Red Hat

Vinod Pathangay

Chief Architect, Field CTO Organization, Red Hat

Jayanth Babu Reddy

Principal Architect, NxtGen Cloud Technologies

Abhisyant Anasapurapu

VP, NxtGen Cloud Technologies

Most production inference clusters today are single-vendor — not because it's an optimal design, but because it's the simplest way to configure a cluster.

That is starting to change. Procurement cycles bring new generations alongside older ones, supply constraints push teams across vendors, and the cost gap between accelerators makes a one-size-fits-all fleet increasingly expensive to defend. Real production fleets are accumulating heterogeneity whether or not the architecture planned for it.

This is an opportunity to unlock real value: lower-cost accelerators can absorb low-priority workloads while premium hardware handles latency-sensitive paths, stranded capacity gets reclaimed, and the organization is no longer held hostage to one supplier's roadmap or pricing. The case is stronger still for sovereign and on-premise deployments, where data residency, regulatory alignment, and the long-term economics of high-volume inference are pushing AI workloads off centralized hyperscaler stacks.

But making it work in practice is hard. Divergent driver stacks, firmware versions, container images, hardware-specific attention kernels, and the absence of standardized performance comparisons across accelerators all combine to make a coherent serving layer over a heterogeneous fleet a non-trivial systems problem.

llm-d Community Update - June 2025

June 25, 2025 · 4 min read

Pete Cheslock

AI Community Architect, Red Hat

Hey everyone! We've been making great progress with the llm-d project, and I wanted to share some important updates and opportunities to get involved.

Help Shape the Future of the llm-d Project

To guide the future development of the llm-d project, we need to understand the real-world challenges, configurations, and performance needs of our community. We've created a short survey to gather insight into how you serve Large Language Models, from the hardware you use to the features you need most.

This anonymous, vendor-agnostic survey will take approximately 5 minutes to complete. Your input will directly influence the project's roadmap and priorities. The aggregated results will be shared with the llm-d-contributors mailing list to benefit the entire community.

Your Input Will Define Our Roadmap

We've created an llm-d Community Roadmap Survey to gather information about your LLM workloads. We are looking to learn more about:

Your Serving Environment: This includes the hardware you use now and anticipate using in a year (like NVIDIA GPUs, AMD GPUs, or CPUs), and whether you run on-premise, in the cloud, or on edge devices.
Your Model Strategy: Do you serve a few large models or many smaller ones, which model families (like Llama or Mistral) are most common, and how you utilize techniques like LoRA adapters.
Your Performance Requirements: Your real-world SLOs for latency and throughput and the biggest LLM serving challenges you face—from cost optimization to operational ease of use.
Your Future Needs: What single new feature you would prioritize for an LLM Model-as-a-Service to help guide our innovation.

Take the 5-Minute Survey

Your participation is invaluable. Please take a few minutes to complete the survey. We encourage you to share it with other users or proxy their needs in your response to ensure our direction reflects the community's diverse requirements.

Help Shape the Future of the llm-d Project​

Take the 5-Minute Survey​

Help Shape the Future of the llm-d Project

Take the 5-Minute Survey