OFFRE LIMITÉE Réservez votre billet Early Bird et économisez 30% ! · Offre valable jusqu'au 13 avril — Réservez maintenant !
Filtrer
Atelier vLLM

Accelerators for Agentic AI with vLLM

Date 5 mai Heure 11:00 - 11:40 Lieu Founders Cafe
Agentic AI is reshaping inference architecture. Reasoning models, long-context workflows, and multi-step agent systems put new pressure on serving stacks, which makes accelerator strategy more important than before. The challenge is no longer only to serve a model fast. It is to match the right serving architecture and the right accelerator profile to the right workload.

That is why broad accelerator support matters in vLLM. vLLM is built for high-throughput, memory-efficient inference and serving, with an architecture designed to deploy open models across different hardware environments rather than a single path. AI accelerators span familiar categories such as GPUs, NPUs, and ASIC-based designs, with CPU support still relevant for specific deployment profiles such as smaller models, edge environments, and cost-sensitive workloads.

The session explores what that shift means in practice for teams building inference platforms. It looks at why agentic workloads make accelerator strategy more important, why broad hardware enablement matters in the vLLM ecosystem, and how platform teams can think about infrastructure choices through the lens of workload shape, prompt and context behavior, concurrency, decode latency, memory pressure, scaling patterns, and operational fit. vLLM is a strong foundation for that discussion because the project emphasizes broad hardware support, top performance, production readiness, and extensible architecture, and its serving stack includes capabilities such as high-throughput serving, distributed inference, and online and offline inference modes. 

A practical framework ties accelerator decisions back to workload requirements and deployment goals, giving attendees a clearer way to think about modern inference architecture while showing how vLLM keeps the serving layer open and adaptable as workloads evolve.

This talk is aimed at a technical audience building or operating inference platforms and looking for a practical framework to align accelerator strategy with workload and deployment requirements.