Building, Testing and Contributing to vLLM: A Developer's Guide
Large Language Models (LLMs) have revolutionized the AI landscape, and vLLM has emerged as a leading inference engine that dramatically accelerates LLM serving through innovations like PagedAttention. But how do you actually build, test, and contribute to this rapidly evolving project?
In this talk, we'll take you through vLLM's architecture and explore the practical aspects of working with this complex Python/C++ codebase. We'll start with an overview of vLLM's core optimizations including PagedAttention, then dive into the build process for different targets as well as third party hardware plugins, such as Google TPU, AWS Neuron, Intel Gaudi and more.
You'll learn about testing strategies such as performance benchmarking with GuideLLM and model evaluation using lm-evaluation-harness. We'll also cover contribution best practices to the vLLM community and how Red Hat AI Inference Server (RHAIIS) provides a trustworthy and validated platform to run LLM workflows across diverse hardware environments.