Agentic AI on Edge

KTransformers: Full-Precision Inference for 600B+ MoE Models on Consumer Hardware

Date May 6 Time 10:20 - 10:45 Location Central Room

KTransformers is an open-source CPU-GPU heterogeneous inference framework that
runs frontier MoE models like DeepSeek-V3 and Qwen3.5-397B at FP8 precision
on consumer GPUs. By offloading expert computations to CPU with CUDA
Graph-capturable coordination, it achieves 35+ tokens/sec decode speed —
making 600B+ models accessible without datacenter infrastructure.

Speakers

Ervin Xie Ph.D. Candidate, Tsinghua University