KTransformers: Full-Precision Inference for 600B+ MoE Models on Consumer Hardware
DateMay 6Time10:20 - 10:45Location Central Room
KTransformers is an open-source CPU-GPU heterogeneous inference framework that
runs frontier MoE models like DeepSeek-V3 and Qwen3.5-397B at FP8 precision
on consumer GPUs. By offloading expert computations to CPU with CUDA
Graph-capturable coordination, it achieves 35+ tokens/sec decode speed —
making 600B+ models accessible without datacenter infrastructure.