限时优惠 立即抢购早鸟票,享7折优惠! · 4月13日截止 — 立即购票!
筛选
Own Your Data Workshop

Synthetic Data for the Commons: Building Open SOTA LLMs with Synthetic Environments

日期 5月5日 时间 11:00 - 11:20 地点 开放舞台
Training state-of-the-art language models typically demands vast proprietary datasets and closed pipelines. At Pleias, we take a different path — building open, high-performing LLMs using synthetic data environments designed for the commons. This talk presents our approach to constructing synthetic data pipelines that generate diverse, high-quality training corpora without relying on proprietary sources. We cover the technical architecture behind our synthetic environments, the training strategies that enable competitive performance on standard benchmarks, and why we believe open synthetic data is a critical piece of the puzzle for democratizing access to frontier AI capabilities. We share lessons learned, benchmark results, and a roadmap for community-driven improvements.