Own Your Data Science and AI Workshop

Synthetic Data for the Commons: Building Open SOTA LLMs with Synthetic Environments

Date May 5 Time 11:00 - 11:20 Location Open Stage

Training state-of-the-art language models typically demands vast proprietary datasets and closed pipelines. At Pleias, we take a different path — building open, high-performing LLMs using synthetic data environments designed for the commons. This talk presents our approach to constructing synthetic data pipelines that generate diverse, high-quality training corpora without relying on proprietary sources. We cover the technical architecture behind our synthetic environments, the training strategies that enable competitive performance on standard benchmarks, and why we believe open synthetic data is a critical piece of the puzzle for democratizing access to frontier AI capabilities. We share lessons learned, benchmark results, and a roadmap for community-driven improvements.

Speakers

Anastasia Stasenko Pleias

Pierre-Carl Langlais Pleias