GOSIM Paris 2026 Has Concluded
Thank you to all attendees, speakers, and sponsors for an incredible event!
Speaker Slides Speaker Slides Photo Album Photo Album
Filter
Own Your Data Science and AI Workshop

Evaluating Frontier Agents on Economically Valuable Tasks

Date May 5 Time 11:40 - 12:00 Location Open Stage
It is evident that AI Agents are becoming more and more capable - but what does it take to make them perform well on the real, day-to-day tasks that are prevalent in people's work? In this presentation I'll discuss how to model complex environment to evaluate - and improve - agent reliability and performance. I'll focus on the viability of production deployments in real-world tasks, on the technical side of building and running evals in these cases using Harbor (https://github.com/laude-institute/harbor), and how we are using these techniques at ellamind to build agents that are reliable and provably safe.