LIMITED TIME Grab your Early Bird ticket and save 30%! · Deal ends April 13th — Get Tickets Now!
Filter
Own Your Data Science and AI Workshop

Evaluating Frontier Agents on Economically Valuable Tasks

Date May 5 Time 15:20 - 15:40 Location Open Stage
It is evident that AI Agents are becoming more and more capable - but what does it take to make them perform well on the real, day-to-day tasks that are prevalent in people's work? In this presentation I'll discuss how to model complex environment to evaluate - and improve - agent reliability and performance. I'll focus on the viability of production deployments in real-world tasks, on the technical side of building and running evals in these cases using Harbor (https://github.com/laude-institute/harbor), and how we are using these techniques at ellamind to build agents that are reliable and provably safe.