Own Your Data Science and AI Workshop

Evaluating Frontier Agents on Economically Valuable Tasks

Date May 5 Time 11:40 - 12:00 Location Open Stage

It is evident that AI Agents are becoming more and more capable - but what does it take to make them perform well on the real, day-to-day tasks that are prevalent in people's work? In this presentation I'll discuss how to model complex environment to evaluate - and improve - agent reliability and performance. I'll focus on the viability of production deployments in real-world tasks, on the technical side of building and running evals in these cases using Harbor (https://github.com/laude-institute/harbor), and how we are using these techniques at ellamind to build agents that are reliable and provably safe.

Speakers

Björn Plüster ellamind

Benedikt Droste ellamind