限时优惠 立即抢购早鸟票,享7折优惠! · 4月13日截止 — 立即购票!
筛选
Own Your Data Workshop

Evaluating Frontier Agents on Economically Valuable Tasks

日期 5月5日 时间 15:20 - 15:40 地点 开放舞台
It is evident that AI Agents are becoming more and more capable - but what does it take to make them perform well on the real, day-to-day tasks that are prevalent in people's work? In this presentation I'll discuss how to model complex environment to evaluate - and improve - agent reliability and performance. I'll focus on the viability of production deployments in real-world tasks, on the technical side of building and running evals in these cases using Harbor (https://github.com/laude-institute/harbor), and how we are using these techniques at ellamind to build agents that are reliable and provably safe.