Bridging AI Evaluation and Real-World Deployment: Place-Based Evaluation Testbeds for High-Trust Public Domains
Date5 maiHeure15:15 - 15:35Lieu Central Room
As AI systems evolve from standalone models toward tool-using, workflow-oriented, and increasingly agentic deployments, evaluation must move beyond benchmark comparison and static task scores. This presentation proposes place-based evaluation testbeds, designed through a governance-first approach, to bridge AI evaluation and real-world deployment in high-trust public domains such as culture, education, and water-linked socio-ecological contexts.
The core argument is that benchmark performance alone does not show whether a system is ready for institution-grade deployment. In real-world settings, system quality depends not only on answer accuracy, but also on traceability, robustness, auditability, documentation readiness, and safe behavior under contextual constraints. To address this gap, the talk introduces a testbed architecture based on a governed knowledge layer rather than unconstrained generation.
The proposed pipeline is: field encounter → co-curation → governed knowledge layer → AI companion → supervised feedback loop. This creates a controlled retrieval environment in which outputs can be assessed against curated sources, provenance logic, access conditions, and correction pathways. It therefore supports evaluation not only of response quality, but also of deployment-relevant properties under realistic public-facing conditions.
As an initial reference environment, the presentation uses Tihany (Hungary), a heritage- and water-linked landscape context, to illustrate how governed knowledge layers can support more controllable and accountable public-interest AI deployment. The water dimension also connects the testbed to Source2Sea: Bartók 3.0 Connectivity (2026–2031), a UN Ocean Decade Action exploring linked cultural and water systems.
The broader claim is that place-based evaluation testbeds can complement open evaluation initiatives by providing real-world environments in which model capability, system orchestration, and societal deployment readiness can be assessed together.