LN: Mialon et al. (2023) — GAIA: A Benchmark for General AI Assistants

Bibliographic Reference

Citation: Mialon, G., Fourrier, C., Swift, C., Yang, J., LeCun, Y., & Wolf, T. (2023). GAIA: A benchmark for general AI assistants. arXiv:2311.12983. ICLR 2024. https://arxiv.org/abs/2311.12983


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryBenchmark proposal
ContextComplementary to AgentBench; focuses on real-world, multi-step tasks requiring reasoning + tool use
CorrectnessHuman expert validation. 466 questions requiring multi-step reasoning.
Contributions(1) GAIA: 466 real-world tasks in 3 difficulty levels; (2) GPT-4 achieves 15% on hardest level vs. humans’ 92%; (3) Highlights gap between benchmark-trained and truly capable agents
ClarityGood.

Relevance: ⭐⭐⭐

Useful for contextualising PUMA’s benchmark contribution. GAIA evaluates general AI agents; PUMA evaluates domain-specific PM agents.


PUMA Connection

GAIA’s 3-level difficulty taxonomy (L1/L2/L3) could inspire a similar difficulty taxonomy for PUMA’s PM benchmark: Level 1 = simple triage; Level 2 = estimation with context; Level 3 = full sprint planning. Reference for benchmark design (Ch.2, state-of-the-art landscape).

MOCs