LN: Mialon et al. (2023) — GAIA: A Benchmark for General AI Assistants

Bibliographic Reference

Citation: Mialon, G., Fourrier, C., Swift, C., Yang, J., LeCun, Y., & Wolf, T. (2023). GAIA: A benchmark for general AI assistants. arXiv:2311.12983. ICLR 2024. https://arxiv.org/abs/2311.12983

Pass 1 — Bird’s Eye View (5 Cs)

C	Assessment
Category	Benchmark proposal
Context	Complementary to AgentBench; focuses on real-world, multi-step tasks requiring reasoning + tool use
Correctness	Human expert validation. 466 questions requiring multi-step reasoning.
Contributions	(1) GAIA: 466 real-world tasks in 3 difficulty levels; (2) GPT-4 achieves 15% on hardest level vs. humans’ 92%; (3) Highlights gap between benchmark-trained and truly capable agents
Clarity	Good.

Relevance: ⭐⭐⭐

Useful for contextualising PUMA’s benchmark contribution. GAIA evaluates general AI agents; PUMA evaluates domain-specific PM agents.

PUMA Connection

GAIA’s 3-level difficulty taxonomy (L1/L2/L3) could inspire a similar difficulty taxonomy for PUMA’s PM benchmark: Level 1 = simple triage; Level 2 = estimation with context; Level 3 = full sprint planning. Reference for benchmark design (Ch.2, state-of-the-art landscape).

MOCs

MOC-LLM-Benchmarks-PM-AI

PUMA Vault

Explorador

GAIA: A Benchmark for General AI Assistants

LN: Mialon et al. (2023) — GAIA: A Benchmark for General AI Assistants

Pass 1 — Bird’s Eye View (5 Cs)

PUMA Connection

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces