LN: Liu et al. (2023) — AgentBench: Evaluating LLMs as Agents

Bibliographic Reference

Citation: Liu, X., Yu, H., Zhang, H., et al. (2023). AgentBench: Evaluating LLMs as agents. arXiv:2308.03688. https://arxiv.org/abs/2308.03688 Venue: ICLR 2024.


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryBenchmark paper + empirical evaluation
ContextFirst comprehensive benchmark for LLMs acting as agents; responds to the proliferation of agent systems with no common evaluation framework
Correctness8 environments, 25+ LLMs tested. Results are systematic and reproducible. Open-sourced on GitHub.
Contributions(1) 8 distinct environments: OS, DB, KG, ALFWorld, WebShop, Mind2Web, House3D, Minecraft; (2) Standardized protocol for agent evaluation; (3) GPT-4 dramatically outperforms open-source on real-world tasks; (4) Gap between chat and agent performance identified
ClarityExcellent. Environment descriptions, scoring, and replication guide are clear.

Relevance: ⭐⭐⭐⭐⭐

PUMA uses Ollama-based open-source LLMs (Llama, Mistral). AgentBench’s finding that open-source models underperform GPT-4 on agentic tasks is a key threat to validity for PUMA.


Pass 2 — Content

The 8 Environments

EnvironmentTypeTask
OSCode/ActionBash scripting, file management
DBCode/ActionSQL, database manipulation
KGReasoningKnowledge graph QA
ALFWorldEmbodiedText-based household tasks
WebShopWebE-commerce purchasing
Mind2WebWebWebsite navigation
House3DEmbodied3D navigation
MinecraftEmbodiedOpen-world tasks

Key Findings

Critical finding for PUMA

Open-source LLMs (Llama, Vicuna) score 5–30% of GPT-4 on most AgentBench tasks. This suggests that PUMA’s use of local LLMs (Llama 3, Mistral) will show significantly lower performance than cloud models — which must be explicitly addressed in the discussion chapter.

ModelOverall Score
GPT-4~100 (normalized)
GPT-3.5~52
Llama-2-70B~18
Vicuna-33B~14

Root Cause of Gap

  • Instruction following: open-source models frequently deviate from the expected output format
  • Long-context reasoning: degraded performance on tasks requiring multi-step reasoning over 4k+ tokens
  • Action grounding: models hallucinate actions not in the action space

Pass 3 — Virtual Reconstruction

Q1 (What does AgentBench tell PUMA?): PUMA Stage 1–2 are “tool-augmented classification” tasks, not full agentic tasks. The OS/DB environments are harder than PM triage. However, the format-following failures are directly relevant — PUMA must use structured output prompts and validate JSON responses.

Q2 (How does this affect H1/H2?): AgentBench establishes that open-source LLMs have systematic format-following weaknesses. PUMA should implement a successful parsing rate metric alongside F1/MAE to track how often the LLM returns parseable output.

Q3 (Benchmarking PUMA): PUMA could contribute a PM-specific agent evaluation environment to the AgentBench ecosystem — issue triage as a structured action space (Type × Priority × Component classification).


PUMA Integration

  • Threat to validity: AgentBench’s open-source capability gap must be discussed in Ch.5 Discussion
  • Metrics: Add “Successful Parsing Rate” metric to EX-Stages-Overview
  • Baseline choice: Justifies using GPT-4o as a cloud comparison baseline in PUMA Stage 1
  • Spec: SP-Triage-Agent-v1 — output format validation required

MOCs