LN: Jimenez et al. (2023) — SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Bibliographic Reference

Citation: Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can language models resolve real-world GitHub issues? arXiv:2310.06770. ICLR 2024. https://arxiv.org/abs/2310.06770


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryBenchmark proposal
ContextEvaluates LLMs on real GitHub issue resolution (code fixing)
Correctness2294 issues from 12 repositories. Test suite validation. Rigorous.
Contributions(1) SWE-bench: 2294 real GitHub issues requiring code patches; (2) State-of-the-art (at publication) resolved only 1.96% of issues; (3) Much harder than typical coding benchmarks
ClarityExcellent.

Relevance: ⭐⭐⭐⭐

SWE-bench is to code generation what PUMA is to PM: a benchmark of LLM agents on real-world software engineering tasks with verified ground truth. Key reference for benchmark design methodology.


PUMA Connection

SWE-bench demonstrates that real-world task benchmarks (vs. toy problems) expose true LLM capability gaps. PUMA follows the same design principle: real Jira SR issues + real TAWOS user stories, not synthetic data. Reference for benchmark design justification (Ch.2 + Ch.3).

Connects to: LN-Arora-2024-MASAI

MOCs