This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to datasets

SWE-bench

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

AI-focusedPublicNeither
Visit Dataset
Specific Type
AI benchmarking
Dataset Type
Cross-sectional
Institution
Princeton University; Stanford University; UW
Institution Type
Academia
Level of Focus
Task capability
Most Granular Level
Individual GitHub issue level
Perspective
Neither
Time Coverage
2023-present
Frequency
Static benchmark with periodic updates
Sample Size
2294 GitHub issues; 500 verified subset
Geographic Detail
Global
Occupational Classification
Not specified
Industrial Classification
Software repositories
Other Classification
GitHub issue classification
Key Variables
Code generation accuracy; functional correctness; repository navigation; debugging capability
AI/Tech Tracking
Real-world software engineering task automation
Access Details
Available on GitHub and Hugging Face
Notes
Real GitHub issues from 12 Python repositories; measures end-to-end coding capability beyond isolated functions

Key Papers

Jimenez et al. (2024) ICLR