This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.

This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

Back to datasets

SWE-bench

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

AI-focusedPublicNeither

Specific Type: AI benchmarking
Dataset Type: Cross-sectional
Institution: Princeton University; Stanford University; UW
Institution Type: Academia
Level of Focus: Task capability
Most Granular Level: Individual GitHub issue level
Perspective: Neither
Time Coverage: 2023-present
Frequency: Static benchmark with periodic updates
Sample Size: 2294 GitHub issues; 500 verified subset
Geographic Detail: Global
Occupational Classification: Not specified
Industrial Classification: Software repositories
Other Classification: GitHub issue classification

Key Variables

Code generation accuracy; functional correctness; repository navigation; debugging capability

AI/Tech Tracking

Real-world software engineering task automation

Access Details

Available on GitHub and Hugging Face

Notes

Real GitHub issues from 12 Python repositories; measures end-to-end coding capability beyond isolated functions

Key Papers

Jimenez et al. (2024) ICLR