This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to datasetsKey Variables Code generation accuracy; functional correctness; repository navigation; debugging capability AI/Tech Tracking Real-world software engineering task automation Access Details Available on GitHub and Hugging Face Notes Real GitHub issues from 12 Python repositories; measures end-to-end coding capability beyond isolated functions
SWE-bench
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
AI-focusedPublicNeither
Visit Dataset- Specific Type
- AI benchmarking
- Dataset Type
- Cross-sectional
- Institution
- Princeton University; Stanford University; UW
- Institution Type
- Academia
- Level of Focus
- Task capability
- Most Granular Level
- Individual GitHub issue level
- Perspective
- Neither
- Time Coverage
- 2023-present
- Frequency
- Static benchmark with periodic updates
- Sample Size
- 2294 GitHub issues; 500 verified subset
- Geographic Detail
- Global
- Occupational Classification
- Not specified
- Industrial Classification
- Software repositories
- Other Classification
- GitHub issue classification
Key Papers
Jimenez et al. (2024) ICLR