This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to datasetsKey Variables Code generation accuracy; functional correctness; programming capability AI/Tech Tracking Python programming capability; basic algorithmic reasoning Access Details Available on GitHub and Papers with Code Notes Function-level code generation; uses pass@k metric for evaluation; saturated by current models
HumanEval
HumanEval: Evaluating Large Language Models Trained on Code
AI-focusedPublicNeither
Visit Dataset- Specific Type
- AI benchmarking
- Dataset Type
- Cross-sectional
- Institution
- OpenAI
- Institution Type
- AI Lab
- Level of Focus
- Task capability
- Most Granular Level
- Individual programming problem level
- Perspective
- Neither
- Time Coverage
- 2021-present
- Frequency
- Static benchmark with extensions
- Sample Size
- 164 programming problems
- Geographic Detail
- Global
- Occupational Classification
- Not specified
- Industrial Classification
- Not specified
- Other Classification
- Programming task classification
Key Papers
Daniotti et al. (2026); Chen et al. (2021) OpenAI