This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to datasets

HumanEval

HumanEval: Evaluating Large Language Models Trained on Code

AI-focusedPublicNeither
Visit Dataset
Specific Type
AI benchmarking
Dataset Type
Cross-sectional
Institution
OpenAI
Institution Type
AI Lab
Level of Focus
Task capability
Most Granular Level
Individual programming problem level
Perspective
Neither
Time Coverage
2021-present
Frequency
Static benchmark with extensions
Sample Size
164 programming problems
Geographic Detail
Global
Occupational Classification
Not specified
Industrial Classification
Not specified
Other Classification
Programming task classification
Key Variables
Code generation accuracy; functional correctness; programming capability
AI/Tech Tracking
Python programming capability; basic algorithmic reasoning
Access Details
Available on GitHub and Papers with Code
Notes
Function-level code generation; uses pass@k metric for evaluation; saturated by current models