This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to datasets

GDPval

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

AI-focusedPublicNeither
Visit Dataset
Specific Type
AI benchmarking
Dataset Type
Cross-sectional
Institution
OpenAI
Institution Type
AI Lab
Level of Focus
Task capability; Occupation
Most Granular Level
Individual professional task level
Perspective
Neither
Time Coverage
2025
Frequency
One-time release
Sample Size
1,320 tasks across 44 occupations in 9 GDP sectors
Geographic Detail
US-focused (GDP sectors)
Occupational Classification
BLS Work Activities mapped to 44 occupations
Industrial Classification
9 largest US GDP sectors
Key Variables
Task completion quality relative to human experts; cost comparison; time savings; deliverable accuracy across professional occupations
AI/Tech Tracking
Directly measures AI performance on economically valuable professional tasks; tracks frontier model progress on real work products including documents, slides, diagrams, and spreadsheets
Access Details
220-task gold subset publicly available; automated grader available at evals.openai.com
Notes
Tasks designed by industry professionals averaging 14 years of experience; represents a shift toward benchmarks that measure economic productivity rather than abstract capabilities; frontier models approaching expert-level quality on many tasks

Key Papers

Patwardhan et al. (2025)