This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to datasetsKey Variables Task completion quality relative to human experts; cost comparison; time savings; deliverable accuracy across professional occupations AI/Tech Tracking Directly measures AI performance on economically valuable professional tasks; tracks frontier model progress on real work products including documents, slides, diagrams, and spreadsheets Access Details 220-task gold subset publicly available; automated grader available at evals.openai.com Notes Tasks designed by industry professionals averaging 14 years of experience; represents a shift toward benchmarks that measure economic productivity rather than abstract capabilities; frontier models approaching expert-level quality on many tasks
GDPval
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
AI-focusedPublicNeither
Visit Dataset- Specific Type
- AI benchmarking
- Dataset Type
- Cross-sectional
- Institution
- OpenAI
- Institution Type
- AI Lab
- Level of Focus
- Task capability; Occupation
- Most Granular Level
- Individual professional task level
- Perspective
- Neither
- Time Coverage
- 2025
- Frequency
- One-time release
- Sample Size
- 1,320 tasks across 44 occupations in 9 GDP sectors
- Geographic Detail
- US-focused (GDP sectors)
- Occupational Classification
- BLS Work Activities mapped to 44 occupations
- Industrial Classification
- 9 largest US GDP sectors
Key Papers
Patwardhan et al. (2025)