This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to datasetsKey Variables AI task completion quality; expert-level comparison; productivity measurement across professional domains AI/Tech Tracking Evaluates frontier AI models on extended professional tasks that require sustained reasoning and domain expertise; graded by domain experts Access Details Leaderboard and results publicly available Notes Focuses on tasks requiring significant expert time (1-8 hours), distinguishing it from benchmarks that test quick-answer capabilities; represents the trend toward economically grounded AI evaluation
APEX
APEX: AI Productivity Index
AI-focusedPublicNeither
Visit Dataset- Specific Type
- AI benchmarking
- Dataset Type
- Cross-sectional
- Institution
- Mercor
- Institution Type
- Private Data Provider
- Level of Focus
- Task capability; Occupation
- Most Granular Level
- Individual professional task level
- Perspective
- Neither
- Time Coverage
- 2025-present
- Frequency
- Static benchmark with periodic updates
- Sample Size
- 400 test cases (v1-extended) across 4 professional domains, created by 76 domain experts
- Geographic Detail
- Global
- Occupational Classification
- Professional domains (investment banking, management consulting, law, primary medical care)
Key Papers
Vidgen et al. (2025)