This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Patwardhan, Dias, Proehl, Kim, Wang, Watkins, Posada Fishman, Aljubeh, Thacker, Fauconnet, Kim, Chao, Miserendino, Chabot, Li, Sharman, Barr, Glaese, Tworek

2025arXiv pre-print

AI capability / benchmarkingComputer Science / AI

LLM / Generative AI

View Repository DOI: 10.48550/arXiv.2510.04374

Abstract

We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.

Summary

Patwardhan et al. develop GDPval, a benchmark evaluating frontier AI model capabilities on 1,320 real-world economically valuable tasks across 44 occupations in the top 9 US GDP-contributing sectors, using blind pairwise comparisons by industry experts to assess how model performance compares to human professionals.

Main Finding

Frontier LLM performance on real-world economically valuable tasks is improving roughly linearly over time, with the best models (Claude Opus 4.1 at 47.6% win-or-tie rate) approaching parity with industry experts; when paired with human oversight, models show potential to save time and money compared to unaided experts, though performance varies substantially by task type, occupation, and deliverable format.

Primary Datasets

Custom benchmark of economically-relevant tasks drawn from freelancing platforms

Secondary Datasets

O*NET

Other

O*NET task descriptions

Key Methods: Task-level AI performance evaluation using human expert pairwise comparisons; economic value weighting based on O*NET task coverage; comparison across frontier LLMs (GPT-4o, o4-mini, o3, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, Grok 4); experimental automated grader development
Sample Period: 2024-2025
Geographic Coverage: US
Sample Size: 1,320 tasks in full set; 220 tasks in open-sourced gold subset; 44 occupations across 9 sectors; multiple model samples (typically 3) per task; average 5 human reviews per task during construction; 9 pairwise comparisons per prompt per model (3 samples × 3 graders) for evaluation
Level of Analysis: Task, Occupation
Occupation Classification: O*NET SOC codes (4-digit and 6-digit)
Industry Classification: NAICS (2-digit sector codes)
Replication Package: Partial

Notes

arXiv:2510.04374. Evaluates AI on tasks with real economic value rather than academic benchmarks. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content. [Claude classification]: This is a benchmark evaluation paper, not an economic study per se. The paper constructs GDPval, a dataset of 1,320 real-world tasks across 44 occupations in 9 sectors contributing to US GDP. Tasks created by industry experts with average 14 years experience. Primary evaluation uses blind human expert pairwise comparisons; also develops experimental automated grader with 66% agreement vs 71% human inter-rater agreement. Tests effects of reasoning effort, prompt tuning, and scaffolding on performance. Open-sources 220-task gold subset. The paper measures AI *capability* on economic tasks rather than actual economic impacts. aiTechFocus codes what the paper studies (LLMs performing tasks), not the full range of AI discussed in task content.