This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

The AI Productivity Index (APEX)

Vidgen, Thrush, Hale, Madnani, Awal, Majumder, Luger, Baines, Klyman, Saifullah, Kirk

2025arXiv preprint

AI capability / benchmarkingComputer Science / AI

LLM / Generative AIHealthcareLegalFinanceHuman-AI collaboration

View Repository DOI: 10.48550/arXiv.2509.25721

Abstract

We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.

Summary

Vidgen et al. introduce APEX-v1-extended, an expanded benchmark with 400 held-out evaluation cases across four professional jobs (investment banking, management consulting, law, medicine), using expert-annotated task rubrics and LM-as-judge grading to assess frontier AI models' capabilities on economically valuable tasks

Main Finding

GPT-5 (Thinking=High) achieves the highest performance at 67.0% on realistic professional tasks, with substantial variation across jobs (lowest 63.0% for investment banking, highest 77.9% for law), demonstrating that frontier models still have significant limitations on economically valuable work

Primary Datasets

O*NET

Other

O*NET task descriptions; custom expert annotation of AI task performance

Secondary Datasets

OES/OEWS

Labor outcomes

BLS OEWS

Key Methods: Expert annotation of task rubrics; LM-as-judge evaluation; benchmark construction with n=400 held-out cases across 4 professional occupations; repeated model inference (8 runs per task) with mean scoring
Sample Period: 2024-2025
Geographic Coverage: US
Sample Size: 400 held-out evaluation cases (100 per job) plus 100 open-source development cases (25 per job); 10 frontier models evaluated; mean 14.81 criteria per case; mean 3.70 source documents per case
Level of Analysis: Task, Occupation
Occupation Classification: O*NET-SOC
Industry Classification: None
Replication Package: Partial

Notes

arXiv:2509.25721. Creates a granular, updatable index of AI productivity potential across occupational tasks. [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction [Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction