The AI Productivity Index (APEX)
Vidgen, Thrush, Hale, Madnani, Awal, Majumder, Luger, Baines, Klyman, Saifullah, Kirk
2025arXiv preprint
AI capability / benchmarkingComputer Science / AI
LLM / Generative AIHealthcareLegalFinanceHuman-AI collaboration
AbstractWe present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.
SummaryVidgen et al. introduce APEX-v1-extended, an expanded benchmark with 400 held-out evaluation cases across four professional jobs (investment banking, management consulting, law, medicine), using expert-annotated task rubrics and LM-as-judge grading to assess frontier AI models' capabilities on economically valuable tasks
Main FindingGPT-5 (Thinking=High) achieves the highest performance at 67.0% on realistic professional tasks, with substantial variation across jobs (lowest 63.0% for investment banking, highest 77.9% for law), demonstrating that frontier models still have significant limitations on economically valuable work
Primary Datasets
O*NET task descriptions; custom expert annotation of AI task performance
Secondary Datasets
BLS OEWS
- Key Methods
- Expert annotation of task rubrics; LM-as-judge evaluation; benchmark construction with n=400 held-out cases across 4 professional occupations; repeated model inference (8 runs per task) with mean scoring
- Sample Period
- 2024-2025
- Geographic Coverage
- US
- Sample Size
- 400 held-out evaluation cases (100 per job) plus 100 open-source development cases (25 per job); 10 frontier models evaluated; mean 14.81 criteria per case; mean 3.70 source documents per case
- Level of Analysis
- Task, Occupation
- Occupation Classification
- O*NET-SOC
- Industry Classification
- None
- Replication Package
- Partial
NotesarXiv:2509.25721. Creates a granular, updatable index of AI productivity potential across occupational tasks.
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction
[Claude classification]: arXiv:2509.25721v6. This is a benchmark/evaluation paper, not a study of AI's labor market effects. Tests AI model capabilities on professional tasks using expert-created rubrics and LM-as-judge methodology. Open sources n=100 development cases. Uses Gemini 2.5 Flash as single judge LM. Statistical significance tested via Friedman test and paired t-tests with Bonferroni correction