This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

Which Economic Tasks Are Performed with AI? Evidence from Millions of Claude Conversations

Handa, Tamkin, McCain, Huang

2025arXiv pre-print5 citations

Adoption / usageInterdisciplinary

LLM / Generative AISoftware / codingWriting / contentCustomer serviceEducationScience / researchHuman-AI collaborationAugmentation vs. substitutionGeneral automation

View Repository DOI: 10.48550/arXiv.2503.04761

Abstract

Despite widespread speculation about artificial intelligence's impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor's O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with approximately 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI's evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.

Summary

Handa, Tamkin, McCain et al. use privacy-preserving analysis of over 4 million Claude.ai conversations to measure which economic tasks across the O*NET database are seeing AI usage, finding concentration in software development and writing tasks with mixed automation and augmentation patterns.

Main Finding

AI usage concentrates in software development (37.2% of queries) and writing tasks, with 36% of occupations showing usage in at least 25% of their tasks but only 4% showing usage in 75%+ of tasks; usage peaks in upper-quartile wage occupations; 57% of interactions show augmentative patterns while 43% demonstrate automation-focused usage

Primary Datasets

ADP (Automatic Data Processing) administrative payroll records, January 2021 - July 2025

Secondary Datasets

Eloundou et al. (2023)

AI-focused

Anthropic Economic Index

AI-focused

CPS

Labor outcomes

ACS

Labor outcomes

Eloundou et al. (2024) GPT-4 beta AI exposure measures; Handa et al. (2025) Anthropic Economic Index (Claude conversation data); Dingel and Neiman (2020) telework classification; Current Population Survey (CPS) for comparison; American Community Survey (ACS) 2017 for college share; BLS Personal Consumption Expenditure index

Key Methods: Privacy-preserving LLM-based classification of millions of Claude.ai conversations mapped to O*NET occupational tasks through hierarchical tree-based search; descriptive analysis of usage patterns across occupations, skills, wages, and automation vs. augmentation modes
Sample Period: 2021-2025
Geographic Coverage: United States
Sample Size: ~4 million Claude.ai conversations (1M for main task analysis Dec 16-23, 2024; 500K for skills analysis Jan 10-17, 2025; 1M for automation/augmentation Dec 16-23, 2024; 1M for model comparison Dec 15, 2024-Jan 4, 2025; 2.8M for cluster validation Nov 28-Dec 18, 2024)
Level of Analysis: Task, Occupation, Individual
Occupation Classification: 2010 SOC (Standard Occupational Classification), mapped to 2018 SOC for exposure measures
Industry Classification: NAICS (for robustness checks excluding information sector)
Replication Package: Partial

Notes

arXiv:2503.04761 [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data.