This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to papers

Which Economic Tasks Are Performed with AI? Evidence from Millions of Claude Conversations

Handa, Tamkin, McCain, Huang

2025arXiv pre-print5 citations
Adoption / usageInterdisciplinary
LLM / Generative AISoftware / codingWriting / contentCustomer serviceEducationScience / researchHuman-AI collaborationAugmentation vs. substitutionGeneral automation
Abstract

Despite widespread speculation about artificial intelligence's impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor's O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with approximately 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI's evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.

Summary

Handa, Tamkin, McCain et al. use privacy-preserving analysis of over 4 million Claude.ai conversations to measure which economic tasks across the O*NET database are seeing AI usage, finding concentration in software development and writing tasks with mixed automation and augmentation patterns.

Main Finding

AI usage concentrates in software development (37.2% of queries) and writing tasks, with 36% of occupations showing usage in at least 25% of their tasks but only 4% showing usage in 75%+ of tasks; usage peaks in upper-quartile wage occupations; 57% of interactions show augmentative patterns while 43% demonstrate automation-focused usage

Primary Datasets

ADP (Automatic Data Processing) administrative payroll records, January 2021 - July 2025

Secondary Datasets

Eloundou et al. (2024) GPT-4 beta AI exposure measures; Handa et al. (2025) Anthropic Economic Index (Claude conversation data); Dingel and Neiman (2020) telework classification; Current Population Survey (CPS) for comparison; American Community Survey (ACS) 2017 for college share; BLS Personal Consumption Expenditure index

Key Methods
Privacy-preserving LLM-based classification of millions of Claude.ai conversations mapped to O*NET occupational tasks through hierarchical tree-based search; descriptive analysis of usage patterns across occupations, skills, wages, and automation vs. augmentation modes
Sample Period
2021-2025
Geographic Coverage
United States
Sample Size
~4 million Claude.ai conversations (1M for main task analysis Dec 16-23, 2024; 500K for skills analysis Jan 10-17, 2025; 1M for automation/augmentation Dec 16-23, 2024; 1M for model comparison Dec 15, 2024-Jan 4, 2025; 2.8M for cluster validation Nov 28-Dec 18, 2024)
Level of Analysis
Task, Occupation, Individual
Occupation Classification
2010 SOC (Standard Occupational Classification), mapped to 2018 SOC for exposure measures
Industry Classification
NAICS (for robustness checks excluding information sector)
Replication Package
Partial
Notes
arXiv:2503.04761 [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: Uses both Eloundou et al. (2024) GPT-4 beta exposure measures and Handa et al. (2025) Claude-based measures including automation vs augmentation distinction. Finds employment declines concentrated in automative AI applications but not augmentative ones. Results robust to excluding computer occupations, teleworkable occupations, and information sector firms. Compensation effects minimal, suggesting wage stickiness. Sample includes 3.5-5 million workers monthly from ADP payroll data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data. [Claude classification]: This paper uses Clio (Tamkin et al., 2024), a privacy-preserving framework that uses Claude to analyze aggregated conversation patterns. The study is purely descriptive and does not make causal claims. Classification uses hierarchical tree-based search through O*NET tasks (k-means clustering with sentence embeddings). Human validation shows 86% accuracy at base O*NET level, 91.3% at middle level, 95.3% at top level. Key limitations: single platform (Claude.ai), U.S.-centric O*NET framework, cannot observe how outputs are actually used in workflows, potential overestimation from novice users. Sample: 1M conversations for main analysis (Dec 2024), additional 500K for skills analysis (Jan 2025). The paper builds on task-based framework from Autor et al. (2003) and complements exposure predictions from Webb (2019) and Eloundou et al. (2023) with actual usage data.