This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to papers

Generative AI at Work

Brynjolfsson, Li, Raymond

2023NBER Working Paper736 citations
Experimental evidenceCausal
LLM / Generative AICustomer serviceJunior / entry-levelHuman-AI collaborationAugmentation vs. substitutionTraining / upskilling
Abstract

We study the staggered introduction of a generative AI-based conversational assistant using data from 5,179 customer support agents.Access to the tool increases productivity, as measured by issues resolved per hour, by 14 percent on average, with the greatest impact on novice and lowskilled workers, and minimal impact on experienced and highly skilled workers.We provide suggestive evidence that the AI model disseminates the potentially tacit knowledge of more able workers and helps newer workers move down the experience curve.In addition, we show that AI assistance improves customer sentiment, reduces requests for managerial intervention, and improves employee retention.

Summary

Brynjolfsson, Li, and Raymond use a staggered rollout field experiment with data from 5,179 customer support agents to study the impact of a GPT-based conversational assistant on worker productivity, learning, and experience of work in a Fortune 500 software firm.

Main Finding

Access to generative AI assistant increases customer support agent productivity by 14% on average (resolutions per hour), with highly uneven effects: 34% improvement for novice and low-skilled workers but minimal impact on experienced and highly skilled workers, alongside improved customer sentiment and reduced worker attrition

Primary Datasets

Proprietary customer support data

Secondary Datasets

Internal performance metrics

Key Methods
Staggered rollout field experiment with difference-in-differences analysis; event studies using robust DiD estimators (Sun-Abraham, Borusyak et al., Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille); text analysis using LLM embeddings for sentiment and semantic similarity
Sample Period
2020-2021
Geographic Coverage
US
Sample Size
3 million chats by 5,179 agents; 1.2 million chats by 1,636 agents post-AI deployment
Level of Analysis
Individual
Occupation Classification
Internal job codes
Industry Classification
N/A
Replication Package
Partial
Notes
5,179 customer support agents [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation.