This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

Generative AI at Work

Brynjolfsson, Li, Raymond

2023NBER Working Paper736 citations

Experimental evidenceCausal

LLM / Generative AICustomer serviceJunior / entry-levelHuman-AI collaborationAugmentation vs. substitutionTraining / upskilling

View Repository DOI: 10.3386/w31161

Abstract

We study the staggered introduction of a generative AI-based conversational assistant using data from 5,179 customer support agents.Access to the tool increases productivity, as measured by issues resolved per hour, by 14 percent on average, with the greatest impact on novice and lowskilled workers, and minimal impact on experienced and highly skilled workers.We provide suggestive evidence that the AI model disseminates the potentially tacit knowledge of more able workers and helps newer workers move down the experience curve.In addition, we show that AI assistance improves customer sentiment, reduces requests for managerial intervention, and improves employee retention.

Summary

Brynjolfsson, Li, and Raymond use a staggered rollout field experiment with data from 5,179 customer support agents to study the impact of a GPT-based conversational assistant on worker productivity, learning, and experience of work in a Fortune 500 software firm.

Main Finding

Access to generative AI assistant increases customer support agent productivity by 14% on average (resolutions per hour), with highly uneven effects: 34% improvement for novice and low-skilled workers but minimal impact on experienced and highly skilled workers, alongside improved customer sentiment and reduced worker attrition

Primary Datasets

Proprietary customer support data

Secondary Datasets

Internal performance metrics

Key Methods: Staggered rollout field experiment with difference-in-differences analysis; event studies using robust DiD estimators (Sun-Abraham, Borusyak et al., Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille); text analysis using LLM embeddings for sentiment and semantic similarity
Sample Period: 2020-2021
Geographic Coverage: US
Sample Size: 3 million chats by 5,179 agents; 1.2 million chats by 1,636 agents post-AI deployment
Level of Analysis: Individual
Occupation Classification: Internal job codes
Industry Classification: N/A
Replication Package: Partial

Notes

5,179 customer support agents [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation. [Claude classification]: This is the first study of generative AI impact in a real-world workplace at scale. The AI system is built on GPT and fine-tuned for customer service. A small initial RCT (50 agents, 7 weeks) was followed by staggered rollout. Authors use LLM embeddings (all-MiniLM-L6-v2) for textual similarity analysis and SiEBERT for sentiment analysis. Paper examines software outages to test for durable learning effects. Authors emphasize they cannot observe aggregate employment/wage effects, skill composition changes, or worker compensation.