This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Peng, Kalliamvakou, Cihon, Demirer

2023MIT Exploration of Generative AI (online publication)238 citations

Experimental evidenceCausal

LLM / Generative AISoftware / codingHuman-AI collaborationAugmentation vs. substitution

Abstract

Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to implement an HTTP server in JavaScript as quickly as possible. The treatment group, with access to the AI pair programmer, completed the task 55.8% faster than the control group. Observed heterogenous effects show promise for AI pair programmers to help people transition into software development careers.

Summary

Cui et al. conduct two randomized field experiments with 1,974 software developers at Microsoft and Accenture to estimate the causal effect of GitHub Copilot (an AI coding assistant) on developer productivity measured through pull requests, commits, builds, and code quality metrics tracked via GitHub version control.

Main Finding

Software developers given access to GitHub Copilot completed 12.92% to 21.83% more pull requests per week at Microsoft and 7.51% to 8.69% more at Accenture, with the largest and most precise effects found using SLATE specification that weights periods with higher compliance; Accenture developers also showed 84% to 107% increase in successful builds

Primary Datasets

Peng GitHub Copilot Lab

AI-focused

Ziegler GitHub Productivity

AI-focused

Microsoft internal GitHub data (1,663 developers, September 2022-September 2023); Accenture internal GitHub data (311 developers, July 2022-November 2023); GitHub Copilot usage telemetry data

Secondary Datasets

None

Key Methods: Field experiments with randomized assignment to GitHub Copilot access; instrumental variables estimation using treatment assignment as instrument for Copilot adoption; SLATE (Super Local Average Treatment Effect) specification weighting periods by compliance differences
Sample Period: 2022-2023
Geographic Coverage: US (Microsoft); Southeast Asia (Accenture)
Sample Size: 1,974 total developers (1,663 at Microsoft, 311 at Accenture); weekly observations over 7 months (Microsoft) and 16 months (Accenture)
Level of Analysis: Individual
Occupation Classification: None
Industry Classification: None

Notes

arXiv:2302.06590 [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains. [Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.