The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
Peng, Kalliamvakou, Cihon, Demirer
2023MIT Exploration of Generative AI (online publication)238 citations
Experimental evidenceCausal
LLM / Generative AISoftware / codingHuman-AI collaborationAugmentation vs. substitution
AbstractGenerative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to implement an HTTP server in JavaScript as quickly as possible. The treatment group, with access to the AI pair programmer, completed the task 55.8% faster than the control group. Observed heterogenous effects show promise for AI pair programmers to help people transition into software development careers.
SummaryCui et al. conduct two randomized field experiments with 1,974 software developers at Microsoft and Accenture to estimate the causal effect of GitHub Copilot (an AI coding assistant) on developer productivity measured through pull requests, commits, builds, and code quality metrics tracked via GitHub version control.
Main FindingSoftware developers given access to GitHub Copilot completed 12.92% to 21.83% more pull requests per week at Microsoft and 7.51% to 8.69% more at Accenture, with the largest and most precise effects found using SLATE specification that weights periods with higher compliance; Accenture developers also showed 84% to 107% increase in successful builds
Primary Datasets
Microsoft internal GitHub data (1,663 developers, September 2022-September 2023); Accenture internal GitHub data (311 developers, July 2022-November 2023); GitHub Copilot usage telemetry data
- Key Methods
- Field experiments with randomized assignment to GitHub Copilot access; instrumental variables estimation using treatment assignment as instrument for Copilot adoption; SLATE (Super Local Average Treatment Effect) specification weighting periods by compliance differences
- Sample Period
- 2022-2023
- Geographic Coverage
- US (Microsoft); Southeast Asia (Accenture)
- Sample Size
- 1,974 total developers (1,663 at Microsoft, 311 at Accenture); weekly observations over 7 months (Microsoft) and 16 months (Accenture)
- Level of Analysis
- Individual
- Occupation Classification
- None
- Industry Classification
- None
NotesarXiv:2302.06590
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.
[Claude classification]: This is a preview/working paper version published online at MIT. Low compliance at Microsoft (initial 8.6% uptake) and organizational changes at Accenture limit precision. Control group at Microsoft given access after experiment ended. Pre-treatment imbalance on commits variable at Accenture. SLATE specification improves precision by weighting periods with larger treatment-control differences in Copilot uptake. Authors note they cannot discuss Copilot's architecture but mention it benefits from GPT-4 gains.