This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to papers

The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers

Cui, Demirer, Jaffe, Musolff, Peng, Salz

2024Working paper43 citations
Experimental evidenceCausal
LLM / Generative AISoftware / codingJunior / entry-levelHuman-AI collaborationAugmentation vs. substitution
Summary

Cui et al. analyze three randomized controlled trials at Microsoft, Accenture, and an anonymous Fortune 100 company involving 4,867 software developers to study the causal effect of access to GitHub Copilot (an AI coding assistant) on developer productivity in real workplace settings

Main Finding

Using GitHub Copilot causes a 26.08% (SE: 10.3%) increase in weekly completed tasks among software developers, with significantly larger gains for less experienced developers (shorter tenure and more junior positions) who also had higher adoption rates

Primary Datasets

Supermind Design Augmented Collective Intelligence Database (938 cases)

Secondary Datasets

None

Key Methods
Randomized controlled trials (field experiments) with instrumental variable regression using experimental assignment to instrument for actual Copilot usage; weighted IV approach that weights periods by treatment-control adoption differences; developer and week fixed effects
Sample Period
Not applicable - review paper
Geographic Coverage
Not applicable - theoretical review
Sample Size
4,867 software developers across three experiments (Microsoft: 1,521; Accenture: 316; Anonymous Company: 3,030); developer-week observations
Level of Analysis
Individual
Occupation Classification
None
Industry Classification
None
Replication Package
Partial
Notes
SSRN Electronic Journal [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis.