This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers

Cui, Demirer, Jaffe, Musolff, Peng, Salz

2024Working paper43 citations

Experimental evidenceCausal

LLM / Generative AISoftware / codingJunior / entry-levelHuman-AI collaborationAugmentation vs. substitution

View Repository DOI: 10.2139/ssrn.4945566

Summary

Cui et al. analyze three randomized controlled trials at Microsoft, Accenture, and an anonymous Fortune 100 company involving 4,867 software developers to study the causal effect of access to GitHub Copilot (an AI coding assistant) on developer productivity in real workplace settings

Main Finding

Using GitHub Copilot causes a 26.08% (SE: 10.3%) increase in weekly completed tasks among software developers, with significantly larger gains for less experienced developers (shorter tenure and more junior positions) who also had higher adoption rates

Primary Datasets

Cui Multi-Company Developers

AI-focused

Supermind Design Augmented Collective Intelligence Database (938 cases)

Secondary Datasets

None

Key Methods: Randomized controlled trials (field experiments) with instrumental variable regression using experimental assignment to instrument for actual Copilot usage; weighted IV approach that weights periods by treatment-control adoption differences; developer and week fixed effects
Sample Period: Not applicable - review paper
Geographic Coverage: Not applicable - theoretical review
Sample Size: 4,867 software developers across three experiments (Microsoft: 1,521; Accenture: 316; Anonymous Company: 3,030); developer-week observations
Level of Analysis: Individual
Occupation Classification: None
Industry Classification: None
Replication Package: Partial

Notes

SSRN Electronic Journal [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Conceptual/theoretical paper presenting a multilayer network framework for human-AI collective intelligence. Reviews real-world applications using the Supermind Design database (938 cases across 12 application areas). Integrates perspectives from complex systems theory, network science, and multiple disciplines. Does not conduct original empirical analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis. [Claude classification]: Post-registered as AEARCTR-0014530. Three separate field experiments with different designs: Microsoft (8 months, 50% treatment), Accenture (4 months, 61% treatment), Anonymous Company (2 months staggered rollout). Imperfect compliance required IV approach. Statistical power challenges due to large outcome variance and high fraction of zero-output weeks. Weighted IV estimator places more weight on periods with larger treatment-control adoption differences. Additional abandoned Accenture experiment discussed in appendix (42% layoff, missing usage data). Outcomes: pull requests (primary), commits, builds, build success rate. Microsoft data includes tenure and seniority allowing heterogeneity analysis.