This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to papers

Combining Human Expertise with Artificial Intelligence: Experimental Evidence from Radiology

Agarwal, Moehring, Rajpurkar, Salz

2023NBER Working Paper Series98 citations
Experimental evidenceCausalTheoretical model
Computer Vision / Image AIHealthcareHuman-AI collaborationDecision-makingAugmentation vs. substitution
Abstract

While Artificial Intelligence (AI) algorithms have achieved performance levels comparable to human experts on various predictive tasks, human experts can still access valuable contextual information not yet incorporated into AI predictions.Humans assisted by AI predictions could outperform both human-alone or AI-alone.We conduct an experiment with professional radiologists that varies the availability of AI assistance and contextual information to study the effectiveness of human-AI collaboration and to investigate how to optimize it.Our findings reveal that (i) providing AI predictions does not uniformly increase diagnostic quality, and (ii) providing contextual information does increase quality.Radiologists do not fully capitalize on the potential gains from AI assistance because of large deviations from the benchmark Bayesian model with correct belief updating.The observed errors in belief updating can be explained by radiologists' partially underweighting the AI's information relative to their own and not accounting for the correlation between their own information and AI predictions.In light of these biases, we design a collaborative system between radiologists and AI.Our results demonstrate that, unless the documented mistakes can be corrected, the optimal solution involves assigning cases either to humans or to AI, but rarely to a human assisted by AI.

Summary

Agarwal, Moehring, Rajpurkar, and Salz conduct a randomized field experiment with 227 professional radiologists to study how AI assistance affects diagnostic performance and to quantify biases in how humans combine AI predictions with their own information.

Main Finding

AI assistance does not improve average radiologist performance despite the AI outperforming 78% of radiologists; radiologists exhibit automation neglect (underweighting AI predictions) and signal dependence neglect (treating AI and own signals as independent), leading to performance improvements only when AI is highly confident but performance decreases when AI is uncertain.

Primary Datasets

Stanford University Hospital chest X-ray patient records (324 cases); Experimental data from 227 professional radiologists recruited through teleradiology companies

Secondary Datasets

CheXpert AI algorithm predictions (Irvin et al. 2019)

Key Methods
Randomized field experiment with professional radiologists using within-participant design across multiple information treatments (X-ray only, clinical history, AI predictions, both); Grether model estimation to quantify automation neglect and signal dependence neglect; optimal delegation analysis comparing human-only, AI-only, and human+AI modalities.
Sample Period
2023
Geographic Coverage
US (majority of radiologists serve US patients; 14% US-based radiologists, 15% US-trained; some radiologists from Vietnam)
Sample Size
227 radiologists; 41,920 patient-pathology assessments across three experimental designs
Level of Analysis
Individual
Occupation Classification
None
Industry Classification
None
Replication Package
Partial
Notes
NBER Working Paper 31422 [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper. [Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.