Combining Human Expertise with Artificial Intelligence: Experimental Evidence from Radiology
Agarwal, Moehring, Rajpurkar, Salz
2023NBER Working Paper Series98 citations
Experimental evidenceCausalTheoretical model
Computer Vision / Image AIHealthcareHuman-AI collaborationDecision-makingAugmentation vs. substitution
AbstractWhile Artificial Intelligence (AI) algorithms have achieved performance levels comparable to human experts on various predictive tasks, human experts can still access valuable contextual information not yet incorporated into AI predictions.Humans assisted by AI predictions could outperform both human-alone or AI-alone.We conduct an experiment with professional radiologists that varies the availability of AI assistance and contextual information to study the effectiveness of human-AI collaboration and to investigate how to optimize it.Our findings reveal that (i) providing AI predictions does not uniformly increase diagnostic quality, and (ii) providing contextual information does increase quality.Radiologists do not fully capitalize on the potential gains from AI assistance because of large deviations from the benchmark Bayesian model with correct belief updating.The observed errors in belief updating can be explained by radiologists' partially underweighting the AI's information relative to their own and not accounting for the correlation between their own information and AI predictions.In light of these biases, we design a collaborative system between radiologists and AI.Our results demonstrate that, unless the documented mistakes can be corrected, the optimal solution involves assigning cases either to humans or to AI, but rarely to a human assisted by AI.
SummaryAgarwal, Moehring, Rajpurkar, and Salz conduct a randomized field experiment with 227 professional radiologists to study how AI assistance affects diagnostic performance and to quantify biases in how humans combine AI predictions with their own information.
Main FindingAI assistance does not improve average radiologist performance despite the AI outperforming 78% of radiologists; radiologists exhibit automation neglect (underweighting AI predictions) and signal dependence neglect (treating AI and own signals as independent), leading to performance improvements only when AI is highly confident but performance decreases when AI is uncertain.
Primary Datasets
Stanford University Hospital chest X-ray patient records (324 cases); Experimental data from 227 professional radiologists recruited through teleradiology companies
Secondary Datasets
CheXpert AI algorithm predictions (Irvin et al. 2019)
- Key Methods
- Randomized field experiment with professional radiologists using within-participant design across multiple information treatments (X-ray only, clinical history, AI predictions, both); Grether model estimation to quantify automation neglect and signal dependence neglect; optimal delegation analysis comparing human-only, AI-only, and human+AI modalities.
- Sample Period
- 2023
- Geographic Coverage
- US (majority of radiologists serve US patients; 14% US-based radiologists, 15% US-trained; some radiologists from Vietnam)
- Sample Size
- 227 radiologists; 41,920 patient-pathology assessments across three experimental designs
- Level of Analysis
- Individual
- Occupation Classification
- None
- Industry Classification
- None
- Replication Package
- Partial
NotesNBER Working Paper 31422
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.
[Claude classification]: Pre-registered experiment (AEA registry AEARCTR-0009620). Three experimental designs used: (1) within-participant across different patients, (2) within-participant with 2-week washout between repeated patients, (3) same patient read twice sequentially. Diagnostic standard constructed from aggregate of 5 board-certified radiologists from Mount Sinai. Paper uses random forest to estimate conditional distributions for belief updating model. Instrumental variables used to address measurement error in radiologist signals. Approximately 60% of radiologists had previous experience with AI tools. November 2025 revision of July 2023 working paper.