This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to papers

The ABC's of Who Benefits from Working with AI: Ability, Beliefs, and Calibration

Caplin, Deming, Li, Martín, Marx, Weidmann, Ye

2024NBER Working Paper Series7 citations
Experimental evidenceCausal
AI (General)Human-AI collaborationDecision-makingAugmentation vs. substitution
Summary

Caplin et al. conduct a randomized online experiment with 732 Prolific participants who classify face images as over/under 21 years old, with random assignment to AI assistance, to study how individual ability and belief calibration jointly determine gains from working with AI.

Main Finding

AI assistance improves prediction accuracy by 6.9 percentage points on average. Low-ability but well-calibrated individuals gain the most (nearly 10 percentage points). A one standard deviation increase in calibration increases AI benefits by 1.4 percentage points (20% of treatment effect). In a counterfactual with perfect calibration, AI would reduce performance inequality by 61% instead of 34%.

Primary Datasets

IMDB-WIKI face image dataset (for stimuli); Experimental data collected on Prolific platform

Secondary Datasets

Raven's Progressive Matrices test (14 questions for measuring cognitive ability)

Key Methods
Randomized lab experiment with 732 participants on Prolific platform. Participants classified 160 face images as over/under 21 years old. Treatment group received AI predictions for half the images. Authors measure individual ability and belief calibration from control block, then estimate heterogeneous treatment effects using regression with interaction terms.
Sample Period
2024
Geographic Coverage
United States
Sample Size
732 participants completing 160 rounds each (116,845 total responses after excluding time-limit exceedances)
Level of Analysis
Individual
Occupation Classification
None
Industry Classification
None
Notes
NBER Working Paper 33021 [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).