The ABC's of Who Benefits from Working with AI: Ability, Beliefs, and Calibration
Caplin, Deming, Li, Martín, Marx, Weidmann, Ye
2024NBER Working Paper Series7 citations
Experimental evidenceCausal
AI (General)Human-AI collaborationDecision-makingAugmentation vs. substitution
SummaryCaplin et al. conduct a randomized online experiment with 732 Prolific participants who classify face images as over/under 21 years old, with random assignment to AI assistance, to study how individual ability and belief calibration jointly determine gains from working with AI.
Main FindingAI assistance improves prediction accuracy by 6.9 percentage points on average. Low-ability but well-calibrated individuals gain the most (nearly 10 percentage points). A one standard deviation increase in calibration increases AI benefits by 1.4 percentage points (20% of treatment effect). In a counterfactual with perfect calibration, AI would reduce performance inequality by 61% instead of 34%.
Primary Datasets
IMDB-WIKI face image dataset (for stimuli); Experimental data collected on Prolific platform
Secondary Datasets
Raven's Progressive Matrices test (14 questions for measuring cognitive ability)
- Key Methods
- Randomized lab experiment with 732 participants on Prolific platform. Participants classified 160 face images as over/under 21 years old. Treatment group received AI predictions for half the images. Authors measure individual ability and belief calibration from control block, then estimate heterogeneous treatment effects using regression with interaction terms.
- Sample Period
- 2024
- Geographic Coverage
- United States
- Sample Size
- 732 participants completing 160 rounds each (116,845 total responses after excluding time-limit exceedances)
- Level of Analysis
- Individual
- Occupation Classification
- None
- Industry Classification
- None
NotesNBER Working Paper 33021
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).
[Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).