This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.

The ABC's of Who Benefits from Working with AI: Ability, Beliefs, and Calibration

Caplin, Deming, Li, Martín, Marx, Weidmann, Ye

2024NBER Working Paper Series7 citations

Experimental evidenceCausal

AI (General)Human-AI collaborationDecision-makingAugmentation vs. substitution

Summary

Caplin et al. conduct a randomized online experiment with 732 Prolific participants who classify face images as over/under 21 years old, with random assignment to AI assistance, to study how individual ability and belief calibration jointly determine gains from working with AI.

Main Finding

AI assistance improves prediction accuracy by 6.9 percentage points on average. Low-ability but well-calibrated individuals gain the most (nearly 10 percentage points). A one standard deviation increase in calibration increases AI benefits by 1.4 percentage points (20% of treatment effect). In a counterfactual with perfect calibration, AI would reduce performance inequality by 61% instead of 34%.

Primary Datasets

IMDB-WIKI face image dataset (for stimuli); Experimental data collected on Prolific platform

Secondary Datasets

Raven's Progressive Matrices test (14 questions for measuring cognitive ability)

Key Methods: Randomized lab experiment with 732 participants on Prolific platform. Participants classified 160 face images as over/under 21 years old. Treatment group received AI predictions for half the images. Authors measure individual ability and belief calibration from control block, then estimate heterogeneous treatment effects using regression with interaction terms.
Sample Period: 2024
Geographic Coverage: United States
Sample Size: 732 participants completing 160 rounds each (116,845 total responses after excluding time-limit exceedances)
Level of Analysis: Individual
Occupation Classification: None
Industry Classification: None

Notes

NBER Working Paper 33021 [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form). [Claude classification]: Pre-registered experiment (https://aspredicted.org/pm63-gvdv.pdf). Uses binarized scoring rule for incentive-compatible belief elicitation. Measures ability as expected accuracy at 50% threshold, calibration as negative absolute value of net confidence (difference between confidence and correctness). Robustness checks include ORIV methodology to address measurement error and alternative performance measures (AUC, Grether calibration functional form).