This site is a work in progress and has not been widely shared. Content may contain errors. Feedback is welcome.
This site is undergoing review. Some annotations were human-generated, some AI-generated — all are being verified.
Back to papers

Collaborating with AI Agents: Field Experiments on Teamwork, Productivity, and Performance

Ju, Aral

2025arXiv pre-print2 citations
Experimental evidenceInterdisciplinaryCausal
LLM / Generative AICreative workHuman-AI collaborationCollective intelligence / teams
Summary

Ju and Aral conduct a large-scale randomized experiment using their custom Pairit platform to study how collaboration with multimodal AI agents affects teamwork processes, productivity, and ad performance, randomly assigning 2,234 US participants to human-human or human-AI teams producing 11,024 ads evaluated through human ratings and a field experiment on X.

Main Finding

Human-AI teams produced 50% more ads per worker with higher text quality but lower image quality compared to human-human teams; field experiment revealed text quality improved click-through rates and view-through duration while image quality reduced cost-per-click; effects were mediated by task-oriented communication (25% increase), delegation (17% increase), and AI recognition

Primary Datasets

Custom experiment data from Pairit platform (lab experiment); Prolific participant pool; X (Twitter) advertising API data (field experiment); DocSend document view tracking data

Secondary Datasets

None

Key Methods
Large-scale randomized controlled trial using custom Pairit platform; participants randomly assigned to human-human or human-AI teams; mixed methods combining lab experiment, human quality ratings survey, AI quality ratings, and field experiment on X (Twitter) measuring real ad performance
Sample Period
2024-2025
Geographic Coverage
United States
Sample Size
2,234 participants producing 11,024 ads with 182,607 messages, 1,889,559 text edits, 62,119 image edits; separate quality rating survey with 1,195 participants; field experiment with 4,932,373 ad impressions and 7,546 clicks
Level of Analysis
Individual, Task
Occupation Classification
None
Industry Classification
None
Notes
arXiv:2503.18238 [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration. [Claude classification]: Combines lab experiment (RCT) with field experiment on X platform. Uses custom-built Pairit platform enabling real-time human-AI collaboration. AI agent can perform equivalent actions to humans (editing, image selection, chat). Field experiment generated 4.9M impressions. GPT-4o-mini used for message labeling and quality ratings (methodological tool). Study preregistered at OSF. No human-alone condition limits ability to isolate AI's marginal contribution beyond human collaboration.