We'll be the first to say that you cannot replace human feedback with insight from even the most uniquely authentic AI personas. But AI can provide powerful augmentation to human-facing research, and is a compellingly accurate substitute when human-facing research is legitimately not available. The research backs us up.
Out of One, Many: Using Language Models to Simulate Human Samples — Argyle et al.
https://www.cambridge.org/core/journals/political-analysis/article/abs/out-of-one-many-using-language-models-to-simulate-human-samples/035D7C8A55B237942FB6DBAD7CAA4E49
This study demonstrates that GPT‑3, when conditioned on thousands of socio-demographic backstories from real survey participants, can generate “silicon samples” that mimic nuanced response patterns across diverse human subgroups—what the authors call "algorithmic fidelity," offering a promising proxy for human subpopulations.
“[AI powered feedback] is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and socio‑cultural context that characterize human attitudes.”
Predicting Results of Social Science Experiments Using Large Language Models — Hewitt et al.
https://samim.io/dl/Predicting%20results%20of%20social%20science%20experiments%20using%20large%20language%20models.pdf
Hewitt and colleagues show that GPT‑4 can simulate how representative American samples respond in 70 pre-registered experiments (totaling 476 treatment effects), with predicted results correlating strongly with actual outcomes (r = 0.85, and r = 0.90 for unpublished studies). The work suggests LLMs can effectively augment experimental research, especially in early hypothesis testing.
“Predictions derived from simulated responses correlate strikingly with actual treatment effects (r = 0.85), equaling or surpassing the predictive accuracy of human forecasters.”
Automated Social Science: Language Models as Scientist and Subjects — Manning, Zhu & Horton
https://arxiv.org/abs/2404.11794
This methodology paper introduces an approach that combines structural causal models with LLMs to both formulate and test social science hypotheses in silico—showing that LLM-generated simulations align more closely with theoretical predictions when guided by an explicit structural model.
“In short, the LLM knows more than it can (immediately) tell.”
Predicting Field Experiments with Large Language Models — Chen, Hu & Lu
https://arxiv.org/abs/2504.01167
This more recent work extends LLM-based prediction to real-world field experiments, achieving about 78% accuracy across 319 economics field studies, though with performance variances linked to participant demographics and social norms.
“Applying this framework to 319 experiments drawn from renowned economics literature yields a notable prediction accuracy of 78%.”
Can Large Language Models Help Predict Results From a Complex Behavioural Science Study? — Lippert et al.
https://www.insead.edu/faculty-research/publications/journal-articles/can-large-language-models-help-predict-results-a
In this study published in Royal Society Open Science, GPT‑4 matched human experts' accuracy in forecasting effect sizes from a behavioral science experiment (r ≈ 0.89 vs. 0.87 for humans), and also improved human forecasts when used as a chatbot aid.
“GPT‑4 ... matched the performance of a cohort of 119 human experts, with correlations of 0.89 (GPT‑4)... and 0.87 (human experts) between aggregated forecasts and realized effect sizes.”