A landmark study from the Federal Reserve Board has produced one of the most consequential findings yet in the debate over AI’s role in knowledge work: when agentic AI systems and human economists are given identical research tasks under identical instructions, the AI systems perform at least as well—and by structured evaluation, somewhat better—than their human counterparts.

The research, conducted by Federal Reserve economist Serafin Grundl, pitted three agentic AI systems—Codex running GPT-5.4, Codex running GPT-5.3, and Claude Code with Opus 4.6—against human research teams on a standardized causal inference problem drawn from published academic work. Teams were asked to estimate the employment effects of the DACA immigration program using real survey data, across three progressively constrained task designs. This is not a trivia contest or a coding benchmark. This is the actual craft of empirical economics.

The headline finding is striking in its consistency. When three separate AI models—including Google’s Gemini—were used as independent reviewers to rank submissions from all participants, every reviewer produced the same ordering: GPT-5.4 first, GPT-5.3 second, Claude Code third, and human researchers last. The robustness of this ranking across reviewer models is difficult to dismiss as artifact or bias, particularly given that each AI reviewer ranked competing AI systems ahead of its own outputs.

On the distribution of estimates, the picture is more nuanced. AI systems and humans converge around similar median causal effect estimates, suggesting no systematic directional distortion in AI outputs. However, human estimates show significantly wider tails—larger standard deviations and broader ranges—indicating that human researchers introduce more idiosyncratic variation, both upside and downside. The AI systems are, in a meaningful statistical sense, more consistent.

For executives and institutional investors, the implications extend well beyond academia. Empirical economics underpins regulatory analysis, market research, policy evaluation, and investment due diligence. If agentic AI can perform this work at comparable or superior quality, the constraint on scaling analytical capacity shifts from human talent to computational budget. Organizations that have historically been bottlenecked by the availability of senior quantitative researchers should reconsider their operating models now.

The caveat the author flags—that AI systems still make mistakes—is real but applies equally to humans, as the wide-tailed human distribution itself demonstrates. The era of AI as a genuine research peer has arrived ahead of most forecasts.

⚠️ CONTRADICTION #9 — This article’s categorical conclusion (“AI as a genuine research peer”) is in apparent tension with Wiki/wiki/cross-cutting/why-ai-cannot-yet-do-real-science.md, which finds AI “falls dramatically short of genuine scientific discovery capabilities.” The resolution is task-type: this study tests structured empirical analysis on well-defined causal inference tasks; the science benchmark tests open-ended hypothesis generation and experimental design. AI excels at the former, struggles with the latter. Both findings are valid; the categorical framing in both articles obscures this distinction.


Source: Raw/trigger-agentic-ai-systems-vs-human-economists.md