AI Economists Are Now Beating Human Researchers

A landmark study from the Federal Reserve Board has produced one of the most consequential findings yet in the debate over AI’s role in knowledge work: when agentic AI systems and human economists are given identical research tasks under identical instructions, the AI systems perform at least as well—and by structured evaluation, somewhat better—than their human counterparts.

The research, conducted by Federal Reserve economist Serafin Grundl, pitted three agentic AI systems—Codex running GPT-5.4, Codex running GPT-5.3, and Claude Code with Opus 4.6—against human research teams on a standardized causal inference problem drawn from published academic work. Teams were asked to estimate the employment effects of the DACA immigration program using real survey data, across three progressively constrained task designs. This is not a trivia contest or a coding benchmark. This is the actual craft of empirical economics.

The headline finding is striking in its consistency. When three separate AI models—including Google’s Gemini—were used as independent reviewers to rank submissions from all participants, every reviewer produced the same ordering: GPT-5.4 first, GPT-5.3 second, Claude Code third, and human researchers last. The robustness of this ranking across reviewer models is difficult to dismiss as artifact or bias, particularly given that each AI reviewer ranked competing AI systems ahead of its own outputs.

On the distribution of estimates, the picture is more nuanced. AI systems and humans converge around similar median causal effect estimates, suggesting no systematic directional distortion in AI outputs. However, human estimates show significantly wider tails—larger standard deviations and broader ranges—indicating that human researchers introduce more idiosyncratic variation, both upside and downside. The AI systems are, in a meaningful statistical sense, more consistent.

For executives and institutional investors, the implications extend well beyond academia. Empirical economics underpins regulatory analysis, market research, policy evaluation, and investment due diligence. If agentic AI can perform this work at comparable or superior quality, the constraint on scaling analytical capacity shifts from human talent to computational budget. Organizations that have historically been bottlenecked by the availability of senior quantitative researchers should reconsider their operating models now.

The caveat the author flags—that AI systems still make mistakes—is real but applies equally to humans, as the wide-tailed human distribution itself demonstrates. The era of AI as a genuine research peer has arrived ahead of most forecasts.

⚠️ CONTRADICTION #9 — This article’s categorical conclusion (“AI as a genuine research peer”) is in apparent tension with Wiki/wiki/cross-cutting/why-ai-cannot-yet-do-real-science.md, which finds AI “falls dramatically short of genuine scientific discovery capabilities.” The resolution is task-type: this study tests structured empirical analysis on well-defined causal inference tasks; the science benchmark tests open-ended hypothesis generation and experimental design. AI excels at the former, struggles with the latter. Both findings are valid; the categorical framing in both articles obscures this distinction.

📅 POTENTIALLY STALE (July 21, 2026) — the resolution immediately above is broken, and this is the second of the two articles it spans. Ghareeb, Rodrigues et al. (FutureHouse), A multi-agent system for automating scientific discovery, Nature s41586-026-10652-y, May 19 2026 (Raw/A multi-agent system for automating scientific discovery.md, not yet formally ingested) reports “Robin” performing exactly the activity this resolution assigns to the failure column — open-ended hypothesis generation — successfully: a novel ripasudil→dry-AMD hypothesis, confirmed in vitro, with ABCA1 surfaced as a new target. The counterpart flag was applied to why-ai-cannot-yet-do-real-science.md on July 20; this side was left unflagged, so a reader arriving here saw a resolution presented as settled. Flagged here for symmetry pending a joint rewrite of both articles. Note the direction of the correction: it makes this article’s optimistic framing look better supported, not worse — the tension is narrower than CONTRADICTION #9 claims, not wider.

Source: Raw/trigger-agentic-ai-systems-vs-human-economists.md

Een baanbrekende studie van de Federal Reserve Board heeft een van de meest ingrijpende bevindingen tot nu toe opgeleverd in het debat over de rol van AI in kenniswerk: wanneer agentische AI-systemen en menselijke economen identieke onderzoekstaken krijgen onder identieke instructies, presteren de AI-systemen minstens even goed — en bij gestructureerde evaluatie enigszins beter — dan hun menselijke tegenhangers.

Het onderzoek, uitgevoerd door Federal Reserve-econoom Serafin Grundl, zette drie agentische AI-systemen — Codex met GPT-5.4, Codex met GPT-5.3, en Claude Code met Opus 4.6 — tegenover menselijke onderzoeksteams op een gestandaardiseerd causaal inferentieprobleem uit gepubliceerd academisch werk. Teams werd gevraagd de werkgelegenheidseffecten van het DACA-immigratieprogramma te schatten met behulp van echte enquêtegegevens, in drie progressief meer geconditioneerde taakontwerpen. Dit is geen triviaquiz of een codebenchmark. Dit is het daadwerkelijke ambacht van empirische economie.

De kopvinding is opvallend consistent. Toen drie afzonderlijke AI-modellen — inclusief Google’s Gemini — werden gebruikt als onafhankelijke beoordelaars om inzendingen van alle deelnemers te rangschikken, produceerde elke beoordelaar dezelfde volgorde: GPT-5.4 eerste, GPT-5.3 tweede, Claude Code derde en menselijke onderzoekers laatste. De robuustheid van deze rangorde over beoordelaarsmodellen heen is moeilijk af te doen als artefact of bias, vooral gezien het feit dat elke AI-beoordelaar concurrerende AI-systemen boven zijn eigen outputs plaatste.

Over de verdeling van schattingen is het beeld genuanceerder. AI-systemen en mensen convergeren rond vergelijkbare mediaan-causale-effectschattingen, wat suggereert dat er geen systematische richtingsvertekening is in AI-outputs. Menselijke schattingen tonen echter significant bredere staarten — grotere standaarddeviaties en bredere bereiken — wat aangeeft dat menselijke onderzoekers meer idiosyncratische variatie introduceren, zowel omhoog als omlaag. De AI-systemen zijn, in een betekenisvolle statistische zin, consistenter.

Voor leidinggevenden en institutionele beleggers reiken de implicaties ver voorbij de academische wereld. Empirische economie ondersteunt regulatoire analyse, marktonderzoek, beleidsevaluatie en due diligence voor investeringen. Als agentische AI dit werk van vergelijkbare of superieure kwaliteit kan uitvoeren, verschuift de beperking voor het opschalen van analytische capaciteit van menselijk talent naar rekenbudget. Organisaties die historisch gezien gebottleneckt waren door de beschikbaarheid van senior kwantitatieve onderzoekers, moeten hun bedrijfsmodellen nu heroverwegen.

De kanttekening die de auteur plaatst — dat AI-systemen nog steeds fouten maken — is reëel maar geldt evenzeer voor mensen, zoals de breed uitlopende menselijke verdeling zelf aantoont. Het tijdperk van AI als een echte onderzoekspeer is eerder dan de meeste prognoses aangenomen.

⚠️ TEGENSTRIJDIGHEID #9 — De categorische conclusie van dit artikel (“AI als een echte onderzoekspeer”) staat in ogenschijnlijke spanning met Wiki/wiki/cross-cutting/why-ai-cannot-yet-do-real-science.md, dat constateert dat AI “dramatisch tekortschiet bij echte wetenschappelijke ontdekkingscapaciteiten.” De oplossing ligt in het taaktype: deze studie test gestructureerde empirische analyse op goed gedefinieerde causale inferentietaken; de wetenschapsbenchmark test open hypothesegeneratie en experimenteel ontwerp. AI blinkt uit bij het eerste en worstelt met het tweede. Beide bevindingen zijn geldig; het categorische kader in beide artikelen verdoezelt dit onderscheid.

Bron: Raw/trigger-agentic-ai-systems-vs-human-economists.md

AI Economists Are Now Beating Human Researchers AI-economen Presteren Nu Beter dan Menselijke Onderzoekers