human-simulation · English · 10 min read

AgentSociety: What Happens When You Simulate Society with 10,000 LLM Agents

May 16, 2026

Tsinghua's AgentSociety runs 10,000 LLM agents generating 5 million interactions — testing political polarization, UBI policy, and hurricane shocks on a simulated population. Here's what it actually demonstrates, and what it still hasn't proven.

Suppose you want to study how a UBI policy reshapes social networks before it's implemented. Or how misinformation cascades differently across ideologically clustered populations versus diffuse ones. Traditional field experiments are expensive, slow, and give you one scenario at a time — and you can't rerun history.

That's the gap AgentSociety is trying to fill. Built by Tsinghua University's FIB Lab, it runs 10,000 LLM agents through a simulated social environment, generating over 5 million interactions. The platform lets researchers inject policy interventions — a hurricane shock, an economic subsidy, a wave of inflammatory content — and watch how they propagate through the population.

This is not an incremental upgrade from previous agent simulations. It's a different class of question entirely.

The Scale Jump That Changes Everything

In 2023, Stanford's Smallville put 25 LLM agents in a virtual town and let them run. Smallville proved that LLM agents could maintain consistent personas, plan their days, and generate simple emergent social patterns. It was a compelling proof of concept. But 25 agents can't answer: "How does political polarization form at a population level?" or "What does the cascade structure of misinformation look like across a heterogeneous social network?"

AgentSociety jumps to 10,000 agents — not 400, not 1,000, but 10,000 — producing 5 million interactions in a single simulated environment. That scale shift is what opens the door to genuine population dynamics: emergent behavior, cascade effects, and the kind of second-order social phenomena that don't appear in small sandboxes.

The difference isn't just technical. Scale changes which research questions become tractable.

How It Works

The basic architecture

Each agent in AgentSociety carries an individual memory, a psychosocial profile, and the ability to interact with other agents through natural language. Unlike rule-based agent-based models (ABMs), each agent uses an LLM to make decisions — meaning behavior emerges from context and interaction history, not predetermined formulas.

Researchers inject interventions into the simulated environment: seed a network with inflammatory messages, alter the economic conditions for a subpopulation, introduce an external shock. The platform tracks how those perturbations spread, amplify, or dampen across the full 10,000-agent population.

The five phenomena tested

According to the paper, the team ran five specific social experiments:

Phenomenon	What was studied
Political polarization	How do agent opinions diverge over time under information exposure?
Inflammatory message spread	Does inflammatory content travel faster than neutral content?
Universal Basic Income (UBI)	How does a population of agents respond to a UBI intervention?
External shock (hurricane)	How does a sudden economic shock propagate through social networks?
Urban sustainability	What behavioral patterns emerge in response to environmental policy?

For each, the team compared simulation outputs against findings from real-world empirical social science research — and reported that results aligned directionally with established findings.

What's Genuinely New: Auditable Agent Behavior

One underappreciated aspect of where AgentSociety fits in the broader field is the push toward auditable agents — systems where you can inspect why an agent made a decision, not just observe what it did.

This connects to a parallel line of work: the Multi-Agent Psychological Simulation System (November 2025) goes further by embedding established psychological theories — self-efficacy, mindset theory, social constructivism — directly into agent architecture. Agents generate internal cognitive-affective states before producing observable behavior. The result: agent behavior that's not just plausible but theoretically traceable.

AgentSociety is moving in this direction. The transition from "what did the agent do" to "what did the agent reason through before acting" is methodologically significant. It transforms agent simulation from a black box into something closer to a scientific instrument.

The Honest Caveats

This is where most writeups about AgentSociety go quiet. Worth being direct.

Alignment is not validation

The paper reports that simulation outputs "align" with real-world social science findings — but that alignment covers five specific phenomena, was reported by the same team that built the system, and has not been independently replicated by any external research group.

Independent replication in social science is notoriously difficult even for traditional field experiments. For LLM-based social simulation, it's an open methodological problem. The AgentSociety results are directionally interesting and worth taking seriously — but "self-reported alignment on five phenomena" is not the same as "validated behavioral model."

The validation gap is the field's central problem

Zoom out: across all three major domains of multi-agent simulation — human behavior, business dynamics, consumer markets — the same pattern holds. Simulations are getting larger; the gap between simulation output and real-world ground-truth data is not narrowing. The field is generating more plausible outputs, not more validated ones.

AgentSociety is the most ambitious system in this space right now. That makes its validation gap the most important methodological question to answer next.

Long-horizon persona coherence is unsolved

LLM agents drift. Run a simulation over extended periods of simulated time and an agent's persona at step 1,000 will have drifted meaningfully from its persona at step 10. No architecture has solved this, and it's the binding constraint on every application domain — from social simulation to business strategy games.

Computational cost is real

10,000 agents interacting through LLM calls means a lot of API requests per simulation run. Anyone planning to run AgentSociety for their own research needs to budget for this. The field is developing workarounds — caching, smaller models for lower-stakes agent nodes, hybrid architectures — but nothing is standardized yet.

The Open-Source Angle

AgentSociety is open source. That matters more than it might seem.

The central methodological gap — the absence of independent replication — can only be closed if other teams can actually run the system. An open GitHub repo is the necessary precondition for that work to happen. The FIB Lab also runs the AgentSociety Challenge Workshop, an active community benchmark pushing the evaluation frontier.

Compare this to large simulation systems that remain closed: the field can't build on what it can't examine. AgentSociety's openness is what makes the next generation of work — including critical replication — possible.

Parallel Developments Worth Watching

AgentSociety isn't the only serious work in this space. Three parallel directions are worth tracking:

Modular frameworks for reusability. The Generalized Multi-Agent Social Simulation Framework (October 2025) takes a different angle: instead of one large platform, build modular, object-oriented components that researchers can compose for different simulation scenarios. Less ambitious in scale, but potentially more reusable across research groups.

Survey-calibrated agent personas. LLM Agents Grounded in Self-Reports calibrates agent personas from individual survey data — meaning agents are initialized to reflect real demographic and psychographic profiles rather than generic LLM defaults. This is the right methodological direction for anyone who wants simulation outputs that generalize beyond their training set.

Validated habit formation. The Toulouse multimodal transport simulation (October 2025) demonstrated LLM agents forming travel habits over time and validated those habits against empirical mobility data — the first convincing demonstration of habit formation in LLM agents with real-data comparison.

The common thread: the field is converging on the idea that agents need to be grounded — in psychological theory, in survey data, in empirical behavioral observations — rather than left to generate plausible-but-unanchored behavior.

Common Mistakes

1. Treating "plausible" as "valid"

Simulation output that looks like real social behavior is not the same as output that's been calibrated against real social data. AgentSociety's results are plausible and directionally aligned with social science findings — but "looks right" and "has been validated" are different claims. Don't cite simulation results as evidence for policy without a clear validation chain.

2. Ignoring the WEIRD baseline problem

Most large LLMs are trained on English-language data with Western, Educated, Industrialized, Rich, Democratic behavioral baselines. Simulating populations outside that profile — Southeast Asian societies, non-English-speaking communities, low-income urban populations — will produce results that may be systematically biased in ways that aren't flagged by the system. This is an open gap the field hasn't addressed.

3. Underestimating long-horizon drift

If you run a simulation across many steps of simulated time, expect agents to drift from their initial personas. This isn't a bug in AgentSociety specifically — it's an unsolved problem across all LLM-based agent architectures. Design your experiments accordingly: shorter timelines with more runs, or monitor persona consistency explicitly.

4. Assuming five validated phenomena generalize

The alignment reported in the paper covers five specific social phenomena. It doesn't follow that AgentSociety will produce valid outputs for other phenomena you might want to study. Each new domain needs its own validation.

5. Treating open source as equivalent to replicable

The GitHub repo provides code. But independent replication also requires data, compute, and methodological clarity about how the original team compared results to ground truth. No external group has published a replication study for AgentSociety. That gap is real, and the open-source repo is a precondition for closing it — not a substitute for the work itself.

6. Ignoring cost when planning research

Running 10,000 agents through millions of interactions is expensive in LLM API calls. Prototype at small scale first. Map out cost estimates before committing to a research design that requires full-scale simulation.

What You Can Do Now

If you're a practitioner who wants to engage with AgentSociety seriously:

Start with the paper. Read arXiv:2502.08691 before running anything. Focus on the methodology sections — specifically, how the team measured alignment against real-world data. Understanding their choices there is essential context for interpreting any output you generate.

Clone the repo and run small. The GitHub repo is public. Start with a small-scale scenario to understand the platform's mechanics before scaling up. This also lets you estimate compute costs for your specific use case.

Design a validation plan before running experiments. What ground-truth data could you compare your simulation outputs against? A simulation without a validation plan produces interesting visualizations, not research findings. The earlier you define this, the more useful your experimental design will be.

Follow the Challenge Workshop. The AgentSociety Challenge is where evaluation standards for social simulation are actively being negotiated. Staying close to that conversation puts you ahead of the methodological curve.

AgentSociety sits at Technology Readiness Level 4: validated in the lab by one team, not yet independently replicated at scale, not yet production-ready for policy applications. The distance from here to a tool that policymakers can trust remains significant. But this is exactly the moment when the next research contributions will matter most — and the open platform makes it possible for those contributions to come from anywhere.

Key takeaways:

AgentSociety (Tsinghua FIB Lab) is the largest current social simulation platform: 10,000 LLM agents, 5 million interactions, open source on GitHub
It marks a qualitative shift from small sandboxes like Smallville (25 agents) to population-scale simulation capable of studying genuine social dynamics
Five phenomena tested: political polarization, inflammatory content spread, UBI policy response, hurricane shock propagation, urban sustainability — with results reported to align with empirical social science
Critical caveat: alignment is self-reported, no independent replication exists yet; WEIRD LLM bias is unaddressed; long-horizon persona coherence remains an open engineering problem
Practitioners can engage today through the open-source repo — but any serious use requires a clear validation plan before drawing conclusions

multi-agentshuman-simulationllmgenerative-agents

Sources