human-simulation · English · 8 min

Simulating Human Behavior with AI: From Markov Chains to LLM Personas

March 2, 2026

From simple Markov chains to 25 AI agents spontaneously organizing a Valentine's Day party — this is the technical arc of one of Data Science's most interesting problems: teaching machines to understand people.

In 2023, a Stanford research team placed 25 AI agents inside a simulated town and told one of them: "Throw a Valentine's Day party." No script was written. No rules specified who should invite whom. The agents spread invitations on their own, asked each other out, and coordinated attendance — social behavior emerging from simple individual rules.

That experiment proved something important: LLMs do not just generate text. Given the right architecture, they can play the role of a person convincingly enough to make social behavior worth studying. For data scientists, the practical question follows: how does this actually work, and when should you reach for it?

What human behavior simulation is — and why it matters

Human behavior simulation means building systems that can replicate how people make decisions, respond to stimuli, and interact with each other — in a controlled environment, without requiring real participants.

That sounds academic, but the applications are concrete. Want to know whether a redesigned checkout flow will lift conversion rates? Instead of running an A/B test for three weeks at the expense of real customers, you can first run it on synthetic users. Social Simulacra (2022) demonstrated this at the community level: researchers generated thousands of synthetic members with diverse behavioral profiles, letting product teams stress-test moderation policies before going live. Zooming out further, Epidemic Modeling with Generative Agents (2023) had each citizen-agent reason — via LLM — about whether to wear a mask or attend a gathering, and the model reproduced realistic epidemic wave patterns that could be used to evaluate interventions.

The governing analogy: simulating human behavior is to social systems what physics simulation is to engineering. You test your design in a cheap, safe virtual environment before exposing the real world to it.

Four generations of technique

Generation 1: Rule-based agents

The simplest starting point — hand-code your IF-THEN logic. if inventory < 10: reorder. ELIZA in 1966 worked this way: keyword detection, scripted responses.

The upside is full interpretability and determinism. The downside is that you can only simulate what you explicitly coded for, and maintenance costs grow fast as the system scales. This approach still makes sense for compliance workflows or supply-chain simulation with known decision trees — anywhere that the full decision space can be specified in advance.

Generation 2: Statistical models — Markov chains and friends

The key step forward: instead of hand-coding behavior, learn it from data. A Markov chain models behavior as a sequence of state transitions with learned probabilities. With clickstream data, you can estimate that 30% of users move from a product page to the cart, 60% proceed from cart to checkout, and 25% abandon at checkout.

Hidden Markov Models extend this further — modeling latent states like "browsing mode" vs. "purchase-intent mode" that aren't directly observable. Survival models (Cox, Weibull) answer the when question: at what point does a user churn, or make that first purchase?

This family of methods is mathematically rigorous, interpretable, and well-suited to problems like customer lifetime value modeling, A/B test pre-simulation, and website traffic forecasting. The core limitation is the Markov assumption: only recent history matters. Complex social dynamics, contextual reasoning, and genuinely novel situations all strain these models.

Generation 3: ML-based approaches

Sequence models — LSTMs, Transformers — learn directly from historical behavioral logs, predicting the next action given the full sequence of prior actions. Behavioral cloning trains a model on human demonstrations to replicate decision-making without explicit rules.

These approaches capture non-linear patterns at scale and improve as data grows. The trade-off: they are black-box, require large labeled datasets, and cannot generalize to truly novel situations. If you are building a product that has never existed before, there is no behavioral history to learn from.

Generation 4: LLM persona agents

This is where things get genuinely new. The core idea: give an LLM a detailed persona description — demographics, personality traits, goals, context — and let it reason about what that person would do in a given situation.

The foundational architecture comes from Park et al.'s Generative Agents (2023), which introduced three components that most subsequent work builds on:

Memory stream: External storage of all an agent's past experiences in natural language, retrieved via semantic search when relevant to a current decision.
Reflection: Periodic synthesis of raw memories into higher-level insights — from "Klaus asked me about the deadline again" to "Klaus seems stressed about work."
Planning: Translation of goals and current context into concrete schedules and actions.

The upside over earlier approaches is significant: LLM agents can handle situations they have never encountered before, produce natural-language reasoning traces you can inspect, and require no behavioral training data to define a persona. The downsides are real too: hallucination risk, non-trivial per-query cost, and results that depend heavily on which base model and prompt you use.

Practical tools to get started

TinyTroupe — Microsoft Research

TinyTroupe is a Python library for creating TinyPerson agents — LLM-powered personas defined by background, personality, and goals. The design is deliberately "tiny" relative to full Generative Agent architectures: simpler, faster, and cost-effective for business use cases.

A data scientist can pip install tinytroupe and run a synthetic focus group in an afternoon. Core use cases: virtual focus groups for product research, simulating consumer decisions before A/B tests, testing AI system behavior with diverse user types.

AgentSims — open-source sandbox

AgentSims (2023) is an open-source sandbox with a GUI for building and evaluating agent simulations without heavy coding. It supports pluggable memory systems, planning modules, and tool-use. It is the most accessible entry point for practitioners who want to experiment before committing to building their own framework.

A concrete example: Simulating a customer purchase journey

An e-commerce company wants to understand how different customer segments respond to a new checkout flow that includes a one-click upsell offer.

Traditional approach: A/B test on live users — takes weeks, affects real customers, burns through experiment budget.

LLM simulation approach using TinyTroupe:

Step 1 — Define personas:

Nguyen Van A, 32, Ho Chi Minh City, works in marketing, budget-conscious, researches purchases carefully
Tran Thi B, 45, Hanoi, small business owner, values convenience over price, time-poor
Le Van C, 22, student, impulse buyer, highly responsive to social proof and discounts

Step 2 — Set the environment: Describe the checkout page, the upsell offer (add a 3-month warranty for 15% of the product price), and the trust signals visible on the page.

Step 3 — Run the simulation: Each persona agent reasons through the offer. Nguyen Van A asks clarifying questions about warranty terms, reads carefully, and hesitates at the price. Tran Thi B adds the warranty immediately — it saves time later. Le Van C ignores the warranty entirely but clicks into a "customers also bought" section.

Step 4 — Extract insights: Aggregate responses to estimate acceptance rates by segment, identify friction points (what were agents asking about? which copy caused confusion?), and iterate on the offer design before running the real A/B test.

# Illustrative pseudocode — check TinyTroupe docs for the exact API
from tinytroupe.agent import TinyPerson
from tinytroupe.environment import TinyWorld
 
# Define persona
nguyen = TinyPerson("Nguyen Van A")
nguyen.define("age", 32)
nguyen.define("occupation", "marketing professional")
nguyen.define("personality", "analytical, budget-conscious, researches before buying")
nguyen.define("goal", "make a good purchase decision without overspending")
 
# Set up environment
world = TinyWorld("E-commerce checkout page", [nguyen])
world.broadcast(
    "You are on the checkout page for a laptop. "
    "There is an offer to add a 3-month warranty for 15% extra. "
    "The page shows 4.2-star reviews. What do you do?"
)
 
# Run simulation
world.run(1)
nguyen.listen_and_act()
print(nguyen.get_actions())

Limitations worth understanding before trusting results

LLM persona simulation is compelling — but several fundamental issues matter before you act on the output.

Hallucination and behavioral confabulation: A persona defined as "budget-conscious" may still accept an expensive upsell if the prompt framing triggers a different reasoning pattern. There is no guarantee the agent's stated reasoning matches what a real person would actually do.

Base model dependency: The Silicon Society Cookbook (2026) found that swapping GPT-4 for Claude or Llama — with identical personas and prompts — produces qualitatively different emergent social behaviors. Simulation results do not transfer across models without revalidation.

The validation gap: Most published work validates against held-out survey items, which are easy to overfit. Validation against real longitudinal behavioral data — the kind that would actually confirm behavioral fidelity — is expensive and rarely done. The Silicon Society Cookbook (2026) identifies this as the central unsolved problem of the field.

Persona stereotyping: LLMs trained on internet text carry demographic stereotypes. A persona described as "elderly rural woman" may generate behavior driven by those stereotypes rather than realistic individual variation. SPIRIT (2026) attempts to address this by grounding personas in individual social media traces — but that surfaces a different problem: privacy.

The practical decision for DS practitioners: do I need behavioral fidelity or behavioral plausibility? Market size estimation needs plausibility — a directionally correct answer. Policy impact assessment needs fidelity — behavior that approximates what real people would measurably do. These different requirements lead to very different tool choices.

Where the field is heading

From 25 agents in Stanford's Smallville in 2023, the trajectory points toward large-scale infrastructure. S-Researcher (2026) supports up to 100,000 concurrent LLM agents running autonomous social science experiments — replacing expensive longitudinal studies with synthetic populations that can be queried on demand.

Computational scale is no longer the bottleneck. The current bottleneck is evaluation: how do you know your synthetic humans are behaving like real ones? That is the question practitioners working in this space need to start asking — not "what can simulation do?" but "how do I know my simulation is right?"

The best current practice remains a complementary approach: Markov chains and sequence models for problems with historical behavioral data, LLM personas for genuinely novel scenarios or when you need natural-language reasoning traces. These methods are not competing — they cover different failure modes, and the strongest simulations tend to use both.

Common Mistakes

Trusting simulation results without real-world validation. Behavioral plausibility (results that sound reasonable) is not behavioral fidelity (results that match measurable real behavior). Use simulation to generate and narrow hypotheses before a real-world experiment, not as a substitute for one.
Expecting consistent results when switching base models. The same persona and prompt on GPT-4 vs. Claude vs. Llama produces qualitatively different emergent social behaviors. Changing the underlying model requires revalidating your results from scratch.
Reaching for LLM personas when statistical models are cheaper and sufficient. For problems with adequate historical data and non-novel behavior, Markov chains or sequence models are typically faster, less expensive, and far more interpretable. LLM personas earn their cost in genuinely novel scenarios.
Not checking for stereotype artifacts in persona output. LLMs trained on internet text encode demographic stereotypes. For personas built around specific population groups, inspect outputs for whether behavior is driven by stereotypes rather than realistic individual variation.
Using synthetic focus group results to make large decisions without A/B testing. The strongest use of LLM simulation is narrowing the design space cheaply — not replacing real-world experiments entirely. Simulation generates hypotheses; real users validate them.

Key takeaways:

Four generations of technique: Rule-based → Markov/Statistical → ML sequence models → LLM persona agents — each generation addresses the limitations of its predecessor
LLM personas are strongest for genuinely novel scenarios and when you need natural-language reasoning traces; statistical models outperform when sufficient historical data exists
TinyTroupe and AgentSims are practical entry points that don't require building an agent framework from scratch
The current bottleneck is validation: behavioral fidelity is hard to measure and rarely validated rigorously in published research
The strongest approach combines both: statistical models for problems with data, LLM personas for novel situations — they are complementary, not competing

multi-agentshuman-simulationllmbasics

Sources