← Writing

multi-agents · English · 8 min

🇻🇳 Đọc tiếng Việt

Why Most Agentic AI Projects Fail at the Last Step

May 16, 2026

Data as of May 2026

Andrew Ng and Laurence Moroney give us a four-step framework for agentic workflows — Intent, Planning, Tools, Reflection. Engineers are reasonably good at the first three. The fourth is where everything quietly collapses.

Data as of May 2026 — This field moves fast. Some figures, models, or tools may have been updated since this was written.

There is a moment in the Stanford CS230 lecture that made me stop — not because it was wrong, but because it stopped at exactly the wrong place.

Laurence Moroney — Director of AI at Arm, formerly Lead AI Advocate at Google for a decade — describes agentic workflows using four steps: Intent (understand what the user wants), Planning (break it into steps), Tools (grant model access to capabilities), and Reflection (verify the output actually met the original intent). It is one of the most grounded, concrete frameworks for agent system design currently in circulation. It cuts through the fog of "agents will do everything" hype.

And then it stops, right after naming Reflection. Not a word on how to make that step actually work.

That is precisely where every agentic AI project I have seen either stalls in pilot or fails silently in production.


The Right Framework — Missing Half of What Matters

The first three steps — Intent, Planning, Tools — have been engineered to a relatively mature state. Prompt engineering for intent extraction, chain-of-thought for planning, function calling for tool use: all three are well-documented, benchmarked, and supported by libraries. You can learn them from OpenAI or LangChain documentation in an afternoon.

Reflection is different. Conceptually it is simple: the agent asks itself, "Does this output actually answer the original question?" But in practice, that question requires an evaluation mechanism — and this is the technical gap that neither Ng nor Moroney addresses.

The problem is not that models cannot reflect. The problem is that we do not know how to measure whether reflection succeeded.

A concrete example makes this clearer.

Scenario: A contract analysis agent for a law firm. The user asks: "Which clauses in this contract create IP ownership risk for our side?"

  • Intent: Identify IP clauses with risk.
  • Planning: Split contract into sections, prioritize IP clauses, flag issues.
  • Tools: Document reader, legal knowledge base, clause classifier.
  • Reflection (per the framework): The agent confirms it has answered the question.

But "confirms" in what sense? The agent can self-report that it completed the task — this is self-report, not verification. Large language models are notoriously confident even when wrong. If the agent missed an implied arbitration clause that could override the client's IP rights — and still scored itself as "task complete" — reflection has failed completely and silently.

In legal, medical, or financial contexts — exactly the domains Moroney identifies as the greatest "Small AI" opportunity because of data sovereignty and IP protection — this kind of silent failure is not an edge case. It is the baseline risk.


How Reflection Fails in Practice

Reflection in agentic systems breaks down in three recurring patterns:

Pattern 1: Self-report substituted for verification. The agent is asked to assess whether it completed the task. This is equivalent to asking students to grade their own exams. LLMs carry a strong confidence calibration bias — they tend to be over-confident about their own output, particularly when prompted toward task-completion framing.

Pattern 2: Undetected intent drift. In workflows with multiple planning steps, the original intent gradually gets "diluted" through successive refinements. Example: "Find IP risks" becomes "identify all contract risks" becomes "summarize the contract" — each step is logical, but the goal has shifted. The agent completes a transformed version of what was originally asked — and because no representation of the original intent has been preserved for comparison, reflection has no reference point.

Pattern 3: Unspecified evaluation criteria. "Does the output meet the intent?" is an open question. In most real-world domains, the answer depends on business-specific criteria the agent was never given. Without a concrete rubric, reflection reduces to surface-level pattern matching — checking whether the output looks correct rather than is correct.

Working with enterprise clients in Japan — environments with near-zero tolerance for production failures and strong requirements for audit trails — these are the three breakdown points I have seen most consistently. Not because engineers are careless, but because the widely-used frameworks offer no pattern for handling any of them.


What Actually Makes Reflection Work

Reliable reflection requires a layer that sits outside the Intent-Plan-Tools-Reflect loop: an independent evaluation layer.

For legal, medical, or financial agents, this is not optional — it is a regulatory necessity.

This is not a novel idea in software — it is basic testing discipline: don't use the code to test itself. But in agentic systems, the principle gets bypassed because a framework that "self-Reflects" sounds like enough.

In practice, the evaluation layer needs to answer three questions, each independent of the agent's own assessment:

  1. Completeness: Does the output cover the full scope of the intent? (Not "did it answer" but "did it answer fully")
  2. Accuracy: For the verifiable parts of the output, are they correct? (Factual grounding, citation integrity, logical consistency)
  3. Business fitness: Is the output usable in the specific operational context? (This is the question the model cannot answer without domain-specific rubric)

Questions 1 and 2 can be handled by a separate evaluator model — a smaller model or a separately-prompted LLM judge with no visibility into how the agent reached its output. Question 3 requires either human-in-the-loop review or a rubric encoded in advance by a domain expert.

In multi-agent pipelines for business process automation, this evaluation layer is not an optional add-on — it is a prerequisite for production deployment. Without it, "Reflection" in the framework is just a step that generates false confidence.


Stanford's Advice Is Right — But You Need a Different Map

To be clear: the career advice from Ng and Moroney is fundamentally sound. The Three Pillars — Depth, Business Focus, Delivery Bias — are a solid framework for any AI practitioner trying to survive past 2023. The "Big AI vs. Small AI" bifurcation — the structural divergence between large centralized models (OpenAI, Google, Anthropic) and smaller, self-hosted, domain-specialized ones — is a real structural prediction, not marketing.

But that map was drawn from Silicon Valley. Reading it from Vietnam or Tokyo requires adjustment.

The "Small AI" opportunity is larger here than there. Data sovereignty, IP protection, and cloud inference cost constraints are not theoretical in Japanese or Vietnamese enterprises — they are real procurement blockers. That means on-device, self-hosted, and fine-tuned models are not niche plays. They are the entry point for enterprise AI in this region.

The business alignment bar is higher. Enterprise procurement in Japan and Vietnam is slower and more demanding than Silicon Valley. A polished pilot is not enough — clients want to see that you understand their risk profile, not just the model's capabilities. The "Trusted Advisor" Moroney describes — someone who can manage deployment risks like hallucination, security vulnerability, and reputational harm — is not an advanced career level. It is the minimum requirement to be taken seriously in enterprise.

The multi-agent systems skill gap is more pronounced. If this is where real ROI for business process automation lies — and I believe it is — then the practitioner who masters agentic workflow evaluation, not just construction, will have unusual leverage. Not because they are smarter, but because this gap is being systematically ignored by most current frameworks and curricula.


Common Mistakes

1. Letting the agent self-report "task complete" without independent verification. Fix: Separate evaluator from executor. Use a separate LLM judge or rule-based checker to assess output — never let the agent grade its own work.

2. Not preserving a representation of the original intent. Fix: Serialize the intent statement at the start of the pipeline and pass it to the evaluation layer at the end — do not let it be overwritten by intermediate planning steps.

3. Using reflection prompts that are too vague. Weak: "Did you complete the task?" Better: "The user asked for X. Cover [completeness / accuracy / business fitness]. List what is missing."

4. Skipping business fitness because it can't be fully automated. Fix: Encode domain criteria into a rubric before deployment, even if that rubric is imperfect. A hard checklist with five items beats nothing.

5. Treating Reflection as a terminal step rather than a loop. Fix: Design the pipeline so Reflection can trigger re-Planning — not just flag errors, but return to step 2 with a refined intent statement.

6. Applying the framework without adjusting for high-stakes domains. In law, healthcare, and finance, false confidence from self-reporting is not a UX failure — it is a liability. The evaluation layer is not optional in these contexts.


Key takeaways:

  • Moroney's Intent-Plan-Tools-Reflect framework is the right structure — but it stops at the hardest part.
  • Reflection fails because agents self-assess their own output, lack a reference point for the original intent, and have no domain-specific evaluation criteria to work against.
  • An independent evaluation layer — separated from the agent executor — is the necessary condition for Reflection to work in production.
  • The "Small AI" opportunity is meaningfully larger in Asian markets than the Stanford lecture frames it: data sovereignty and enterprise procurement dynamics create a structural advantage for on-device and self-hosted models.
  • The practitioner who masters agentic workflow evaluation — not just construction — is targeting a gap that most current curricula are still ignoring.
agentic-aimulti-agentsai-careerllmproduction-aivietnamese-context

Sources

  1. Stanford CS230 | Autumn 2025 | Lecture 9: Career Advice in AI — Stanford Online / CS230