What AI Agents Are Actually Bad At

An AI agent goes live on day one. The demo worked. The workflow looked clean. Three days in, the agent produces outputs that seem right but aren't.

The assumption was that complexity would be the problem — that hard workflows would break and simple ones would run. That assumption is wrong. Complexity is not what breaks AI agents in production. Vagueness is.

AI agents fail at vague tasks, not difficult ones

A difficult task can be automated if every decision point inside it is defined. A simple task breaks the moment it requires the agent to know something nobody specified.

"Send a follow-up email to any lead who hasn't replied in five business days" is difficult enough to feel like it requires judgment. It runs reliably because the trigger is defined, the input is a CRM record, the output is one email, and the conditions are explicit. The agent makes no decisions that weren't already made in the brief.

"Handle customer communication" is simple enough to feel like it should be straightforward. It fails in production because "handle customer communication" is not a task. It is a category containing hundreds of tasks — each with its own inputs, outputs, and edge cases — none of which were specified.

The failure mode is underspecification. The agent delivers exactly what it was given. When what it was given is incomplete, the output reflects that.

The four failure patterns in production

Four patterns account for most agent failures after the first week in production.

Ambiguous inputs — The agent receives something it wasn't designed for. A customer writes in a language the agent wasn't briefed for. An order arrives with two line items instead of one. A form is submitted with a required field blank. The input isn't wrong — it just wasn't anticipated. The agent produces output that looks correct but isn't, because the case it encountered was never defined.

Context-dependent judgment — The task requires knowing something the agent was never given. "Follow up with this lead" sounds defined. But the right tone depends on how the lead arrived, how long they've been in the pipeline, and whether they had a difficult exchange last quarter. A person who has worked in the business for six months navigates this automatically. An agent with no access to that history cannot.

Moving scope — The task definition drifts because the business changes. The agent was briefed for how the workflow worked in February. By April, the team added a step, changed a field name, or started routing a new case type. Nobody updated the brief. The agent continues running the old version of the workflow.

Multi-party coordination — The task involves waiting on another person or system, then acting based on the response. "Send a proposal, then follow up if no response" seems simple. But what if the prospect replies with a question instead of a decision? What if they respond to the wrong thread? The agent was briefed for one path. Production contains several.

Four failure mode cards: ambiguous inputs, context-dependent judgment, moving scope, multi-party coordination — each with a one-line description of how the failure occurs — These four patterns account for most production failures. All four are visible before an agent is built.

Tasks that look automatable but regularly break

Agents don't fail on hard tasks. They fail on vague ones.

An agent is not bad at customer communication. It is bad at "handle customer communication" — a phrase that contains fifty tasks it was never given.

Some workflow categories appear on nearly every business owner's automation list. They consistently underperform because their apparent simplicity hides structural problems.

"Manage the inbox" — Every message is different. An agent can handle a specific message type — refund requests, delivery questions, account changes — when that type is isolated and specified. The whole inbox is not a task. It is a category.

"Schedule meetings" — Looks mechanical. Contains preference logic. What if two time slots are available but one falls right before a client call the agent has no visibility into? What if the other party prefers mornings and is in a different time zone? A person applies these rules without being asked. An agent applies none of them unless they are written down.

"Summarize this week's activity" — What counts as this week? Which activities matter? For which audience? A founder summarizing for themselves includes different items than a summary going to an investor. The agent needs a defined scope and a fixed template — not a general instruction.

"Monitor and respond to leads" — Monitoring is automatable. The response depends on how the lead arrived, what they said, and what stage they are at. Combining both in one instruction produces an agent that responds to every lead with the same logic.

How to use these limits before you commit

The four failure patterns above are detectable before a single line of agent code is written. For any workflow under consideration, four questions reveal where the gaps are.

Can every input this agent will receive be described in a single paragraph? If the answer involves "it depends on who's writing in," the input space is undefined.

Can success be evaluated without reading the output? If a person would need to check whether the agent got it right, the task needs more specification before an agent can handle it reliably.

How often does an exception occur, and what happens when it does? If exceptions surface more than once a week and each one is handled differently, the workflow has undefined behavior at the edges. Those edges appear in production.

What does the agent do when it can't decide? Every brief needs an explicit escalation path — a named action for inputs that don't match the expected pattern. Not "the agent will handle it." A specific step: flag for review, route to a queue, send a holding message.

These questions don't rule out automation. They identify what needs to be defined first. A workflow that fails all four can still be automated — after the inputs are bounded, success is measurable, exceptions are capped, and the escalation path is named.

AI agents fail at vague tasks, not difficult ones

The four failure patterns in production

Tasks that look automatable but regularly break

How to use these limits before you commit

Ready to put agents to work?