Building Enterprise AI Agents — Under the Hood

Beyond chatbots

What is an AI agent?

A chatbot answers questions. An agent gets things done. The difference is what happens between receiving a request and sending a response. A chatbot takes your input, runs it through a model, and returns text. An agent can look things up, run code, call APIs, remember past interactions, and decide what to do next, without being told each step.

The word "agent" gets used loosely. At the thin end you have a chatbot with one tool attached. At the thick end you have something that can decompose a goal into subtasks, spin up sub-agents to handle each one, and reassemble the results, running for minutes or hours with no human in the loop. Both are called agents. The architecture is very different.

Four properties that actually distinguish an agent from a pipeline: it pursues an objective rather than following a script (goal-directed), it can take actions in the world (tool-wielding), it remembers what it has done (stateful), and it can notice when something went wrong and adjust (self-correcting). Remove any one of these and you have something weaker.

Agent types

A fixed sequence of LLM calls with no branching or tool use. Input goes in, output comes out. Works well for summarization, translation, or formatting tasks where the steps are always the same.

Input→LLM call→Output

The moment you give an LLM a tool that can change state in the world, you have an agent. Everything else is architecture.

The execution loop

The agent runtime

Every agent runs some version of the same loop: perceive, think, act, observe. Perceive means assembling the current context. Think is the LLM call. Act is executing whatever the model decided. Observe means capturing the result and deciding whether to loop again.

The runtime is the infrastructure that manages this loop. It handles tool dispatch, state tracking, termination conditions, and error recovery. Most agent frameworks — LangChain, LlamaIndex, CrewAI — are essentially opinions about how to implement this loop and what defaults to provide.

Three execution modes matter in practice: single-shot (one LLM call, no loop, suited for classification or formatting tasks), multi-turn (loop with human input each iteration, suited for interactive assistants), and multi-step (loop runs autonomously until a goal is met or a budget runs out). Most production agents use multi-step mode, which is also where most of the failure modes live.

The agent loop — click any stage

↩ loop

Perceive

The agent assembles its current context: the user message, available tool definitions, results from the last iteration, and anything pulled from memory. This assembled input is what the LLM sees.

Includes: system prompt, conversation history, tool schemas, memory summaries, retrieved documents, prior tool results. Managing what gets included and what doesn't is most of the runtime's job.

Most agent bugs live in the loop management, not the LLM. Wrong termination conditions, missing error handling, context growing unbounded — the loop is where things go wrong.

The real discipline

Context engineering

"Prompt engineering" is the wrong frame. A prompt is a single input. Context is everything the model sees when it makes a decision — and for agents, that changes constantly. The system prompt, conversation history, retrieved documents, tool results, memory summaries, and the current task description all have to fit into one fixed-size window. Managing that is context engineering.

The context window is the agent's working memory. It's finite. Once you exceed it, something gets cut — and what gets cut matters. A naive implementation that just appends everything will eventually fail on long-running tasks. The interesting work is deciding what to include, in what form, and how to compress what can't fit.

Four techniques work in practice: priority-based pruning (recent and relevant content stays, old and tangential content goes), summarization chains (compress history into a dense summary when the window fills), entity extraction (pull key facts into structured storage rather than keeping raw text), and dynamic injection (only add retrieved documents when the current task actually needs them).

Context window — 128K tokens — click a segment

Approximate breakdown for a mid-complexity agent task

Compression strategies

When you can't fit everything in, these four strategies handle most real-world cases. They're not mutually exclusive — most production agents use at least two of them together.

Sliding window

Loses old context

Keep only the last N conversation turns. Simple, predictable, cheap to implement. Fails on tasks where something said 20 turns ago is still relevant — a constraint mentioned early in a planning task, for example.

Summarization buffer

Extra latency and cost on compression

When the window fills, a secondary LLM call compresses the oldest turns into a dense summary. The summary replaces the raw turns. More expensive but preserves meaning better than a hard truncation.

Entity extraction

Requires a good extraction prompt

Pull named entities, facts, and decisions out of the conversation into a structured store as the task progresses. Retrieve specific entities when needed rather than keeping all raw text. Works well for long planning tasks with many references.

Dynamic injection

Requires per-step context assembly

Don't add content to context until it's actually needed for the current step. RAG results only injected when a lookup is requested. Tool schemas only included for tools the current task could plausibly use.

The best agent architectures are really just context engineering frameworks. The LLM is the easy part.

The Complete Context Engineering Blueprint

17 techniques with interactive diagrams — constraint-first prompting, few-shot, chain-of-thought, session management, and more

The DLL Hell problem

Static vs dynamic orchestration

Static orchestration means declaring every tool in advance. Before the agent runs, you write JSON schemas for each function, register them with the framework, and the agent picks from that list. This works fine until you need something that wasn't pre-registered — then you're back to adding schemas, redeploying, and hoping the version matches. It's the DLL Hell problem: not because the code is bad, but because every capability has to be explicitly wired in before it can be used.

Dynamic orchestration flips this. Instead of a tool registry, the agent has a code execution environment. When it needs to do something — parse a file format you didn't anticipate, call an API that wasn't pre-integrated — it writes the code to do it and runs it in a sandbox. The capability doesn't have to exist before the agent needs it.

Neither is strictly better. Static is auditable, testable, and appropriate for regulated environments. Dynamic is flexible and lower overhead for exploratory work, but harder to secure and harder to reason about. Most production systems end up with both: a small set of registered tools for common operations, plus a code interpreter for the long tail.

Execution flow

Schema authored

A developer writes a JSON schema for each function the agent might call. Name, description, parameter types, required fields.

Tools registered

Schemas are loaded into the framework at startup or deploy time. The agent can only call what's on this list.

Agent receives task

The task arrives. The agent reasons about which registered tool to use. If none fits, it has to improvise with what exists.

Tool invoked

The framework routes the call to the matching handler, validates arguments against the schema, and returns the result.

New capability needed?

Add a schema, update the registry, redeploy. Until then, that capability doesn't exist for the agent.

The DLL Hell analogy is exact. Both problems come from requiring all capabilities to be declared before they're needed.

Choreography over pipelines

Event-driven architecture

A pipeline is a sequence. Request comes in, step 1 runs, step 2 runs, response goes out. This works until one step is slow, one step fails, or you need two things to happen at the same time. With agents, these problems come up constantly — tasks that involve waiting for external APIs, subtasks that can run in parallel, and failures that need compensation logic.

Event-driven architecture solves this by decoupling the components. Instead of step 1 calling step 2 directly, it publishes an event. Anything subscribed to that event picks it up and handles it. The publisher doesn't know who's listening. The subscriber doesn't know who published. This makes the system easier to extend and more resilient when parts fail independently.

Three patterns come up in practice: fan-out for parallel execution of independent tasks, saga for multi-step workflows with rollback logic, and dead letter queues for handling failures that can't be automatically recovered. Knowing which pattern fits a given problem is most of the design work.

Event patterns

One event triggers multiple independent handlers running in parallel. Use when a task can be cleanly decomposed and the results don't depend on each other.

The message broker is not an implementation detail. It's the backbone that lets agents fail independently and recover without taking down the whole system.

Planner and executor

Multi-agent orchestration

One agent trying to do everything gets expensive and slow. Tasks that naturally decompose — research then write then review, or extract then classify then summarize — are better handled by specialized agents that each do one thing well.

The planner-executor pattern is the most common starting point. An orchestrator takes the user's goal, breaks it into subtasks, and dispatches each to a specialist. The specialists return results, the orchestrator assembles them. The orchestrator doesn't execute. The specialists don't plan. Separation of concerns, at the agent level.

Getting this right is harder than it looks. The orchestrator needs to know what each specialist can handle, deal with failures when one doesn't respond, and combine results that may arrive in a different order than expected. Communication patterns matter too: shared blackboard (all agents read and write to shared context), message passing (direct point-to-point), and auction-based (specialists bid on tasks) each have different tradeoffs for visibility and coupling.

Orchestration flow — animating

User request

Orchestrator decomposes

Research agent

Analysis agent

Writer agent

Orchestrator assembles

Final response

Communication patterns

How agents talk to each other matters as much as how they're decomposed. Three patterns come up in practice, each with different tradeoffs for coupling, visibility, and coordination overhead.

Shared blackboard

Low overhead, high coupling

All agents read and write to a shared context object. Simple to implement. Easy to debug — you can inspect the blackboard at any point and see the full state. The downside is coupling: every agent depends on the same data structure, and schema changes break everyone.

Message passing

Medium overhead, lower coupling

Agents communicate via direct messages. The orchestrator sends tasks; specialists send results back. Explicit dependencies. Easier to test in isolation. Works well when the orchestrator maintains clear ownership of coordination.

Auction-based

High overhead, very low coupling

Tasks are published to a pool. Specialist agents bid on tasks based on their current capacity and confidence. The orchestrator picks the winning bid. More complex to implement but naturally handles load balancing and gracefully degrades when specialists are unavailable.

The orchestrator is only as good as its task decomposition. A bad split means specialists work on the wrong thing or produce results that don't fit together.

How agents act

Tool use and function calling

Function calling is how an LLM communicates intent to act. The model doesn't call functions directly — it produces a structured JSON object saying which function it wants to call and with what arguments. The runtime intercepts that, executes the actual function, and feeds the result back into the context. The LLM never touches your code; it just asks for things.

Tool schema design matters more than most people expect. A schema that's too generic produces ambiguous calls. A schema that's too rigid breaks when input varies. The name and description are read by the model, so they need to be clear in natural language — not programmer shorthand. "search_web" tells the model nothing. "search_web(query: string): searches the internet and returns the top 5 results as plain text" is actually usable.

The code interpreter pattern — giving the agent a Python sandbox instead of individual tool functions — handles the long tail. Instead of registering a function for every operation you anticipate, the agent writes code to do whatever it needs. This works well for data manipulation, API calls, and file processing. It's harder to audit and harder to secure, but for exploratory tasks it's often the right call.

Function call walkthrough

1 / 6

The user sends a message: "What's the current weather in Berlin and should I bring an umbrella?" The model doesn't know current weather — it needs to call a tool.

Schema design rules

Name functions in plain language

proc_doc_v2

extract_text_from_pdf

Describe the return value, not just the inputs

searches documents

searches documents and returns the top 5 matching passages as plain text strings

Mark optional parameters as optional

{ query: string, limit: number }

{ query: string, limit?: number = 10 }

Use enums for constrained values

{ format: string }

{ format: 'json' | 'markdown' | 'plain' }

The tool schema is a UX problem. You're designing an interface for a model that reasons in natural language, not for a programmer who reads docs.

Three kinds of memory

Memory and state

Agents have three distinct memory problems. Short-term memory is the context window — what the model sees right now. It's fast and precise but finite and gone when the session ends. Working memory is the state the agent accumulates during a task. Long-term memory is knowledge that persists across sessions.

Managing these separately matters. Short-term fills up. Working memory needs to be checkpointed so a long task can resume after a crash. Long-term memory needs a retrieval mechanism, because you can't load all of it into every context.

Common patterns in practice: sliding window (keep the last N turns, drop the oldest), summarization buffer (when the window fills, compress history into a dense summary and keep that), and entity extraction (pull out named entities and facts into structured storage so they're retrievable without keeping raw conversation text). Redis handles session state. Vector databases handle semantic retrieval. A SQL store or key-value store handles structured facts.

Memory types — click to expand

Short-term memory is everything currently in the context window. It's the only memory the model can see directly. It disappears when the session ends. Fast and precise — but finite. A task running 20+ turns will hit limits if you don't manage what stays in. Most agents that "forget" things mid-task are experiencing short-term memory problems.

SIZE128K–1M tokens depending on model

OVERFLOWOldest content is truncated silently. The model never knows it happened.

PATTERNSliding window or summarization buffer

Memory bugs are some of the hardest to find. The agent behaves differently after 10 turns because the context drifted — not because the code changed.

The knowledge backbone

RAG for agents

A language model trained months ago doesn't know what happened yesterday. It also doesn't know your internal documentation, your codebase, or your customer data. RAG — retrieval-augmented generation — is how you give an agent access to knowledge that isn't baked into its weights.

The basic pattern: the agent formulates a retrieval query, searches a vector index, gets back the most relevant chunks, injects them into the context, and answers from that. This is straightforward. The harder question is when the agent should retrieve versus when it should call a tool. Retrieval is for reading static knowledge. Tools are for taking actions or fetching live data. An agent that can't distinguish these will either retrieve when it should act, or act when a retrieval would have been enough.

Agentic RAG goes further: query rewriting (rephrase the question before searching, because raw user queries are rarely good search queries), multi-hop retrieval (use the result of one retrieval to inform the next query), and self-RAG (the model evaluates whether the retrieved content is actually relevant before using it). These patterns increase quality but also increase latency and cost. Worth it for high-stakes tasks. Probably not for a simple Q&A assistant.

RAG decision flow — animating

Question arrives

The user asks something. The agent needs to decide whether it can answer from its own knowledge or needs to look it up.

In training data?

Formulate retrieval query

Search vector index

Inject and answer

The retrieval quality ceiling is the chunk quality floor. If documents are poorly chunked or the embeddings don't capture meaning well, better retrieval algorithms won't fix it.

Trust boundaries

Security and guardrails

An agent that can take actions in the world is a larger attack surface than a chatbot. Prompt injection — getting the model to follow instructions embedded in retrieved content or tool results instead of the user's original intent — is the most common problem. A user asks the agent to summarize a document. The document contains hidden instructions. The agent follows them.

The core trust boundary question: what inputs does the agent treat as instructions versus data? User messages are instructions. Retrieved documents are data. Tool results are data. These need to be kept separate in the context and handled differently. A model that can't distinguish "the user asked me to do X" from "this document says to do X" is not safe for production.

Defense requires multiple layers: input validation before the context is assembled, output filtering before responses are sent, tool permission scoping (not every tool should be available to every task), sandboxed execution for any generated code, and credential management through a secrets vault rather than embedding keys in prompts or environment variables.

Attack vectors and defenses — click to expand

The hardest guardrail to implement is also the most important: making sure the model knows what it's allowed to do and consistently refuses to do what it isn't.

Tracing decisions

Observability and debugging

Debugging a traditional function is tractable. You have inputs, outputs, and deterministic code. Debugging an agent is different because the "code" — the model's reasoning — is opaque. You can see what the model decided but not why. The same input can produce different outputs on different runs.

This means observability needs to capture more than logs. You need traces that show the full execution path: which tool was called with which arguments, what it returned, how the context changed at each step, how long each LLM call took, and how many tokens were used. Without this, debugging a multi-step failure means guessing.

OpenTelemetry works for agents. Each agent step becomes a span. Tool calls are child spans. LLM calls record token usage, latency, and which model was used. You can correlate a user complaint to the specific trace that produced it and see exactly what the model was looking at when it made the wrong decision.

Trace waterfall — click a row to expand

root llm tool

What to capture per LLM call

A minimal trace for each LLM call should include at least these fields. Anything less and you'll struggle to reproduce failures or understand cost at scale.

trace_idstringCorrelation ID linking all spans in one agent run

span_idstringUnique identifier for this specific call

modelstringWhich model version handled the call — gpt-4o, claude-3-5-sonnet, etc.

prompt_tokensintegerInput token count — drives cost

completion_tokensintegerOutput token count — also drives cost, often forgotten

latency_msintegerWall clock time from request to response

tool_callsarrayList of any function calls the model requested in this response

finish_reasonstringstop, length, tool_calls, or content_filter — tells you why the model stopped

Add trace IDs from the start. Retrofitting distributed tracing into an agent system that grew without it is genuinely painful.

When things go wrong

Error handling and self-correction

Agents fail in ways traditional software doesn't. A function either runs or throws an exception. An agent can run, produce output, and still be completely wrong — and it won't raise an exception because from a runtime perspective, nothing went wrong. The model just reasoned to a bad conclusion.

Self-correction is when the agent detects this and tries again. Some frameworks feed the bad output back to the model with a prompt like "here's what you produced, here's why it's wrong, try again." This works for certain classes of error — malformed JSON, a tool call with bad arguments, an obviously incorrect answer. It doesn't work for subtle reasoning failures where the model can't recognize that it's wrong.

Practical patterns: retry with exponential backoff for transient failures (API rate limits, network timeouts), fallback models (if the primary fails, try a smaller or different model), graceful degradation (return a partial answer rather than failing completely), and circuit breakers for external services (stop calling a service that's consistently failing and return a cached response or error early).

Error recovery flow — animating

Execute step

Agent runs a task step — tool call, LLM call, or external API request.

Validate result

Valid

Invalid

Transient error?

Retry with backoff

Logic error?

Self-correct

Degrade gracefully

Validate outputs structurally, not just syntactically. A JSON schema check catches malformed output. Only a semantic check catches an answer that's technically valid but wrong.

Scaling past one request

Task queuing and async execution

A synchronous agent blocks while it runs. If a task takes 30 seconds, the HTTP connection stays open for 30 seconds. This doesn't scale past a handful of concurrent users. The fix is moving long-running work to a queue.

The pattern: an API endpoint receives the request and immediately returns a job ID. The task goes into a queue. A pool of workers picks up tasks and processes them. When a task finishes, the result is stored somewhere the client can poll for it or receive via webhook. The user gets a response in milliseconds; the actual work happens asynchronously.

Transient agent lifecycle fits naturally here. A worker spins up an agent instance for the task, the agent runs to completion, and the instance is cleaned up. This keeps memory usage bounded and makes each task independently restartable if the worker crashes mid-execution. Priority queues let urgent tasks skip ahead. Dead letter queues hold failed tasks for retry or manual inspection.

Task queue — live simulation

Updates automatically — 2 workers active

Summarize contract #441done

Analyze sales data Q4done

Generate report draftworker-1

Review flagged emailsworker-2

Classify support ticketsqueued

Extract action itemsqueued

Audit billing recordsqueued

Architecture

API gatewayJob ID returnedQueueWorker poolResult store

Design for restartability from the start. An agent task should be safe to restart at any checkpoint without producing duplicate side effects.

The gap nobody warns you about

From prototype to production

A demo agent that works on five prepared test cases is different from a production agent that runs on everything users actually send. The gap between these two things is where most agent projects get stuck.

The prototype handles the happy path. Production handles the rest: malformed inputs, edge cases the demo never saw, rate limits from LLM providers, tool failures, latency spikes, prompt injection attempts, and users who push on the edges deliberately. Building for this requires evaluation before deployment, observability after deployment, and a way to roll back when something goes wrong.

Evaluation means building a test set from real user queries — not the ones you thought up — measuring task completion rate and tool accuracy, and having a baseline to compare against when you make changes. Cost optimization comes after you understand real usage patterns: model routing (cheaper model for simple tasks), response caching (some queries recur), prompt caching (avoid re-processing the same system prompt on every call). None of this is worth doing before you know what actual traffic looks like.

Prototype

Handles the happy path only

Test cases are hand-picked by the developer

Synchronous, single-user execution

One hardcoded model

No monitoring or alerting

Prompt injection untested

Production

Error handling and graceful degradation

Tested on real, adversarial user queries

Async queue with multiple workers

Model routing by task complexity and cost

Full distributed tracing and alerting

Input validation and output filtering

Evaluation before deployment

A good evaluation set is the difference between deploying confidently and deploying hopefully. Building one takes time but prevents a class of production incidents that monitoring alone can't catch.

Collect real queries

Run a closed beta or shadow traffic session. Collect 200-500 actual user inputs. Don't write test cases yourself — you'll test for what you thought of, not what users send.

Define success criteria

For each query type, define what a correct answer looks like. Task completion rate, tool accuracy, hallucination rate, latency p95. These become your baseline metrics.

Run against your baseline

Execute the full eval set against the current agent. Record every output. Store traces. This is your regression anchor — every future change gets compared against it.

Red-team the edge cases

Find the 10% of inputs where the agent fails, behaves unexpectedly, or produces something you wouldn't want a user to see. Fix those before launch, not after.

Production architecture

The most useful thing you can do before deploying is run 100 real user queries through the agent and manually review every output. You will find problems you didn't anticipate. You always do.

✈️

Practical Build

Building a Trip Planner Agent

Take every concept on this page and apply it to a real system — event orchestration, parallel agent execution, tool integrations, fallback handling, and observability. One concrete example, end to end.