Agent Frameworks Are Not the Architecture

Agent PoCs often work surprisingly fast.

You call an LLM.

You connect a tool.

You ask for structured output.

You add tracing.

The demo works.

Then the system gets slightly more complex, and a different problem appears:

when the output is wrong, you cannot tell where the failure lives.

Is it the prompt?

The context?

The workflow?

The schema?

The reflection step?

The tool call?

At that point, “the agent failed” stopped being a useful debugging signal.

That was the moment I stopped treating the Agent Framework as the architecture.

This is not a framework comparison.

I am not trying to prove that LangGraph is better than Pydantic AI, or that one Agent Framework should replace another. Frameworks are useful. I still use them.

The lesson I took from building my small contract-question-agent project was different:

Agent frameworks are useful, but they should not become the architecture.

The architecture lives in the responsibility boundaries.

Where does orchestration live?

Where does execution live?

Where does state live?

Where does context enter?

Where are boundaries checked?

Where does the system become observable?

Those questions became more useful than asking which framework should own everything.

The framework became a black box

Agent Frameworks are attractive for a good reason.

They give you a place to put model calls, tools, structured output, memory, state, tracing, retries, and sometimes UI events. In the beginning, that feels like progress. The system has one conceptual center. The code feels easier to explain.

But as my project became more specific, that convenience started hiding the boundaries I needed to see.

My project is called contract-question-agent. It is not designed to answer whether a contract clause is legal, enforceable, fair, risky, or safe to sign.

Instead, it turns a vague concern about a contract clause into structured verification questions a human reviewer can raise before relying on the clause.

In other words:

Not verdicts, but verification questions.

Not legal conclusions, but review prompts.

Not final judgment, but better questions before expert review.

That narrowness is intentional.

For this kind of task, I do not want an agent that confidently produces a polished answer. I want a system that keeps uncertainty visible. I want to know which clause type was handled, which review lenses were considered, which output contract was enforced, and whether the final result still follows the task thesis.

A universal agent abstraction can make it too easy to collapse all of that into one final response.

The output may look clean. But the system becomes harder to inspect.

Which context was used?

Which assumptions were introduced?

Which part of the workflow decided the input was in scope?

Which part made sure the output did not become legal advice?

Which part should be fixed when the result feels wrong?

When everything belongs to the framework, no failure belongs anywhere specific.

That was the real debugging problem.

What I tried

I started with a broad Agent Framework abstraction because it seemed like the natural place to put agent behavior.

I expected it to help with model calls, structured output, skills, and interface concerns. At first, it did help. The early version moved faster because there was an obvious abstraction to reach for.

But the more the project became about a specific runtime shape, the more responsibilities started to overlap.

The workflow wanted one owner.

The model calls wanted another boundary.

The output schema wanted a contract.

The prompt wanted a visible surface.

The context provider needed to stay controlled.

The task thesis needed to be checked after generation.

The UI needed to show state, not just final prose.

A single “agent” abstraction was no longer the cleanest center of the system.

It became too easy for workflow control, model execution, context selection, reflection, tracing, and interface status to blur together.

And once those responsibilities blurred, debugging degraded into the least useful loop in LLM engineering:

change the prompt and try again

Sometimes the prompt really is the problem.

But often it is not.

Sometimes the schema is too weak.

Sometimes the context candidates are poor.

Sometimes the scope classifier allowed the wrong input through.

Sometimes the reflection step is not strict enough.

Sometimes the workflow routed correctly, but the UI hid the actual state.

Sometimes the system generated an answer when it should have stopped earlier.

If all of those failures look like “the agent gave a bad answer,” the system is not debuggable enough.

What remained after removing the magic

After removing the idea that one framework should own everything, the system became less magical and more inspectable.

What remained was not one perfect Agent Framework.

What remained was a responsibility map:

Framework comparison asks:
  Which framework should own the agent?

This article asks:
  Which responsibility should own each boundary?

        ┌──────────────────────┐
        │      Workflow         │  LangGraph
        │ state / route / guard │
        └──────────┬───────────┘
                   │
        ┌──────────▼───────────┐
        │      Execution        │  Pydantic AI / OpenRouter
        │ structured model call │
        └──────────┬───────────┘
                   │
        ┌──────────▼───────────┐
        │      Contracts        │  Pydantic / Skill / Reflection
        │ schema / thesis check │
        └──────────┬───────────┘
                   │
        ┌──────────▼───────────┐
        │    Observability      │  Langfuse / AG-UI / Streamlit
        │ trace / event / state │
        └──────────────────────┘

In the implementation, that map looks like this:

LangGraph:
  workflow orchestration
  state transitions
  routing
  retry paths

Pydantic AI:
  node-internal structured model calls
  classification
  generation
  reflection calls

Pydantic:
  schema contracts
  output validation

Jinja:
  prompt surface
  skill injection
  candidate context rendering

MCP:
  controlled candidate review lenses

Skill:
  task thesis
  operating spec

Reflection:
  thesis compliance checking

Langfuse:
  machine observability
  spans
  generation usage

FastAPI:
  HTTP boundary

AG-UI / Streamlit:
  human-facing workflow state
  run viewer

This is not as exciting as saying “I built an autonomous agent.”

But it is much easier to debug.

The important shift was that each part had a job. The workflow did not need to own model execution. The model client did not need to own orchestration. The prompt did not need to carry every safety rule alone. The UI did not need to pretend it was a chat interface.

The system became more ordinary.

And that was the point.

Orchestration is not execution

The biggest architectural lesson was simple:

orchestration is orchestration. execution is execution.

The component that owns workflow state and transitions does not have to be the same component that performs structured model calls.

In my current design, LangGraph owns the workflow.

It defines business-readable nodes such as:

LOAD_CLAUSE_SPANS
FILTER_RECORDS
PRECHECK_INPUT
CLASSIFY_SCOPE
GENERATE_MINIMAL_QUESTIONS
REFLECT_AGAINST_SKILL_THESIS
SAFETY_CHECK
WRITE_OUTPUT

Those names matter. They make the runtime easier to talk about.

If input fails deterministic precheck, the workflow stops before LLM classification.

If scope classification says the input is out of scope, the workflow stops before generation.

If reflection fails the output against the skill thesis, the workflow can request regeneration.

If safety checks fail, the system can report that boundary clearly.

That is orchestration.

Pydantic AI, on the other hand, is used inside nodes for structured model calls. It handles classification, generation, and reflection through explicit output schemas.

That is execution.

For this particular repository, Pydantic AI is intentionally boring.

A raw provider SDK with vanilla Pydantic validation could probably handle the current execution layer too. That is fine. The point is not that Pydantic AI is required.

The point is that model execution is kept behind a replaceable boundary.

Today that boundary handles structured classification, generation, and reflection calls. In related experiments, I use the same idea to keep LLMs as replaceable transformation components rather than autonomous evaluators or workflow owners.

What I want to avoid is letting the execution layer also become the architecture.

Keeping those responsibilities separate made the system easier to change.

I can keep the LangGraph workflow while changing the model-call internals. Or, in the future, I can rethink the orchestration layer without rewriting every schema, prompt, and context boundary.

This is a different kind of framework flexibility.

It is not “choose the best framework.”

It is “avoid letting one framework absorb every responsibility.”

MCP as controlled context provider

MCP is often discussed in the context of tool use: give the model tools, let it decide when to call them, and let the agent act.

That is not how I use it here.

In this project, MCP is a controlled context provider.

The application, not the model, calls a review-lens capability deterministically based on the clause type. The orchestrator retrieves candidate lenses, injects them into the prompt, and asks the model to select from that visible set.

The model is not blindly reaching for tools. It is choosing from context the runtime has already made observable.

The shape is closer to this:

clause_type
-> lookup_clause_review_hints
-> candidate review lenses
-> prompt rendering
-> selected_review_lenses
-> verification questions

This may look less autonomous.

That is intentional.

For a contract review support task, I do not want the model to silently decide which external context to retrieve and why. I want the system to expose what context was available, what was selected, and how it influenced the questions.

MCP still gives me a capability boundary.

But the current workflow remains controlled.

That boundary also leaves room for future designs. If model-controlled tool use becomes the better fit later, the same capability can move closer to an agent loop. For now, the system keeps context entry deterministic because observability matters more than autonomy.

Skill is not just a prompt

Another boundary that became important was Skill.

It is tempting to treat a skill as “some extra prompt text.”

For this project, that was too weak.

I started treating Skill as a task-level operating thesis.

For contract-question-agent, the thesis is roughly:

do not return verdicts
do not give legal conclusions
do not decide whether a clause is enforceable, fair, risky, or safe to sign
return verification questions
preserve uncertainty
keep the output review-oriented

That thesis is injected into the prompt surface.

But it does not stop there.

A separate reflection step checks whether the generated output still follows the thesis. If the result drifts toward verdicts, legal conclusions, or overconfident advice, the runtime can catch that as a boundary failure.

In practice, the reflection step returns a structured result, not just prose feedback. It can mark the output as passed or failed, attach violation types, and request regeneration when the generated questions drift away from the thesis.

That keeps the Skill from becoming a decorative prompt paragraph.

The thesis becomes something the workflow can route on.

That distinction matters.

A prompt instruction asks the model to behave.

A runtime boundary gives the system a place to inspect whether it did.

Debugging got clearer

Before separating responsibilities, a bad output felt vague.

The agent output is bad.
Maybe the prompt is wrong.
Maybe the tool call failed.
Maybe the context is weak.
Maybe the model just missed it.

After separating responsibilities, the failure became easier to name.

PRECHECK_INPUT rejected the input.
CLASSIFY_SCOPE marked it out of scope.
The schema failed validation.
MCP candidate lenses were weak.
Reflection rejected thesis compliance.
Safety check caught a boundary issue.
The workflow routed correctly, but the UI hid state.

The output can still be wrong.

This design does not magically make LLMs reliable.

But it gives the failure a location.

That is the difference between staring at a vague answer and opening the next pull request with a clear target.

When responsibilities are explicit, failure becomes a signal.

The UI is not a chat surface

One surprisingly useful decision was to avoid treating the UI as a chat UI.

The Streamlit viewer in this project is not meant to be a general conversation surface. It is a run viewer.

It calls a FastAPI boundary, reads AG-UI-compatible server-sent events, and shows workflow state: which node is running, which node finished, and what safe state snapshot is available.

That choice matches the architecture.

If the goal is observability, the human should see runtime state, not just final prose.

A chat interface can be useful. But for debugging an agent runtime, a chat window often hides the exact information you need most.

The interface should reflect the system boundary.

GitHub repo as implementation proof

This is not just a conceptual distinction.

The repository is not an appendix to the essay.

It is the implementation proof.

I am building these boundaries in a small open-source design experiment called contract-question-agent:

https://github.com/mofuteq/contract-question-agent

The repository is not a production-ready legal AI product.

It is a design lab for observable Agent Runtime architecture around a narrow task: turning contract concerns into verification questions instead of legal verdicts.

In the repo, you can inspect how the responsibilities are separated:

workflows/
  LangGraph orchestration and state transitions

model_client/
  Pydantic AI / OpenRouter structured model calls

schemas.py
  Pydantic output contracts

prompts/
  Jinja prompt surfaces

skills/
  contract verification-question skill thesis

mcp/
  candidate review-lens provider

api/
  FastAPI boundary and AG-UI event stream

viewer/
  Streamlit run viewer

tracing.py
  optional Langfuse tracing helpers

The repo exists because architecture essays are more useful when the reader can inspect the shape in code.

The code is the proof that the article is not just saying “use better abstractions.”

It is trying to make the responsibility boundaries visible.

If you read the repo after this essay, the useful question is not “is this a production-ready template?”

It is:

Can I see where each responsibility lives?

Closing

I still use Agent Frameworks.

This article is not an argument against them.

But I no longer want one framework to own every part of the system just because the word “agent” appears in the project.

Before asking which Agent Framework to use, I now ask:

Where does orchestration live?

Where does execution live?

Where does state live?

Where does context enter?

Where are boundaries checked?

Where does reflection happen?

Where does the UI expose state?

Where does the system become observable?

Those questions made debugging clearer than another framework comparison would have.

The framework is useful.

But it is not the architecture.

Support the ongoing experiments

If these architectural notes helped you think more clearly about agent systems, you can support the ongoing experiments here:

Buy Me a Coffee

Support goes toward LLM API credits, tracing tools, and small open-source design experiments.

Stop Asking Which Agent Framework to Use. Ask Where Responsibilities Live.

The framework became a black box

What I tried

What remained after removing the magic

Orchestration is not execution

MCP as controlled context provider

Skill is not just a prompt

Debugging got clearer

The UI is not a chat surface

GitHub repo as implementation proof

Closing

Support the ongoing experiments

Comments

Observable Agent Runtime

Command Palette

The framework became a black box

What I tried

What remained after removing the magic

Orchestration is not execution

MCP as controlled context provider

Skill is not just a prompt

Debugging got clearer

The UI is not a chat surface

GitHub repo as implementation proof

Closing

Support the ongoing experiments

Comments

Observable Agent Runtime