自建 AI Agent 一定要六層工具全用嗎?

不用。最低限度是框架層(例如 Pydantic AI)加上評測(Promptfoo)與可觀測性(AgentOps),確保它可量化、看得見。RAG、沙箱、部署等到真的有需求再補上,避免一開始就背太多工具。

Pydantic AI 和 LangChain 該選哪個?

流程單純、輸出要結構化、團隊習慣寫型別,優先試 Pydantic AI,它把輸出用 schema 綁死較好維護。要串大量現成整合或團隊已熟 LangChain,再用 LangChain,但要承擔它抽象層厚、版本變動快的成本。

什麼情況下 Agent 才需要執行沙箱?

當 Agent 會「自己產生程式碼並執行」或執行不可控的系統指令時,就需要 Blaxel 這類沙箱做隔離。如果它只是查資料、呼叫固定 API,風險較低,可以先不用。

可觀測性工具值得在專案早期就導入嗎?

值得。多步驟 Agent 出錯時,沒有 AgentOps、Langfuse 這類工具記錄執行軌跡,你很難知道是哪一步出問題。早點接等於買保險,事後補通常代價更大。

想自己做一個 AI Agent,該準備哪些工具?2026 開發者工具鏈全圖解

Demo 跑得漂亮、上線就翻車,問題九成出在工具鏈缺了一層。這篇把 2026 自建 AI Agent 的開發堆疊拆成六層:框架、RAG、沙箱、可觀測性、評測、部署,逐層說明各自解決什麼問題、什麼時候才需要它。

On Friday afternoon, a friend sent me his Agent demo video: the user typed a sentence, and the Agent searched for information, called three APIs, and returned a clear and concise summary. Impressive. I asked him, "Is it online?" He was silent for two seconds before responding, "The version that went online yesterday sent the customer's order number to the refund process by mistake, and we don't know which step went wrong."

This is almost the same pitfall that every self-built Agent team will encounter. Demos are a smooth road, but the production environment is a net - if any node fails, the entire chain will break, and you often don't even know where it broke. The difference lies not in how smart the model is, but in whether the toolchain behind it is complete.

Why 2026 is about the "toolchain" rather than "a framework"

In 2023 and 2024, everyone was asking "which framework to use to build an Agent." By 2026, this question is no longer relevant. The framework is just the top layer. A real Agent that can go online, be maintained, and debugged when something goes wrong has a clearly divided stack behind it - just like backend engineering, which requires not only a web framework but also databases, caching, logging, monitoring, and CI/CD.

The special thing about Agents is that they are "uncertain." Given the same input, the model may give different steps; it will decide whether or not to call tools and which ones to call. This uncertainty makes the traditional development habit of "writing and running, then checking the stack trace when it fails" completely ineffective. You need a new layer of tools to deal with problems like "why it did this," "whether it did it correctly," and "what issues it encountered online."

Key point: breaking down the toolchain into six layers

I'm used to breaking down the toolchain for self-built Agents into six layers, from top to bottom:

Framework layer: determines how the Agent defines, calls tools, and manages multi-step processes.
Retrieval layer (RAG): allows the Agent to read private data, rather than just relying on the knowledge built into the model.
Execution sandbox layer: provides a controlled environment for the Agent to run code and execute commands.
Observability layer: records every step of reasoning and every tool call, making it visible what the Agent is thinking.
Evaluation and security layer: uses a fixed set of test cases to measure the Agent's performance before going online and conducts red team testing.
Deployment layer: deploys the entire system stably and scalably online.

Not every project needs all six layers, but you should at least know what each layer does and which one you're missing.

Breaking down each layer: what problems they solve and when they're needed

Framework layer: Pydantic AI and LangChain

The framework helps you handle "how to glue models, tools, and processes together." Pydantic AI is a popular choice among Python engineers in the past two years because it binds outputs to type schemas - the Agent returns a structured, validated object that you can use directly, rather than a string that needs to be parsed. For backend engineers who are used to writing typed code, it's easy to pick up.

LangChain is another extreme: it has the largest ecosystem and integrates the most tools, making it easy to connect to almost anything. However, its abstraction layer is thick, and its versions change quickly, making it overkill for small projects. My suggestion is to try Pydantic AI first if your process is simple and your output needs to be structured; if you need to integrate many existing tools and your team is already familiar with LangChain, use it. Don't choose a tool just because "everyone is using it."

Retrieval layer: RAGFlow

If your Agent doesn't have RAG, it can only rely on the knowledge built into the model during training, and it will start to falter when faced with your company's internal files or latest product specifications. RAGFlow handles the most underestimated part of the chain: document chunking, parsing, vectorization, and retrieval. It's particularly good at handling PDFs, tables, and complex documents, which is common in Taiwanese enterprises with many PDF contracts and specification books. When do you need it? Whenever your Agent needs to answer "things that only your company knows."

Execution sandbox layer: Blaxel

When your Agent starts to "write code and execute it," danger arises. It may run into an infinite loop, read files it shouldn't, or call external networks. Blaxel provides a sandboxed environment for the Agent to run these uncontrollable actions, so even if it fails, it won't crash your main machine. If your Agent only searches for data or calls fixed APIs, you may not need it; but once it needs to dynamically execute code, a sandbox is no longer optional, it's necessary.

Observability layer: AgentOps and Langfuse

This is the layer that my friend at the beginning was lacking the most. When the Agent runs a multi-step process, you need to see what it's thinking, what tools it called, and which step went wrong. AgentOps focuses on the Agent's execution trajectory, recording every step and every tool call into a replayable timeline, which is lifesaving when debugging multi-step processes. Langfuse is more focused on online monitoring and analysis, suitable for tracking every conversation and every cost in the production environment. The two have slightly overlapping but different focuses, which I'll discuss in more detail in another article about evaluation and security.

Evaluation and security layer: Promptfoo

The most frightening thing about Agents is not when they crash, but when they "silently do something wrong" - they return an answer that looks reasonable but is actually incorrect, and no one notices. Promptfoo allows you to fix a set of test cases, run them every time you change the prompt or model, and quantify whether the answer quality has improved or deteriorated, while also conducting red team testing to find vulnerabilities that can be bypassed. The spirit of this layer is to turn "I think it's better" into "the numbers say it's better." To understand the complete method, you can read the follow-up article 〈AI Agent evaluation and security checks before going online〉.

Deployment layer: Northflank

Finally, there's deploying the entire system online. Agent deployment is more complicated than general web services: it requires long-term execution, background tasks, and sometimes managing sandbox containers. Northflank helps you handle containerization, scaling, and CI/CD, so you don't have to build Kubernetes from scratch. Small teams can use it to save the manpower of a DevOps engineer.

Three types of teams, three ways to play

Taiwanese individual developers: don't try to do all six layers at once. First, use Pydantic AI to build the core process, then connect Promptfoo to ensure it doesn't get worse with changes, and add other layers as needed. One person can only maintain a limited number of tools, so be restrained.

Startup teams: observability is key. I've seen too many teams leave AgentOps and Langfuse until "later," only to spend three days searching through logs when something goes wrong. Connect them early, which is like buying insurance. For the deployment layer, directly use a managed platform like Northflank to save time and focus on polishing the product.

Enterprises: RAG and security are crucial. Your data is sensitive, and compliance requirements are high, so RAGFlow's private deployment capabilities, Promptfoo's red team testing, and sandbox isolation are all directly related to the questions that the risk control department will ask. It's recommended to find a small scenario to walk through all six layers first, then replicate it horizontally.

Practical approach: where to start

If you're just starting, my order is: framework (Pydantic AI) → evaluation (Promptfoo) → observability (AgentOps) → add RAG (RAGFlow) and sandbox (Blaxel) as needed → finally deploy (Northflank).

Note that evaluation and observability come before RAG. Because an Agent without evaluation and visibility is just piling uncontrollable things higher. First, make it "quantifiable and visible," then make it "stronger." If you haven't done any Agent development before, you can start by reading the introductory article 〈How to build your own AI Agent〉 to establish basic concepts, then come back to this toolchain.

TheAI Academy summary and comment

The competition for self-built Agents in 2026 is no longer about "whose model is smarter," but about "whose toolchain is more complete." Anyone can do a demo, but being able to deploy an Agent online, debug it when something goes wrong, and ensure it doesn't deteriorate with changes - that's the real skill.

"Many people can build Agents, but few can deploy them online safely and maintain them - the latter is the truly valuable engineering capability in 2026."

My specific suggestion for Taiwanese readers is: don't be intimidated by the number of tools, and don't be greedy and try to install everything at once. Choose a small task, and first walk through the framework, evaluation, and observability layers - these three layers are enough to make you better than 90% of people who only know how to do demos. RAG, sandbox, and deployment can be added later when you really encounter the wall.