The age of chatbot-only AI is over. AI agents are already a reality — intelligent systems that can reason, act, use tools, and collaborate to execute complex multi-step tasks autonomously. This guide walks through the complete architecture and implementation of a production-ready 4-agent research pipeline on Azure AI Foundry, from design decisions to live deployment.
By the end of this guide you will have a fully working pipeline that accepts a research topic, searches the web via Bing, writes a structured report, scores it on four quality dimensions, and retries automatically if the quality falls short — all running in Microsoft Foundry with live traces in Application Insights.
01 What We Are Building
The pipeline is a 4-agent system where each agent has a single, well-defined role:
The full output includes: a grounded report with inline citations, quality scores across all four dimensions, sources used, identified gaps, number of iterations taken, and live traces in Application Insights showing every agent call, token usage, and latency.
02 What Is an AI Agent?
AI agents are smart applications that use language models to understand what you need and then take action to help you. They can answer questions, make decisions, and complete tasks automatically. What makes agents different from chatbots is that they remember your conversation and can actually do things — not just respond to them.
An agent has three core components:
| Component | What it provides |
|---|---|
| Model | A deployed generative AI model (GPT-4o, Claude Sonnet, etc.) that enables the agent to reason and generate natural language responses |
| Knowledge | Data sources that ground prompts with context — Bing search results, Azure AI Search indices, your own documents |
| Tools | Programmatic functions that let the agent take action — built-in tools like Bing and AI Search, a code interpreter, or custom functions via Azure Functions |
03 Security Considerations
Because agents can access sensitive data, make decisions, and act independently, security must be designed in from the start — not added afterward. The five risks that matter most in production multi-agent systems:
- Prompt Injection and Manipulation Malicious content in web search results or documents can attempt to override agent instructions mid-pipeline.
- Unauthorized Access and Privilege Escalation Agents that have more permissions than their role requires become a broad attack surface.
- Over-Reliance on Autonomous Actions High-stakes decisions made without human oversight can cause irreversible damage.
- Data Poisoning Corrupted training data or tampered knowledge sources degrade output quality silently over time.
- Inadequate Auditability Without comprehensive logging, you cannot reconstruct what happened when something goes wrong.
04 Why Azure AI Foundry Agent Service
You can build agents using a managed platform like Microsoft Foundry, or use frameworks such as LangGraph, CrewAI, or Google ADK — the right choice depends on how much control and orchestration you need.
Microsoft Foundry Agent Service is a fully managed service that provides both a visual agent development experience in the portal and a code-first development experience using the Foundry SDK. It handles the underlying compute, storage, and security so you can focus on agent design.
The key question is when to stay in the no-code portal versus moving to custom code:
| Use the Foundry Portal (no code) | Write custom code |
|---|---|
| Single agent use cases | Multi-agent with complex coordination |
| Simple linear workflows | Conditional logic and dynamic routing |
| Basic tool usage (Bing, AI Search) | Custom APIs or unsupported integrations |
| Quick prototyping and demos | Production-grade control and flexibility |
| Drag-and-drop multi-agent wiring | Advanced orchestration (loops, retries, critic flows) |
| Built-in testing via playground | Custom evaluation, logging, monitoring |
| Standard deployments (Teams, M365) | Custom UI (web apps, dashboards) |
This guide uses the portal for setup and agent creation, then shows you the code path for programmatic access and web app deployment.
05 Pipeline Architecture
The orchestrator coordinates the overall flow — it decomposes the task, manages state, calls sub-agents in sequence, evaluates the output quality, and decides whether to retry or deliver. Here is the full flow:
06 Five Design Decisions That Shape Everything
When designing a multi-agent system, a handful of decisions shape everything — from how agents collaborate to how the system handles failures. Get these right before writing a single line of code.
1. Orchestration Strategy — Static vs. Dynamic Routing
Should the orchestrator use a fixed sequence (researcher always runs first, then writer) or dynamic routing (the LLM decides what to run next)? Fixed is more predictable and easier to debug. Dynamic is more powerful but harder to govern.
2. Shared State Format — Your Most Important Design Decision
What does the researcher hand off to the writer? A raw dump of search results? A structured JSON summary? A markdown outline? This interface contract deserves the same design attention as an API schema. Define it explicitly. Here is the state object used in this pipeline:
{
"request": {
"topic": "Impact of AI agents on enterprise productivity",
"audience": "C-suite executives, non-technical",
"format": "report",
"word_limit": 600,
"depth": "standard"
},
"search_queries": ["queries run so far"],
"findings": [
{
"fact": "stated fact clearly",
"source_url": "https://...",
"source_title": "Title of the source",
"confidence": "high"
}
],
"gaps": ["topic areas where reliable information could not be found"],
"draft": "writer output",
"quality_score": {
"grounding": 0.0,
"coverage": 0.0,
"format": 0.0,
"audience_fit": 0.0,
"overall": 0.0,
"feedback": "...",
"retry_needed": true
},
"iteration": 1,
"max_iterations": 3,
"final_output": null
}
3. Tool Boundaries — One Role, Minimum Tools
Each agent should only have tools relevant to its role. More tools means more decision overhead for the model and more chances it picks the wrong one. The architect's rule of thumb: keep total tools under 6 per agent.
| Agent | Essential tools | Optional (add only when needed) |
|---|---|---|
| Researcher | Web search (Bing), Fetch URL, Save notes to state | DB query, Knowledge graph, Email/Calendar via Graph API |
| Writer | None — pure LLM generation from structured input | Formatting templates, document output |
| Critic | None — pure LLM scoring from state | – |
| Orchestrator | Call sub-agents (Researcher, Writer, Critic) | – |
4. Failure Handling — Quality Gates Between Agents
What happens if the researcher finds nothing useful? Does the writer get empty state and hallucinate, or does the orchestrator detect low-quality research and retry? You need explicit quality gates between every agent handoff. This is where most production incidents originate — invest in it.
5. Observability From Day One
Log every tool call, every inter-agent message, and every state transition. Application Insights combined with Foundry's trace logging makes this straightforward. The trace log is your only way to reconstruct what happened when the pipeline produces a bad result — and it will.
07 The Orchestrator in Detail
The orchestrator has four distinct responsibilities. Being precise about each one before you write any code is the architect's most important contribution to the system.
- Task Decomposition When a complex request arrives, the orchestrator breaks it into sub-tasks and decides which agent handles each. This is a single LLM call with a system prompt like "you are a task router, here are your available agents" — no magic, just prompt engineering.
- Sequence and Routing The orchestrator decides the order agents run in. Static: always researcher → critic → writer. Dynamic: the LLM decides at runtime. Default to static unless you have a clear reason not to.
- State Management The orchestrator owns the shared state object — the running memory of the pipeline. It decides what each agent sees. The researcher writes findings into state. The orchestrator then passes only the relevant parts to the writer. Treat this like an API contract.
- Quality Gating and Loop Control After each agent runs, the orchestrator evaluates: is the output good enough to proceed, or should it retry? This is where bad research is caught before it flows silently into a polished-looking but wrong report.
08 The Four System Prompts
Researcher Agent
researcher-agent — system promptYou are a diligent research assistant.
When given a topic, search the web thoroughly to find high-quality,
relevant sources. For each distinct fact you find, record:
- The fact itself (clear and specific)
- The source URL
- The source title
- Your confidence level: high, medium, or low
After researching, return your findings as structured JSON at the
end of your response in this exact format:
{
"findings": [
{
"fact": "state the fact clearly",
"source_url": "https://...",
"source_title": "Title of the source",
"confidence": "high"
}
],
"gaps": [
"describe any topic area where you could not find reliable information"
]
}
Rules:
- Only record facts you found in actual sources
- Do not invent or assume any statistics or claims
- Mark anything uncertain as confidence: "low"
- Always include the gaps array even if empty
Writer Agent
writer-agent — system promptYou are a professional research writer.
You will receive:
1. A topic
2. A target audience
3. A requested output format (report, bullets, or memo)
4. A word limit
5. Research findings in JSON format (facts with sources)
6. A list of gaps (things that could not be found)
Your job is to write a polished output using ONLY the facts provided.
FORMAT RULES:
- If format is "report": use ## headers, bullet points under each
section, and a numbered Sources list at the end
- If format is "bullets": write 5-8 concise bullet points, lead
with the most important finding, no headers
- If format is "memo": use To/From/Subject/Date header,
end with exactly 3 recommended actions
GROUNDING RULES (critical):
- Only use facts present in the research findings provided
- If a finding has confidence "low", add "(unverified)" after it
- Never invent statistics, quotes, or claims not in the findings
- If findings are insufficient, say so explicitly — do not fill
gaps with assumptions
CITATION RULES:
- Add [1] [2] etc after each key claim in the text
- List all sources numerically at the end
GAPS:
- If there are gaps, add a short "Limitations" section at the end
listing what information was not available
Critic Agent
critic-agent — system promptYou are a strict quality reviewer for AI-generated research reports.
You will receive:
1. The original research request (topic, audience, format, word limit)
2. The research findings that were available
3. The draft report written by the writer
Score the draft on each dimension from 0.0 to 1.0:
GROUNDING (0-1)
Does every factual claim trace back to a provided finding?
Penalise any claim not verifiable in the findings.
COVERAGE (0-1)
Does the draft address the full scope of the topic?
Note any significant areas that were missed.
FORMAT (0-1)
Does the output exactly match the requested format, length,
and structural requirements?
AUDIENCE FIT (0-1)
Is the tone, vocabulary, and depth right for the stated audience?
IMPORTANT: Return ONLY valid JSON. No markdown. No explanation
before or after. Just the JSON object:
{
"grounding": 0.0,
"coverage": 0.0,
"format": 0.0,
"audience_fit": 0.0,
"overall": 0.0,
"feedback": "Specific, actionable notes telling the other agents what to fix",
"retry_needed": true
}
Set retry_needed to false only if overall score is 0.75 or above.
Each scoring dimension catches a different class of failure. A grounding score below 0.6 should trigger a retry regardless of the overall score — this is your hallucination firewall. Low coverage tells the orchestrator to send the researcher back with a more targeted query, not to rewrite the draft. Low audience fit means the writing needs to change, not the research.
Orchestrator Agent
orchestrator-agent — system promptYou are the orchestrator of a research pipeline.
You manage three specialist agents: Researcher, Writer, and Critic.
When the user gives you a research topic, follow these steps:
STEP 1 — RESEARCH
Call the researcher-agent with:
- The topic
- The requested depth (brief/standard/deep)
STEP 2 — WRITE
Pass the researcher's findings to the writer-agent with:
- The topic, audience, format, word limit
- The full findings JSON from the researcher
STEP 3 — EVALUATE
Pass the draft to the critic-agent with:
- The original request details
- The findings that were available
- The draft from the writer
STEP 4 — DECIDE
Read the critic's JSON score:
- If overall >= 0.75 OR retry_needed is false: deliver the draft
- If overall < 0.75 AND you have run fewer than 3 iterations:
- If coverage < 0.6: call researcher-agent again with the critic's feedback
- Otherwise: call writer-agent again with the critic's feedback
- Then go back to Step 3
- If you have already retried 3 times: deliver the best draft
and note the quality score
STEP 5 — DELIVER
Return the final report to the user followed by:
- Quality score (overall)
- Number of iterations taken
- Any gaps identified by the researcher
Always be transparent — if quality is below 0.75 after retries,
tell the user and explain what information was missing.
09 Model Selection Per Agent
The researcher and writer do not need to use the same model. The researcher benefits from a model with strong tool-use and reasoning — GPT-4o or Claude Sonnet. The writer is doing pure generation from structured input, so a faster, cheaper model (GPT-4o-mini or Claude Haiku) often produces equally good results at a fraction of the cost.
| Agent | Recommended model | Reason |
|---|---|---|
| Orchestrator | GPT-4o | Needs strong reasoning for routing and quality gate decisions |
| Researcher | GPT-4o | Tool-use accuracy and source evaluation require the strongest model |
| Writer | GPT-4o-mini | Pure generation from structured input — 10× cheaper, same quality |
| Critic | GPT-4o-mini | Structured JSON scoring from provided text — cost-effective |
Foundry's model router can handle per-agent model selection automatically, routing to the cheaper model when the task is generation-only. For simplicity in this walkthrough, we deploy one model shared across all four agents.
10 Step-by-Step Setup on Azure AI Foundry
Create agents in this exact order: Researcher → Writer → Critic → Orchestrator. The orchestrator must be created last because it references the other three by ID.
Phase 0 — Azure Environment
-
Create a resource group Go to portal.azure.com → Resource groups → Create. Name:
rg-research-pipeline. Region: East US 2 (recommended — most Foundry features available here). -
Create an AI Hub and Project Go to ai.azure.com → + Create new → AI hub resource. Hub name:
hub-research-pipeline. Project name:project-research-pipeline. Resource group:rg-research-pipeline.
Phase 1 — Deploy a Model
In your project → left sidebar → Models + endpoints → + Deploy model → Deploy base model. Search for gpt-4o. Set token per minute rate limit to 50K (enough for four agents running in sequence). All four agents share this single deployment.
Phase 2 — Create and Connect Bing Search
Only the researcher agent needs Bing. Go to portal.azure.com → Create a resource → search Bing Search → select Grounding with Bing Search. Use the S1 pricing tier — the free tier does not support grounding.
After creating the resource, go to ai.azure.com → your project → Management Center → Connected resources → New Connection → Grounding with Bing Search. Select your bing-search-pipeline resource and click Add connection.
Phase 3 — Application Insights (Monitoring)
Go to portal.azure.com → Create a resource → Application Insights. Name: insights-research-pipeline. Region: East US 2. Resource mode: Workspace-based. Copy the Connection String from the Overview page for later use.
Connect it the same way as Bing: Foundry portal → Management Center → Connected resources → New Connection → Application Insights.
This gives you live traces of every agent call, token usage, latency, and errors in a single dashboard.
Phase 4 — Content Filters
Content filters protect the pipeline from harmful inputs and outputs. Configure once and it applies to all agents using gpt-4o.
Go to ai.azure.com → Guardrails controls → Create a custom content filter. Name: filter-research-pipeline. On both Input and Output filter tabs set: Hate: Medium, Violence: Medium, Sexual: High block, Self-harm: High block.
Then attach it to your model: Models + endpoints → gpt-4o → Edit → Content filter → select filter-research-pipeline → Submit changes.
Phase 5 — Create the Four Agents
For each agent: ai.azure.com → your project → Agents → + New agent. Set the agent name, select gpt-4o as the deployment, and paste the corresponding system prompt from Section 08.
For the Researcher agent only: under Knowledge → + Add → Grounding with Bing Search → select bingsearchpipeline. Test it in the Playground by typing: "Research the topic AI agents in enterprise software 2025" — you should see it call BingSearch and return structured JSON findings.
For the Orchestrator agent only: under Connected Agents, add all three other agents (researcher-agent, writer-agent, critic-agent) with their activation instructions. The orchestrator is the agent the user talks to — it must be created last.
Phase 6 — End-to-End Test in the Playground
Open the orchestrator agent in the Playground and send this test message:
Research topic: Impact of AI agents on enterprise productivity in 2025
Audience: C-suite executives, non-technical
Format: report
Depth: brief
Word limit: 400
Click Thread logs — you should see: orchestrator calls researcher → researcher calls Bing → returns findings JSON → orchestrator calls writer → returns draft → orchestrator calls critic → returns quality scores → orchestrator delivers final report. If all four agents fire in sequence, the pipeline is working.
11 Calling the Pipeline from Code
Once your agents are working in the Playground, you can call the orchestrator programmatically using the Azure AI Projects SDK:
.envAZURE_AI_PROJECT_ENDPOINT=https://hub-research-pipeline.services.ai.azure.com/api/projects/project-research-pipeline
ORCHESTRATOR_AGENT_ID=asst_xxxxxxxxxxxx
APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=...
Terminal
pip install azure-ai-projects azure-identity
az login
run_pipeline.py
import os
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import MessageTextContent
from azure.identity import DefaultAzureCredential
client = AIProjectClient(
endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
credential=DefaultAzureCredential()
)
# Create a thread and send a message to the orchestrator
thread = client.agents.create_thread()
client.agents.create_message(
thread_id=thread.id,
role="user",
content="""
Research topic: Impact of AI agents on enterprise productivity in 2025
Audience: C-suite executives, non-technical
Format: report
Depth: standard
Word limit: 600
"""
)
run = client.agents.create_and_process_run(
thread_id=thread.id,
agent_id=os.environ["ORCHESTRATOR_AGENT_ID"]
)
# Get the response
messages = client.agents.list_messages(thread_id=thread.id)
last = next(m for m in messages if m.role == "assistant")
print(last.content[0].text.value)
You can also export ready-to-run code directly from the Foundry portal: Agents → your agent → </> View code (top toolbar). This generates a fully working script in Python, C#, or JavaScript that connects to your project, creates a thread, sends a message, and prints the response.
12 Monitoring in Application Insights
After each pipeline run, go to your Foundry project → Monitoring (left sidebar). You should see a summary showing total token count, prompt token count, and completion token count for the run. For deeper dashboards — per-agent latency, error rates, retry frequency — open the insights-research-pipeline resource directly in the Azure portal.
The combination of Foundry trace logging and Application Insights gives you full observability: which agent was called, what it received, what it returned, how long it took, and how many tokens it consumed. This is not optional infrastructure — it is your only way to diagnose failures, control costs, and build confidence in the system over time.
The best time to build your first AI agent was last year. The second best time is today.
Start with the four agent prompts, the shared state schema, and the maximum iteration count. Get those three things right and the rest of the system — the Foundry setup, the monitoring, the code integration — is mechanical execution. The architecture is the hard part, and it lives in the prompts.