The age of chatbot-only AI is over. AI agents are already a reality — intelligent systems that can reason, act, use tools, and collaborate to execute complex multi-step tasks autonomously. This guide walks through the complete architecture and implementation of a production-ready 4-agent research pipeline on Azure AI Foundry, from design decisions to live deployment.

By the end of this guide you will have a fully working pipeline that accepts a research topic, searches the web via Bing, writes a structured report, scores it on four quality dimensions, and retries automatically if the quality falls short — all running in Microsoft Foundry with live traces in Application Insights.

01 What We Are Building

The pipeline is a 4-agent system where each agent has a single, well-defined role:

Orchestrator
Routes tasks, reads quality scores, decides when to retry or deliver the final output. The only agent the user talks to directly.
Researcher
Searches the web with Bing, collects facts from sources, returns structured JSON findings with confidence levels.
Writer
Transforms research findings into a formatted, grounded report, bullets, or memo — using only facts from the researcher's output.
Critic
Evaluates the draft on four dimensions (grounding, coverage, format, audience fit), returns a quality score and retry decision.

The full output includes: a grounded report with inline citations, quality scores across all four dimensions, sources used, identified gaps, number of iterations taken, and live traces in Application Insights showing every agent call, token usage, and latency.

02 What Is an AI Agent?

AI agents are smart applications that use language models to understand what you need and then take action to help you. They can answer questions, make decisions, and complete tasks automatically. What makes agents different from chatbots is that they remember your conversation and can actually do things — not just respond to them.

An agent has three core components:

ComponentWhat it provides
Model A deployed generative AI model (GPT-4o, Claude Sonnet, etc.) that enables the agent to reason and generate natural language responses
Knowledge Data sources that ground prompts with context — Bing search results, Azure AI Search indices, your own documents
Tools Programmatic functions that let the agent take action — built-in tools like Bing and AI Search, a code interpreter, or custom functions via Azure Functions

03 Security Considerations

Because agents can access sensitive data, make decisions, and act independently, security must be designed in from the start — not added afterward. The five risks that matter most in production multi-agent systems:

  1. Prompt Injection and Manipulation Malicious content in web search results or documents can attempt to override agent instructions mid-pipeline.
  2. Unauthorized Access and Privilege Escalation Agents that have more permissions than their role requires become a broad attack surface.
  3. Over-Reliance on Autonomous Actions High-stakes decisions made without human oversight can cause irreversible damage.
  4. Data Poisoning Corrupted training data or tampered knowledge sources degrade output quality silently over time.
  5. Inadequate Auditability Without comprehensive logging, you cannot reconstruct what happened when something goes wrong.
Mitigation checklist Enforce RBAC and least-privilege permissions. Validate and filter all inputs. Gate sensitive operations behind human-in-the-loop approvals. Log every agent action with full traceability. Audit third-party integrations regularly. Retrain and validate models continuously to detect data drift.

04 Why Azure AI Foundry Agent Service

You can build agents using a managed platform like Microsoft Foundry, or use frameworks such as LangGraph, CrewAI, or Google ADK — the right choice depends on how much control and orchestration you need.

Microsoft Foundry Agent Service is a fully managed service that provides both a visual agent development experience in the portal and a code-first development experience using the Foundry SDK. It handles the underlying compute, storage, and security so you can focus on agent design.

The key question is when to stay in the no-code portal versus moving to custom code:

Use the Foundry Portal (no code)Write custom code
Single agent use casesMulti-agent with complex coordination
Simple linear workflowsConditional logic and dynamic routing
Basic tool usage (Bing, AI Search)Custom APIs or unsupported integrations
Quick prototyping and demosProduction-grade control and flexibility
Drag-and-drop multi-agent wiringAdvanced orchestration (loops, retries, critic flows)
Built-in testing via playgroundCustom evaluation, logging, monitoring
Standard deployments (Teams, M365)Custom UI (web apps, dashboards)

This guide uses the portal for setup and agent creation, then shows you the code path for programmatic access and web app deployment.

05 Pipeline Architecture

The orchestrator coordinates the overall flow — it decomposes the task, manages state, calls sub-agents in sequence, evaluates the output quality, and decides whether to retry or deliver. Here is the full flow:

Task Arrives "Research topic X" Orchestrator Routes · Manages state Controls retry loop Researcher Bing search · JSON facts Writer Report · Bullets · Memo Critic Scores 4 dimensions Retry if score < 0.75 Deliver if ≥ 0.75
Fig. 1 — The 4-agent pipeline. The orchestrator manages all state and the retry loop. The user only ever interacts with the orchestrator.

06 Five Design Decisions That Shape Everything

When designing a multi-agent system, a handful of decisions shape everything — from how agents collaborate to how the system handles failures. Get these right before writing a single line of code.

1. Orchestration Strategy — Static vs. Dynamic Routing

Should the orchestrator use a fixed sequence (researcher always runs first, then writer) or dynamic routing (the LLM decides what to run next)? Fixed is more predictable and easier to debug. Dynamic is more powerful but harder to govern.

Default to static routing Static orchestration is not a limitation — it is a deliberate governance decision. It covers 80% of use cases and makes the system far easier to debug, trace, and audit. Only add dynamic routing when you have a clear need and the governance model to support it.

2. Shared State Format — Your Most Important Design Decision

What does the researcher hand off to the writer? A raw dump of search results? A structured JSON summary? A markdown outline? This interface contract deserves the same design attention as an API schema. Define it explicitly. Here is the state object used in this pipeline:

{
  "request": {
    "topic": "Impact of AI agents on enterprise productivity",
    "audience": "C-suite executives, non-technical",
    "format": "report",
    "word_limit": 600,
    "depth": "standard"
  },
  "search_queries": ["queries run so far"],
  "findings": [
    {
      "fact": "stated fact clearly",
      "source_url": "https://...",
      "source_title": "Title of the source",
      "confidence": "high"
    }
  ],
  "gaps": ["topic areas where reliable information could not be found"],
  "draft": "writer output",
  "quality_score": {
    "grounding": 0.0,
    "coverage": 0.0,
    "format": 0.0,
    "audience_fit": 0.0,
    "overall": 0.0,
    "feedback": "...",
    "retry_needed": true
  },
  "iteration": 1,
  "max_iterations": 3,
  "final_output": null
}

3. Tool Boundaries — One Role, Minimum Tools

Each agent should only have tools relevant to its role. More tools means more decision overhead for the model and more chances it picks the wrong one. The architect's rule of thumb: keep total tools under 6 per agent.

AgentEssential toolsOptional (add only when needed)
Researcher Web search (Bing), Fetch URL, Save notes to state DB query, Knowledge graph, Email/Calendar via Graph API
Writer None — pure LLM generation from structured input Formatting templates, document output
Critic None — pure LLM scoring from state
Orchestrator Call sub-agents (Researcher, Writer, Critic)

4. Failure Handling — Quality Gates Between Agents

What happens if the researcher finds nothing useful? Does the writer get empty state and hallucinate, or does the orchestrator detect low-quality research and retry? You need explicit quality gates between every agent handoff. This is where most production incidents originate — invest in it.

5. Observability From Day One

Log every tool call, every inter-agent message, and every state transition. Application Insights combined with Foundry's trace logging makes this straightforward. The trace log is your only way to reconstruct what happened when the pipeline produces a bad result — and it will.

07 The Orchestrator in Detail

The orchestrator has four distinct responsibilities. Being precise about each one before you write any code is the architect's most important contribution to the system.

  1. Task Decomposition When a complex request arrives, the orchestrator breaks it into sub-tasks and decides which agent handles each. This is a single LLM call with a system prompt like "you are a task router, here are your available agents" — no magic, just prompt engineering.
  2. Sequence and Routing The orchestrator decides the order agents run in. Static: always researcher → critic → writer. Dynamic: the LLM decides at runtime. Default to static unless you have a clear reason not to.
  3. State Management The orchestrator owns the shared state object — the running memory of the pipeline. It decides what each agent sees. The researcher writes findings into state. The orchestrator then passes only the relevant parts to the writer. Treat this like an API contract.
  4. Quality Gating and Loop Control After each agent runs, the orchestrator evaluates: is the output good enough to proceed, or should it retry? This is where bad research is caught before it flows silently into a polished-looking but wrong report.
The orchestrator is a single point of failure and cost Every agent call routes through it. If the orchestrator's prompt is poorly designed, every downstream agent suffers. If it loops too aggressively, token costs spike. If it has no exit condition, you get infinite loops. Always design it with a maximum iteration count and explicit exit criteria first.

08 The Four System Prompts

Researcher Agent

researcher-agent — system prompt
You are a diligent research assistant.
When given a topic, search the web thoroughly to find high-quality,
relevant sources. For each distinct fact you find, record:
- The fact itself (clear and specific)
- The source URL
- The source title
- Your confidence level: high, medium, or low

After researching, return your findings as structured JSON at the
end of your response in this exact format:
{
  "findings": [
    {
      "fact": "state the fact clearly",
      "source_url": "https://...",
      "source_title": "Title of the source",
      "confidence": "high"
    }
  ],
  "gaps": [
    "describe any topic area where you could not find reliable information"
  ]
}

Rules:
- Only record facts you found in actual sources
- Do not invent or assume any statistics or claims
- Mark anything uncertain as confidence: "low"
- Always include the gaps array even if empty

Writer Agent

writer-agent — system prompt
You are a professional research writer.
You will receive:
1. A topic
2. A target audience
3. A requested output format (report, bullets, or memo)
4. A word limit
5. Research findings in JSON format (facts with sources)
6. A list of gaps (things that could not be found)

Your job is to write a polished output using ONLY the facts provided.

FORMAT RULES:
- If format is "report": use ## headers, bullet points under each
  section, and a numbered Sources list at the end
- If format is "bullets": write 5-8 concise bullet points, lead
  with the most important finding, no headers
- If format is "memo": use To/From/Subject/Date header,
  end with exactly 3 recommended actions

GROUNDING RULES (critical):
- Only use facts present in the research findings provided
- If a finding has confidence "low", add "(unverified)" after it
- Never invent statistics, quotes, or claims not in the findings
- If findings are insufficient, say so explicitly — do not fill
  gaps with assumptions

CITATION RULES:
- Add [1] [2] etc after each key claim in the text
- List all sources numerically at the end

GAPS:
- If there are gaps, add a short "Limitations" section at the end
  listing what information was not available

Critic Agent

critic-agent — system prompt
You are a strict quality reviewer for AI-generated research reports.
You will receive:
1. The original research request (topic, audience, format, word limit)
2. The research findings that were available
3. The draft report written by the writer

Score the draft on each dimension from 0.0 to 1.0:

GROUNDING (0-1)
Does every factual claim trace back to a provided finding?
Penalise any claim not verifiable in the findings.

COVERAGE (0-1)
Does the draft address the full scope of the topic?
Note any significant areas that were missed.

FORMAT (0-1)
Does the output exactly match the requested format, length,
and structural requirements?

AUDIENCE FIT (0-1)
Is the tone, vocabulary, and depth right for the stated audience?

IMPORTANT: Return ONLY valid JSON. No markdown. No explanation
before or after. Just the JSON object:
{
  "grounding": 0.0,
  "coverage": 0.0,
  "format": 0.0,
  "audience_fit": 0.0,
  "overall": 0.0,
  "feedback": "Specific, actionable notes telling the other agents what to fix",
  "retry_needed": true
}

Set retry_needed to false only if overall score is 0.75 or above.

Each scoring dimension catches a different class of failure. A grounding score below 0.6 should trigger a retry regardless of the overall score — this is your hallucination firewall. Low coverage tells the orchestrator to send the researcher back with a more targeted query, not to rewrite the draft. Low audience fit means the writing needs to change, not the research.

Orchestrator Agent

orchestrator-agent — system prompt
You are the orchestrator of a research pipeline.
You manage three specialist agents: Researcher, Writer, and Critic.

When the user gives you a research topic, follow these steps:

STEP 1 — RESEARCH
Call the researcher-agent with:
- The topic
- The requested depth (brief/standard/deep)

STEP 2 — WRITE
Pass the researcher's findings to the writer-agent with:
- The topic, audience, format, word limit
- The full findings JSON from the researcher

STEP 3 — EVALUATE
Pass the draft to the critic-agent with:
- The original request details
- The findings that were available
- The draft from the writer

STEP 4 — DECIDE
Read the critic's JSON score:
- If overall >= 0.75 OR retry_needed is false: deliver the draft
- If overall < 0.75 AND you have run fewer than 3 iterations:
  - If coverage < 0.6: call researcher-agent again with the critic's feedback
  - Otherwise: call writer-agent again with the critic's feedback
  - Then go back to Step 3
- If you have already retried 3 times: deliver the best draft
  and note the quality score

STEP 5 — DELIVER
Return the final report to the user followed by:
- Quality score (overall)
- Number of iterations taken
- Any gaps identified by the researcher

Always be transparent — if quality is below 0.75 after retries,
tell the user and explain what information was missing.

09 Model Selection Per Agent

The researcher and writer do not need to use the same model. The researcher benefits from a model with strong tool-use and reasoning — GPT-4o or Claude Sonnet. The writer is doing pure generation from structured input, so a faster, cheaper model (GPT-4o-mini or Claude Haiku) often produces equally good results at a fraction of the cost.

AgentRecommended modelReason
OrchestratorGPT-4oNeeds strong reasoning for routing and quality gate decisions
ResearcherGPT-4oTool-use accuracy and source evaluation require the strongest model
WriterGPT-4o-miniPure generation from structured input — 10× cheaper, same quality
CriticGPT-4o-miniStructured JSON scoring from provided text — cost-effective

Foundry's model router can handle per-agent model selection automatically, routing to the cheaper model when the task is generation-only. For simplicity in this walkthrough, we deploy one model shared across all four agents.

10 Step-by-Step Setup on Azure AI Foundry

Create agents in this exact order: Researcher → Writer → Critic → Orchestrator. The orchestrator must be created last because it references the other three by ID.

Phase 0 — Azure Environment

  1. Create a resource group Go to portal.azure.com → Resource groups → Create. Name: rg-research-pipeline. Region: East US 2 (recommended — most Foundry features available here).
  2. Create an AI Hub and Project Go to ai.azure.com → + Create new → AI hub resource. Hub name: hub-research-pipeline. Project name: project-research-pipeline. Resource group: rg-research-pipeline.
Region matters Not all Azure regions support all models and Foundry features. If deployment errors occur, switch to East US 2.

Phase 1 — Deploy a Model

In your project → left sidebar → Models + endpoints → + Deploy model → Deploy base model. Search for gpt-4o. Set token per minute rate limit to 50K (enough for four agents running in sequence). All four agents share this single deployment.

Phase 2 — Create and Connect Bing Search

Only the researcher agent needs Bing. Go to portal.azure.com → Create a resource → search Bing Search → select Grounding with Bing Search. Use the S1 pricing tier — the free tier does not support grounding.

Cost note S1 Bing costs approximately $3 per 1,000 queries. In practice a single pipeline run typically triggers 5–15 Bing calls depending on depth setting.

After creating the resource, go to ai.azure.com → your project → Management Center → Connected resources → New Connection → Grounding with Bing Search. Select your bing-search-pipeline resource and click Add connection.

Phase 3 — Application Insights (Monitoring)

Go to portal.azure.com → Create a resource → Application Insights. Name: insights-research-pipeline. Region: East US 2. Resource mode: Workspace-based. Copy the Connection String from the Overview page for later use.

Connect it the same way as Bing: Foundry portal → Management Center → Connected resources → New Connection → Application Insights.

This gives you live traces of every agent call, token usage, latency, and errors in a single dashboard.

Phase 4 — Content Filters

Content filters protect the pipeline from harmful inputs and outputs. Configure once and it applies to all agents using gpt-4o.

Go to ai.azure.com → Guardrails controls → Create a custom content filter. Name: filter-research-pipeline. On both Input and Output filter tabs set: Hate: Medium, Violence: Medium, Sexual: High block, Self-harm: High block.

Then attach it to your model: Models + endpoints → gpt-4o → Edit → Content filter → select filter-research-pipeline → Submit changes.

Phase 5 — Create the Four Agents

For each agent: ai.azure.com → your project → Agents → + New agent. Set the agent name, select gpt-4o as the deployment, and paste the corresponding system prompt from Section 08.

For the Researcher agent only: under Knowledge → + Add → Grounding with Bing Search → select bingsearchpipeline. Test it in the Playground by typing: "Research the topic AI agents in enterprise software 2025" — you should see it call BingSearch and return structured JSON findings.

For the Orchestrator agent only: under Connected Agents, add all three other agents (researcher-agent, writer-agent, critic-agent) with their activation instructions. The orchestrator is the agent the user talks to — it must be created last.

Phase 6 — End-to-End Test in the Playground

Open the orchestrator agent in the Playground and send this test message:

Research topic: Impact of AI agents on enterprise productivity in 2025
Audience: C-suite executives, non-technical
Format: report
Depth: brief
Word limit: 400

Click Thread logs — you should see: orchestrator calls researcher → researcher calls Bing → returns findings JSON → orchestrator calls writer → returns draft → orchestrator calls critic → returns quality scores → orchestrator delivers final report. If all four agents fire in sequence, the pipeline is working.

11 Calling the Pipeline from Code

Once your agents are working in the Playground, you can call the orchestrator programmatically using the Azure AI Projects SDK:

.env
AZURE_AI_PROJECT_ENDPOINT=https://hub-research-pipeline.services.ai.azure.com/api/projects/project-research-pipeline
ORCHESTRATOR_AGENT_ID=asst_xxxxxxxxxxxx
APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=...
Terminal
pip install azure-ai-projects azure-identity
az login
run_pipeline.py
import os
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import MessageTextContent
from azure.identity import DefaultAzureCredential

client = AIProjectClient(
    endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
    credential=DefaultAzureCredential()
)

# Create a thread and send a message to the orchestrator
thread = client.agents.create_thread()
client.agents.create_message(
    thread_id=thread.id,
    role="user",
    content="""
    Research topic: Impact of AI agents on enterprise productivity in 2025
    Audience: C-suite executives, non-technical
    Format: report
    Depth: standard
    Word limit: 600
    """
)

run = client.agents.create_and_process_run(
    thread_id=thread.id,
    agent_id=os.environ["ORCHESTRATOR_AGENT_ID"]
)

# Get the response
messages = client.agents.list_messages(thread_id=thread.id)
last = next(m for m in messages if m.role == "assistant")
print(last.content[0].text.value)

You can also export ready-to-run code directly from the Foundry portal: Agents → your agent → </> View code (top toolbar). This generates a fully working script in Python, C#, or JavaScript that connects to your project, creates a thread, sends a message, and prints the response.

Protect your credentials The exported code will contain your project endpoint. Never hardcode your API key or endpoint directly in source code. Always use environment variables or Azure Key Vault.

12 Monitoring in Application Insights

After each pipeline run, go to your Foundry project → Monitoring (left sidebar). You should see a summary showing total token count, prompt token count, and completion token count for the run. For deeper dashboards — per-agent latency, error rates, retry frequency — open the insights-research-pipeline resource directly in the Azure portal.

The combination of Foundry trace logging and Application Insights gives you full observability: which agent was called, what it received, what it returned, how long it took, and how many tokens it consumed. This is not optional infrastructure — it is your only way to diagnose failures, control costs, and build confidence in the system over time.

The best time to build your first AI agent was last year. The second best time is today.

Start with the four agent prompts, the shared state schema, and the maximum iteration count. Get those three things right and the rest of the system — the Foundry setup, the monitoring, the code integration — is mechanical execution. The architecture is the hard part, and it lives in the prompts.