How do I create my first agent?

Click the 'Create Agent' button in your dashboard, choose a template or start from scratch, define the agent's role and capabilities, and start chatting!

What models are available?

We support GPT-4, Claude, Gemini, and specialized models through OpenRouter. Each model has different strengths for various tasks.

How is pricing & usage metered?

Model usage is metered by provider tokens (upstream) plus Cleo orchestration overhead. You can set per-user and per-workspace soft & hard budgets in Settings → Billing. Real-time token + cost dashboards are provided under Analytics.

Yes! We use enterprise-grade encryption, never train on your data, isolate per-tenant storage, and you can revoke any integration at any time. Private context is never sent to models without explicit scope.

What happens to my data if I delete an agent?

The agent configuration is removed immediately; conversation logs and memory embeddings tied to it are soft-retained for 24h (grace) then purged unless you enabled compliance retention.

Can I self-host or run local models?

Yes. You can provide an OpenAI-compatible endpoint or run local inference (e.g. via Ollama) and register it as a custom provider. Latency & capability metadata can be annotated for routing heuristics.

What is on the roadmap?

Upcoming: execution trace inspector, real-time collaboration cursors on canvas, fine-grained per-tool RBAC, vector workspace sync, and autonomous scheduled tasks.

How do I report issues or request features?

Use the in-app feedback panel or open a ticket via Support → New Request. Attach session trace ID from the footer for fastest triage.

Cleo Documentation

Overview

Master your AI workspace with intelligent agents, seamless integrations, and powerful tools

Cleo centraliza agentes inteligentes, herramientas conectadas y flujos de trabajo productivos en una sola experiencia premium. Usa el menú lateral para explorar cada área.

Quick Start

Launch in minutes and build momentum

Crea tu cuenta / inicia sesión.
Accede a la app y ve a Settings → API & Keys.
Configura tus claves de modelo.
Introduce al menos una clave (OpenAI, Anthropic, Groq u OpenRouter). Cleo autodetectará disponibilidad y latencias.
# Ejemplo (.env.local) OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=... GROQ_API_KEY=... OPENROUTER_API_KEY=...

Crea tu primer agente.

Ve a Agents y pulsa “New Agent”. Elige rol specialist para tareas concretas.

Config JSON (UI equivalente)

{
  "name": "Research Scout",
  "description": "Busca y resume información actual",
  "role": "specialist",
  "model": "gpt-4o-mini",
  "temperature": 0.4,
  "tools": ["web_search", "web_fetch"],
  "prompt": "Eres un agente que verifica, contrasta y sintetiza fuentes creíbles." ,
  "memoryEnabled": true,
  "memoryType": "short_term"
}

Creación vía API (POST)

curl -X POST https://api.tu-dominio.com/api/agents/create   -H 'Content-Type: application/json'   -H 'Authorization: Bearer <TOKEN>'   -d '{
    "name": "Research Scout",
    "description": "Busca y resume información actual",
    "role": "specialist",
    "model": "gpt-4o-mini",
    "tools": ["web_search", "web_fetch"],
    "prompt": "Eres un agente que verifica y sintetiza fuentes confiables",
    "memoryEnabled": true,
    "memoryType": "short_term"
  }'

Ejecuta un prompt de prueba.
Selecciona el agente recién creado en el panel de conversación y pregunta: “Resume en 5 viñetas las tendencias actuales en IA para edge computing”.
curl -X POST https://api.tu-dominio.com/api/agents/execute -H 'Content-Type: application/json' -H 'Authorization: Bearer <TOKEN>' -d '{ "agentId": "<AGENT_ID>", "input": "Resume en 5 viñetas las tendencias actuales en IA para edge computing" }'
Crea una mini cadena (workflow).
Agrega un segundo agente evaluador (rol evaluator) para refinar calidad. El supervisor puede delegar automáticamente.
1. Specialist
Recolecta y sintetiza info cruda.
2. Evaluator
Verifica, limpia sesgos, estructura.
3. Output Final
Supervisor integra y entrega.
Guarda y reutiliza.
Exporta la configuración de agentes o clónala para nuevas variantes (baja temperatura para datos, alta para ideación).

Checklist de validación

Clave de modelo válida
Primer agente creado
Ejecución exitosa
Delegación configurada
Workflow guardado
Ajuste de temperatura probado

Consejos rápidos

0.2–0.4 temperatura: respuesta estable / factual. 0.7–0.9: ideación / creatividad.
Incluye objetivo claro en el prompt: mejora delegación.
Activa memoria corta para contexto de sesión; evita memoria larga si no necesitas persistencia.
Limita herramientas: 2–3 por agente max para precisión.

Agents

Design, specialize and orchestrate autonomous assistants

Agents in Cleo are modular, typed entities with a defined role, model, prompt, and an allowed tool set. The multi‑agent graph routes tasks between them via the supervisor.

Core Roles

Supervisor

Routes tasks, decides delegation, aggregates final response.

Specialist

Domain‑focused (research, code, analysis, planning, data).

Worker

Executes atomic sub‑tasks (fetch, transform, extract).

Evaluator

Reviews quality, bias, structure & can request rewrites.

Minimal Config

{
  "name": "Data Analyst",
  "role": "specialist",
  "model": "gpt-4o-mini",
  "temperature": 0.2,
  "tools": ["python_runner", "chart_builder"],
  "prompt": "Eres un analista de datos. Devuelve análisis concisos y verificables.",
  "memoryEnabled": false
}

Expanded Config

{
  "name": "Research Planner",
  "description": "Breaks down broad objectives into structured research tasks",
  "role": "specialist",
  "model": "claude-3-5-sonnet",
  "temperature": 0.4,
  "tools": ["web_search", "web_fetch", "notion_write"],
  "prompt": "Actúa como un planificador estratégico. Divide objetivos complejos en pasos claros priorizados.",
  "objective": "Transform vague goals into actionable research sequences",
  "customInstructions": "Always ask clarifying questions if scope is ambiguous.",
  "memoryEnabled": true,
  "memoryType": "short_term",
  "stopConditions": ["[FINAL]"],
  "toolSchemas": { "notion_write": { "properties": { "page": {"type": "string"} } } }
}

Lifecycle

Registration: Agent definition stored; supervisor graph updated.
Invocation: User or supervisor dispatches request.
Reasoning / Tooling: Model generates intermediate thoughts & tool calls.
Delegation (optional): Supervisor re-routes if another agent is better suited.
Evaluation (optional): Evaluator reviews & refines.
Finalization: Response aggregated and returned.

Specialization Patterns

Splitter: Breaks tasks → sub-prompts (planner)
Researcher: Multi-source synthesis + credibility scoring
Extractor: Structured JSON output from messy text
Synthesizer: Combines multi-agent outputs
Reviewer: Style, tone & factual QA

Delegation Heuristics

Detect domain keywords ("analyze", "plan", "buscar")
Check tool availability match
Fallback to generalist if confidence < threshold
Escalate to evaluator on low coherence
Stop chain if cost/time limit exceeded

Best Practices

One primary objective per agent
2–5 tools max; avoid overloading
Lower temperature for evaluators (0–0.2)
Use explicit stop tokens in multi-step outputs
Tag agents (e.g. research)

When to Create a New Agent?

Recurring task with distinct style or constraints
Needs unique tool combo (e.g. Notion + Web + Python)
Different temperature / risk tolerance required
Output format radically different (JSON vs narrative)
Separate audit / logging channel needed

Prompt Examples

High‑quality prompt patterns for reliable outputs

A curated set of production‑grade prompt archetypes covering system conditioning, structured extraction, reasoning, delegation, and evaluation. All outputs are designed for deterministic parsing and multi‑agent chaining.

Structured Research Synthesizer

Reliable multi-source synthesis with explicit output schema.

You are a senior research synthesis agent.
Goal: Produce a concise, unbiased summary.
Rules:
- Validate each claim with at least 2 sources.
- If contradiction exists, surface it explicitly.
- Output strict JSON with keys: summary, key_points[], risks[], sources[].
- Do NOT hallucinate.
Return only JSON.

SystemModel: claude-sonnet / gpt-4o-miniGreat for evaluator + specialist pairing

Planner Decomposition

Break down vague objective into prioritized task plan.

You are a strategic planning agent.
Input: A vague objective.
Transform into: { objective, clarifying_questions[], tasks[ {id, title, rationale, dependencies[]} ], risks[], success_criteria[] }
Always ask questions first if scope ambiguous.
Return JSON only.

RoleModel: gpt-4o-mini / claude-haikuFeed tasks into worker agents

Constrained Reasoning Steps

Encourages explicit internal reasoning with bounded length.

You will solve the problem using structured reasoning.
Format:
THOUGHT[1]: ...
THOUGHT[2]: ...
FINAL: <answer>
Keep each THOUGHT under 25 tokens. If uncertain, state assumptions.

Chain-of-ThoughtModel: gpt-4o-mini (temperature 0.3)Pairs well with evaluator agent

Robust Field Extraction

Turns messy text into typed structured record.

Extract fields from input text.
Output strictly JSON: { company: string|null, country: string|null, employees: number|null, funding_stage: enum[seed,series_a,series_b,growth]|null }
If missing set null. Never guess.
Return ONLY JSON.

ExtractionModel: claude-haiku / gpt-4o-miniUse temperature 0–0.2

Supervisor Delegation Pattern

Supervisor decides whether to route to research or analysis agent.

You are SUPERVISOR.
Agents: research_agent (web_search, web_fetch), analysis_agent (python_runner, chart_builder)
User query: <INSERT>
Evaluate intent:
IF requires external info -> delegate:research_agent with objective
ELSE IF numeric / data transformation -> delegate:analysis_agent
ELSE respond directly.
Return JSON: { mode: direct|delegate, target_agent?: string, rationale: string, objective?: string }

DelegationModel: gpt-4o / claude-sonnetUse inside orchestration layer

Quality & Fact Reviewer

Evaluator that flags factual uncertainty and style issues.

You are an evaluator.
Input: draft_response + original_request.
Tasks:
1. Score factuality (0-1)
2. List potential hallucinations (if any)
3. Suggest style improvements
4. If rewrite needed, provide improved_response.
Return JSON: { factuality: number, hallucinations: string[], improvements: string[], improved_response?: string }

EvaluationModel: claude-sonnet / gpt-4o-miniTrigger on low confidence

Guidelines

Prefer explicit JSON schemas for extraction & handoff.
Bound reasoning tokens: reduces drift + cost.
Separate evaluation from generation for higher factuality.
Use lower temperature for system / evaluator prompts.
Never mix natural language + JSON in machine‑consumable outputs.

Model Strategy

Choose the optimal model per intent, cost and latency

Model selection in Cleo balances latency, determinism, reasoning depth and cost. Use fast tiers for routing & control loops, balanced for planning & synthesis, and escalate only when confidence or structure thresholds fail.

Tier	Models	Latency	Cost	Ideal For
Ultra Fast	gpt-4o-mini, claude-haiku, mistral-small	50–250ms	Low	Routing, delegation heuristics, light classification
Balanced	gpt-4o, claude-sonnet, gemini-1.5-pro	300–1200ms	Medium	General reasoning, planning, structured synthesis
Heavy Reasoning	claude-opus, oatmega-70b (open)	1.5–4s	High	Complex multi-hop reasoning, deep evaluation passes
Specialized	embedding-small, vision-model, audio-large	Varies	Variable	Vector search, OCR, multimodal enrichment

Selection Heuristics

Extraction (strict JSON): Small deterministic (gpt-4o-mini) → escalate only on parse failure
Multi-hop reasoning: Start Balanced (gpt-4o / sonnet), escalate to opus only if reasoning depth score < threshold
Cost sensitive batch tasks: Use open smaller models + caching + batch API
Delegation routing: Ultra Fast tier for low latency control loop
Evaluation / Fact QA: Balanced model at low temperature (0–0.3) for consistency
Creative ideation: Increase temperature 0.7–0.9 on Balanced tier before using Heavy

Fallback Cascade Pattern

// Pseudocode
async function smartInvoke(task) {
  // Tier 1: fast attempt
  const fast = await callModel('gpt-4o-mini', task, { timeout: 1800 })
  if(fast.parsed && fast.confidence >= 0.82) return fast

  // Tier 2: balanced refinement
  const balanced = await callModel('gpt-4o', enhance(fast, task), { temperature: 0.4 })
  if(balanced.confidence >= 0.9) return balanced

  // Tier 3: heavy reasoning escalation
  return await callModel('claude-opus', enrichWithCritique(balanced, task), { maxTokens: 1200 })
}

Escalate only when parse fails or confidence < threshold.
Propagate critique context instead of raw hallucinated text.
Track token + cost metrics per tier for optimization.

Caching & Cost Control

Deduplicate identical structured extraction prompts via hash cache.
Use temperature 0–0.3 for parse‑critical tasks to reduce retries.
Persist intermediate balanced-tier outputs for heavy escalation reuse.
Track token usage per agent role to spot misalignment.
Batch low priority tasks during off-peak windows.

Confidence Signals

Structural: JSON schema validation pass/fail.
Self-estimated certainty: Model returns numeric confidence (sanity bound).
Evaluator score: Independent pass for factuality & coherence.
Time budget: Abort escalation if nearing SLA limit.
Cost guardrail: Hard ceiling per user/session triggers degrade mode.

Tool Safety

Approval workflows and secure execution model

Tool execution is governed by scoped permissions, real‑time policy checks, human approval escalation and immutable audit trails. Minimize blast radius by constraining agents to least privilege.

Permission Scopes

Scope	Description
read	Non‑destructive retrieval (fetch, search, list)
write	Create or modify content (notion_write, file_save)
execute	Run code or transformations (python_runner, script_exec)
network	Outbound web requests (web_fetch, api_call)
sensitive	Access to PII / internal systems; requires explicit approval

Approval Workflow

Agent tool call → policy check
  | pass (auto) if scope ∈ allowed && risk < threshold
  | queue if scope=sensitive OR confidence < 0.75
Queue item → human approve/deny → audit log entry → continue/abort

Human queue stored with TTL; stale requests auto‑expire.
UI shows diff / requested arguments for clarity.
Denied calls propagate structured error to agent for graceful fallback.

Risk Classification

Level	Scopes	Examples
Low	read	Public info retrieval, static asset fetch
Medium	write\|execute	Content mutation, code run with sandbox
High	network\|sensitive	External exfiltration vectors, PII read
Critical	sensitive + execute	Potential lateral movement or data leakage

Rate Limits

Tool	Quota	Burst	Notes
web_fetch	60 / 5m	Burst 5	Backoff exponential after 429
python_runner	20 / 10m	Serialized	Workspace CPU guard
notion_write	40 / 10m	Burst 3	Queue + retry jitter
email_send	100 / 1h	Burst 10	DMARC compliance + delay
vector_search	200 / 5m	Parallel	Cache layer w/ LRU

Audit Log Schema

Event	Fields
Tool Call Start	timestamp, agentId, tool, argsHash, scope
Tool Call End	duration, success, errorType, tokensUsed
Escalation	previousTool, rationale, newScope
Approval Decision	approverId, decision, latency, justification
Anomaly Flag	patternType, severity, correlationId

All entries carry a correlationId for tracing cross-agent flows.
High severity anomalies trigger webhook + optional Slack alert.
Logs are immutable append-only; retention tiered (hot → warm → archive).

Best Practices

Create separate agents for high‑risk tools (isolate scope).
Hash + diff args for write operations to show intent clarity.
Enable human queue only for sensitive+execute not routine writes.
Alert on unusual burst patterns (entropy of tool sequence).
Rotate API keys & enforce per‑agent tokens when possible.

Multi-Agent

Delegation, supervision and collaboration patterns

Cleo orchestrates agents through an adaptive supervisor that performs intent routing, delegation, arbitration and evaluation. The system emphasizes minimal escalation, deterministic structure, and explicit confidence signals.

Conceptual Flow

User Input
   ↓
[ Supervisor ] -- intent classification --> ( route )
   |   |     |   |   -> direct answer (low complexity)
   |   |--> Specialist A (research)
   |   |--> Specialist B (analysis)
   |         ↓
   |      Worker agents (extraction, transform)
   |         ↓
   |<-- Aggregated partial outputs --
   |           ↓
   |--> Evaluator (quality/factuality/style)
   |           ↓ (approve / request revision)
Final Response --> User

Graph edges represent potential delegation; actual path chosen by heuristics (intent, tool availability, confidence, cost budget).

Orchestration Phases

Intake: Normalize user input; detect language; strip PII if required.
Intent Classification: Light fast model or rules to map to domain + complexity level.
Routing Decision: Select direct response vs delegation; choose specialist set.
Task Decomposition: Optional planner expansion into structured sub‑tasks.
Execution: Specialists + workers perform reasoning + tool calls.
Synthesis: Combine multi‑agent outputs (order, conflict resolution).
Evaluation: Quality, factuality, coherence, style normalization.
Finalization: Formatting, safe content filters, response packaging.

Routing Strategies

Keyword + Tool Match: Map intent tokens to agents whose tool set intersects required capability.
Confidence Threshold: If classifier confidence < τ → escalate to generalist or ask clarification.
Cost-Aware Routing: Prefer cheapest capable agent unless complexity score > threshold.
Adaptive Feedback: Evaluator signals misrouting; update routing weights incrementally.
Composite Voting: Sample 2 light models for classification; use consensus or escalate.

Arbitration Patterns

Evaluator Gate: Evaluator must approve if risk score > R or novelty flag set.
Dual Response Compare: Two specialists produce outputs → evaluator chooses or merges.
Progressive Refinement: Draft → critique → improved draft (limit N cycles).
Conflict Resolution: If contradictory claims → request sources or escalate to higher tier model.
Time Budget Abort: If cumulative execution time > SLA threshold → degrade gracefully.

Supervision Loops

Light Supervision: Supervisor only delegates & aggregates; no evaluator unless uncertainty flagged.
Inline Evaluation: Evaluator reviews each intermediate artifact before next stage.
Periodic Audit: Every N tasks, sample outputs for deeper factual QA.
Escalation Ladder: Uncertain → evaluator → heavy model → human (optional).
Self-Critique Injection: Agent produces THOUGHT + CRITIQUE internally before FINAL output.

Optimization Tips

Cache classification & routing decisions by normalized query signature.
Short‑circuit evaluator when structural parse already passes high confidence.
Limit refinement loops (N ≤ 2) to prevent cost spirals.
Track per‑role token + latency metrics to prune underperforming agents.
Fallback to single‑agent mode in degraded / high load states.

Image Generation

Creative rendering with model selection & limits

Troubleshooting

Common issues, diagnostics and recovery steps

Use this guide to quickly isolate issues across routing, delegation, tooling, memory and cost. Patterns are designed for rapid triage with structured remediation.

Area	Symptom	Likely Cause	Action
Connection	Intermittent 504 / timeouts	Model provider latency spike	Failover to secondary key; check status page
Agents	Delegation never triggers	Routing heuristics confidence too strict	Lower threshold or add domain keywords
Tools	Frequent 429 on web_fetch	Rate limit exceeded	Introduce jitter & batch queries
Memory	Context truncation early	Max tokens too low	Increase maxTokens or enable streaming summarizer
Costs	Token usage spikes suddenly	Escalation loop / evaluator recursion	Cap refinement cycles; add safeguard counter
Output	Invalid JSON parse	Temperature too high or missing schema framing	Add explicit JSON schema + reduce temperature

API Diagnostics

List agents: GET /api/agents
Recreate orchestrator: POST /api/agents/register?recreate=true
Execute agent: POST /api/agents/execute { agentId, input }
List tasks: GET /api/agent-tasks
Check metrics: GET /api/agents/metrics
Reset thread: POST /api/threads/reset { threadId }

Error Taxonomy

Code	Meaning
routing.miss	Supervisor selected suboptimal agent; adjust thresholds
delegation.timeout	Worker exceeded execution window; raise timeout or optimize task
tool.rate_limited	429 from provider; apply backoff + queue
model.hallucination	Low factual confidence; trigger evaluator rewrite
parse.failure	JSON invalid; enforce schema & retry with lower temp
memory.overflow	Too many tokens; compress older context
cost.guardrail	Budget exceeded; degrade to fast tier + reduce depth

Error codes are structured to allow automated remediation triggers.

Recovery Playbooks

Routing Broken

Enable debug routing logs
Lower confidence threshold 0.85 → 0.7
Add explicit keyword mapping
Rebuild orchestrator

Escalation Loop

Set max refinement cycles = 2
Add token guard
Log evaluator triggers
Fallback to balanced tier

High Latency

Activate streaming
Switch to fast tier
Enable partial synthesis
Batch similar requests

JSON Failures

Wrap schema in fenced block
Remove narrative instructions
Lower temperature
Add validator + retry

Tool Flood

Apply per-agent rate limiter
Throttle high-frequency tool
Introduce queue + jitter
Alert on anomaly

Memory Drift

Shorten conversation window
Enable summarizer
Disable long_term memory temporarily
Reset thread context

Preventative Monitoring

Alert on escalation chain length > 2.
Track JSON parse failure rate; auto‑lower temperature if spike detected.
Log per‑tool p95 latency & throttle anomalies.
Capture evaluator disagreement rate as drift signal.
Budget guard: emit event at 80% daily cost threshold.

Frequently Asked Questions

Answers to recurring questions