Cleo

Cleo Documentation

Overview

Master your AI workspace with intelligent agents, seamless integrations, and powerful tools

Cleo centraliza agentes inteligentes, herramientas conectadas y flujos de trabajo productivos en una sola experiencia premium. Usa el menú lateral para explorar cada área.

Quick Start

Launch in minutes and build momentum

  1. Crea tu cuenta / inicia sesión.
    Accede a la app y ve a Settings → API & Keys.
  2. Configura tus claves de modelo.
    Introduce al menos una clave (OpenAI, Anthropic, Groq u OpenRouter). Cleo autodetectará disponibilidad y latencias.
    # Ejemplo (.env.local) OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=... GROQ_API_KEY=... OPENROUTER_API_KEY=...
  3. Crea tu primer agente.
    Ve a Agents y pulsa “New Agent”. Elige rol specialist para tareas concretas.

    Config JSON (UI equivalente)

    {
      "name": "Research Scout",
      "description": "Busca y resume información actual",
      "role": "specialist",
      "model": "gpt-4o-mini",
      "temperature": 0.4,
      "tools": ["web_search", "web_fetch"],
      "prompt": "Eres un agente que verifica, contrasta y sintetiza fuentes creíbles." ,
      "memoryEnabled": true,
      "memoryType": "short_term"
    }

    Creación vía API (POST)

    curl -X POST https://api.tu-dominio.com/api/agents/create   -H 'Content-Type: application/json'   -H 'Authorization: Bearer <TOKEN>'   -d '{
        "name": "Research Scout",
        "description": "Busca y resume información actual",
        "role": "specialist",
        "model": "gpt-4o-mini",
        "tools": ["web_search", "web_fetch"],
        "prompt": "Eres un agente que verifica y sintetiza fuentes confiables",
        "memoryEnabled": true,
        "memoryType": "short_term"
      }'
  4. Ejecuta un prompt de prueba.
    Selecciona el agente recién creado en el panel de conversación y pregunta: “Resume en 5 viñetas las tendencias actuales en IA para edge computing”.
    curl -X POST https://api.tu-dominio.com/api/agents/execute -H 'Content-Type: application/json' -H 'Authorization: Bearer <TOKEN>' -d '{ "agentId": "<AGENT_ID>", "input": "Resume en 5 viñetas las tendencias actuales en IA para edge computing" }'
  5. Crea una mini cadena (workflow).
    Agrega un segundo agente evaluador (rol evaluator) para refinar calidad. El supervisor puede delegar automáticamente.

    1. Specialist

    Recolecta y sintetiza info cruda.

    2. Evaluator

    Verifica, limpia sesgos, estructura.

    3. Output Final

    Supervisor integra y entrega.

  6. Guarda y reutiliza.
    Exporta la configuración de agentes o clónala para nuevas variantes (baja temperatura para datos, alta para ideación).

Checklist de validación

  • Clave de modelo válida
  • Primer agente creado
  • Ejecución exitosa
  • Delegación configurada
  • Workflow guardado
  • Ajuste de temperatura probado

Consejos rápidos

  • 0.2–0.4 temperatura: respuesta estable / factual. 0.7–0.9: ideación / creatividad.
  • Incluye objetivo claro en el prompt: mejora delegación.
  • Activa memoria corta para contexto de sesión; evita memoria larga si no necesitas persistencia.
  • Limita herramientas: 2–3 por agente max para precisión.

Agents

Design, specialize and orchestrate autonomous assistants

Agents in Cleo are modular, typed entities with a defined role, model, prompt, and an allowed tool set. The multi‑agent graph routes tasks between them via the supervisor.

Core Roles

Supervisor

Routes tasks, decides delegation, aggregates final response.

Specialist

Domain‑focused (research, code, analysis, planning, data).

Worker

Executes atomic sub‑tasks (fetch, transform, extract).

Evaluator

Reviews quality, bias, structure & can request rewrites.

Minimal Config
{
  "name": "Data Analyst",
  "role": "specialist",
  "model": "gpt-4o-mini",
  "temperature": 0.2,
  "tools": ["python_runner", "chart_builder"],
  "prompt": "Eres un analista de datos. Devuelve análisis concisos y verificables.",
  "memoryEnabled": false
}
Expanded Config
{
  "name": "Research Planner",
  "description": "Breaks down broad objectives into structured research tasks",
  "role": "specialist",
  "model": "claude-3-5-sonnet",
  "temperature": 0.4,
  "tools": ["web_search", "web_fetch", "notion_write"],
  "prompt": "Actúa como un planificador estratégico. Divide objetivos complejos en pasos claros priorizados.",
  "objective": "Transform vague goals into actionable research sequences",
  "customInstructions": "Always ask clarifying questions if scope is ambiguous.",
  "memoryEnabled": true,
  "memoryType": "short_term",
  "stopConditions": ["[FINAL]"],
  "toolSchemas": { "notion_write": { "properties": { "page": {"type": "string"} } } }
}
Lifecycle
  1. Registration: Agent definition stored; supervisor graph updated.
  2. Invocation: User or supervisor dispatches request.
  3. Reasoning / Tooling: Model generates intermediate thoughts & tool calls.
  4. Delegation (optional): Supervisor re-routes if another agent is better suited.
  5. Evaluation (optional): Evaluator reviews & refines.
  6. Finalization: Response aggregated and returned.

Specialization Patterns

  • Splitter: Breaks tasks → sub-prompts (planner)
  • Researcher: Multi-source synthesis + credibility scoring
  • Extractor: Structured JSON output from messy text
  • Synthesizer: Combines multi-agent outputs
  • Reviewer: Style, tone & factual QA

Delegation Heuristics

  • Detect domain keywords ("analyze", "plan", "buscar")
  • Check tool availability match
  • Fallback to generalist if confidence < threshold
  • Escalate to evaluator on low coherence
  • Stop chain if cost/time limit exceeded

Best Practices

  • One primary objective per agent
  • 2–5 tools max; avoid overloading
  • Lower temperature for evaluators (0–0.2)
  • Use explicit stop tokens in multi-step outputs
  • Tag agents (e.g. research)

When to Create a New Agent?

  • Recurring task with distinct style or constraints
  • Needs unique tool combo (e.g. Notion + Web + Python)
  • Different temperature / risk tolerance required
  • Output format radically different (JSON vs narrative)
  • Separate audit / logging channel needed

Prompt Examples

High‑quality prompt patterns for reliable outputs

A curated set of production‑grade prompt archetypes covering system conditioning, structured extraction, reasoning, delegation, and evaluation. All outputs are designed for deterministic parsing and multi‑agent chaining.

Structured Research Synthesizer

Reliable multi-source synthesis with explicit output schema.

You are a senior research synthesis agent.
Goal: Produce a concise, unbiased summary.
Rules:
- Validate each claim with at least 2 sources.
- If contradiction exists, surface it explicitly.
- Output strict JSON with keys: summary, key_points[], risks[], sources[].
- Do NOT hallucinate.
Return only JSON.
SystemModel: claude-sonnet / gpt-4o-miniGreat for evaluator + specialist pairing
Planner Decomposition

Break down vague objective into prioritized task plan.

You are a strategic planning agent.
Input: A vague objective.
Transform into: { objective, clarifying_questions[], tasks[ {id, title, rationale, dependencies[]} ], risks[], success_criteria[] }
Always ask questions first if scope ambiguous.
Return JSON only.
RoleModel: gpt-4o-mini / claude-haikuFeed tasks into worker agents
Constrained Reasoning Steps

Encourages explicit internal reasoning with bounded length.

You will solve the problem using structured reasoning.
Format:
THOUGHT[1]: ...
THOUGHT[2]: ...
FINAL: <answer>
Keep each THOUGHT under 25 tokens. If uncertain, state assumptions.
Chain-of-ThoughtModel: gpt-4o-mini (temperature 0.3)Pairs well with evaluator agent
Robust Field Extraction

Turns messy text into typed structured record.

Extract fields from input text.
Output strictly JSON: { company: string|null, country: string|null, employees: number|null, funding_stage: enum[seed,series_a,series_b,growth]|null }
If missing set null. Never guess.
Return ONLY JSON.
ExtractionModel: claude-haiku / gpt-4o-miniUse temperature 0–0.2
Supervisor Delegation Pattern

Supervisor decides whether to route to research or analysis agent.

You are SUPERVISOR.
Agents: research_agent (web_search, web_fetch), analysis_agent (python_runner, chart_builder)
User query: <INSERT>
Evaluate intent:
IF requires external info -> delegate:research_agent with objective
ELSE IF numeric / data transformation -> delegate:analysis_agent
ELSE respond directly.
Return JSON: { mode: direct|delegate, target_agent?: string, rationale: string, objective?: string }
DelegationModel: gpt-4o / claude-sonnetUse inside orchestration layer
Quality & Fact Reviewer

Evaluator that flags factual uncertainty and style issues.

You are an evaluator.
Input: draft_response + original_request.
Tasks:
1. Score factuality (0-1)
2. List potential hallucinations (if any)
3. Suggest style improvements
4. If rewrite needed, provide improved_response.
Return JSON: { factuality: number, hallucinations: string[], improvements: string[], improved_response?: string }
EvaluationModel: claude-sonnet / gpt-4o-miniTrigger on low confidence

Guidelines

  • Prefer explicit JSON schemas for extraction & handoff.
  • Bound reasoning tokens: reduces drift + cost.
  • Separate evaluation from generation for higher factuality.
  • Use lower temperature for system / evaluator prompts.
  • Never mix natural language + JSON in machine‑consumable outputs.

Model Strategy

Choose the optimal model per intent, cost and latency

Model selection in Cleo balances latency, determinism, reasoning depth and cost. Use fast tiers for routing & control loops, balanced for planning & synthesis, and escalate only when confidence or structure thresholds fail.

TierModelsLatencyCostIdeal For
Ultra Fastgpt-4o-mini, claude-haiku, mistral-small50–250msLowRouting, delegation heuristics, light classification
Balancedgpt-4o, claude-sonnet, gemini-1.5-pro300–1200msMediumGeneral reasoning, planning, structured synthesis
Heavy Reasoningclaude-opus, oatmega-70b (open)1.5–4sHighComplex multi-hop reasoning, deep evaluation passes
Specializedembedding-small, vision-model, audio-largeVariesVariableVector search, OCR, multimodal enrichment

Selection Heuristics

  • Extraction (strict JSON): Small deterministic (gpt-4o-mini) → escalate only on parse failure
  • Multi-hop reasoning: Start Balanced (gpt-4o / sonnet), escalate to opus only if reasoning depth score < threshold
  • Cost sensitive batch tasks: Use open smaller models + caching + batch API
  • Delegation routing: Ultra Fast tier for low latency control loop
  • Evaluation / Fact QA: Balanced model at low temperature (0–0.3) for consistency
  • Creative ideation: Increase temperature 0.7–0.9 on Balanced tier before using Heavy

Fallback Cascade Pattern

// Pseudocode
async function smartInvoke(task) {
  // Tier 1: fast attempt
  const fast = await callModel('gpt-4o-mini', task, { timeout: 1800 })
  if(fast.parsed && fast.confidence >= 0.82) return fast

  // Tier 2: balanced refinement
  const balanced = await callModel('gpt-4o', enhance(fast, task), { temperature: 0.4 })
  if(balanced.confidence >= 0.9) return balanced

  // Tier 3: heavy reasoning escalation
  return await callModel('claude-opus', enrichWithCritique(balanced, task), { maxTokens: 1200 })
}
  • Escalate only when parse fails or confidence < threshold.
  • Propagate critique context instead of raw hallucinated text.
  • Track token + cost metrics per tier for optimization.

Caching & Cost Control

  • Deduplicate identical structured extraction prompts via hash cache.
  • Use temperature 0–0.3 for parse‑critical tasks to reduce retries.
  • Persist intermediate balanced-tier outputs for heavy escalation reuse.
  • Track token usage per agent role to spot misalignment.
  • Batch low priority tasks during off-peak windows.

Confidence Signals

  • Structural: JSON schema validation pass/fail.
  • Self-estimated certainty: Model returns numeric confidence (sanity bound).
  • Evaluator score: Independent pass for factuality & coherence.
  • Time budget: Abort escalation if nearing SLA limit.
  • Cost guardrail: Hard ceiling per user/session triggers degrade mode.

Tool Safety

Approval workflows and secure execution model

Tool execution is governed by scoped permissions, real‑time policy checks, human approval escalation and immutable audit trails. Minimize blast radius by constraining agents to least privilege.

Permission Scopes

ScopeDescription
readNon‑destructive retrieval (fetch, search, list)
writeCreate or modify content (notion_write, file_save)
executeRun code or transformations (python_runner, script_exec)
networkOutbound web requests (web_fetch, api_call)
sensitiveAccess to PII / internal systems; requires explicit approval

Approval Workflow

Agent tool call → policy check
  | pass (auto) if scope ∈ allowed && risk < threshold
  | queue if scope=sensitive OR confidence < 0.75
Queue item → human approve/deny → audit log entry → continue/abort
  • Human queue stored with TTL; stale requests auto‑expire.
  • UI shows diff / requested arguments for clarity.
  • Denied calls propagate structured error to agent for graceful fallback.

Risk Classification

LevelScopesExamples
LowreadPublic info retrieval, static asset fetch
Mediumwrite|executeContent mutation, code run with sandbox
Highnetwork|sensitiveExternal exfiltration vectors, PII read
Criticalsensitive + executePotential lateral movement or data leakage

Rate Limits

ToolQuotaBurstNotes
web_fetch60 / 5mBurst 5Backoff exponential after 429
python_runner20 / 10mSerializedWorkspace CPU guard
notion_write40 / 10mBurst 3Queue + retry jitter
email_send100 / 1hBurst 10DMARC compliance + delay
vector_search200 / 5mParallelCache layer w/ LRU

Audit Log Schema

EventFields
Tool Call Starttimestamp, agentId, tool, argsHash, scope
Tool Call Endduration, success, errorType, tokensUsed
EscalationpreviousTool, rationale, newScope
Approval DecisionapproverId, decision, latency, justification
Anomaly FlagpatternType, severity, correlationId
  • All entries carry a correlationId for tracing cross-agent flows.
  • High severity anomalies trigger webhook + optional Slack alert.
  • Logs are immutable append-only; retention tiered (hot → warm → archive).

Best Practices

  • Create separate agents for high‑risk tools (isolate scope).
  • Hash + diff args for write operations to show intent clarity.
  • Enable human queue only for sensitive+execute not routine writes.
  • Alert on unusual burst patterns (entropy of tool sequence).
  • Rotate API keys & enforce per‑agent tokens when possible.

Multi-Agent

Delegation, supervision and collaboration patterns

Cleo orchestrates agents through an adaptive supervisor that performs intent routing, delegation, arbitration and evaluation. The system emphasizes minimal escalation, deterministic structure, and explicit confidence signals.

Conceptual Flow

User Input
   ↓
[ Supervisor ] -- intent classification --> ( route )
   |   |     |   |   -> direct answer (low complexity)
   |   |--> Specialist A (research)
   |   |--> Specialist B (analysis)
   |         ↓
   |      Worker agents (extraction, transform)
   |         ↓
   |<-- Aggregated partial outputs --
   |           ↓
   |--> Evaluator (quality/factuality/style)
   |           ↓ (approve / request revision)
Final Response --> User

Graph edges represent potential delegation; actual path chosen by heuristics (intent, tool availability, confidence, cost budget).

Orchestration Phases

  1. Intake: Normalize user input; detect language; strip PII if required.
  2. Intent Classification: Light fast model or rules to map to domain + complexity level.
  3. Routing Decision: Select direct response vs delegation; choose specialist set.
  4. Task Decomposition: Optional planner expansion into structured sub‑tasks.
  5. Execution: Specialists + workers perform reasoning + tool calls.
  6. Synthesis: Combine multi‑agent outputs (order, conflict resolution).
  7. Evaluation: Quality, factuality, coherence, style normalization.
  8. Finalization: Formatting, safe content filters, response packaging.

Routing Strategies

  • Keyword + Tool Match: Map intent tokens to agents whose tool set intersects required capability.
  • Confidence Threshold: If classifier confidence < τ → escalate to generalist or ask clarification.
  • Cost-Aware Routing: Prefer cheapest capable agent unless complexity score > threshold.
  • Adaptive Feedback: Evaluator signals misrouting; update routing weights incrementally.
  • Composite Voting: Sample 2 light models for classification; use consensus or escalate.

Arbitration Patterns

  • Evaluator Gate: Evaluator must approve if risk score > R or novelty flag set.
  • Dual Response Compare: Two specialists produce outputs → evaluator chooses or merges.
  • Progressive Refinement: Draft → critique → improved draft (limit N cycles).
  • Conflict Resolution: If contradictory claims → request sources or escalate to higher tier model.
  • Time Budget Abort: If cumulative execution time > SLA threshold → degrade gracefully.

Supervision Loops

  • Light Supervision: Supervisor only delegates & aggregates; no evaluator unless uncertainty flagged.
  • Inline Evaluation: Evaluator reviews each intermediate artifact before next stage.
  • Periodic Audit: Every N tasks, sample outputs for deeper factual QA.
  • Escalation Ladder: Uncertain → evaluator → heavy model → human (optional).
  • Self-Critique Injection: Agent produces THOUGHT + CRITIQUE internally before FINAL output.

Optimization Tips

  • Cache classification & routing decisions by normalized query signature.
  • Short‑circuit evaluator when structural parse already passes high confidence.
  • Limit refinement loops (N ≤ 2) to prevent cost spirals.
  • Track per‑role token + latency metrics to prune underperforming agents.
  • Fallback to single‑agent mode in degraded / high load states.

Image Generation

Creative rendering with model selection & limits

Troubleshooting

Common issues, diagnostics and recovery steps

Use this guide to quickly isolate issues across routing, delegation, tooling, memory and cost. Patterns are designed for rapid triage with structured remediation.

AreaSymptomLikely CauseAction
ConnectionIntermittent 504 / timeoutsModel provider latency spikeFailover to secondary key; check status page
AgentsDelegation never triggersRouting heuristics confidence too strictLower threshold or add domain keywords
ToolsFrequent 429 on web_fetchRate limit exceededIntroduce jitter & batch queries
MemoryContext truncation earlyMax tokens too lowIncrease maxTokens or enable streaming summarizer
CostsToken usage spikes suddenlyEscalation loop / evaluator recursionCap refinement cycles; add safeguard counter
OutputInvalid JSON parseTemperature too high or missing schema framingAdd explicit JSON schema + reduce temperature

API Diagnostics

  • List agents: GET /api/agents
  • Recreate orchestrator: POST /api/agents/register?recreate=true
  • Execute agent: POST /api/agents/execute { agentId, input }
  • List tasks: GET /api/agent-tasks
  • Check metrics: GET /api/agents/metrics
  • Reset thread: POST /api/threads/reset { threadId }

Error Taxonomy

CodeMeaning
routing.missSupervisor selected suboptimal agent; adjust thresholds
delegation.timeoutWorker exceeded execution window; raise timeout or optimize task
tool.rate_limited429 from provider; apply backoff + queue
model.hallucinationLow factual confidence; trigger evaluator rewrite
parse.failureJSON invalid; enforce schema & retry with lower temp
memory.overflowToo many tokens; compress older context
cost.guardrailBudget exceeded; degrade to fast tier + reduce depth

Error codes are structured to allow automated remediation triggers.

Recovery Playbooks

Routing Broken

  1. Enable debug routing logs
  2. Lower confidence threshold 0.85 → 0.7
  3. Add explicit keyword mapping
  4. Rebuild orchestrator

Escalation Loop

  1. Set max refinement cycles = 2
  2. Add token guard
  3. Log evaluator triggers
  4. Fallback to balanced tier

High Latency

  1. Activate streaming
  2. Switch to fast tier
  3. Enable partial synthesis
  4. Batch similar requests

JSON Failures

  1. Wrap schema in fenced block
  2. Remove narrative instructions
  3. Lower temperature
  4. Add validator + retry

Tool Flood

  1. Apply per-agent rate limiter
  2. Throttle high-frequency tool
  3. Introduce queue + jitter
  4. Alert on anomaly

Memory Drift

  1. Shorten conversation window
  2. Enable summarizer
  3. Disable long_term memory temporarily
  4. Reset thread context

Preventative Monitoring

  • Alert on escalation chain length > 2.
  • Track JSON parse failure rate; auto‑lower temperature if spike detected.
  • Log per‑tool p95 latency & throttle anomalies.
  • Capture evaluator disagreement rate as drift signal.
  • Budget guard: emit event at 80% daily cost threshold.

Frequently Asked Questions

Answers to recurring questions

Cleo — Documentation · Cleo