Architecture - Aployee Documentation

Assistants

An assistant is a callable voice agent. You define:

Name and system prompt
Personality sliders (formality, assertiveness, empathy, talkativeness)
Reasoning mode: built-in LLM, webhook, or MCP
Telephony options: incoming only, outgoing only, or both

Assistants do not contain business logic. They handle the call. Your backend handles truth.

Conversation Engine

Aployee controls the live interaction. Responsibilities include:

Transcribe speech incrementally
Manage turn-taking with sub-300ms latency
Handle interruptions cleanly
Maintain short-term conversational state
Drive the assistant's personality

Aployee produces the words and timing. You keep the business rules.

Dual-Agent Architecture

Every call runs two processes:

Conversation Agent

Produces immediate, low-latency responses. Handles speech, turn-taking, interruptions. Keeps the dialogue flowing naturally.

Reasoning Agent

Runs async via built-in LLM, webhook, or MCP. Complex reasoning executes in parallel without blocking speech.

The conversation agent keeps talking while the reasoning agent works. When reasoning completes, results blend into the next natural turn.

Outcome: Calls feel human even when your backend is slow or complex.

Reasoning Modes

Mode	Best For	How It Works
Built-in	Prototypes, MVPs	A single endpoint handles everything. Good for demos.
Webhook	Enterprise, Custom Logic	Your backend receives utterance + state + context. Returns updated state + optional reply. "Bring your own LLM" path.
MCP	Tool Integrations	Model Context Protocol for external tools. State stays in Aployee. Tools live wherever you want.

State Model

Aployee maintains a JSON state object per call. This stores:

Key entities mentioned
Call progress markers
Conversation summary

Every webhook call includes this state. Your webhook can modify it or store durable state in your own system. Aployee takes your returned state as the source of truth.

Behavioral Controls

These override prompt and personality. Explicit config, no guesswork:

max_call_duration_seconds
silence_timeout_ms
max_consecutive_silences
escalation_rules
disconnect_rules
allowed_phrases / banned_phrases
transfer_target

Personality Model

Externally you set sliders:

Slider	Options
formality	casual \| neutral \| formal
assertiveness	low \| medium \| high
empathy	low \| medium \| high
talkativeness	low \| medium \| high

Internally these map to DISC-driven behaviors. You get predictable speech patterns without having to know any psych models.

Telephony

Aployee abstracts carriers and audio pipelines. You get:

Inbound numbers
Outbound calling
Transfers to phones or SIP endpoints
Recordings
Real-time transcripts

No carrier complexity. No codec or jitter handling. No SIP behaviors to tune.

Observability

Each call exposes:

Timeline with user speech, assistant speech, reasoning events, webhook calls
Waveform visualization
Full transcript
Latency breakdown: ASR, TTS, conversation agent, reasoning agent, webhook, MCP
Call summary

Debug behavior without guessing.

Core Concepts