Assistants
An assistant is a callable voice agent. You define:
- Name and system prompt
- Personality sliders (formality, assertiveness, empathy, talkativeness)
- Reasoning mode: built-in LLM, webhook, or MCP
- Telephony options: incoming only, outgoing only, or both
Assistants do not contain business logic. They handle the call. Your backend handles truth.
Conversation Engine
Aployee controls the live interaction. Responsibilities include:
- Transcribe speech incrementally
- Manage turn-taking with sub-300ms latency
- Handle interruptions cleanly
- Maintain short-term conversational state
- Drive the assistant's personality
Aployee produces the words and timing. You keep the business rules.
Dual-Agent Architecture
Every call runs two processes:
Conversation Agent
Produces immediate, low-latency responses. Handles speech, turn-taking, interruptions. Keeps the dialogue flowing naturally.
Reasoning Agent
Runs async via built-in LLM, webhook, or MCP. Complex reasoning executes in parallel without blocking speech.
The conversation agent keeps talking while the reasoning agent works. When reasoning completes, results blend into the next natural turn.
Outcome: Calls feel human even when your backend is slow or complex.
Reasoning Modes
| Mode | Best For | How It Works |
|---|---|---|
| Built-in | Prototypes, MVPs | A single endpoint handles everything. Good for demos. |
| Webhook | Enterprise, Custom Logic | Your backend receives utterance + state + context. Returns updated state + optional reply. "Bring your own LLM" path. |
| MCP | Tool Integrations | Model Context Protocol for external tools. State stays in Aployee. Tools live wherever you want. |
State Model
Aployee maintains a JSON state object per call. This stores:
- Key entities mentioned
- Call progress markers
- Conversation summary
Every webhook call includes this state. Your webhook can modify it or store durable state in your own system. Aployee takes your returned state as the source of truth.
Behavioral Controls
These override prompt and personality. Explicit config, no guesswork:
- max_call_duration_seconds
- silence_timeout_ms
- max_consecutive_silences
- escalation_rules
- disconnect_rules
- allowed_phrases / banned_phrases
- transfer_target
Personality Model
Externally you set sliders:
| Slider | Options |
|---|---|
| formality | casual | neutral | formal |
| assertiveness | low | medium | high |
| empathy | low | medium | high |
| talkativeness | low | medium | high |
Internally these map to DISC-driven behaviors. You get predictable speech patterns without having to know any psych models.
Telephony
Aployee abstracts carriers and audio pipelines. You get:
- Inbound numbers
- Outbound calling
- Transfers to phones or SIP endpoints
- Recordings
- Real-time transcripts
No carrier complexity. No codec or jitter handling. No SIP behaviors to tune.
Observability
Each call exposes:
- Timeline with user speech, assistant speech, reasoning events, webhook calls
- Waveform visualization
- Full transcript
- Latency breakdown: ASR, TTS, conversation agent, reasoning agent, webhook, MCP
- Call summary
Debug behavior without guessing.
Next steps
- • Follow the quickstart guide
- • Configure reasoning webhooks
- • Set up MCP integration