Convonet Sequence Diagram

Multi-LLM · LiveKit WebRTC · Domain Agents · Agent Monitor

← Back to Tech Spec

Detailed Sequence Diagram

Sequence Phases

Phase 1: Authentication

Steps 1-7

Phase 2: Conversation Loop

Steps 8-31

Phase 3: Transfer Request

Steps 32-38

Phase 4: Twilio Transfer

Steps 39-52

Phase 1: Authentication (Steps 1-7)

User authentication and session creation

User Browser → WebSocket Server: Connect WebSocket

User opens the Convonet LiveKit WebRTC voice assistant UI and establishes a WebSocket connection via Socket.IO (or LiveKit) to the Flask server.

WebSocket Server → PIN Auth: Request Authentication

The WebSocket server requests PIN authentication from the authentication module.

PIN Auth → PostgreSQL: Validate PIN

PIN authentication module validates the user's PIN against PostgreSQL database using SQLAlchemy. Note: Authentication is handled via PostgreSQL, not FusionPBX.

PostgreSQL → PIN Auth: User Data

PostgreSQL returns user data including user ID, name, and team associations.

PIN Auth → Redis: Create Session

A new session is created in Redis with a unique session ID, storing user context and authentication status.

PIN Auth → WebSocket Server: Authenticated

Authentication module confirms successful authentication to the WebSocket server.

WebSocket Server → User Browser: Session ID

WebSocket server sends the session ID to the user browser. The user is now authenticated and ready to interact with the voice assistant.

Phase 2: Normal Conversation Loop (Steps 8-31)

Audio capture, transcription, AI processing, and response generation

User Browser → WebSocket Server: Start Recording

User initiates voice recording in the browser (LiveKit or MediaRecorder), beginning the audio capture process.

WebSocket Server → LiveKit WebRTC Voice Server: Audio Chunks

Audio chunks are streamed from the browser via LiveKit WebRTC (or Socket.IO) to the LiveKit WebRTC Voice Server (webrtc_voice_server_socketio.py).

LiveKit WebRTC Voice Server → Redis: Buffer Audio Data

Each session's audio chunks are appended to a Redis buffer (convonet.redis_manager) so downstream workers can read them for processing.

11-12

Redis → Deepgram STT: Send Audio Buffer → Transcribed Text

The voice server reads the buffer and forwards it to streaming Deepgram STT for transcription. Deepgram returns the recognized text back to the voice service.

Important: Deepgram STT is used for voice audio, not Twilio transcription.

13-15

LiveKit WebRTC Voice Server → LangGraph → Multi-LLM: Process Intent

The transcript enters the LangGraph assistant. LangGraph calls the selected LLM (Claude, Gemini, or OpenAI) for reasoning and response generation. Domain agents (Productivity, Mortgage, Healthcare) may be invoked. Agent Monitor tracks the interaction. LLM returns response and tool calls to LangGraph.

16-21

Tool Execution (If Needed)

If the LLM decides to execute an action, LangGraph invokes registered MCP tools. Agent Monitor records each tool call with elapsed time.

Database Operations: PostgreSQL queries/updates via SQLAlchemy
Calendar Operations: Google Calendar/OAuth workflows
PBX Metadata: FusionPBX lookups for call metadata

Tool results are injected back into the LangGraph state, and the LLM composes the final reply.

22-24

LangGraph → Multi-LLM → LangGraph: Generate Final Response

LangGraph sends context to the selected LLM (Claude, Gemini, or OpenAI) for final response generation. LLM returns response text to LangGraph, which passes it to the LiveKit WebRTC Voice Server.

25-31

Streaming TTS → Redis → WebSocket → User: Audio Response

Final text is synthesized via streaming TTS (Deepgram, ElevenLabs, or Cartesia). The audio response is streamed back over Socket.IO/LiveKit to the browser for playback. Audio may be buffered in Redis for smooth playback.

Loop: Steps 8-31 repeat for each user utterance, creating a continuous conversation loop until the user requests transfer or ends the session.

Phase 3: Transfer Request (Steps 32-38)

User requests transfer and system initiates transfer process

User Browser → WebSocket Server: "Transfer to Agent"

User says "I need a human" or requests transfer to a human agent.

31-32

WebSocket Server → LiveKit WebRTC Voice Server → LangGraph: Transfer Intent

Transfer intent is passed through the LiveKit WebRTC Voice Server to LangGraph for detection and processing.

33-34

LangGraph → Multi-LLM: Detect Transfer Intent → Transfer Command

LangGraph detects transfer intent and confirms with the selected LLM. LLM returns a transfer command to LangGraph.

LangGraph → Redis: Set Transfer Flag

LangGraph sets a transfer flag in Redis to indicate that a transfer is in progress.

36-38

LangGraph → LiveKit WebRTC Voice Server → WebSocket Server → User: Transfer Initiated

Transfer initiated signal flows back through the system. LiveKit WebRTC Voice Server sends transfer event to WebSocket Server, which notifies the user browser. User sees transfer status update.

Phase 4: Twilio Transfer Flow (Steps 39-52)

Call bridging to FusionPBX and agent dashboard connection

LiveKit WebRTC Voice Server → Twilio API: POST /voice_assistant/transfer_bridge

The voice backend calls the Convonet /twilio/voice_assistant/transfer_bridge endpoint. Twilio uses the provided SIP URI: sip:2001@FREEPBX_DOMAIN;transport=udp or trunk number, depending on .env configuration.

Twilio → FusionPBX: SIP INVITE to Extension 2001

Twilio dials FusionPBX (running on Google Cloud) via SIP INVITE. FusionPBX routes the call to the target agent extension (e.g., 2001).

FusionPBX → Agent Dashboard: Ring Extension 2001

FusionPBX rings extension 2001. The agent dashboard (JsSIP client) registers with FusionPBX over WSS (wss://<fusionpbx>:7443) and receives the incoming call notification.

42-44

Agent Dashboard → PostgreSQL: Fetch User Info → Show User Info Popup

Agent dashboard fetches caller record from PostgreSQL via REST API. Dashboard pops the caller record and shows ringing controls with user information (name, context, call history).

45-48

Agent Answers → Conversation Begins

Agent answers the call via JsSIP client. FusionPBX notifies Twilio that the call is connected. Twilio bridges the audio between user leg and agent leg. Live conversation begins between the user and agent.

Note: Once Twilio bridges the call to FusionPBX, the audio is now a PSTN/SIP leg (not the original WebRTC stream). Deepgram STT may still be used on the server side for context logging if needed.

49-52

Alternative: Agent Rejects/Timeout

If the agent rejects the call or it times out:

FusionPBX notifies Twilio that the call failed
Twilio sends transfer failed notification to LiveKit WebRTC Voice Server
LiveKit WebRTC Voice Server sends transfer error event to WebSocket Server
User browser displays transfer failed message

Transfer Callback: Convonet logs the transfer outcome via the /transfer_callback webhook for monitoring and Sentry alerts.

Key Sequence Points

Authentication Path

PIN authentication uses PostgreSQL (not FusionPBX). Sessions are stored in Redis for fast access.

Audio Processing

Streaming Deepgram STT for voice transcription. TTS via Deepgram, ElevenLabs, or Cartesia. Redis buffers audio chunks when needed.

AI Orchestration

LangGraph coordinates between Multi-LLM (Claude, Gemini, OpenAI), domain agents, and MCP tools. Agent Monitor tracks tool calls and voice timing.

Transfer Mechanism

Twilio bridges the LiveKit WebRTC user leg to the SIP agent leg. Audio transitions from WebRTC to PSTN/SIP stream.

Agent Monitor

Real-time observability at /agent-monitor/. Tracks tool calls with elapsed time and voice timing (buffer capture, STT, agent start, first sentence, first audio).