Voice AI Architecture Guide: Modular vs End-to-End Architectures
The Sandwich architecture composes three distinct components (speech-to-text, text-based agent, text-to-speech) balancing performance, controllability, and modern model capabilities. Speech-to-speech uses multimodal models processing audio input and generating audio output natively without intermediate text representation. Trade-offs include latency (300-500ms for Sandwich vs potential sub-200ms for speech-to-speech), customization (high modularity vs limited provider choice), and cost (pay-per-component vs bundled pricing).
Vapi's strength: Plug in preferred STT, LLM, and TTS providers and control every part of the request-response cycle through modular Sandwich architecture.
Sandwich Architecture Explained
Three-Layer Composition
Layer 1 - Speech-to-Text (STT):
- Input: User's spoken audio
- Process: Convert audio waveform to text transcription
- Output: Transcribed text string
- Providers: Deepgram, AssemblyAI, OpenAI Whisper, Google STT
- Latency: 200-600ms depending on provider
Layer 2 - Language Model (LLM):
- Input: Transcribed text from STT layer
- Process: Understand intent, maintain context, generate response
- Output: Text response
- Providers: OpenAI GPT-4/GPT-3.5, Anthropic Claude, Google Gemini
- Latency: 200-800ms depending on model and prompt
Layer 3 - Text-to-Speech (TTS):
- Input: LLM's text response
- Process: Synthesize natural-sounding speech audio
- Output: Audio waveform
- Providers: ElevenLabs, PlayHT, OpenAI TTS, Amazon Polly
- Latency: 150-400ms depending on voice
Why "Sandwich"?
The language model is "sandwiched" between two modality conversion layers (audio→text, text→audio). This separates conversational logic from speech processing.
Sequential vs Streaming Processing
Sequential (slow):
- Wait for user to finish speaking completely
- Wait for complete STT transcription
- Wait for complete LLM response generation
- Wait for complete TTS synthesis
- Play audio Total: 300ms + 500ms + 300ms = 1100ms
Streaming (fast):
- STT streams partial transcription as user speaks
- LLM begins generating from partial transcript
- TTS synthesizes first tokens while LLM continues
- Audio plays immediately Total: 500-700ms through overlap
Vapi implementation: Streams between every layer achieving 500-700ms voice-to-voice latency.