Voice AI Architecture Guide: Modular vs End-to-End Architectures

The Sandwich architecture composes three distinct components (speech-to-text, text-based agent, text-to-speech) balancing performance, controllability, and modern model capabilities. Speech-to-speech uses multimodal models processing audio input and generating audio output natively without intermediate text representation. Trade-offs include latency (300-500ms for Sandwich vs potential sub-200ms for speech-to-speech), customization (high modularity vs limited provider choice), and cost (pay-per-component vs bundled pricing).

Vapi's strength: Plug in preferred STT, LLM, and TTS providers and control every part of the request-response cycle through modular Sandwich architecture.

Sandwich Architecture Explained

Three-Layer Composition

Layer 1 - Speech-to-Text (STT):

Input: User's spoken audio
Process: Convert audio waveform to text transcription
Output: Transcribed text string
Providers: Deepgram, AssemblyAI, OpenAI Whisper, Google STT
Latency: 200-600ms depending on provider

Layer 2 - Language Model (LLM):

Input: Transcribed text from STT layer
Process: Understand intent, maintain context, generate response
Output: Text response
Providers: OpenAI GPT-4/GPT-3.5, Anthropic Claude, Google Gemini
Latency: 200-800ms depending on model and prompt

Layer 3 - Text-to-Speech (TTS):

Input: LLM's text response
Process: Synthesize natural-sounding speech audio
Output: Audio waveform
Providers: ElevenLabs, PlayHT, OpenAI TTS, Amazon Polly
Latency: 150-400ms depending on voice

Why "Sandwich"?

The language model is "sandwiched" between two modality conversion layers (audio→text, text→audio). This separates conversational logic from speech processing.

Sequential vs Streaming Processing

Sequential (slow):

Wait for user to finish speaking completely
Wait for complete STT transcription
Wait for complete LLM response generation
Wait for complete TTS synthesis
Play audio Total: 300ms + 500ms + 300ms = 1100ms

Streaming (fast):

STT streams partial transcription as user speaks
LLM begins generating from partial transcript
TTS synthesizes first tokens while LLM continues
Audio plays immediately Total: 500-700ms through overlap

Vapi implementation: Streams between every layer achieving 500-700ms voice-to-voice latency.