Voice AI Architecture Guide: Modular vs End-to-End Architectures

The Sandwich architecture composes three distinct components (speech-to-text, text-based agent, text-to-speech) balancing performance, controllability, and modern model capabilities. Speech-to-speech uses multimodal models processing audio input and generating audio output natively without intermediate text representation. Trade-offs include latency (300-500ms for Sandwich vs potential sub-200ms for speech-to-speech), customization (high modularity vs limited provider choice), and cost (pay-per-component vs bundled pricing).

Vapi's strength: Plug in preferred STT, LLM, and TTS providers and control every part of the request-response cycle through modular Sandwich architecture.

Sandwich Architecture Explained

Three-Layer Composition

Layer 1 - Speech-to-Text (STT):

  • Input: User's spoken audio
  • Process: Convert audio waveform to text transcription
  • Output: Transcribed text string
  • Providers: Deepgram, AssemblyAI, OpenAI Whisper, Google STT
  • Latency: 200-600ms depending on provider

Layer 2 - Language Model (LLM):

  • Input: Transcribed text from STT layer
  • Process: Understand intent, maintain context, generate response
  • Output: Text response
  • Providers: OpenAI GPT-4/GPT-3.5, Anthropic Claude, Google Gemini
  • Latency: 200-800ms depending on model and prompt

Layer 3 - Text-to-Speech (TTS):

  • Input: LLM's text response
  • Process: Synthesize natural-sounding speech audio
  • Output: Audio waveform
  • Providers: ElevenLabs, PlayHT, OpenAI TTS, Amazon Polly
  • Latency: 150-400ms depending on voice

Why "Sandwich"?

The language model is "sandwiched" between two modality conversion layers (audio→text, text→audio). This separates conversational logic from speech processing.

Sequential vs Streaming Processing

Sequential (slow):

  1. Wait for user to finish speaking completely
  2. Wait for complete STT transcription
  3. Wait for complete LLM response generation
  4. Wait for complete TTS synthesis
  5. Play audio Total: 300ms + 500ms + 300ms = 1100ms

Streaming (fast):

  1. STT streams partial transcription as user speaks
  2. LLM begins generating from partial transcript
  3. TTS synthesizes first tokens while LLM continues
  4. Audio plays immediately Total: 500-700ms through overlap

Vapi implementation: Streams between every layer achieving 500-700ms voice-to-voice latency.