Voice AI Architecture Guide: Speech-to-Speech Architecture

Direct Audio Processing

Single model: Processes audio input and generates audio output without intermediate text Examples: GPT-4o with audio mode, specialized speech-to-speech models Advantage: Potentially lower latency by eliminating two conversion steps Limitation: Tightly coupled, limited provider choice, less control

How It Works

Input: User's audio waveform Processing: Multimodal model trained on audio understanding audio generation Output: Response audio waveform directly No intermediate text: Skip STT and TTS layers entirely

Theoretical Latency Advantage

Sandwich: STT (200ms) + LLM (400ms) + TTS (200ms) = 800ms Speech-to-speech: Single model (300-400ms) = 50-60% reduction

Reality: Current speech-to-speech models don't achieve theoretical minimum yet. GPT-4o audio mode delivers 600-800ms latency, similar to optimized Sandwich.

Trade-Offs

Pro: Simpler architecture, potential latency advantage, preserves audio nuance Con: Limited provider choice, less customization, cannot swap components independently

Architecture Decision Framework

Start with these questions:

1. Do you need to optimize individual components?

Yes → Modular (Sandwich)
No → Either works

2. Do you have specific provider requirements?

Yes (need Whisper for accents, ElevenLabs for voice quality) → Modular
No (generic requirements) → Either works

3. Is cost optimization critical?

Yes (need to control STT/LLM/TTS spending independently) → Modular
No (bundled pricing acceptable) → Either works

4. Do you need sub-300ms latency?

Yes (wait for mature speech-to-speech) → Future planning required
No (500-700ms acceptable) → Modular works today

5. What's your team size and expertise?

Small team, limited expertise → Speech-to-speech (simpler)
Larger team, high expertise → Modular (more control)

6. How important is customization?

Very important → Modular
Not important → Speech-to-speech

Recommendation Matrix

Requirement	Architecture
Multi-language with provider optimization per language	Modular
Cost optimization with granular control	Modular
Rapid prototyping with minimal complexity	Speech-to-speech
Current production deployment at scale	Modular
Future sub-300ms latency requirement	Speech-to-speech (future)
A/B testing provider combinations	Modular
Simple consumer application	Speech-to-speech
Enterprise with compliance requirements	Modular

FAQ: Voice AI Architecture

What is the Sandwich architecture for voice AI?

The Sandwich architecture for voice AI composes three distinct components: speech-to-text (STT) converting audio to text, language model (LLM) processing text and generating responses, and text-to-speech (TTS) synthesizing audio output. The LLM is "sandwiched" between two modality conversion layers. This separates conversational logic from speech processing enabling independent optimization of each component. Vapi streams between all layers achieving 500-700ms voice-to-voice latency through overlapping processing rather than sequential execution.

What is speech-to-speech AI?

Speech-to-speech AI uses multimodal models processing audio input and generating audio output directly without intermediate text representation. Single models like GPT-4o audio mode handle entire voice pipeline eliminating separate STT and TTS layers. Theoretical advantage includes lower latency (potential 300-400ms vs 500-700ms for Sandwich) by skipping two conversion steps. Current reality: GPT-4o achieves 600-800ms latency, similar to optimized Sandwich architecture. Trade-offs include limited provider choice, less customization, and cannot swap components independently.

Which voice AI architecture has lower latency?

Current production latency is comparable: modular Sandwich architecture achieves 500-700ms through streaming (Vapi), while speech-to-speech delivers 600-800ms (GPT-4o audio mode). Theoretical minimum favors speech-to-speech at 300-400ms but not yet achieved in practice. Sandwich architecture can reach 400-600ms with optimal provider combinations (Deepgram STT 200ms + GPT-3.5 300ms + PlayHT 200ms). Future speech-to-speech models may achieve sub-300ms creating latency advantage over modular approaches.

Can I mix different STT, LLM, and TTS providers?

Yes, modular Sandwich architecture enables mixing different STT, LLM, and TTS providers. Choose Deepgram for fast English transcription, Whisper for multilingual accuracy, GPT-4 for complex reasoning, GPT-3.5 for cost optimization, ElevenLabs for voice quality, or OpenAI TTS for efficiency. Vapi integrates 10+ STT providers, 15+ LLM models, and 8+ TTS providers swappable through dashboard configuration. This enables language-specific routing, cost optimization, and A/B testing provider combinations on production traffic.

What is hybrid voice AI architecture?

Hybrid voice AI architecture combines on-device processing for simple queries with cloud processing for complex reasoning. On-device handles wake word detection, simple queries, preliminary transcription, and cached responses with sub-200ms latency working offline. Cloud handles complex reasoning, knowledge-intensive queries, business system integration, and personalization requiring customer data. By 2026, hybrid architectures will balance latency, privacy, connectivity limitations, and cost by putting robust spatial awareness and fast decision-making on device with cloud used selectively.

Should I use modular or end-to-end voice AI?

Use modular Sandwich architecture when you need to optimize individual components, have specific provider requirements (Whisper for accents, ElevenLabs for voices), require cost optimization with granular control, need A/B testing of provider combinations, or have compliance requirements needing component-level logging. Use end-to-end speech-to-speech when rapid prototyping with minimal complexity, building simple consumer applications, have small team with limited expertise, or future sub-300ms latency requirement justifies betting on emerging technology. Current production deployments favor modular for flexibility and control.

What voice AI architecture does Vapi use?

Vapi uses modular Sandwich architecture with provider flexibility enabling developers to plug in preferred STT, LLM, and TTS providers and control every part of request-response cycle. Platform integrates 10+ STT providers, 15+ LLM models, and 8+ TTS providers swappable through dashboard configuration. Vapi's orchestration layer streams between components achieving 500-700ms voice-to-voice latency, implements automatic failover during provider outages, and provides unified monitoring across all components. This architecture balances performance, controllability, and modern model capabilities.

Will speech-to-speech replace the Sandwich architecture?

Speech-to-speech will complement rather than replace Sandwich architecture. Both will coexist serving different use cases. Speech-to-speech favors simplicity and potential latency advantages (sub-300ms when mature) for consumer applications. Sandwich favors customization, provider flexibility, and cost optimization for enterprise deployments. Vapi's migration path: add speech-to-speech as provider option when mature, enable A/B testing vs Sandwich configurations, support gradual migration for use cases benefiting from end-to-end, maintain Sandwich for use cases requiring component-level control.