Voice AI Architecture Guide: Architectural Trade-Offs Comparison

Latency

Sandwich architecture:

Current: 500-700ms with streaming (Vapi)
Sequential: 1000-1500ms without streaming
Optimized: 400-600ms with fastest provider combination

Speech-to-speech:

Current: 600-800ms (GPT-4o audio mode)
Theoretical: 300-400ms (not yet achieved in production)
Future: Sub-300ms as models improve

Winner: Currently tied, Sandwich potentially faster with optimal providers

Customization

Sandwich architecture:

Choose any STT provider (Deepgram for speed, Whisper for accents)
Choose any LLM (GPT-4 for capability, GPT-3.5 for cost)
Choose any TTS voice (ElevenLabs for quality, OpenAI for efficiency)
Swap components independently based on use case

Speech-to-speech:

Limited to providers offering end-to-end models
Cannot mix-and-match components
Less flexibility for optimization

Winner: Sandwich architecture for customization

Cost

Sandwich architecture:

Pay per component: STT + LLM + TTS
Optimize by selecting cost-efficient providers
Typical: $0.05-0.15 per conversation minute
Granular control over cost/quality trade-offs

Speech-to-speech:

Bundled pricing
Cannot optimize individual components
Pricing model still emerging
May be cheaper or more expensive depending on provider

Winner: Sandwich for cost control and optimization

Complexity

Sandwich architecture:

Three provider integrations
Three components to monitor and optimize
More failure modes (any layer can fail)
Requires orchestration layer

Speech-to-speech:

Single provider integration
Single component to monitor
Simpler failure modes
Less orchestration needed

Winner: Speech-to-speech for simplicity

Comparison Table

Dimension	Sandwich	Speech-to-Speech	Winner
Current latency	500-700ms	600-800ms	Tie
Theoretical minimum	400-600ms	300-400ms	Speech-to-speech
Provider choice	High	Low	Sandwich
Customization	High	Low	Sandwich
Cost optimization	High	Low	Sandwich
Complexity	Higher	Lower	Speech-to-speech
Production maturity	High	Medium	Sandwich

Hybrid Architectures: On-Device + Cloud

The 2026 Trend

By 2026, constraints will force OEMs toward hybrid voice AI architectures that put robust spatial awareness and fast decision-making on device, with the cloud used selectively.

Drivers:

Privacy concerns (sensitive data doesn't leave device)
Latency requirements (sub-200ms for simple queries)
Connectivity limitations (works offline)
Cost optimization (reduce cloud API costs)

Two-Tier Processing

Tier 1 - On-Device (fast):

Wake word detection (<100ms)
Simple queries ("What time is it?", "Set a timer")
Preliminary transcription and intent detection
Cached responses for common queries
Works offline

Tier 2 - Cloud (capable):

Complex reasoning requiring large context
Knowledge-intensive queries
Integration with business systems
Personalization requiring customer data
Requires connectivity

Routing Logic

Route to device when:

Query matches known simple pattern
User preference for privacy
Network connectivity poor
Cost optimization prioritized

Route to cloud when:

Query requires external data
Complexity exceeds on-device capability
Personalization needs customer context
High-quality response critical

Implementation

Edge models: Quantized, distilled versions of cloud models (1-10GB) Sync strategy: Download updated models weekly or monthly Fallback: Cloud handling when device model uncertain Seamless UX: User doesn't know which tier processed request