Voice AI Architecture Guide: Architectural Trade-Offs Comparison
Latency
Sandwich architecture:
- Current: 500-700ms with streaming (Vapi)
- Sequential: 1000-1500ms without streaming
- Optimized: 400-600ms with fastest provider combination
Speech-to-speech:
- Current: 600-800ms (GPT-4o audio mode)
- Theoretical: 300-400ms (not yet achieved in production)
- Future: Sub-300ms as models improve
Winner: Currently tied, Sandwich potentially faster with optimal providers
Customization
Sandwich architecture:
- Choose any STT provider (Deepgram for speed, Whisper for accents)
- Choose any LLM (GPT-4 for capability, GPT-3.5 for cost)
- Choose any TTS voice (ElevenLabs for quality, OpenAI for efficiency)
- Swap components independently based on use case
Speech-to-speech:
- Limited to providers offering end-to-end models
- Cannot mix-and-match components
- Less flexibility for optimization
Winner: Sandwich architecture for customization
Cost
Sandwich architecture:
- Pay per component: STT + LLM + TTS
- Optimize by selecting cost-efficient providers
- Typical: $0.05-0.15 per conversation minute
- Granular control over cost/quality trade-offs
Speech-to-speech:
- Bundled pricing
- Cannot optimize individual components
- Pricing model still emerging
- May be cheaper or more expensive depending on provider
Winner: Sandwich for cost control and optimization
Complexity
Sandwich architecture:
- Three provider integrations
- Three components to monitor and optimize
- More failure modes (any layer can fail)
- Requires orchestration layer
Speech-to-speech:
- Single provider integration
- Single component to monitor
- Simpler failure modes
- Less orchestration needed
Winner: Speech-to-speech for simplicity
Comparison Table
| Dimension | Sandwich | Speech-to-Speech | Winner |
|---|---|---|---|
| Current latency | 500-700ms | 600-800ms | Tie |
| Theoretical minimum | 400-600ms | 300-400ms | Speech-to-speech |
| Provider choice | High | Low | Sandwich |
| Customization | High | Low | Sandwich |
| Cost optimization | High | Low | Sandwich |
| Complexity | Higher | Lower | Speech-to-speech |
| Production maturity | High | Medium | Sandwich |
Hybrid Architectures: On-Device + Cloud
The 2026 Trend
By 2026, constraints will force OEMs toward hybrid voice AI architectures that put robust spatial awareness and fast decision-making on device, with the cloud used selectively.
Drivers:
- Privacy concerns (sensitive data doesn't leave device)
- Latency requirements (sub-200ms for simple queries)
- Connectivity limitations (works offline)
- Cost optimization (reduce cloud API costs)
Two-Tier Processing
Tier 1 - On-Device (fast):
- Wake word detection (<100ms)
- Simple queries ("What time is it?", "Set a timer")
- Preliminary transcription and intent detection
- Cached responses for common queries
- Works offline
Tier 2 - Cloud (capable):
- Complex reasoning requiring large context
- Knowledge-intensive queries
- Integration with business systems
- Personalization requiring customer data
- Requires connectivity
Routing Logic
Route to device when:
- Query matches known simple pattern
- User preference for privacy
- Network connectivity poor
- Cost optimization prioritized
Route to cloud when:
- Query requires external data
- Complexity exceeds on-device capability
- Personalization needs customer context
- High-quality response critical
Implementation
Edge models: Quantized, distilled versions of cloud models (1-10GB) Sync strategy: Download updated models weekly or monthly Fallback: Cloud handling when device model uncertain Seamless UX: User doesn't know which tier processed request