Voice AI Architecture Guide: Architectural Trade-Offs Comparison

Latency

Sandwich architecture:

  • Current: 500-700ms with streaming (Vapi)
  • Sequential: 1000-1500ms without streaming
  • Optimized: 400-600ms with fastest provider combination

Speech-to-speech:

  • Current: 600-800ms (GPT-4o audio mode)
  • Theoretical: 300-400ms (not yet achieved in production)
  • Future: Sub-300ms as models improve

Winner: Currently tied, Sandwich potentially faster with optimal providers

Customization

Sandwich architecture:

  • Choose any STT provider (Deepgram for speed, Whisper for accents)
  • Choose any LLM (GPT-4 for capability, GPT-3.5 for cost)
  • Choose any TTS voice (ElevenLabs for quality, OpenAI for efficiency)
  • Swap components independently based on use case

Speech-to-speech:

  • Limited to providers offering end-to-end models
  • Cannot mix-and-match components
  • Less flexibility for optimization

Winner: Sandwich architecture for customization

Cost

Sandwich architecture:

  • Pay per component: STT + LLM + TTS
  • Optimize by selecting cost-efficient providers
  • Typical: $0.05-0.15 per conversation minute
  • Granular control over cost/quality trade-offs

Speech-to-speech:

  • Bundled pricing
  • Cannot optimize individual components
  • Pricing model still emerging
  • May be cheaper or more expensive depending on provider

Winner: Sandwich for cost control and optimization

Complexity

Sandwich architecture:

  • Three provider integrations
  • Three components to monitor and optimize
  • More failure modes (any layer can fail)
  • Requires orchestration layer

Speech-to-speech:

  • Single provider integration
  • Single component to monitor
  • Simpler failure modes
  • Less orchestration needed

Winner: Speech-to-speech for simplicity

Comparison Table

Dimension Sandwich Speech-to-Speech Winner
Current latency 500-700ms 600-800ms Tie
Theoretical minimum 400-600ms 300-400ms Speech-to-speech
Provider choice High Low Sandwich
Customization High Low Sandwich
Cost optimization High Low Sandwich
Complexity Higher Lower Speech-to-speech
Production maturity High Medium Sandwich

Hybrid Architectures: On-Device + Cloud

The 2026 Trend

By 2026, constraints will force OEMs toward hybrid voice AI architectures that put robust spatial awareness and fast decision-making on device, with the cloud used selectively.

Drivers:

  • Privacy concerns (sensitive data doesn't leave device)
  • Latency requirements (sub-200ms for simple queries)
  • Connectivity limitations (works offline)
  • Cost optimization (reduce cloud API costs)

Two-Tier Processing

Tier 1 - On-Device (fast):

  • Wake word detection (<100ms)
  • Simple queries ("What time is it?", "Set a timer")
  • Preliminary transcription and intent detection
  • Cached responses for common queries
  • Works offline

Tier 2 - Cloud (capable):

  • Complex reasoning requiring large context
  • Knowledge-intensive queries
  • Integration with business systems
  • Personalization requiring customer data
  • Requires connectivity

Routing Logic

Route to device when:

  • Query matches known simple pattern
  • User preference for privacy
  • Network connectivity poor
  • Cost optimization prioritized

Route to cloud when:

  • Query requires external data
  • Complexity exceeds on-device capability
  • Personalization needs customer context
  • High-quality response critical

Implementation

Edge models: Quantized, distilled versions of cloud models (1-10GB) Sync strategy: Download updated models weekly or monthly Fallback: Cloud handling when device model uncertain Seamless UX: User doesn't know which tier processed request