Building Production-Ready Voice Agents
Production vs Prototype Requirements
Production voice AI deployments require comprehensive testing, real-time monitoring, and scalable infrastructure beyond prototype requirements. Design test suites of simulated voice agents to identify hallucination risks before production, monitor latency (sub-500ms target), accuracy, and cost per minute continuously, and architect for 99.9% uptime with ability to scale to millions of calls. Vapi enables A/B testing to experiment with different prompts, voices, and workflows to optimize agent performance.
Infrastructure standard: Sub-500ms latency, 99.9% uptime, millions of concurrent calls supported through managed platform.
Testing Strategies Before Production
Simulated Conversation Testing
Purpose: Validate agent behavior across expected conversation paths without exposing to real users Method: Generate hundreds of automated test conversations covering common scenarios and edge cases Tool: Vapi enables design of test suites with simulated voice agents Coverage: Expected happy paths, error conditions, edge cases, adversarial inputs
Example test scenarios:
- Standard appointment booking (happy path)
- User provides invalid date/time (error handling)
- User interrupts mid-sentence (interruption handling)
- User asks off-topic question (boundary testing)
- User provides ambiguous information (clarification flow)
- Multiple corrections in single conversation (context persistence)
Edge Case Identification
Silent periods: User doesn't respond for 10+ seconds Loud background noise: Transcription confidence drops below threshold Rapid speaker interruption: User cuts agent off repeatedly Nonsense inputs: Gibberish or deliberately confusing speech Boundary violations: Attempts to make agent do unauthorized actions PII exposure: User shares sensitive data that should be redacted
Regression Testing
Purpose: Ensure prompt changes don't break existing functionality Process: Maintain test suite of conversations that passed previously Frequency: Run before deploying any prompt or configuration change Failure threshold: >5% accuracy degradation requires investigation
Hallucination Risk Assessment
Method: Run conversations specifically probing for hallucinations Test areas: Product information, pricing, policies, availability, timelines Validation: Compare agent responses to ground truth from business systems Mitigation: Identify topics requiring structured responses vs LLM generation
Key Metrics to Monitor in Production
Latency Metrics
Voice-to-voice latency: Total time from user silence to agent speech
- P50 (median): Target <500ms
- P95: Target <700ms
- P99: Target <1000ms (outliers acceptable but monitored)
Component latency breakdown:
- STT latency: 200-400ms depending on provider
- LLM latency: 200-600ms depending on model and prompt
- TTS latency: 150-300ms depending on voice
Why percentiles matter: Average latency hides outliers. 500ms average might include 2000ms P99 creating terrible 1% user experience.
Accuracy Metrics
Intent accuracy: Percentage of conversations where agent understands user goal
- Target: >90% intent accuracy
- Measurement: Sample conversations, human review of transcription and response
Task completion rate: Percentage of conversations achieving stated goal
- Target: >70% completion without human transfer
- Measurement: Conversation outcome tracking (appointment booked, question answered, etc.)
Transcription accuracy (WER): Word error rate in STT output
- Target: <10% WER for standard accents
- Measurement: Human review of transcribed vs actual speech
Response relevance: Whether agent responses address user query
- Target: >95% relevant responses
- Measurement: Human evaluation of conversation quality
Cost Metrics
Cost per conversation minute: Combined STT + LLM + TTS + infrastructure costs
- Typical range: $0.05-0.15 per minute
- Monitoring: Track by provider configuration to identify optimization opportunities
Cost per completed conversation: Total cost divided by successful completion rate
- More meaningful than per-minute cost
- Lower completion rates increase effective cost per resolution
Token usage: LLM input/output tokens consumed per conversation
- Correlates with prompt length and conversation duration
- Optimize by reducing system prompt verbosity
User Satisfaction Metrics
Conversation completion rate: Percentage completing conversation vs hanging up
- Target: >80% completion for transactional use cases
- Low completion indicates friction or poor experience
Transfer to human rate: Percentage escalating to human agent
- Target: <30% transfer rate for routine scenarios
- High transfer suggests agent capability gaps
User sentiment: Positive, neutral, negative classification
- Measurement: Analyze conversation tone and explicit feedback
- Target: >70% positive or neutral sentiment
Post-conversation surveys: Optional feedback collection
- "How would you rate this conversation?" (1-5 stars)
- "Did the agent resolve your issue?" (yes/no)