Building Production-Ready Voice Agents

Production vs Prototype Requirements

Production voice AI deployments require comprehensive testing, real-time monitoring, and scalable infrastructure beyond prototype requirements. Design test suites of simulated voice agents to identify hallucination risks before production, monitor latency (sub-500ms target), accuracy, and cost per minute continuously, and architect for 99.9% uptime with ability to scale to millions of calls. Vapi enables A/B testing to experiment with different prompts, voices, and workflows to optimize agent performance.

Infrastructure standard: Sub-500ms latency, 99.9% uptime, millions of concurrent calls supported through managed platform.

Testing Strategies Before Production

Simulated Conversation Testing

Purpose: Validate agent behavior across expected conversation paths without exposing to real users Method: Generate hundreds of automated test conversations covering common scenarios and edge cases Tool: Vapi enables design of test suites with simulated voice agents Coverage: Expected happy paths, error conditions, edge cases, adversarial inputs

Example test scenarios:

  • Standard appointment booking (happy path)
  • User provides invalid date/time (error handling)
  • User interrupts mid-sentence (interruption handling)
  • User asks off-topic question (boundary testing)
  • User provides ambiguous information (clarification flow)
  • Multiple corrections in single conversation (context persistence)

Edge Case Identification

Silent periods: User doesn't respond for 10+ seconds Loud background noise: Transcription confidence drops below threshold Rapid speaker interruption: User cuts agent off repeatedly Nonsense inputs: Gibberish or deliberately confusing speech Boundary violations: Attempts to make agent do unauthorized actions PII exposure: User shares sensitive data that should be redacted

Regression Testing

Purpose: Ensure prompt changes don't break existing functionality Process: Maintain test suite of conversations that passed previously Frequency: Run before deploying any prompt or configuration change Failure threshold: >5% accuracy degradation requires investigation

Hallucination Risk Assessment

Method: Run conversations specifically probing for hallucinations Test areas: Product information, pricing, policies, availability, timelines Validation: Compare agent responses to ground truth from business systems Mitigation: Identify topics requiring structured responses vs LLM generation

Key Metrics to Monitor in Production

Latency Metrics

Voice-to-voice latency: Total time from user silence to agent speech

  • P50 (median): Target <500ms
  • P95: Target <700ms
  • P99: Target <1000ms (outliers acceptable but monitored)

Component latency breakdown:

  • STT latency: 200-400ms depending on provider
  • LLM latency: 200-600ms depending on model and prompt
  • TTS latency: 150-300ms depending on voice

Why percentiles matter: Average latency hides outliers. 500ms average might include 2000ms P99 creating terrible 1% user experience.

Accuracy Metrics

Intent accuracy: Percentage of conversations where agent understands user goal

  • Target: >90% intent accuracy
  • Measurement: Sample conversations, human review of transcription and response

Task completion rate: Percentage of conversations achieving stated goal

  • Target: >70% completion without human transfer
  • Measurement: Conversation outcome tracking (appointment booked, question answered, etc.)

Transcription accuracy (WER): Word error rate in STT output

  • Target: <10% WER for standard accents
  • Measurement: Human review of transcribed vs actual speech

Response relevance: Whether agent responses address user query

  • Target: >95% relevant responses
  • Measurement: Human evaluation of conversation quality

Cost Metrics

Cost per conversation minute: Combined STT + LLM + TTS + infrastructure costs

  • Typical range: $0.05-0.15 per minute
  • Monitoring: Track by provider configuration to identify optimization opportunities

Cost per completed conversation: Total cost divided by successful completion rate

  • More meaningful than per-minute cost
  • Lower completion rates increase effective cost per resolution

Token usage: LLM input/output tokens consumed per conversation

  • Correlates with prompt length and conversation duration
  • Optimize by reducing system prompt verbosity

User Satisfaction Metrics

Conversation completion rate: Percentage completing conversation vs hanging up

  • Target: >80% completion for transactional use cases
  • Low completion indicates friction or poor experience

Transfer to human rate: Percentage escalating to human agent

  • Target: <30% transfer rate for routine scenarios
  • High transfer suggests agent capability gaps

User sentiment: Positive, neutral, negative classification

  • Measurement: Analyze conversation tone and explicit feedback
  • Target: >70% positive or neutral sentiment

Post-conversation surveys: Optional feedback collection

  • "How would you rate this conversation?" (1-5 stars)
  • "Did the agent resolve your issue?" (yes/no)