Why AI Voice Agents Need Purpose-Built Telephony Infrastructure
Standard CPaaS wasn't built for AI. Here's what changes when your customer is an LLM, not a human.
The problem with standard CPaaS for AI
When Twilio was built, the end-to-end latency of a phone call barely mattered as long as it was under a second. Humans don't notice 400ms of extra latency. LLMs do.
AI voice agents operate in a completely different latency budget. If your voice AI takes 600ms to generate a response, and your telephony stack adds another 400ms of media latency, the caller experiences a 1-second silence before the agent responds. That doesn't sound natural. That sounds broken.
Standard CPaaS platforms weren't designed for this. They were optimized for reliability and throughput — not for sub-100ms media delivery.
What "low latency" actually means in telephony
There are several latency components in a voice AI call:
- Media latency — Time to deliver audio packets from PSTN to your server
- ASR latency — Speech-to-text recognition time
- LLM inference latency — Time to generate a response
- TTS latency — Text-to-speech synthesis time
- Playback latency — Time to play audio back to caller
You can't control #2, #3, or #4 if you're using hosted models. You can control #1 and #5. CloudVNO targets <50ms for media delivery (measured p99) — compared to 150-400ms on standard CPaaS routes.
The provisioning problem
Building a voice AI product at scale requires provisioning phone numbers dynamically — sometimes thousands per day. Standard number provisioning on legacy CPaaS can take 5-30 seconds per number.
CloudVNO completes provisioning in under 2 seconds. For AI voice products that need to spin up dedicated numbers per customer or per conversation, this matters.
STIR/SHAKEN and answer rates
Legitimate AI voice agents need their calls to actually connect. With spam labeling at an all-time high, proper STIR/SHAKEN attestation is the difference between a 40% answer rate and a 70% answer rate.
CloudVNO provides Full Attestation (Level A) for numbers you own outright. This significantly reduces the risk of "Spam Likely" labels from carriers.
Conclusion
The telephony layer for AI voice is a solved problem — if you use infrastructure built for it. The combination of sub-50ms media latency, instant number provisioning, and full STIR/SHAKEN attestation is what separates a great-sounding AI agent from one that frustrates callers.