Skip to content
CloudVNO

How CloudVNO Achieves 99.99% Uptime: Our Infrastructure Story

A look at the redundancy, failover, and carrier diversity architecture behind CloudVNO's 99.99% uptime SLA.

What 99.99% uptime actually means

99.99% uptime allows for approximately 52 minutes of downtime per year. For a communications platform, that's the difference between "one brief incident in 12 months" and "customers can't send messages today."

Achieving this requires redundancy at every layer: carrier, network, application, and data.

Carrier diversity

The single biggest risk for a CPaaS platform is carrier dependency. If you route all your US SMS through a single carrier and that carrier has an outage, your delivery drops to zero.

CloudVNO maintains direct connections to multiple tier-1 carriers per country. Our routing engine:

  1. Monitors delivery rates per carrier in real time
  2. Automatically shifts traffic away from degraded carriers
  3. Completes failover in under 30 seconds

We don't publish our carrier matrix publicly — that's a competitive advantage — but our US infrastructure connects to 4 independent carrier paths.

Geographic redundancy

Our API infrastructure runs across multiple EU data centers with active-active configuration. There's no primary/secondary — every region can serve 100% of traffic if another goes down.

Database replication is synchronous within a region and asynchronous across regions, with automatic failover.

The health monitoring stack

We run continuous synthetic monitoring — generating real API calls every 60 seconds from each region:

  • POST /v1/messages (send SMS, verify delivery)
  • POST /v1/calls (make a call, verify connection)
  • POST /v1/verify/send (send OTP, verify receipt)

If any synthetic check fails twice consecutively, PagerDuty wakes someone up.

What we learned from incidents

Our two most significant incidents both had the same root cause: a change that worked perfectly in staging but behaved differently under production traffic patterns. Both were resolved in under 30 minutes.

The lesson: staged rollouts and canary deployments, even for "trivial" changes. Every production change now starts at 1% traffic before full rollout.

The SLA commitment

We commit to 99.99% monthly uptime for SMS and Voice APIs in our SLA. If we miss it, you get service credits automatically — you don't need to ask.

View our SLA