Author Image
by Kresimir Bojcic
Jun 29th 2018
Scale for Speed and Availability
Tags: AWS, Devops, Scaling


First let's go over some concepts: region and availability zone.

Amazon Availability Zones are distinct physical locations that have Low latency network connectivity between them, are located inside the same region and are also engineered to be insulated from failures that happen to afflict other AZ’s.

Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable; they have Independent power, cooling, network and security.


Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate; such that even extremely uncommon disasters like fires, tornadoes or flooding would only affect a single Availability Zone.[^1]

If your platform is working mostly in one area of the world it makes sense to put your servers in that region. The region will then have mulitple "Availability Zones".

This means that you can put redundant servers in different zones within same region and as a result you'll have better availability.

The important twist here is that within the same region network latency is minimal. So we have separate facilities with good interconnectedness. Here is an image of the available regions along with the number of available zones on AWS. Two green circles are new regions that are opening soon (in Paris and Ningxia).

Scale Up

Things are simple here, you have one machine that serves all your traffic. When you notice that the server can't handle the traffic you simply shut down your machine, upgrade the CPU, RAM and storage and you run it again. This approach is the cheapest and is ideal for the MPV state.

Don't be fooled, though, it still can get you very far and I would most definitely always begin with this approach.

simplest single point of failure (all eggs in one basket)
cheapest downtime when upgrading
you can't adjust dynamically (for spike traffic)

Single Availability Zone

This is similar to having a single machine in the sense that if our availability zone goes down, production goes down too. So our server(s) live in single region inside of a single availability zone. We are merely adding an Elastic Load Balancer that distributes traffic to multiple servers within the same availability zone.

possible to upgrade without downtime (multiple servers) single point of failure
possible to adjust dynamically (for spike traffic)

It is much better to use the approach #3 with multiple zones. This can be used when the load is so low it requires only one server (so it has to be in one availability zone) as a stepping stone in the right direction.

Multiple Availability Zones Amazon

EC2/RDS instances have an uptime guarantee of 99.95% on a monthly basis. The max permissible downtime roughly equates to 22 minutes per month (assuming 30 days per month)[^2] When we combine multiple availability zones it means it makes it very unlikely we will have an outage. Elastic Load Balancer can detect problems in each zone and redirect traffic to healthy instances.

screen shot 2017-02-17 at 14 49 19

possible to upgrade without downtime (multiple servers) affected by whole region going down
possible to adjust dynamically (for spike traffic)
possible to survive one or more availability zones going down

This combination is a sweet spot for reasonable realiability and cost.

Multiple Regions - Active/Passive Failover

Although it is very rare for an entire AWS region to go down, it does happen. Many enterprises want to replicate their databases across regions, so that when a catastrophe does occur and the primary region goes down, infrastructure can be quickly setup in another region.[^3] Such a setup requires the database to be synced across regions. Total time from end point failure to DNS failover is about 3 minutes, so we can have a backup server running soon, preventing big outage.

screen shot 2017-02-16 at 14 48 30

One possibility to cut down cost is to use a passive setup as staging area for testing prior to production rollout.

possible to upgrade without downtime (multiple servers) partially affected by whole region going down
possible to adjust dynamically (for spike traffic) we need read replicas in different region for havoc scenarios
possible to survive whole region going down with little to no down time

Multiple Regions - Active/Active Failover

When your server handles lots of customers across multiple regions it makes sense to keep both regions active. In normal circumstances you might use Amazon Route 53 Latency Based Routing (LBR) or Weight Round Robin (WRR) to distribute load. In case of emergency when an entire region goes down you transfer the traffic over to a working region.

This means you get slower responses, but it certainly beats suffering complete downtime.

The configuration is exactly the same as #4 Active/Passive Failover but we use both regions and we distribute the load between them at all times, not just in case of one region going down.

possible to upgrade without downtime (multiple servers) we need read replicas in different regions
possible to adjust dynamically (for spike traffic) * should survive whole region going down without major issues we probably need a database master in each region
allows region by region rollout to test new production

Common Concerns

For a big system, a major problem is always the database. So in a sense you do everything you can to remove the burden from it:

  • Read Replicas
  • Caching of static and dynamic content
  • Splitting data based on regions (multiple masters depending on region) Another good tip is protecting web servers from being burdened by using a CDN for static content delivery or streaming. DDOS protection is another valid concern.


Congratulations on making it all the way here. If you just jumped here, shame on you, otherwise I hope you found this useful :)

If you are in search of an awesome RoR/Vue.js/Nuxt team, or you need help with setting up your project feel free to contact Kodius.


  1. Exploring Amazon Availability Zones
  2. Does-Amazon-EC2-have-an-uptime-guarantee
  3. New AWS Feature: Amazon RDS now support cross-region replication
  4. Active-Active for Multi-Regional Resiliency
  5. A Beginner's Guide To Scaling To 11 Million+ Users On Amazon's AWS
  6. Amazon RDS for MySQL – Promote Read Replica
  7. Overview of Amazon Web Services
  8. Calculator S3
  9. Creating a Billing Alarm to Monitor Your Estimated AWS Charges
  10. Using regions availability zones
  11. New AWS Feature: Amazon RDS now supports cross-region replication
Friendly face

Want to discuss?

Name is required

Valid e-mail is required

Message is required

If you previously need to sign an NDA, email us at: