***UPDATE – November 22nd 2016***
Microsoft just announced a single-instance SLA of 99,9%. To get this SLA, you need to deploy Premium Storage VMs. Very nice addition to Azure, since we can now get a financially backed SLA for applications that do not run in a multi-instance configuration.
Links: https://azure.microsoft.com/da-dk/blog/announcing-4-tb-for-sap-hana-single-instance-sla-and-hybrid-use-benefit-images/ and https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_3/
I’ve been asked quite a few times now, about single-instance SLA in Azure. Apparently people found this blog post and now they think they can get SLA for single instances in Azure. Even after the author updated the post to say Microsoft did not announce no such thing. Some people has also referred to this tweet.
Let me be clear: To get SLA for your VMs in Azure, you have to run at least two instances within an Availability Set. There is no way to get SLA for a single instance!
Is Microsoft working on making this experience better? You bet! They always do, it’s their business and they do listen to feedback. One of the questions I hear most, is about SLAs and why Microsoft doesn’t offer a single instance SLA. Often people even compare Azure to AWS and Google, and says that they offer it. Guess what, they also require multiple instances!
From AWS SLA: https://aws.amazon.com/ec2/sla/
“Monthly Uptime Percentage” is calculated by subtracting from 100% the percentage of minutes during the month in which Amazon EC2 or Amazon EBS, as applicable, was in the state of “Region Unavailable.” Monthly Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Amazon EC2 SLA Exclusion (defined below).
“Region Unavailable” and “Region Unavailability” mean that more than one Availability Zone in which you are running an instance, within the same Region, is “Unavailable” to you.
From Google Cloud SLA: https://cloud.google.com/compute/sla
For Instances: Loss of external connectivity and/or persistent disk access for all running Instances that are hosted across two or more zones combined with the inability to launch replacement Instances in any zone.
“Downtime Period” means a period of five consecutive minutes of Downtime. Intermittent Downtime for a period of less than five minutes will not be counted towards any Downtime Periods.
You can find the Azure VM SLA here: https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_1/
“But don’t they have live migration in Hyper-V?!” – Yes, but this is Azure. Even though it runs on Hyper-V, it’s still an incredible large scale, which would require huge amounts of bandwidth, and make it more expensive. And to be honest, I would prefer 2 small instances over 1 large instance, any day. That way you will also have uptime when updating systems internally, or experiencing failures.
“What about my applications that doesn’t support multiple instances?” – The best I can say is, either you don’t move it to Azure, or you live without a financially backed SLA. Just because there is no official SLA for single-instance VMs, it doesn’t mean Microsoft don’t care about uptime. In my experience, single-instance VMs have very high uptime (>99,9%).
One important thing to remember about SLA, is that you need to configure your VMs to use an Availability Set. This will ensure that your VMs are located in different Fault and Update Domains (basically redundant racks), and won’t be affected by a power outage (or network or other stuff) in a single rack, since you still have a VM running in a second rack. Don’t worry that the VMs will be located in different corners of the datacenters so you get hit by latency. There is a very complex placement system that makes sure you’re VMs are closer to each other, yet not too close (i.e. different racks). When using Azure Resource Manager you can have 3 Fault Domains, and 20 Update Domains. So if you have 20 VMs, they can be in 20 different Update Domains, an 3 different Fault Domains.
To visualize this, imagine 2 VMs in Availability Set. They will be running in 2 different racks:
If a failure happens to the rack where VM2 is running, it will failover to another rack, and VM1 will continue running = no downtime on your service.
Many companies starts out by deploying a single VM just to see if it works, and then over time they will probably get a second instance. Now, because you can only configure Availability Sets at VM creation, some people will create that first VM with an Availability Set, because “I might get a second instance, and don’t want to redeploy the first one”. Do not ever do that! Unless you are absolutely sure that you will get that second instance very soon. Why? Let me explain a bit about maintenance windows in Azure:
When Microsoft does patching in Azure, the hosts on which your VMs run, are rebooted onto a new updated image. Since the host is rebooted and Azure don’t have live migration, your VMs will also shutdown, and then start again when the host is online. It’s usually a quick operation with 5-15 minutes of downtime. If you have multiple instances configured with an Availability Set, Azure will only take down 1 Update Domain at a time, meaning the rest of the Update Domains (and VMs in them) in this Availability Set, stays online.
Now, why does this affect single-instance VMs then? Remember when I said Microsoft cares about your uptime for single instance VMs too? Maintenance Windows are done in multiple waves. This is both to ensure that everything isn’t updated at once in case of errors, but also to respect single-instances. Usually Maintenance Windows are done during the weekends (far from every weekend!), but sometimes they also occur during the week days. If your VMs are not in an Availability Set, the Maintenance Window will not occur during week days, but only after hours in the weekends!
To sum up:
If you run single-instance VMs, do not put these in Availability Sets, as they can be affected by Maintenance Windows during week days.
If you run multiple instance VMs, ensure that these are in Availability Sets. That way Azure will ensure they don’t fail or reboot simultaneously.
Will all this change over time? Probably, but for now, this is the guidelines.