Prepare for a wall of text – but don’t worry, I’ll explain the technical parts too. I’ve wanted to write this post for a while now, but haven’t got around to do it. It’s important though, and seems to be more important as more companies migrate from Azure Service Management (Classic) to Azure Resource Manager (ARM). I’ve worked with multiple customers over the past few months, and most have one thing in common: The network security is… well, lacking.
What does this have to with Azure Classic, you might think. To be honest, I believe network management in Azure Classic was horrible. Just like the rest of the IaaS experience, it was clearly designed for something else, and difficult to fix when Microsoft wanted to enter the IaaS competition. A few examples:
-Every single VM had a Public IP assigned (though it was through a load balancer), you could *not* remove this
-Every single VM had 2 ports NAT’ed to them by default: 3389 (RDP) and 5986 (PS Remoting)
-Network Security Groups and Access Control Lists (for the Public IPs) was quite a mess to manage, often impossible through the portal
After ARM came along (funny thing, just found some of the early presentations I saw about ARM, A LOT has happened since!), we got a much better way of managing our networks. We could use the UI, PowerShell etc. and today we are in a pretty good place network wise.
The problem is, that many environments has been migrated from Classic to ARM, bringing over all the bad configurations that was made in the past. Don’t get me wrong, these configs were just as bad in Classic. Technically, nothing changes by doing a migration. But now we see them, and it’s good horror movie content!
1st example scenario:
You deployed 20 VMs evenly distributed within 10 different Cloud Services. After migrating to ARM, you now have:
10 Public IPs
10 Load Balancers
4 NAT rules on each Load Balancer (2x 3389, 2x 5986)
This adds up to 40 ports open, directly to your servers, from anywhere in the world. Again, you had this before migrating too, but it’s easily visible now.
This is the reason, my very first thing to do when I enter a new customer environment in Azure, is to check the network configuration. There are so many servers accessible on critical ports, all over the world.
Now, I don’t want to start a fight, but.. One other thing most of these environments have in common is: they were deployed by developers, if not in full, then at least parts of it. Someone said “DevOps”, and developers started deploying virtual machines, to run their code on.. Awesome, I love DevOps.. But you can’t have DevOps, without Ops. There is a reason developers don’t do networking infrastructure on-premises, and this is one of them.
My point here is, if you’re doing DevOps, even today, in ARM, you better include Operations, Security, and Network people to build upon a great foundation. Otherwise you might have to redo everything, or worse: restore everything after being infected.
Okay, back to the topic. We need to fix this. My approach is most often to go through the VMs one by one, with the customer. Then ask “does this VM need a Public IP?” – if the answer is yes, I ask why. If they have a good argument, I ask which ports. Maybe it would be a much better idea to use an Application Gateway (layer 7) to control access.
On the other hand, if there is no argument to have a Public IP, I write down the IP address, and remove it. Why write it down? Well, in case someone complains about something not working anymore, we have the IP to track back to.
Now, I promised you an explanation..
SHOW ME THE TECH STUFF!
Alrighty! Azure Network Security 101 incoming:
You have a virtual network, with 2 web servers, both sitting in your “WEB” tier subnet:
Then you add 2 SQL Servers, because your web servers needs a database:
Your website is not worth much though, because right now you can’t connect to the servers from the internet. Let’s add a Load Balancer for port 80 and 443. With a Load Balancer you also need a Public IP (unless you’re doing Internal load balancing, but that’s out of this scope), so one of those too.
Awesome, everything is working, and you can go to sleep… Except, you can’t. In the above configuration, if someone breaks through your web application, and get’s access to the web servers, from a network perspective, they will also have full access to your SQL VMs. Contrary to what many believes, subnets in a virtual network, is not isolated from each other.
Enter Network Security Groups!
With Network Security Groups (NSG) you can isolate subnets, or even single VMs, from the rest of the network. Start by creating 2 NSGs, one for each of your subnet, and associate them to the subnets. After that, take a look at these NSGs:
See those inbound default rules? Rule 65000 allows everything from the virtual network, to enter the subnet or VM. Rule 65001 allows traffic from Azure Load Balancers. Personally, I like to create a rule that overrules these two default rules:
Azure will process the rules from 1 –> 65500. When the traffic that enters the subnet and matches one of the rules, the processing will stop. If it’s allowed, traffic get’s through, if the rule denies this type of traffic, it’s blocked. So even though the default rule 65000 allows inbound traffic from the virtual network, my rule (priority 4096) will block it. It will also block load balancer traffic, and everything else.
You might wonder why I want to block the load balancer traffic, on my web server NSG. It’s simply a matter of trust – I don’t trust that whoever manages the load balancer, knows what they’re doing. It might be someone with little to no knowledge about networking. By using the NSG, I can control exactly which ports is allowed, and I can restrict access to the NSGs so only people who knows the consequences are allowed to edit them. So let’s allow traffic from the load balancer:
It could look something like the above. You can also create individual rules for each destination IP, or for each type of traffic (HTTP & HTTPS). This is just an example.
Let’s do the same for our SQL NSG. Here I will allow traffic from the 2 web servers, to my SQL servers, on port 1433:
See how I allow traffic from my 2 web servers (10.0.1.4 and 10.0.1.5) to my 2 SQL servers (10.0.2.4 and 10.0.2.5)? This could be done less strict, by not specifying the IP addresses, and instead using the VirtualNetwork tag. That would however allow traffic on port 1433 from ANY IP on my network. It’s up to you, to decide how much you want to manage this.
UH, That’s not so difficult?
Maybe not.. But notice though, that I picked the specific ports I want traffic on, from the load balancer – port 80 and 443. If you don’t do this, and allow everything through from the load balancers, people might start to add port 3389 and such on the load balancer. You know, because they needed an easy way to connect to RDP on the VM. This will bypass any other rule you create to block traffic, because the traffic matches your Load Balancer rule which allows everything (*). And since you can’t specifiy ACLs on the Load Balancer to limit the IP adresses that can connect to it, this open up for traffic from everywhere in the world. Let me try to illustrate it..
Here we have a correctly configured network, with NSGs and rules defined (everything that is not listed, is blocked):
On the other hand, if you create a wildcard rule from the AzureLoadBalancer tag, you can create tons of rules on the Load Balancer, which will allow traffic through, like RDP, SSH, FTP etc:
One thing I see quite often is a configuration like this:
See how they actually tried to lock down RDP traffic, so only a specific IP address is allowed to connect? Well, their load balancer had 3389 NAT’ed through to this server, so that rule was bypassed and never used.
Are you scared now? Good I won’t go into too much more details.. This should show the basics, so now you can go ahead and start locking down you networks!