The Importance of Network Watcher
“Broad network access” is not one of NIST’s Essential Traits of a Cloud for no reason. Without network connectivity, resources in the cloud are useless. What use would a virtual machine be if you cannot access the services that it hosts or if you cannot integrate it with other systems?
This is why Azure’s Network Watcher is a critical troubleshooting tool. Network Watcher includes a number of tools that can be used in several scenarios. In this post, I will show you how you can figure out the root cause of a communications failure between virtual machines.
The Demo Lab
I have created a small demo lab in a resource group called rg-petri. There are two virtual machines in a simple flat network:
- vm-petri-01: This is a “bastion host” or “jump box” machine. I am only allowing remote desktop connections into this machine from outside of the virtual network. To access application servers, one must first log into vm-petri-01 and then jump from there to the required machine.
- vm-petri-02: This is my demo application server that a security consultant has recently hardened.
Administrators have just reported that they can no longer sign into the application server (vm-petri-02). They need to be able to because users are reporting that the services that this machine hosts are broken.
This tool, in Preview at the time of writing this article, performs an end-to-end test. Using a Network Watcher extension that is installed in the source virtual machine, 100 probes (packet tests) are sent to the destination machine from a defined source port to a defined destination port. This is better than any ping because ping only does an “is it responding in a reasonable time?” test using ICMP.
You enter the following information about the source machine when using this tool:
- Choose a subscription: Select the subscription from your tenant.
- Resource Group: Pick the resource group with the source virtual machine.
- Virtual Machine: Select the source virtual machine.
- Port: Enter the port that is used to send the desired protocol. You can enter 0 if it is dynamic like most protocols are.
Enter the following information about the destination machine:
- Resource Group: Pick the resource group with the destination virtual machine.
- Virtual Machine: Select the destination virtual machine.
- Port: Enter the port that should be listening for the traffic.
This test can take a while because it really will send 100 packet tests from the source virtual machine to the destination virtual machine. I have run a test where vm-test-01 is trying to send RDP traffic (destination port 3389) to vm-test-02. And as you can see below, all 100 probes failed. This indicated that there was a configuration issue of some kind, probably at the destination.
I like this tool because it allows me to run detailed network tests, just like I would from a virtual machine, but without needing to sign-in. I might not have the necessary credentials to sign into the virtual machine.
This is a tool that allows you to test the routing of a virtual machine. For example, what is the next hop if I send a packet to X? This is useful for troubleshooting scenarios such as, but not limited to:
- User-defined routing
- VPN/ExpressRoute with tons of network addresses
- VNet peering with gateway transit
IP Flow Verify
This tool is a simple test to validate that a virtual machine’s NIC is able to do one of the following:
- Receive data from a source
- Send data to a source
IP Flow Verify does not check that the packets will actually make it or not; it’s not an end-to-end test and packets are not actually sent. However, it does check things like how network security groups are affecting a single virtual machine. I suspect that I have a firewall issue because all of my packets are being blocked.
When you run IP Flow Verify, the following fields must be completed:
- Subscription: In case you have more than one subscription in your tenant
- Resource Group: Enter the name of the virtual machine that you are testing
- Network Interface: In case the virtual machine has more than one
- Protocol: TCP (default) or UDP
- Direction: Inbound or outbound – are you checking if traffic can get out of this machine or into this machine
- Local IP Address and Local Port: The network settings of the local machine
- Remote IP Address and Remote Port: The network settings of the remote machine
I first run a test from vm-petri-01. I want to see if it is allowed to put TCP 3389 packets onto the subnet that is destined to TCP 3389 on vm-petri-02. The results come in and the test passes. So, I suspect that all is well with vm-petri-01.
I then run a test to see if TCP 3389 traffic can get into vm-petri-02 from vm-petri-01. That test fails and indicates that there is an issue with a rule in a network security group. Somehow, traffic into vm-petri-02 is being blocked.
Security Group View
There are two ways that a network security group can be deployed or associated:
- Virtual Machine NIC: The rules block outbound traffic before it reaches the subnet and inbound rules prevent traffic from reaching the NIC.
- Virtual Network Subnet: Inbound and outbound rules prevent traffic from getting onto the subnet.
If we look at the topology that I generated earlier (at the top of the post) you will see that there are two network security groups (NSGs). A rule in one of these NSGs is the culprit:
- nsg-petri-sn01 is associated with the subnet.
- vm-petri-02-nsg is associated with the vm-petri-02 virtual machine’s NIC
Network Watcher has a tool called Security Group View. Instead of bouncing around the Azure Portal, we can see all NSG rules that are in effect on a virtual machine broken up into inbound and outbound.
You need to enter the following to configure the tool:
- Subscription: The name of the subscription containing the virtual machine
- Resource Group: Specify the resource group that the virtual machine is in
- Virtual Machine: Pick the virtual machine
- Network Interface: Select the affected NIC of the virtual machine
The results can be complex but you can download them for easier analysis. In the results, you can see the overall effective rules, the rules associated with the subnet, and the rules associated with the NIC.
Unfortunately, the effective rules view is a little confusing but you can make sense of it. Read the inbound rules from bottom to top. That is the path that an inbound packet will pass through. First, it hits the subnet’s NSG. You can tell when the rules switch from subnet to NIC because the last rules are always the default rules. Once you hit a set of default rules in the middle, as in this case, you are at the end of the NSG and about to move into the next NSG.
Let’s pretend to be an RDP packet trying to reach vm-petri-02. You come in and hit the subnet NSG (marked in blue). The user-defined rule, allow-rdp, is allowing TCP 3389 traffic, which is good. We have hit a green light in this NSG, which allows us to attempt to get into the NIC of vm-petri-02.
Now we hit the NSG associated with the NIC, marked in green. The first rule we hit, marked in red, is a user-defined rule called securityexpert-blockingall. That rule is denying all traffic going to all ports. This is where our beleaguered RDP packet dies and might explain why users cannot access the services on this machine anymore.
If you cannot find anything wrong with the NSGs, then it is time to start looking inside the guest OS. Some things to consider might be:
- Windows Firewall: TCP 1433 is not open on Azure machines with template installations of SQL Server.
- The service: Is the service that is supposedly listening, really working at all?
- Third-party security software: Many Microsoft support articles recommend turning off third-party security software for a reason!
It is a good idea to get to know Azure Network Watcher when things are good because when things go wrong, the contained set of tools might help you get to the root cause very quickly. This is especially true if you already know how to use them.