Hi folks,
Dan Cuomo here for our final installment in this blog series on synthetic accelerations covering Windows Server 2019. In Server 2019, we took learnings and expanded on the work that began in Server 2012 R2 with Dynamic VMQ and Server 2016 with VMMQ, to bring Dynamic VMMQ (d.VMMQ).
The multi-release journey is designed to achieve one primary goal; improving your (and your tenant’s) networking experience in the Software Defined Data Center. This may come in the form of reducing CPU processing for network traffic and/or ensuring a smooth and consistent experience for the virtual machines on your host which ultimately means happy tenants running more virtual machines (and no midnight calls to troubleshoot the all-to-common “network slow-down”
Public Service Announcement: Most of what you see below will not apply if you’re using an LBFO team. Microsoft recommends using Switch Embedded Teaming (SET) as the default teaming mechanism whenever possible, particularly when using Hyper-V.
Before we get to the good stuff, here are the pointers to the previous blogs:
As a quick refresher, Virtual Receive Side Scaling (on the host) creates an indirection table which enables packets to be processed by multiple, separate processors. The distribution of these packets to separate processors can be done in the OS, or offloaded to the NIC. While the indirection table is always established by the OS, we can offload the packet distribution to the NIC; when offloaded to the NIC, we call this VMMQ.
Originally, we enabled the dynamic updating of the indirection table, called Dynamic VMQ, in Windows Server 2012 R2. However, in part due to the rearchitected design in Windows Server 2016 to bring VMMQ, Dynamic VMQ was not available in Windows Server 2016.
Now in Windows Server 2019 we can dynamically remap VMMQ’s placement of packets onto different processors. We had three primary goals:
I’m starting to think those midnight network slow-downs may be a thing of the past!
When network throughput is low, Dynamic VMMQ enables the system to coalesce traffic received on a virtual NIC to as few CPUs as possible; we call this queue packing because we’re packing the queues onto as few CPU cores as is necessary to sustain the workload. Queue packing is more optimal for the host as the system would otherwise need to manage the distribution of packets across more CPUs; the more CPUs are engaged, the more the system must work to ensure all packets are properly handled.
The picture below shows a virtual NIC receiving a low amount of network traffic. You can see we’re using the performance counter Hyper-V Virtual Switch Processor > Packets from External/sec and there is one bar for each CPU core engaged. Only one CPU core (the green bar) is processing packets destined for a virtual NIC. The system has coalesced or packed all the queues onto one CPU core as was necessary to sustain the workload.
Here’s a video showing the Dynamic Coalescing. Note, the video is sped up to show the process occurring a bit quicker than normal.
After a hard day’s work, you head home for the day. Little did you know, your CIO is a night-owl and a few hours later begins working right as some backups begin on the file servers hosting the user profile.
I think we all know the story that’s about to unfold. Your CIO calls in the support team after-hours because of the terrible performance. The following day, you’ll be asked to root cause what happened and develop an action plan to ensure the CIO never has this experience again. You think to yourself
“this would be about the best place in the entire world to work, if it weren’t for all these complainers…” ;)
One of the challenges with VMMQ in Windows Server 2016 (Static VMMQ) is that the indirection table – the assignment of a VMQ to be processed by a specific processor – cannot be updated once established.
If another workload (for example VM B) starts receiving more traffic and one of its queues are mapped to the same processor as a queue from VM A, one of them may suffer. This is what happened to your CIO, the queues for the file server hosting his/her user profile was on the same processors as another workload performing backups.
Note: I’ve seen folks try to avoid this by preventing a NIC from using the same processors used by other NICs (overlapping). In practice, we’ve seen this provide very little value if any with SET teams. First, most people misconfigure this. Even if they have it configured correctly, you’re forced into constraining your adapters to using less processors. This only compounds the original problem. We do not recommend changing the default RSS Processor Array (which governs the indirection table creation) unless directed by Microsoft Support.
With Windows Server 2019 and Dynamic VMMQ, we can now automatically move queues on an overburdened processor to other processors that aren’t doing as much work. Now workloads will have a more consistent and performant experience.
In the following video, (sorry, no sound) we show a running network workload. Eventually we start a new process that competes and consumes for the CPU that is processing packets. In Windows Server 2016, the virtual machine would start receiving less packets affecting the throughput into the VM and your sleep patterns as your CIO calls you into the office to troubleshoot.
However, in this video you can see that the system dynamically updates the indirection table and moves the processing of network traffic from CPU3 to an available processor (CPU1) when another workload starts consuming the CPU cycles. This allows the VM to continue receiving the same amount of traffic despite having a competing workload.
When a virtual NIC is idle, it doesn’t need any receive queues. However, if no queues are allocated (or perhaps only a bare minimum), and a burst of traffic comes in destined for that virtual NIC, it won’t be able to process all the data because we can’t just allocate queues all willy-nilly. Willy-nilly is bad...
To ensure that we can meet an immediate burst of traffic, we pre-allocate queues for an idle workload. We call this queue parking (not to be confused with core parking).
You can see the allocation of queues across a receive processor for a particular virtual NIC using the perfmon counter Hyper-V Virtual Network Adapter VRSS > Instance (per virtual NIC) > Receive Processor.
It’s important to note that there are always 16 entries shown and if you look closely, you’ll note that there are two bars of the same height. You can control how many receive queues per processor for all virtual NICs (although we recommend that you stick with the defaults) by modifying the MaxProcessors on the physical adapter.
The setting on the physical adapters cap the processors to be used by a virtual NIC.
If you only want to cap certain virtual NICs then instead of setting the value on the physical adapters, just set it on the virtual NIC using Set-VMNetworkAdapter -VRSSMaxQueuePairs <value>
Then review the updates to the vNIC as shown below.
As you can see, the requirements to implement and manage the feature are greatly reduced.
I hope you have enjoyed this series on synthetic accelerations and found it useful. As you can see, we’ve steadily worked towards reducing the setup complexity, improve the stability, and increase the performance for your virtualized workloads. Previously you had setup complicated adapters schemes, tune the system, avoid processors, and more…Now you simply install Windows and Hyper-V, test, and monitor.
Please let us know in the comments if you have any questions!
Thanks,
Dan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.