Wednesday 9 May 2012

WHAT, WHY, HOW……Nexus 1000v


Although the Nexus1000v has been with us for a couple of years now it’s still a bit of a mystery as to what it is, why it was create and how it works or gets deployed. So I thought I’d take the time to give you all my explanation and thoughts about the product.

So WHAT is it?
The Nexus 1000v is a virtual switch, meaning it is able to provide Network communication like a switch but run as at the virtual machine level and is completely independent of a hardware device, so a switch without a switch! It works in conjunction with VMware vSphere 4.0 (upwards) and has already been ratified to work with Hyper-V 3.0 when it launches. So for anyone who is familiar with VMware, they should know that in order to provide switching for Virtual Machines each vSphere host will run a local vSwitch within its hypervisor in order to distribute traffic from multiple VMs out through the shared Network cards. The Nexus 1000v forms part of Cisco’s Data Center class Nexus series of switches and runs the NX-OS operating system.

But WHY did Cisco make it?
Well IT departments can be a little “Siloed” with different teams of people responsible for different areas of the infrastructure. The Networking team manage the Network, the Server team manage the Servers and the VMware team manage the virtual environment….simple yes?
Unfortunately not….. and I know from experience. When a new Blade chassis arrives at a customer site the Server team or a Server engineer will start the installation, when he gets to the switches in the back of the Chassis he asks the Network Engineer to help with the configuration, the network engineer simply states “I’m a networking guy, we don’t configure the Chassis switches, that’s your job”. So the Server guy clicks his way through the options until he’s happy that some communication is working. He then passes the new Chassis over to the VMware engineer who installs ESXi and starts setting up the basic config, when he gets to the local vSwitch configuration, he then asks the networking engineer to help with the configuration and I’m sure you can guess the response (and  I am one of the few people who can get away with saying this, as many of you will know me as a Server guy, and VMware guy and a Network guy). So what were left with is three sections of switching all configured and managed by three separate teams we limited visibility as to the whole environment. Cisco have never been too happy about this, mainly because as soon as someone has a performance issue with their application, computer or Server the first thing to be blamed is the Network and as it’s split up into all these different section, then it’s very hard to state categorically that it’s not the network without a few days investigation, by which time the issue has magically disappeared and normal service is resumed.
So Cisco and the Networking team want to take back control of all areas of switching and have the visibility of all traffic on their network to reduce the chances of misconfiguration and simplify troubleshooting. At the same time VMware were aware that the major network vendors weren’t happy with the functionality of their vSwitch, so created an open API for their DVS and invited all the vendors to see if they could do better. Cisco were first to respond to the invite with the Nexus 1000v and I have to say they have done a pretty good job with it.

So HOW does it work?
Well first off you need to make sure you can deploy it. The Nexus 1000v utilises the VMware Distributed Switch technology to function and as such requires you to have VMware Enterprise Plus licensing. The Nexus 1000v can then be purchased in line with the CPU model VMware use. It can be a little expensive if purchase on its own, but Cisco do offer it for free if purchase alongside Cisco UCS. I think because of this, people seem to think the 1000v can only be used with UCS, but this isn’t the case. The 1000v only requires vSphere (or Hyper-V 3.0) and can be deployed on any hardware platform. It is made up of two parts, the Virtual Supervisor Module (VSM) and the Virtual Ethernet Module (VEM). The VSM is the control plane for the 1000v and integrates with multiple VEM’s which are embedded into each vSphere host. The VSM runs as a Virtual Machine and can be deployed via an OVA template, the VSM can be made redundant across a pair of VSM virtual machine configured as a cluster. The VSM also benefits from features such as HA and vMotion within the virtual environment. Once the OVA is deployed, the configuration and management of the 1000v is all done via the Command line (CLI), which is how networking people like to interact with things. The CLI commands are NX-OS based and simple to use. The setup wizard allows for the pairing with a vCenter Server, which then presents itself as a DVS within vCenter. The Network team are then able to create each of the port-groups required by the virtual environment and apply standard networking polices right down to the virtual machine level. The two key factors for this are Netflow and QoS, meaning that you can take the existing Quality of Service settings running in the rest of the network and even apply them to the traffic between two VMs on the same host!
Once the Configuration is done and all the Port-groups are available in vCenter, the VMware team (while working with the network team this time) can begin to migrate the Hosts and VM’s over to the 1000v. When a Host is added to the 1000v, update manager is used to deploy the VEM component into the hypervisor. The VEM itself handles all the switching independently of the VSM, although all configuration is referenced through the VSM. When new Port-Groups are created at the VSM level, the VEM in each connected host is updated. Once the Port-Groups are created they become available for the VMware team to select them from the drop down list of Port-Groups when configuring a Network Card for the VM.
So in conclusion the Nexus 1000v passes control of the network back to the network team and provides a variety of enhancements over the traditional vSwitch.
The product can be a little tricky to install and falls in a grey area between Virtualisation and Networking.
There is a lot of development by Cisco and other vendors in this area at the moment and I aim to follow this article up with more about the Virtual Security Gateway VSG and VM-FEX as an alternative way to control Virtual Machine networking.
For anyone who’s interested in testing the Nexus 1000v out, it is available for a free 60 Day trial and can be downloaded from either Cisco or VMware’s websites.
There are some considerations and best practices for Nexus 1000v, and below are my thoughts that may be of interest to the more technical people reading this.

Considerations:

COPY RUN START!
The Nexus 1000v is a switch and needs to be treated as one. Anyone deploying this for the first time, will see that as you enable a new Port-Profile, the relevant Port-Group is created in the vCenter immediately. However just because it’s written to vCenter doesn’t mean it’s written to the Nexus 1000v. Always remember to Copy the running configuration to the start-up configuration whenever you make a change or create a new Port-Profile.

SYSTEM VLANs
When setting up the “Ethernet” Port-Profile for the Uplink ports, you are requested to specify the “System VLANs”. The instruction for the deployment will tell you to make sure that the Packet, Control and Data vLANs for the Nexus 1000v should be set to System vLANs. However the purpose of a system vLAN is not always explained. A System vLAN is a vLAN that is able to become active when an host server first powers on and activates the VEM, without the VEM first communicating with the VSM. So whatever vLAN the VSM sits on must be a System vLAN, but at the same time whatever vLAN the vCenter server is on must also be set as a System vLAN, otherwise communication to the vCenter server may not be established! Also if your using NFS or iSCSI to access datastores it’s a must to make these vLANs System vLANs as well.
It is also essential that a System vLAN is configured for any "vethernet" Port-Profile that is used by one of these vmKernel ports or the Nexus 1000v and vCenter Virtual machines. 

BACKDOOR
Always think about an alternative method for accessing the local ESXi console in case you make a mistake during the installation of the Nexus 1000v and the migration of the hosts and VMkernel ports. UCS will provide KVM Console access directly to the host, but other vendor hardware may need an ILO or DRAC to be enabled first.

HOW MANY NICs
It used to be that VMware required around 6 to 8 NICs to provide all the different types of communication and when Nexus 1000v was first launched this design was the recommended way to go. Recently however things have changed and with 10GB capabilities and several review of QoS designs the new “Best Practice” from Cisco for the Nexus 1000v is to utilise only 2x 10GB NICs and have the Nexus 1000v manage all traffic across all port-groups. In the case of Cisco UCS, it is recommended that the two NIC’s created in the Service Profile are NOT enabled for Hardware failover as the convergence of the Nexus 1000v will be quicker and more stable.

DON’T RUSH!
So you’re ready to migrate your Hosts and VM’s over to the Nexus 1000v and the wizard in vCenter will let you do every single host and VM all at once, which is pretty cool……but don’t ever do that!!! The worst thing that can happen when migrating hosts and VM’s is to lose communication mid-way through. So in order to mitigate against this you should follow these steps:

1       Disable HA and DRS on the cluster before migration
2     Start with a single host and not the one hosting vCenter or the VSM
3     Move one of the two NICs first along with the VMkernel Ports for Management
4       Test communication with each VMkernel Port
5       Move a single test VM to the Nexus 1000v
6       Test Communication
7       Move the second NIC in the first host and any other VM’s on the host (then test)
8       Do steps 3 and 4 to a second host
9      Test vMotion between the first two hosts
10   Move the second NIC in the second host along with it’s VM’s
11   Once your happy everything is working on these two hosts, you can now move the rest of the Hosts, VMkernel ports and Virtual Machines.
12   Leave vCenter till last.

And hopefully that covers everything!