Invasion of the IP Snatchers – A DEVOPs Internal Cloud Horror Story (Pre-SDN)

Listen to this post Listen to this post here: 

Let’s face it. There is a “manufactured” belief that the SDN (Software Defined Networking) concept and solutions are a magic bullet that will eliminate all the head-aches, bogging down IT Professionals who struggle, as they have to furiously expand their networks to match the explosion of Virtual Machines and Cloud Computing implementations.

Well, I have some bad news for you: I believe challenges are only going to increase in velocity as well as severity, adding complexity to the brewing virtualization pot.

But that’s a whole separate story, which we’ll uncover at another time.

First, let’s look at what awaits you when you simply try to accommodate an expected huge number of virtual machine, in figures X10 times what you had in mind when all you had were physical machines or virtualization hosts, which could run just a couple of VMs. Let’s assume you are not yet using any SDN type of solution.

Nowadays you can buy an 8-core Dell server with ~400GB memory for less than $13,000. If your average Virtual Machine requires 5GB of memory, you could easily find yourself serving 80-100 VMs off that single server.

Group a couple of those servers and you can easily overflow a typical CLASS C subnet.

Here are few lessons learned you may want to look into, as you design your DEVOPs Internal Cloud Network Architecture:

Leadership and Management:

  1. It is paramount all relevant team members (Network Team, Virtualization Admins, Sys Admins, Storage and server suppliers, clients, are aligned with this project objectives and feel responsible for a success of the project. That’s because you will stumble into challenges, and people might say “It is too complex!”, “Supernets don’t work well with this device”, “It takes too much time, until ‘they’ do their part in the configuration”. So you want to make sure everyone is involved and continue to stir the boat to the Promised Land…
  2. Those days Virtualization and Sys admins have a bunch of technologies thrown at them as part of the cloud infrastructure. They have to quickly know how to operate and maintain blade infrastructure, storage units such as EMC, NetAPP, IBM and others, as well as blade center switches. You will not have all the knowledge in your internal team upfront. So you will need to turn to contractors, suppliers and others to set things up for you. But always make sure you take the time and pay the fee to have them document what they do for one work item (one server, one LUN, one VLAN setup), and then immediately use this procedure to do the rest of the same work items YOURSELF. You can always learn and take courses as you go, but never wait for it. Leveraging and reusing your contractor’s knowledge will save you money (you pay them for just 1-2 units set rather than all of your units). This will also save you downtime, since you can schedule it per your client’s schedule, over time, rather than per the contractor’s schedule, which results in massive all-or-nothing downtime and lots of troubleshooting, in the morning after. Reusing your contractor’s knowledge that way, also increases your team’s self-confidence, in general.
  3. Make sure you are aware of any gaps in knowledge between network professionals and system professionals. There are fine network professionals who don’t know exactly what a VMware vSwitch is and how do VMs get their MAC and IP through a real server. This can be easily resolved through a quick discussion and few examples, but you have to notice it and address it, or you’ll get bitten by it at the worst timing.
  4. When your contractors say “Ah, this should work”, ask them if they actually have done it, within an environment as similar to yours, or can get someone who had done it, to instruct them. If they can’t confirm it, then pay for the extra time they’ll need to do the homework – this extra spending will pay off in shortened downtime and more assurance over the project’s success.
  5. Define exactly what successful project implementation means: downtime length, what should be up and running, a quick rollback plan and so on.

Network Planning:

  1. GO Big. If today’s servers can run 20 VMs, then in a year they would run 80VMs. Be rest assured, someone will need all those VMs, so plan for a big number of IPs.
  2. Most of your internal VMs do not need a public IPv4, and you don’t want to jump into IPv6, just to get a lot of IPs, since the universe is still not set for it.
  3. Go for internal ranges of IPs, such as 10.x.x.x.
  4. If your network is based on CLASS C, you’d probably need to have your internal IP ranges set as CLASS C as well. However you’d want to set subnets with more than 254 IPs. For this you could use “Supernets”. This means you could have slices of 254 IPs (10.xx.10, 10.xx.11) use the same gateway. This will provide you with ~500 IPs per “Supernet”.
  5. If you wonder, why not create 1000 IP Supernets or more, so your huge clusters could use a single standard subnet, then consider that broadcasts across so many hosts could bog down your network. For now 512 IP Supernets seems to be a nice midway.
  6. Your local and wide area network team should add the relevant routes to those new subnets, so you don’t lose companywide connectivity.
  7. Consider on which segments you activate DHCP and make sure it can accommodate the new subnets.

Server Side and VLANs:

  1. Now let’s take care of the server side. Basically you’d want all your Virtual Machine hosts, be capable of launching VMs on any of those subnets. If you have 4 NICS (network cards) on each server, this means you can only support up to 2-3 subnets (leaving 1 or 2 for iSCSI or management). This also means your ESX, XEN or Hyper-V servers can’t easily be set to support new subnets, without downtime and scheduling delays to sync this operation with the network team(s). That’s why you want to use VLANS.
  2. Using VLANs you could set your physical switches deliver data for many more subnets to the same 2-3 physical NICs your servers have.
  3. To set your Virtual Host servers with VLAN support, you need to sync the VLAN tags (labels) across your backbone switches, your blade center switches and the virtual switches (in case of VMware ESX) defined in your (ESX) servers.
  4. For each subnet you’d set your ESX server with a new vSwitch, labeled to accommodate only a specific VLAN, associated with a 512 IP Supernet. All your (ESX) Clusters should use the same Virtual Switch naming convention and VLAN labeling. So “VM VLAN 1211” means the same 10.x.11 subnet across all of your (ESX) Clusters.
  5. Your VMs can now easily “move” across subnets, once they are exhausted, by simply re-assigning them with a new vSwitch for their network cards.
  6. Always aim to set one of your VLANs as a “native” “default” VLAN for servers that you can’t or won’t set VLAN settings on. Those servers will automatically use the subnet associated with the default VLAN.
  7. Leave a subnet for VMs that require a static IP. That way they will not have fierce competition with massive DHCP requests from VMs that suffice with a DHCP based IP.

Leadership and Management Revisited:

  1. Create a cross device backup process which takes snapshots of your vCenter, ESX Servers, Blade Center and Blades, as well as storage units and switch configurations. You will need all those to recover or troubleshoot…
  2. Better yet, try and create templates of configurations for all those devices, which you could either restore or use to deploy on new device units: servers, VLANS, switches, etc.
  3. Monitor your IP allocation use and make sure the network team is verifying your backbone switches are not over loaded with all those new VMs.
  4. When you have finished this new project, show your management team the challenges you tackled to save them time and money and allow future growth. Maybe even write a blog post about your lessons… 🙂

That’s about it for now…I am sure you’d have a lot to add and comment on this post…

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s