vSphere ESXi, vSwitch0, vmk0 and host profiles

9 03 2011

So I’ve noticed a rather worrying feature of ESXi (I’m on 4.1 U1) in the configuration that I have. I’ve created our ESXi servers to use a vDS for VM traffic but for management, vMotion and FT I am sticking with the local vSwitch0. vSwitch0 has three active NICs and three port-groups, one for each of the above VMkernel tasks (management, vMotion and FT).

These port-groups have the NICs set to a specific order with only 1 active and 2 in standby ensuring that, although only 1 NIC is actively used for each portgroup, the two other NICs are there for failover. A few pics may explain this better….

vSwitch0 Overview

vSwitch0 vSwitch NIC configuration (click to enlarge)

Management port-group configuration

 

Management NIC configuration (click to enlarge)

It’s also worth noting here that, since we have Enterprise Licenses, I decided to make use of Host Profiles to help me set up other ESXi hosts. So I configured one ESXi server the way I wanted it and used it to create a baseline host profile. This profile was then applied to all remaining hosts. Nice, quick deployment of a standard config. Good times!

Server Down!

My problems started when I wanted to change the IP address or VLAN of the Management port-group. Actually problems arose when I made a change to ANY port group in vSwitch0. My most recent problems were when I wanted to change the VLAN that the vMotion port-group used. So I changed the VLAN and associated IP address for the vMotion PG and suddenly my host went offline!

What seemed to happen was that the new IP  and VLAN I applied to the vMotion PG was actually applied to the Management PG!!! Seriously not good. Especially when your cluster has HA enabled which means that vCenter dutifully starts powering down VM’s and bringing them online to other hosts in your cluster! Queue lots of alerts and irate application owners.

After some digging around and finding some VMware Community posts I realised that the problem was due to an unconfirmed bug. It seems to be down to the way that Host Profiles are applied; specifically the order in which port-groups are created on the host.

If you look again at my first pic you’ll notice that the Management PG does not use the first vmkernel nic ‘vmk0′. This is actually assigned to the vMotion PG.

To me this is a massive problem and from my point of view, the “fix” is to make sure that the Management PG is using vmk0.

Reassigning vmk0

  1. The first thing you want to do is disable HA on your cluster. Otherwise when you make your change and temporarily loose connection to your host all your VM’s when vCenter detects that host has gone offline.
  2. Next, you’ll need either physical or ILO access to the DCUI. SSH won’t cut it as we’re going to loose network connection for a while. Once you’re on the machine hit ALT-F1 to go to the local TSM (assuming you have it enabled) and logon as root or admin user
  3. Now we want to remove all port-groups apart from the Management PG. We could do this via the VI Client, but lets do this at the command line being as we’ve made the effort to get here…List your vSwitch0 port-groups (and show vmknic’s):

    Here, again, you can see that the Management port-group is using vmk3. This is bad, mmmkay? The other odd thing, and something I’m assuming is not good, is that the MAC shown for the Management PG is that of the physical NIC. This shouldn’t be the case. This should be a virtual MAC starting with the VMware specific range; 00:05
  4. The following commands removed my vMotion and Fault Tolerance port-groups. The affect of this is that their vmknics are destroyed:
    esxcfg-vmknic -d vMotion
    esxcfg-vmknic -d 'Fault Tolerance'

    Now the fun part. We run exactly the same command but on our Management PG. Yes. You will loose network connectivity to the host, but that’s why we’re using ILO, right? :)

    esxcfg-vmknic -d Management
  5. After we’ve blown away our Management PG we can recreate it. When we do, it will automatically be given vmk0:
    esxcfg-vmknic -a -i 10.32.202.11 -n 255.255.255.0 Management


    You’ll also notice our new PG has a correctly assigned virtual MAC (starting 00:05).

  6. At this point you’ll want to go back to the VI client and configure the new PG the way you want including adding additional pNIC’s to the vSwitch and setting failover order for the PG’s.

Now I’m able to make changes to my vMotion and FT port-groups without affecting my Management PG.

Let me know if you’ve been affected by this and how you’ve got around it.

My next step is to take another host-profile baseline and ensure that, when reapplying it to the other hosts, the vmknic assignments are consistent.





Change disk persistence mode on the fly in vSphere using PowerCLI

4 11 2010

I’ve been plagued, for some time, by a really annoying problem in my home lab. Being as I run everything from the one VM host (ESXi) it is also home to my Windows 2008 domain controller.

Recently I had installed Veeam Backup and Replication to finally start backing up my VM’s (even though it’s my home/test environment I would be a broken man if I lost everything on there!). I installed Veeam on to the Windows 2008 DC – against all common sense of course, but RAM is low in my ML110 and I couldn’t afford to create another Windows VM.

I had configured Veeam to backup my guest’s using the vStorage API – Virtual Appliance mode. Backups of my other Windows guests ran well but I noticed some problems with the DC VM backup after a short while. It was around this time that I also noticed that the hard disks of this guest OS had automatically changed to Independant Nonpersistent! This would have been bad enough for any machine (as soon as the guest powers down you loose all changes), but for a DC – as you can imagine – it’s a nightmare!

This post on Experts Exchange confirmed – to some extent – that Veeam was the culprit and there wasn’t some other weird force at work: http://www.experts-exchange.com/Software/VMWare/Q_26310902.html

I disabled the DC backup via Veeam but the problem remained that my VM’s disks were Independant. The vSphere Infrastructure Client does not allow you to change disk modes on the fly so I was stuck – as soon as I shut my VM down to change the disk type I would loose all changes:

Thankfully, PowerCLI came to my rescue!

After connecting to my ESXi host using

connect-viserver <esxi_ipaddress>

All that was needed was a simple command:

Get-HardDisk -VM <vm_name> | Set-HardDisk -Persistence "Persistent"

Success!! At least partially. My primary (system) disk had successfully reverted to a persistent disk. However, my second disk had not and PowerCLI had actually errored :

CapacityKB Persistence                                                    Filename
---------- -----------                                                    --------
68157440   Persistent                       [localdisk01] vm_guest/disk1.vmdk
Set-HardDisk : 04/11/2010 00:49:44    Set-HardDisk        Another task is already in progress.
At line:1 char:42

I haven’t yet figured out what task it is that is supposedly in progress, but I certainly can’t find one. It may be worth noting that my primary disk is on local SATA while the second disk is on an NFS share.

Now, at least, when I run Get-HardDisk, it shows that my system drive is Persistent:





RDM mapping of local SATA storage for ESXi

25 10 2010

This post has been sat in my WordPress Drafts folder for sometime since I no longer use local storage this way. I decided to post it however as (a) it’s a good learning curve for ESXi work and (b) others may have more luck that me.

I recently acquired three 1TB drives and decided to do something about my lack of storage at home. Always trying to make best use of existing kit (and save money) I decided to stick the drives in to my HP ML110 and try something in a VM instead of doing the sensible thing of lobbing them in to a dedicated NAS box.

After wasting a few hours I realised that the onboard SATA RAID controller of the ML110 just can’t do RAID5 and to make matters worse, when I gave up and created a RAID1 array with a hot spare, vSphere 4.1 didn’t recognise the array and instead saw the drives as 3 individual drives. I saw this as a chance to try out the WAFL-alike ZFS file system. FreeNAS had been my NAS of choice recently so I chose to try ZFS in that.

I point blank refused to create 3 1TB VMDK’s (one of each of the three drives) so I set about figuring out how to create Raw Device Mappings (RDMs) of the local SATA drives. There were a couple of posts on the net that got me a little closer, but no guide/article had the whole thing down, so that’s my aim with this blog post.

Step 1

Once you had your drives installed, SSH to your ESXi box (now even easier in vSphere 4.1) and go to the /dev/disks directory. There, if you perform a ls -l, you’ll see your drives listed:

Ignore the instances of your drives which show them as VM stores (vm1.*****). We want to look at the raw devices.

Step 2

Now move to the /vmfs/volumes folder. Here you can see your existing local datastore(s). If, like me, you had a solitary hard-drive, you’ll just see localdisk01 or whatever you chose to name the local datastore:

Step 3

Now we are going to use the vmkfstools utility to create our RDM’s. Remember that a RDM is just another VMDK, but instead of the VMDK pointing to a xxx-flat.vmdk file (which is the actual virtual hard disk), the VMDK points to our physical device. Being as we still need to create this VMDK file we need to save it somewhere. Since we just have the one local datastore, we are going to create the RDM VMDK files in it’s root.

The following command creates the RDM VMDK for us:

vmkfstools -z /vmfs/devices/disks/<name of RAW device from Step 1> <location to store VMDK>/<RDM name>.vmdk

In my personal example below, I am creating an RDM called rdm_WD2DWCAVU0477582.vmdk and it is being stored in the location /vmfs/volumes/localdisk01/ I chose the name of the VMDK to match the name of the serial number of the physical drive (and what is shown in Step 1) to help with troubleshooting in the future when I get an inevitable drive failure). You can call your RDM’s whatever you wish.

The name of the RAW device (t10.ATA____WDC_WD10EARS2D00Z5B1__________________________WD2DCAVU0477582 in my example) you will have noted from Step 1 when you listed all local devices attached to your ESXi host. This is why the tech Gods created Copy n Paste! You will want to copy the full device name as shown in Step 1 in to the vmkfstools command.

Step 4

Once you have repeated the steps for all of your local SATA drives, you can navigate to where you created the RDM’s (in my case /vmfs/volumes/localdisk01) and perform an ls -l *.vmdk to see the new VMDK’s you have created:

Don’t panic – the xxx-rdmp.vmdk files will reflect the size of the RAW devices they are mapping to, but rest assured it will be taking no more space than a few bytes on your local disk!

Step 5

You can now add your RDM’s to an existing VM. vSphere doesn’t recognise this as a true RDM (to a SAN) so you just browse the local disk datastore for the VMDK files that we created.

Edit the properties of an existing VM and click Add…

Step 6

Select Use an existing virtual disk and click Next >

Step 7

Click Browse. You now need to navigate your local datastore and select the VMDK’s that we created in Step 3).

Once complete you will be shown a confirmation window. Repeat Steps 5 through 7 to add additional RDM’s to your VM.

Step 8

You should now see your new Hard Disk’s in your VM and vSphere will correctly identify them as Mapped Raw LUN.

NOTE: One thing I forgot to show in the screen shots, is that you should create your RDM’s on a new SCSI controller! You do this by simply selecting a new SCSI ID starting with 1:x instead of 0:x. Existing VMDK’s should be on SCSI Controller 0. Your RDM’s should be on SCSI controller 1. Although my screenshot shows 0:3 this should read 1:3.

You can now save your VM configuration. Your VM will now access the RAW SATA drives  and be able to use things like SMART to monitor its health.

See below; I am adding my three 1TB drive to FreeNAS to create a new ZFS pool.

Stay tuned for an upcoming blog post on FreeNAS and NexentaStor which may or may not put you off ZFS [in a VM] altogether!





Working with HP MSA 2000 via command line

25 10 2010

This statement is becoming almost cliche in the blogging world, but this post is more for my own benefit to help me remember some of the work I’ve been doing on MSA lately:

The MSA 2000 series has a pretty dire web interface – when you compare it with other SAN’s – and to make matters work, it mixes and swaps terminology with HP’s own EVA’s (a vdisk on an MSA refers to a RAID array made up of physical disks. A vdisk on an EVA is a virtual RAID array which sits on top of the underlying RAID 0 structure). The web interface is challenging enough as it is when it works, but recently the web page of the management interface has been timing out.

Thankfully SSH access still works and responds well. I’ve been using…

restart mc a

…to restart the management interface which helped things for a short while but ultimately I found myself falling back to SSH more and more.

The following procedure follows the steps I took to rename and prevent a volume to a host.

To give a host a friendly name (from its WWN):

set host-wwn-name host <wwn> <friendlyHostName>

To rename a volume:

set volume <volName> name <new_volName>

To view what hosts a volume is currently presented to and what volumes map to what hosts:

show host-maps [<friendlyHostName>]
show volume-maps [<volName>]

To actually present a volume to a host:

map volume <volName> [lun <lunID>] host <friendlyHostName> access rw

In theory the LUN ID can be left blank, but in practise I found that it would not map unless I specified an ID. Don’t forget to map both HBA’s for a host.

After presenting the volume to the Windows 2003 host I was dismayed when the Storage MMC could not see the new lun. Hitting F5, of course, was not enough. On the Disk Management section folder in the left hand side of the MMC window, right-click and select Rescan Disks.

I’ll update thist post if I use and feel the need to remember some other command line options for the MSA 2000 but hopefully we’ll be migrating off it to EVA soon!





Netbackup reporting script – an update (of sorts)

17 08 2010

Seems the Netbackup reporting script is in pretty high demand at the moment. Obviously lots of you just as frustrated with NBU reporting as I was! I’m really pleased that so many of you are finding it useful – though I am starting to wonder if letting it out in to the open without a license agreement (a free one, don’t worry) is a good idea.

I have placed the scripts in a public DropBox location which is linked to at the bottom of the original Netbackup Report Scripting post. In the mean time, please see the PayPal donation button to the right now. If you’ve used my scripts and find them useful – and certainly if you plan to use them in a business environment – please consider donation as much or as little as you can.

I also have some nice new blog posts ready to upload as soon as I can sort my image hosting out! Those of you struggling with EVA performance monitoring and tuning, stay…err….tuned!





Netbackup script to report scratch tapes

4 05 2010

I’ve seen Netbackup used in two completely different ways now. In my last environment all tapes rotas, movements and scratch tapes were handled manually. We had a spreadsheet showing tape sets (see below for an example), when these tapes were collected by Iron Mountain, which days of the months we defined as Monthly/Quarterly backups and so on. In this scenario each set of tapes is a number of tapes – on some days you may use less, on some days you may not have enough loaded.

In the new environment we let Netbackup dictate which tapes are in the scratch pool and free for re-use based on the expiration policy of the jobs written to those tapes. In theory this should mean that you only ever use as many tapes as you need for a given backup job. The problem with this set-up is that although it is, in theory, more efficient, you don’t always know which tapes you will need each night and how many you have free.

Each day guys were running NBU command lines to show which tapes were in the scratch pool, but having to manually sift through them to see which tapes were currently sat in drives/libraries and of the ones held off-site, which tapes were of which format (LTO2, 3 or 4).

The following script can be scheduled to email out a list of tapes in the scratch pool, which are in what library and also groups the offsite tapes by format. It was knocked up in a hurry so isn’t as parameterised or dynamic as it should be, but the comments should help you fix it for your needs…

Read the rest of this entry »





Using CrashPlan to backup a network share

12 03 2010

Some time ago I started using CrashPlan with CrashPlan Central to finally start backing up my family photos and important documents off-site. It’s been working pretty well, despite my frustrations over slow upload speeds. It’s pretty much a fire and forget solution.

Recently I’ve installed FreeNAS on to an old PC to (a) start getting some stability to my storage solution, and (b) play around with ZFS and iSCSI at home. I tried OpenFiler but it pales in comparison to FreeNAS. I’ve gone against my better judgement and I’m using some of the cooler plugins including the Torrent client, Dynamic DNS and I’ve even installed SABNzbd on it. FreeNAS comes with rsync built in and even Unison which is an interesting solution to cross platform backup/synchronisation, but I’ve payed for my CrashPlan Central plan so want to make use of it!Only problem now is backing up my data!

CrashPlan doesn’t support backing up from a network share – either mapped drive or UNC path. There is a work around on the CrashPlan site which works but is a little convoluted so wanted to post a nice, quick version here.

Read on after the break… Read the rest of this entry »








Follow

Get every new post delivered to your Inbox.