Automating VMWare to Hyper-V Migrations using MVMC

There are several tools for migrating VMWare VM’s to Hyper-V. The free tool from Microsoft is called the Microsoft Virtual Machine Converter which allows you to convert from VMWare to Hyper-V or Azure, and from physical to virtual via a GUI or powershell cmdlets for automation. Microsoft also has the Migration Automation Toolkit which can help automate this process. If you have NetApp, definitely check out MAT4SHIFT which is by far the fastest and easiest method for converting VMWare VM’s to Hyper-V. MVMC works fairly well, however, there are a few things the tool doesn’t handle natively when converting from VMWare to Hyper-V.

First, it requires credentials to the guest VM to remove the VMWare Tools. In a service provider environment, you may not have access to the guest OS, so this could be an issue. Second, the migration will inherently cause a change in hardware, which in turn can cause the guest OS to lose its network configuration. This script accounts for that by pulling the network configuration from the guest registry and restoring it after the migration. Lastly, MVMC may slightly alter other hardware specifications (dynamic memory, mac address) and this script aims to keep them as close as possible with the exception of disk configuration due to Gen 1 limitations in Hyper-V booting.

This script relies on several 3rd party components:

You’ll need to install MVMC, HV PS Module, and VMWare PowerCLI on your “helper” server – the server where you’ll be running this script which will perform the conversion. Devcon, the HV IS components, VMWare Tools, and NSSM will need to be extracted into the appropriate folders:

vmware-to-hyper-v-folder-structure

I’ve included a sample kick-off script (migrate.ps1) that will perform a migration:

$esxhost = "192.168.0.10"
$username = "root"
$password = ConvertTo-SecureString "p@ssWord1" -AsPlainText -Force
$cred = New-Object -Typename System.Management.Automation.PSCredential -Argumentlist "root", $password
$viserver = @{Server=$esxhost;Credential=$cred}
 
$destloc = "\\sofs.contoso.int\vm-storage1"
 
$vmhost = "HV03"
 
$vmname = "MYSERVER01"
 
cd C:\vmware-to-hyperv-convert
 
. .\vmware-to-hyperv.ps1 -viserver $viserver -VMHost $vmhost -destLoc $destloc -VerboseMode
 
$vms = VMware.VimAutomation.Core\Get-VM -Server $script:viconnection
 
$vmwarevm = $vms | ?{$_.Name -eq $vmname}
 
$vm = Get-VMDetails $vmwarevm
 
Migrate-VMWareVM $vm

Several notes about MVMC and these scripts:

  • This is an offline migration – the VM will be unavailable during the migration. The total amount of downtime depends on the size of the VMDK(s) to be migrated.
  • The script will only migrate a single server. You could wrap this into powershell tasks to migrate several servers simultaneously.
  • Hyper-V Gen1 servers only support booting from IDE. This script will search for the boot disk and attach it to IDE0, all other disks will be attached to a SCSI controller regardless of the source VM disk configuration.
  • Linux VM’s were not in scope as there are not reliable ways to gain write access to LVM volumes on Windows. Tests of CentOS6, Ubuntu12 and Ubuntu14 were successful. CentOS5 required IS components be pre-installed and modifications made to boot configuration. CentOS7 was unsuccessful due to disk configuration. The recommended way of migrating Linux VM’s is to pre-install IS, remove VMWare Tools, and modify boot configuration before migrating.
  • These scripts were tested running from a Server 2012 R2 VM migrating Server 2003 and Server 2012 R2 VM’s – other versions should work but have not been tested.
  • ESXi 5.5+ requires a connection to a vCenter server as storage SDK service is unavailable on the free version.

How DBPM affects guest VM performance

Dell introduced a feature in their 11G servers called demand-based power management (DBPM). Other platforms refer to this feature as “power management” or “power policy” whereby the system adjusts power used by various system components like CPU, RAM, and fans. In today’s green-pc world, it’s a nice idea, but the reality with cloud-based environments is that we are already consolidating systems to fewer physical machines to increase density and power policies often interfere with the resulting performance.

We recently began seeing higher than normal READY times on our VM’s. Ready time refers to the amount of time a process needed CPU time, but had to wait because no processors were available. In the case of virtualization, this means a VM had some work to do, but it could not find sufficient free physical cores that matched the number of vCPU’s assigned to the VM. VMWare has a decent guide for troubleshooting VM performance issues which led to some interesting analysis. Specifically, our overall CPU usage was only around 50%, but some VM’s were seeing ready times of more than 20%.

This high CPU ready with low CPU utilization could be due to several factors. Most commonly in cloud environments, it suggests the ratio of vCPU’s (virtual CPU’s) to pCPU’s (physical CPU’s) is too high, or that you’ve sized your VM’s improperly with too many vCPU’s. One important thing to understand with virtual environments, is that a VM with multiple cores needs to wait for that number of cores to become free across the system. Assuming you have a single host with 4 cores running 4 VM’s, 3 VM’s with 1vCPU and 1 VM with 4vCPU’s, the 3 single vCPU VM’s could be scheduled to run concurrently while the fourth would have to wait for all pCPU’s to become idle.

Naturally, the easiest way to fix this is to add additional physical CPU’s into the fold. We accomplished this by upgrading all of our E5620 processors (4-core) in our ESXi hosts to E5645 processors (6-core) thereby adding 28 additional cores to the platform. However, this did not help with CPU READY times. vSphere DRS was still reporting trouble delivering CPU resources to VM’s:

DRS-before-dbpm

After many hours of troubleshooting, we were finally about to find a solution – disabling DBPM. One of the hosts consistently showed lower CPU ready times even though it had higher density. We were able to find that this node had a different hardware power management policy than the other nodes. You can read more about what this setting does in the Host Power Management whitepaper from VMWare. By default, this policy is automatically set as a result of ACPI CPU C-States, Intel Speedstep and the hardware’s power management settings on the system.

On our Dell Poweredge R610 host systems, the DBPM setting was under Power Management in the BIOS. Once we changed all systems from Active Power Controller to Maximum Performance, CPU ready times dropped to normal levels.

dell-r610-bios-power-management-settings

Information on the various options can be found in this Power and Cooling wiki from Dell. Before settling on this solution, we attempted disabling C-States altogether and C1E specifically in the BIOS, but neither had an impact. We found that we could also specify OS Control for this setting to allow vSphere to set the policy, though we ultimately decided that Maximum Performance was the best setting for our environment. Note that this isn’t specific to vSphere – the power management setting applies equally to all virtualization platforms.

“VixDiskLibVim: Not licensed to use this function” message with vSphere 5.5

We recently upgraded our environment to vSphere 5.5. The environment is protected by Veeam Backup & Recovery 7.0 R2 which supports vSphere 5.5. Prior to the vSphere 5.5 upgrade, backups were working without issue. Sure enough, after the upgrade, backup jobs started failing. The error message logged in C:\ProgramData\Veeam\Backup\<Job_Name>\Agent.VddkHelper.log was:

VixDiskLibVim: Not licensed to use this function

You may see this same error with any other vSphere backup product as well. Most solutions tell you that you either do not have vStorage API’s licensed for the host, or that the user connecting to vCenter does not have Administrator permissions. Trouble was that this same configuration was working prior to the vSphere 5.5 upgrade. I confirmed that the ESXi hosts did indeed have Enterprise Plus licenses assigned and that “Storage APIs” was listed under licensed features for each host. I also confirmed that the vCenter user account the backup product uses had Administrator permissions assigned at the datacenter level in vCenter – the same as prior to the upgrade.

After opening a support case with Veeam and testing several things, I tried adding the Administrator permission for the vCenter user at the top vCenter level instead of the datacenter (one level down). Sure enough, backups started working. So it seems that your vCenter user needs this permission directly at the vCenter server level in vSphere 5.5

VMWare False VM Snapshot Size Alarm

Just finished troubleshooting an issue with a false alarm being triggered in vCenter after upgrading to vCenter 5.1. We have a custom alarm defined that warns if a VM has a snapshot larger than 15GB and alerts if it’s larger than 25GB. After the upgrade, all VM’s were triggering that alert even though they did not have snapshots that large – even VM’s without any snapshots were triggering. Turns out this is a known issue with vCenter 5.1.0 (799731). The workaround is to set the warning threshold to something less than 15GB and the alert threshold to something less than 20GB (we used 14GB for warning and 19GB for alert).

VMFS Resource Temporarily Unavailable

I was performing some maintenance on a few VMFS LUNs and came across a few files for a VM that I knew were no longer in use. There were a couple of old snapshot files and a VMDK that had been renamed when doing a restore.

After confirming the files were no longer needed or in use, I attempted to remove them using the rm command. ESXi reported back the following error:

rm: cannot remove ‘.vmdk’: Resource temporarily unavailable

After some quick research, I realized it was likely a file lock that was causing the error. VMFS allows access from multiple ESXi servers at the same time by implementing per-file locking. That likely meant that an ESXi host other than the current owner of the VM had a lock on the file. The cluster was small enough that I was able to simply log in to each host and attempt the delete. After 3 attempts, I found the host with the lock to the files and was able to successfully delete them from the VMFS store.

I had tried removing the files via the datastore browser in vSphere hoping that it would be smart enough to know which host had the lock on the file, and issue the delete the command on that host – but no such luck. There was a way to detect which host had a lock on the files in ESX, but I have not found a similar mechanism in ESXi. Until then, trial and error will suffice.

Partition Alignment

Squeezing every ounce of performance out of your disk array is critical in IO intensive applications. Most times, this is simply an after-thought. However, doing a little leg-work during the implementation phase can go a long way to increasing the performance of your application. Aligning partitions is a great idea for SQL and virtualized environments – these are the places you will see the most benefit.

The concept of aligning partitions is actually quite simple and applies to SAN’s and really any disk array alike. If you are using RAID in any capacity, then aligning disk partitions will help increase performance. It is best illustrated by the following graphics, borrowed from http://www.vmware.com/pdf/esx3_partition_align.pdf (This is a great read, but specific to VMWare environments, however, the same concepts apply).

Using unaligned partitions in a virtual environment, you can see that a read could ultimately result in 3 disks accesses to the underlying disk subsystem:

By aligning partitions properly, that same read results in just 1 disk access:

While these graphics are Virtual Machine and VMWare specific, the same is true for Hyper-V and SQL (except remove the middle layer for SQL). In order for partition alignment to work properly, you need to ensure that the lowest level of the disk sub-system has the highest segment size (also referred to as stripe size). Depending upon your RAID controller or SAN, this could default to as low as 4K or as high as 1024K. I won’t cover what differences in segment sizes mean for performance, that’s an entirely difference discussion, but generally speaking defaults are usually 64K or 128K. The basic idea behind a proper stripe size is that you want to size it so that most of your reads/writes can happen in 1 operation.

From there, you need to ensure that your block or file allocation unit size is set properly – ideally smaller or the same size as the segment size and that it is a multiple of the segment size. Lastly, you should then set the offset to the same as the segment size. By default, Windows 2003 will offset by 31.5K, Windows 2008 by 1024K, and VMWare VMFS default’s to 128.

Setting the segment size may or may not be an online operation – that depends entirely on your RAID controller or SAN as to whether this can be done to an already configured array or if it has to be done during the initial configuration. Changing the offset and/or block size of a partition however is NOT an online operation. This means that all data will have to be removed from the partition, the offset configured, and the partition recreated. Prior to Windows 2008, this cannot be done to system partitions so for Windows 2003, you would have to attach the virtual hard disk to another system, set the offset and format the partition, and then perform the windows installation.

The following links provide detailed information about aligning partitions in both VMWare and Windows. Consult your SAN or RAID controller documentation for setting or finding out the segment size.

Recommendations for Aligning VMFS Partitions

Disk Partition Alignment Best Practices for SQL Server

Calculating disk usage and capacity using Diskmon

While evaluating SAN storage solutions for our VMWare environment, we found ourselves asking the question “How many systems can we fit on this system before IOPs and/or throughput become a bottleneck?” Come to find out, the answer is not a simple one. In fact, all of the vendors we posed this question to were only able to give us vauge performance numbers based on perfect conditions. We set out on a quest to quantify the capacity of each of the backend storage systems we tested.

Generally speaking IOPs is inversely proportional to the request size while throughput is proportional. This means that as the request size descreases the total number of IOPs increases while throughput decreases and vice versa. So when you see performance numbers that claim very high IOPs those are based on small requests and therefore throughput will be very minimal. In additional, disk latency and rotational speed can play a role in skewing these numbers as well. Sequential operations will produce much higher numbers than random operations. When we add RAID to the equation, we will see a difference in numbers depending upon whether the operation is a read or a write.

What does all this mean? It means that the performance capacity of a disk or storage device is determined by 3 main factors: Request Size, Random/Sequential operation, and Read/Write operation. There are other factors that can play a role, but focusing on these three factores will provide an estimation of the capacity of a disk, array or storage system. There are differing opinions as to what these numbers are in “real life.” The generally accepted view is that the average request size is 32K, 60% of transactions are random while 40% are sequential, and 65% are reads while 35% are writes. However, these numbers differ depending upon the application. The best way to determine these numbers for your environment is to capture statistics from production systems and average them together.

Fortunately, there is a nice utility for Windows that will allow you to get this information. The Diskmon utility: http://technet.microsoft.com/en-us/sysinternals/bb896646.aspx available from SysInternals (now part of Microsoft), will log every disk transaction with the necessary information.

Diskmon from SysInternals (now Microsoft)

Diskmon will begin capturing data immediately. To stop Diskmon from capturing data, click the magnifying glass in the toolbar:

Stop capture

You can then save the output to a text file by clicking the save button. I recommend capturing data during normal usage over a reasonable period of time. Also, it is best to minimize the Diskmon window to keep CPU usage to a minimum. The next step is to import the text file into Excel. I have provided a sample excel spreadsheet you can use as a template to perform the necessary calculations: server_diskmon.

Diskmon output to Excel spreadsheet

By taking a sampling from various systems on our network and using a weighted average, we calculated average of usage on our systems. In our case, we were using a common storage backend, and we wanted to categorize different systems into low (L), medium (M), and high (H) usage systems. We then assigned a percentage to each. By doing this, we can calculate the disk usage on the system if x% are low usage, y% medium usage, and z% high usage.

Weighted average of several systems on our network

We now have an accurate estimation of the Read Request Size, Random/Sequential percentages, and Read/Write percentages. If we feed these numbers into IOMeter, we can get a baseline of what the backend storage system can support. Divide that by our weighted average and we can find exactly how many systems our backend can support. If we look at point in time numbers, we can figure out the percentage of disk capacity being used:

Capacity of storage backend

I have put together a sample IOMeter configuration file containing the “real life” specification of 32K requests, 60% Random / 40% Sequential, and 65% Reads / 35% Writes.

Also, there’s a great comparison of SAN backends for VMWare environments here: http://communities.vmware.com/message/584154. Users have run the same real life test against their backend storage systems which will allow you to compare your devices performance with other vendors.

One side note when using IOMeter, be sure to set your disk size to something greater than the amount of cache in your backend storage systems in order to calculate raw disk performance. The configuration file I have provided uses a 8GB test file which should suffice for most installations.

Clear stale ESX iSCSI targets

During our beta testing of our new VMWare environment, we created various volumes and then deleted them. After removing them from the Equallogic SAN, we found ESX server was still trying to login to those targets. After some research, I stumbled across the following files containing the stale entries:

[root@vmware root]# cd /var/lib/iscsi
[root@vmware iscsi]# ls -la
total 16
drwx——    2 root     root         4096 Dec 17 12:11 .
drwxr-xr-x   11 root     root         4096 Oct 23 09:17 ..
-rw——-    1 root     root          830 Dec 17 11:22 vmkbindings
-rw——-    1 root     root          474 Dec 17 12:16 vmkdiscovery
[root@vmware iscsi]#

Put the host into maintenance mode, then edit both the /var/lib/iscsi/vmkbindings and /var/lib/iscsi/vmkdiscovery files to remove the stale entries. Reboot the host and exit maintenance mode. Lastly, rescan the HBA:

esxcfg-swiscsi -s

Repeat for all hosts that are attempting to use the stale entries.