VMWare Setup and Updates
This page should keep track of VMware related setup/updates and information.
Update to vSphere 5.1
will document the details of updating to vSphere 5.1, done mid-September 2012.
Note many of these are older (vSphere 4.x), however some of the concepts still apply.
Fixing pvscsi driver module after kernel update (EL5 or less)
We have to jump through a few hoops because you can't generate a new kernel initrd with pvscsi when the modules don't exist for that kernel yet. We also can't configure vmware tools for non-running kernel before EL6 (which has the modules built into kernel anyways). Follow this procedure using a kernel that is able to boot the machine. If no kernel can boot the machine, switch to the LSI Host Bus Adaptor and boot with a boot disk.
- Boot the VM normally on the last kernel that worked.
- Generate a new initrd for the NEW kernel that contains the LSI SAS driver module. Replace the versions in the following command as appropriate:
mkinitrd -v -f /boot/initrd-2.6.18-371.9.1.el5.img 2.6.18-371.9.1.el5 --preload=mptbase --preload=mptscsih --preload=scsi_transport_sas --preload=mptsas --preload=libata --preload=ata_piix --builtin=pvscsi
- Shutdown the VM. Change the type of the SCSI adapter to LSI SAS.
- Startup the VM in the new kernel.
- Run the script /usr/bin/vmware-config-tools.pl. Defaults are generally fine. This should take care of installing the pvscsi and vmxnet modules in the new kernel.
- Shutdown the machine. Change scsi adapter type to PVSCSI.
- Startup the machine and it should be working
I've ended up doing this so often that I wrote a script to do the initrd. The only argument is the kernel version. You can run it from AFS and it will write a new initrd img to /boot:
Installation of New iSCSI System (MD3600i-base)
We purchased a new iSCSI system from Dell consisting of an MD3600i and MD1200 expansion shelf. The new system (call UMVMSTOR02
) replaces the old MD3000i+2xMD1000 system (UMVMSTOR01
) whose warranty expired in Spring 2011. The advantages of the new system include:
- Dual controllers each with:
- Dual 10GE RJ45 interfaces
- SAS connector for expansion shelves
- RAID controller (RAID-0/1/5/6/10)
- OOB-Management port (1GE)
- Redundant P/S on both units
- Purchased both the "High-performance Tier" and "Snapshot and Virtual Disk Copy" entitlements
- Disks are SAS 15K 600GB (12 on the MD3600i and 12 on the MD1200)
The details of the setup and configuration are available at setup and configuration of MD3600i
Debugging VMware Network Issues
We have been having network issues with the VMware configuration since we got the new R710 nodes. The following are the types of problems observed:
- vCenter would get alarms for "Network connectivity lost" or "Network uplink redundancy lost"
- Pinging some VMs would result in 1-15% packet loss
- Problem was worse on VMs on ESX host UMVM01, best on ESX host UMVM03 (no packet loss typically) and in between on ESX host UMVM02
- We would see messages about 10GE ports going down or into spanning-tree blocking mode in the syslog messages
- The ESX hosts showed varying amounts of 'csum' (checksum) errors. There should be almost 0 such errors
The original purchase included Broadcom NetXtreme
II 57711 dual SFP+ 10GE NICs on the R710s. In reading some VMware forums and related whitepapers it seems like many people were having a good experience with the Intel X520 NICs while there were quite a few complaints related to the Broadcom NetXtreme
II. We ended up replacing the Broadcom NICs with Intel NICs on July 26, 2011. Detail plots
of the various perfSONAR measurements spanning the replacement are available. After this replacement we were able to successfully ping formerly problematic VMs with 0% packet loss (100's of packets). Also, as the plots show, we experience much less packet loses in most instances compared to before the switch. There are still some indications of continuing packet loss in the perfSONAR OWAMP measurements that will be worthwhile to pursue but things so far seem much better that before.
VMware Updating June 20 2011
As part of ongoing maintenance, the AGLT2 VMware servers were updated on June 20, 2011. There are a number of parts to the "upgrade"
- Dell server (R710) firmware/bios updates. This included the following updates
- R710 3.0.0 Bios NOTE This update did NOT work
- H700 firmware update
- iDRAC6 update
- Broadcom and Intel NIC firmware updates
- VMware driver and software updates
- New driver for Broadcom 57711 NIC
- Dell OMSA for ESX to v6.5 from v6.3
- There were 8 ESX patches/updates to apply on each host
- Server reconfiguration to utilize 10GE interfaces as the primary VM networking interface
- See various whitepapers above
- Each ESX server (UMVM0x) has a dual port SFP+ Broadcom 57711 NIC with one connection to SW9 and on to SW10
- All network connections (for VM subnets and iSCSI access) were moved to a virtual switch server by these two 10GE ports on each ESX host.
- Three vSwitches were left with service consoles: AGLT2, iSCSIA and iSCSCIB
- The remaining 1GE NICs were added as "standby" NICs on the vSwitch served by the 10GE NIC
- The details of the network changes are noted here
Resizing VMware Disks
I found the following info which is useful for resizing VMware disks. First attach a new virtual disk of the desired size to the VM and reboot into a "rescue" ISO image. Then:
4. Run the find | cpio Spell
Now this spell doesn't have a lot to it, but it's funny how you memorize scripts like this over the years after using them and passing them along to friends. First, change to the root level of the partition you want to copy and then execute the command as root. So, to migrate my root partition from single-user mode, I did the following:
find ./ -xdev -print0 | cpio -pa0V /mnt/sdb1
To migrate from a rescue disk, the command is almost identical, but you change to the mountpoint of the source partition instead (I mounted it at /dev/sda1):
find ./ -xdev -print0 | cpio -pa0V /mnt/sdb1
Upgrading VMware vCenter DB from SQL 2005 Express to SQL Server 2008 R2
After re-enabling VMWare Data Recovery (which produces lots of logged information to the back-end vCenter DB) we again hit the 4GB DB size limit on February 3, 2012. The solution was to finally complete the transition to a "real" DB. This is the link I wished I had used to make this transition:
Instead I started here:
Which points to here for the bulk of the useful information:
- Install SQL Server 2008 R2 after complying with pre-install checks (get right features in place and setup account)
- Detach vCenter and VMWare Update Manager DBs from SQL 2005 Express
- Copy DB files from original location to SQL Server 2008 R2 location
- Attach DBs to SQL 2008 R2
- Make sure all the needed SQL Agent files are installed for DB maintenance
- Verify JDBC TCP/IP connections work
This last item was a problem. The original SQL 2005 Express must have been using dynamic ports
and had settled on port 49163. After the steps above (except for the JDBC part) I was seeing problems on the Performance
and Storage Views
tabs. This was similar to this VMware problem noted here:
There are guidelines for setting up JDBC networking here: Setup JDBC acccess
I first made sure the vCenter had the firewall setup to allow both
static and dynamic port access as described here: http://msdn.microsoft.com/en-us/library/ms175043.aspx
For the static case I opened ports 1433 and 49163.
The smoking gun
for the problem was this log file:
C:\Users\All Users\VMware\VMware VirtualCenter\Logs\stats
The problematic logs showed:
2012-02-03 21:42:09,598 Thread-2 ERROR com.vmware.vim.common.lifecycle.InitializerExecutor] Initialization error; attempt 19 will begin in 60 seconds...
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (The TCP/IP connection to the host localhost, port 49163 has failed. Error: "Connection refused: connect. Verify the connection properties, check that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port, and that no firewall is blocking TCP connections to the port.".)
at com.vmware.vim.common.lifecycle.InitializerExecutor$MonitorCallback.run(Unknown Source)
Caused by: java.lang.IllegalStateException: org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (The TCP/IP connection to the host localhost, port 49163 has failed. Error: "Connection refused: connect. Verify the connection properties, check that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port, and that no firewall is blocking TCP connections to the port.".)
at com.vmware.vim.common.lifecycle.InitializerExecutor$MonitorCallback$1.run(Unknown Source)
Event though when I checked the "dynamic port" in the SQL Server Configuration was something other than 49163.
I finally got things to work by running the SQL Server Configuration Manager
, selecting the "SQL Server Network Configuration"->"Protocols for VMWARE", right-clicking on TCP/IP and choosing properties. At the bottom is the setup for the dynamic port (initially '0' but after restarting it will select a port). It had chosen 52143 but I changed it to 49163 and then things worked.
Add New MD3820i
We recieved new iSCSI storage at the beginning of August 2015. The new system is a Dell MD3820i with 24 2.5" 1.2TB 10K SAS disks and a 5-year warranty. Details on its setup and configuration are at SetupDellMD3820i
- 07 Oct 2008