Updated with Patch! Massive Bug Found in VMWare ESX 3.5U2

Updated with the Express Patch information and download! Click "Read More" for the links!!

This was first reported yesterday evening on both the VMWare Community forums and on DeployLinux.com. From what anyone can tell, there is a bug in the VMWare License Management code and it is causing any system that is running ESX 3.5U2 to not be able to boot this morning. VMware is attempting to figure out what happened and put out a patch, but the more important question is, "Why wasn't this caught before it shipped?" As Matt Marlowe posted:

OK, while we're all remaining calm....just imagine the implications that bugs like this can occur and get past QA testing....5 years down the road, nearly all server apps worldwide pretty much running in VM's (pretty easy prediction)......some country decides to initiate cyberwarfare and manages to get a backdoor into whatever is the prevaling hypervisor of the day.....boom. All your VM's belong to us. [...]

I'd love to find out what happened here. Don't they do any regression testing on new releases to check for date based bugs? I thought that would be pretty obvious.

There have been some updates on this situation since it broke last night:

1. Frank Wegner's suggested workaround:

* Do nothing
* Turn DRS off
* Avoid VMotion
* Avoid to power off VM's

I'd council against turning DRS off as that actually deletes resource pool settings....instead, set sensitivity to 5 which should effectively disable it w/ minimal impact.

2. VMware has stated they will have fixes available in 36hrs at the earliest.

3. Anand Mewalal's suggested workaround:

We used the following workaround to power on the VM's.
Find the host where a VM is located
run ' vmware-cmd -l ' to list the vms.
issue the commands:
service ntpd stop
date -s 08/01/2008
vmware-cmd /vmfs/volumes/
service ntpd start

4. It's reported that there are no easily seen warnings in logs/etc or VC prior to hitting the bug. VC will continue to show the hosts as licensed and no errors will appear in vmkernel log file until you try to start up a new vm, reboot a vm, or reboot the host.

Any more info we get will be added as we find it!

UPDATE 1: According to the new FAQ posted:

Resolution:

VMware Engineering has isolated the root cause and is working to produce an express patch for impacted customers today. The target timeframe is 6pm, August 12, 2008 PST.

That's excellent news for those affected!

UPDATE 2: The Express Patch has been released to fix this issue:

Express Patch Download

Special Notice: Please Read

An issue has been uncovered with ESX/ESXi 3.5 Update 2 that causes the product license to expire on August 12, 2008.

Follow the steps below to correct this issue:

1. Read the following Knowledge Base articles first:
* Fix of virtual machine power on failure issue, refer to KB 1006716
* For VI 3.5, refer to KB 1006721 for deployment consideration and instruction
* For VI3.5i, refer to KB 1006670 for deployment consideration and instruction
2. Download and apply the express patch according to the product(s) you have:
* VMware ESXi 3.5 Update 2 Express Patch
* VMware ESX 3.5 Update 2 Express Patch

News source: All your VM's belong to us
Link: BIG bug in ESX 3.5 Update 2
Link: KB 1006716: Unable to Power On virtual machine with "A General System error occurred: Internal error"
Download: Express Patches for ESX 3.5U2

Report a problem with article
Previous Story

Driving Under the Influence of Technology

Next Story

Zune Goes to Hollywood

13 Comments

Commenting is disabled on this article.

Hopefully this is the first and last time something this major happens. I wonder if anyone will lose their job over this (at VMWare... hopefully other companies would understand that there was little that could have been done).

Just a quick note:

They've posted a manual update (read: no VMware Update Manager deployments) for ESX and ESXi on the Knowledge Base article page.

/off to install

What irks me is that this, like many other public 'incidents' with major software vendors is that the problem is down to freaking software licensing schemes. The actual core product works brilliantly and it is just the licensing aspect of it that has let the side down!

How many more times!

Had this problem this morning. Was spawning 100 XP virtual machines on my VDI cluster and came into work this morning and none of them would turn on. Thought it was a corrupted template so I deleted all 100 machines, then I found this.

doh! :confused:

In the same boat. Actually ran the "new" host remediation feature against our VMware farm this weekend. Just in time to get hosed by the bug! Oh well, live and learn. :)

What I don't get is VMware stating the ISOs will be upgraded first and patches released later in the week. How difficult would it be to push out a patch for the licensing service. Heck, with the licensing services being useless with Update 2, don't see why you couldn't upgrade them with your hosts online.

(MadDog said @ #5)
What I don't get is VMware stating the ISOs will be upgraded first and patches released later in the week. How difficult would it be to push out a patch for the licensing service.

It's far easier and faster from a development standpoint to update the main trunk and re-issue a full build than build a patch for an existing installation, there's much less regression and dependency testing to be done

Also keep in mind that the licensing routines are most likely present in a lot of packages of the ESX Server's installation, it's not just like reinstalling a single rpm

Got stung by this as well today in our new VMware ESX infrastructure that we're currently building. We'd deliberately kept the servers at 3.5.0 U1 but as you may see in the main thread about this on VMware's forums, there is ANOTHER bug which basically silently updated U1 systems where admins had explicitly requested NOT to get the U2 update, to U2 anyway.

We smugly thought that we were ok, but on closer inspection discovered that the U2 update had gone on last night when we did some critical patches.

I'm glad we don't have a major infrastructure on this yet - but the systems we bought (4x DL585's each with 4x AMD Barcelona Quad Core's (so 16 cores per server) and 64GB of RAM) will ultimately scale to hundreds of hosts.

Not funny. Especially right on patch Tuesday!!

It affects VMware ESXi 3.5 Installable Update 2 and VMware ESX 3.5 Update 2 both are now unavailable on the VMware download site

Update 1 revisions of each are unaffected and still downloadable

been hit with the bug this morning, luckily we're just beginning the platform setup so it's no biggie to go back to 3.5u1 , but this got me confused for a few minutes
even putting the infrastructure in evaluation mode didn't work, the whole licensing core is affected, probably an inherited setting from the beta builds, scheduled to expire on the 11th

to vmware's credit, the guys are scrambling to get a new build out before tomorrow noon, in the meantime, setting the Esx Server nodes' ntp server off and date back a few days works just fine, provided the guests don't synch with them through the vmtools and you don't need legal timestamp compliance

I wouldn't call that a bug anyway, rather a little 'oops, forgot to turn that off' with massive consequences

next time, when signing off a build to RTM, they should have their compiler turn the Beta/Expiration flag off automatically