vmware - wierd networking issues e1000


Recommended Posts

Having weird issues and have a ticket open for a month with vmware.   Basically it goes like this:

environment: vmware esxi 5.5 u2 or 6.0 (happens in both environments)

guest os: windows 2012 r2

e1000/e1000e issue: nic card cannot be found according to event logs and randomly has a 0.0.0.0 conflict and everything no longer can communicate.  The fix is to reboot the guest os.  Vmware recognizes this issue and recommends to not use these drivers with windows 2012.  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109922  According to this there is no workaround and to use the vmnetx 3 driver.

vmnetx 3 issue:  users experience slow connection to services taking minutes to get to different pages.  it is so bad that it is unacceptable, it didn't just add seconds between pages... it added minutes.

 

The issue I am at, instability vs completely unusable.  Has anyone seen anything like this or know of a fix. 

 

VMs current stand with e1000/e1000e: not our problem, it is a microsoft problem.  nothing shows up in the logs.  call microsoft and complain

vms current stand with vmnetx:  We have performance tuned that driver before packaging it into vmware, you should not be having those issues. 

 

I have a support person contacting me tomorrow to go through the vmnetx issue to see if he can figure out the issue with that driver.  My expectations are coming out with another day with this issue being unresolved.

 

Link to comment
Share on other sites

I am using vmx3net on esxi 6 with 2k12r2

"vmnetx 3 issue:  users experience slow connection to services taking minutes to get to different pages."

Web services, file sharing services?  Can you give me details of the problem so can try and duplicate.

Do you have the vmware tools installed?  Are you on the current 6 build? 2809209  5.5 u2 is quite old, current build is 2718055

 

Link to comment
Share on other sites

Don't use e1000 NICs with Server 2008+ guests.  Use VMXNET3.  The e1000 NIC on Server 2008+ can cause all kinds of weird problems like dropped packets, VLAN tags being incorrectly applied, even a PSOD on the ESXi host.  Before you switch over to VMXNET3, make sure you take the IP address off of the e1000 NIC.  Also make sure you remove the e1000 device through Device Manager.

Link to comment
Share on other sites

I am using vmx3net on esxi 6 with 2k12r2

"vmnetx 3 issue:  users experience slow connection to services taking minutes to get to different pages."

Web services, file sharing services?  Can you give me details of the problem so can try and duplicate.

Do you have the vmware tools installed?  Are you on the current 6 build? 2809209  5.5 u2 is quite old, current build is 2718055

 

Well it is a cluster, it is our erp system and it is in beta.  So while it is an inconvenience now, it is just an inconvenience.  1 html server, 1 sql db, 1 logic and batch server, 1 file server, 1 reporting server, 5 servers altogether. 

Current build 5.5.0 2718055

vmtools version on the guest oses 9.4.12 build 2627939

 

Don't use e1000 NICs with Server 2008+ guests.  Use VMXNET3.  The e1000 NIC on Server 2008+ can cause all kinds of weird problems like dropped packets, VLAN tags being incorrectly applied, even a PSOD on the ESXi host.  Before you switch over to VMXNET3, make sure you take the IP address off of the e1000 NIC.  Also make sure you remove the e1000 device through Device Manager.

yes but using the vmxnet3 driver makes the system unusable.  I am faced with unsuable or crashing, neither of which are acceptable solutions.  Do you really think if it weren't a big deal that I would get vmware involved or that I would ask a question here after a month of getting no where with vmware?

I will put it to you like this.  e1000 provides instant or near instant logon and instant or near instant database queries and page loads.  With no other change other than changing to the vmnet3 driver (using the same vswitch and statically assigning the same IP) it slows down to the point of watching elmers glue solidify and turn clear before allowing to logon...meaning about 10 minutes before you move on past the logon screen and subsequent pages.  It is really unusable.  It isn't another 10 seconds, it is another 10 minutes...keyword is minutes not seconds.  if you would rather me to display it in seconds, it adds an additional 600 (that is two zeros) seconds to the 1 or less than 1 second with the e1000e driver.

Edited by sc302
Link to comment
Share on other sites

so your cluster has 5.5 host and 6 hosts?  or 2 different setups?

What exactly is slow?  so guessing the web server talks to the db server, etc..  So you have a lot of interaction going on..  Are they all on the same vswitch and network?  Are you using standard vswitches or distributed?

What is the physical nics connected to the vswitches? Do you have them in any sort of load balance or failover?  Are you doing offloading of checksums, etc. etc.

 

Link to comment
Share on other sites

VMxnet3 is used on millions (probably billions) of VM's around the world, so what needs to be determined is why in your environment is this an issue. I would start with your vSphere networking and go from there.

Link to comment
Share on other sites

no 2 different setups.

I have 4 physical hosts the db is on one host, the other servers are spread across the other 3 hosts.  All standard vswitches, each guest has its own dedicated physical nic, so it is one to one.  the vswitch and vmware do not see drops, the 2012 server sees the nic uninstall itself.

Link to comment
Share on other sites

VMxnet3 is used on millions (probably billions) of VM's around the world, so what needs to be determined is why in your environment is this an issue. I would start with your vSphere networking and go from there.

why does the e1000e driver work normally as far as speed goes and the vmxnet3 does not.  something is different between these two, I don't think it is an issue with the vswitch (could be) but why would the vswitch care?  And what would I change to optimize speed on the vswitch.  this is all out of the box configs, nothing really changed outside of naming the vswitch.  I have even switched it down to access vs trunk to eliminate any issues that vmware could possibly have with trunking....it is a flat vlan anyway. 

Edited by sc302
Link to comment
Share on other sites

as to uninstalling itself..  Does the vnic get removed from the vm settings, or does the vm just think there is no nic?

I thought the only problem with the vmx3 was performance?  But what part about performance you have lots of balls in the air here, you have webserver I take it and then that talks to your db and what your other servers in this system do talking to each other I don't know.  What does your performance and utilization look like on your vms? Are you seeing drops or retrans across your physical network between your hosts?

What is the performance like if you put all the vms on the same host on the same vswitch?

There can be issues with offloading, is it enabled on your hardware of the host? Is it enabled in the OS driver for vmx3 nic?

Lets see what the tech says about your setup I guess.

This doesn't spell out 2k12 - but have you looked at this http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2008925

 

Link to comment
Share on other sites

the setup is very vanilla so whatever the defaults are.  Offloading is not enabled.  Utilization is on the very low side.

 

with the e1000/e1000e eventlog:

event id 27 + 32,

followed by 4201 7043 7036 then the eventual event id 4199

to me it seems like the nic or a component of the nic is not installed, or has the proper software support to run correctly. and eventually commits suicide killing all communications.

 

------------------------------------------------------

the vmxnet 3 only has slowness with no event logs.  never had an issue like this in other setups with using either driver.   It is a complex setup, but it isn't an uncommon setup.

Edited by sc302
Link to comment
Share on other sites

you get 32 this error?

event ID: 32 - Source: disk - Description: The driver detected that the device \Device\Harddisk0\DR0 has its write cache enabled. Data corruption may occur.

that could be a problem if your seeing corruption in the file system for sure!!!

so how many of those 4201 are you seeing? do you get the mac address in your 4199 event?

Edited by BudMan
Link to comment
Share on other sites

Event id 32:

the description for event id 32 from source e1iexpress cannot be found.  either the component that raises this event is not installed on your local computer or the installation is corrupted.  you can install or repair the component on the local computer.  the following information was included with the event:

Intel(R) 82574L Gigabit Network Connection.

------------------------------------

 

4201 ~ 25 a day

 

-------------------------------

 

Mac address is that of my switch. 

Link to comment
Share on other sites

Don't use e1000 NICs with Server 2008+ guests.  Use VMXNET3.  The e1000 NIC on Server 2008+ can cause all kinds of weird problems like dropped packets, VLAN tags being incorrectly applied, even a PSOD on the ESXi host.  Before you switch over to VMXNET3, make sure you take the IP address off of the e1000 NIC.  Also make sure you remove the e1000 device through Device Manager.

Serious question, Have you got any Evidence to back this up? I know some people that might be interested in this. 

Link to comment
Share on other sites

Link to comment
Share on other sites

Serious question, Have you got any Evidence to back this up? I know some people that might be interested in this. 

I got the Windows version wrong, as it looks like it's 2012+, not 2008+.  But that's still relevant to this post.  Also looks like VMware fixed at least some of the e1000 and e1000e issues in subsequent 5.x and 6 releases.  However after personal horrible experiences with e1000 NICs I always use VMXNET3 and have never had any issues.
http://community.spiceworks.com/topic/640996-esxi5-e1000-server-2012-and-the-purple-screen-of-death-psod
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2059053

One more...  Experienced this problem on ESXi 5.5, though the article only specifies 5.0 and 5.1
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2058692

Edited by njeske
clarify OS version.
Link to comment
Share on other sites

why does the e1000e driver work normally as far as speed goes and the vmxnet3 does not.  something is different between these two, I don't think it is an issue with the vswitch (could be) but why would the vswitch care?  And what would I change to optimize speed on the vswitch.  this is all out of the box configs, nothing really changed outside of naming the vswitch.  I have even switched it down to access vs trunk to eliminate any issues that vmware could possibly have with trunking....it is a flat vlan anyway. 

Did you go through the process of removing the E1000 NIC from Device Manager? If you didn't then the problems you're seeing with VMxnet3 might be related to a ghost NIC.

Follow these instructions to remove the ghost NIC: http://blogs.technet.com/b/danstolts/archive/2010/09/25/how-to-find-a-lost-missing-hidden-or-removed-network-card-nic-or-other-device-and-even-remove-it.aspx

Link to comment
Share on other sites

fully aware on how to remove a ghost nic, thanks though.

this is a bit easier to follow along though, looking at that page you linked to gives me a headache.  I would remove that from your favorites and never refer back to that ever.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1179

 

Link to comment
Share on other sites

very preliminary, but I think we have a fix for the vmxnet driver.  2 servers are in a processor over commitment state.  Not really an issue for the e1000 driver but, according to support, the vmxnet driver uses processor cores to take some of the load off the nic utilizing receive-side scaling.   Upon disabling receive-side scaling, it seems to have resolved (I cannot tell if this is temporary or not at this point) the slowness issue.  Normally I do not run over committed vms, however this isn't a normal install.  Right now I am faced with 3 options

1. disable receive side scaling

2. drop the cores down on the guest oses

3. purchase new processors and install.

 

This is a beta environment (total current concurrent user count is 5 and they were experiencing extreme slowness with that few user count) so I am not overly concerned with disabling receive side scaling, however if this is a fix my choice will be to either drop the cores or purchase new processors.   If you guys are interested I will keep you updated, otherwise I will go on with those options.

  • Like 2
Link to comment
Share on other sites

This is a beta environment (total current concurrent user count is 5 and they were experiencing extreme slowness with that few user count) so I am not overly concerned with disabling receive side scaling, however if this is a fix my choice will be to either drop the cores or purchase new processors.   If you guys are interested I will keep you updated, otherwise I will go on with those options.

Very interested, thank you!

Link to comment
Share on other sites

Right now we have chose to leave the receive side scaling in the default setting which is enabled.  I have dropped the vcores on the guest oses to be equal or less than the total physical cores on the boxes.  With some light user testing, meaning 1-2 users, everything seems to be running ok with the vmxnet driver.  Monday-Thursday will be the test.  I will post back sometime next week to let you know the outcome, but if you don't hear anything until Friday you can safely assume that this has indeed fixed the issue. 

Link to comment
Share on other sites

So related to this - here is a good article on troubleshooting network issues on esxi with vsish

http://www.v-front.de/2015/08/troubleshooting-vm-network-performance.html

So did I read that right you were running more cores in your VMs than you actually had on your boxes?  So for example host has say 4 cores, you were telling your vms they had say 8?  I don't think that has ever been a recommended setup??

 

Link to comment
Share on other sites

That is correct.  

 

Yes you shouldn't allow a single guest to use more cores than what is available by the host. I was going by the recommended configuration based on what the vendor recommended,  they had our physical server configurations before they sent us their recommended configuration.  Vendor using a cookie cutter document without really looking at what is available at the individual client site...not the first time something like that has happened. 

 

Regardless, that is what the vmware tech was using when he was looking at the nic/vnic.  No drops,  no ring buffering,  what he saw on the card was perfect.  When he looked further at the setup that is when he looked at the guest configuration and saw the core issue.  I didn't know that or would have that drastic of effect on communications...thought it would treat it like other servers on the hardware where you can have an over commitment of cores through multiple guests (8 physical cores, 10 guests utilizing 4 cores each)

Link to comment
Share on other sites

so far no issues, things are moving pretty well.  we are looking to up the processors on those boxes from 2.0 6-core to 3.2 10-core.  Looked at new servers and they are double to triple the price of buying processors (we have 412GB of mem in each host). 

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.