Recommended Posts

Having weird issues and have a ticket open for a month with vmware.   Basically it goes like this:

environment: vmware esxi 5.5 u2 or 6.0 (happens in both environments)

guest os: windows 2012 r2

e1000/e1000e issue: nic card cannot be found according to event logs and randomly has a 0.0.0.0 conflict and everything no longer can communicate.  The fix is to reboot the guest os.  Vmware recognizes this issue and recommends to not use these drivers with windows 2012.  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109922  According to this there is no workaround and to use the vmnetx 3 driver.

vmnetx 3 issue:  users experience slow connection to services taking minutes to get to different pages.  it is so bad that it is unacceptable, it didn't just add seconds between pages... it added minutes.

 

The issue I am at, instability vs completely unusable.  Has anyone seen anything like this or know of a fix. 

 

VMs current stand with e1000/e1000e: not our problem, it is a microsoft problem.  nothing shows up in the logs.  call microsoft and complain

vms current stand with vmnetx:  We have performance tuned that driver before packaging it into vmware, you should not be having those issues. 

 

I have a support person contacting me tomorrow to go through the vmnetx issue to see if he can figure out the issue with that driver.  My expectations are coming out with another day with this issue being unresolved.

 

I am using vmx3net on esxi 6 with 2k12r2

"vmnetx 3 issue:  users experience slow connection to services taking minutes to get to different pages."

Web services, file sharing services?  Can you give me details of the problem so can try and duplicate.

Do you have the vmware tools installed?  Are you on the current 6 build? 2809209  5.5 u2 is quite old, current build is 2718055

 

Don't use e1000 NICs with Server 2008+ guests.  Use VMXNET3.  The e1000 NIC on Server 2008+ can cause all kinds of weird problems like dropped packets, VLAN tags being incorrectly applied, even a PSOD on the ESXi host.  Before you switch over to VMXNET3, make sure you take the IP address off of the e1000 NIC.  Also make sure you remove the e1000 device through Device Manager.

  On 13/08/2015 at 18:01, BudMan said:

I am using vmx3net on esxi 6 with 2k12r2

"vmnetx 3 issue:  users experience slow connection to services taking minutes to get to different pages."

Web services, file sharing services?  Can you give me details of the problem so can try and duplicate.

Do you have the vmware tools installed?  Are you on the current 6 build? 2809209  5.5 u2 is quite old, current build is 2718055

 

Well it is a cluster, it is our erp system and it is in beta.  So while it is an inconvenience now, it is just an inconvenience.  1 html server, 1 sql db, 1 logic and batch server, 1 file server, 1 reporting server, 5 servers altogether. 

Current build 5.5.0 2718055

vmtools version on the guest oses 9.4.12 build 2627939

 

  On 13/08/2015 at 18:06, njeske said:

Don't use e1000 NICs with Server 2008+ guests.  Use VMXNET3.  The e1000 NIC on Server 2008+ can cause all kinds of weird problems like dropped packets, VLAN tags being incorrectly applied, even a PSOD on the ESXi host.  Before you switch over to VMXNET3, make sure you take the IP address off of the e1000 NIC.  Also make sure you remove the e1000 device through Device Manager.

yes but using the vmxnet3 driver makes the system unusable.  I am faced with unsuable or crashing, neither of which are acceptable solutions.  Do you really think if it weren't a big deal that I would get vmware involved or that I would ask a question here after a month of getting no where with vmware?

I will put it to you like this.  e1000 provides instant or near instant logon and instant or near instant database queries and page loads.  With no other change other than changing to the vmnet3 driver (using the same vswitch and statically assigning the same IP) it slows down to the point of watching elmers glue solidify and turn clear before allowing to logon...meaning about 10 minutes before you move on past the logon screen and subsequent pages.  It is really unusable.  It isn't another 10 seconds, it is another 10 minutes...keyword is minutes not seconds.  if you would rather me to display it in seconds, it adds an additional 600 (that is two zeros) seconds to the 1 or less than 1 second with the e1000e driver.

Edited by sc302

so your cluster has 5.5 host and 6 hosts?  or 2 different setups?

What exactly is slow?  so guessing the web server talks to the db server, etc..  So you have a lot of interaction going on..  Are they all on the same vswitch and network?  Are you using standard vswitches or distributed?

What is the physical nics connected to the vswitches? Do you have them in any sort of load balance or failover?  Are you doing offloading of checksums, etc. etc.

 

VMxnet3 is used on millions (probably billions) of VM's around the world, so what needs to be determined is why in your environment is this an issue. I would start with your vSphere networking and go from there.

no 2 different setups.

I have 4 physical hosts the db is on one host, the other servers are spread across the other 3 hosts.  All standard vswitches, each guest has its own dedicated physical nic, so it is one to one.  the vswitch and vmware do not see drops, the 2012 server sees the nic uninstall itself.

  On 13/08/2015 at 19:34, Stokkolm said:

VMxnet3 is used on millions (probably billions) of VM's around the world, so what needs to be determined is why in your environment is this an issue. I would start with your vSphere networking and go from there.

why does the e1000e driver work normally as far as speed goes and the vmxnet3 does not.  something is different between these two, I don't think it is an issue with the vswitch (could be) but why would the vswitch care?  And what would I change to optimize speed on the vswitch.  this is all out of the box configs, nothing really changed outside of naming the vswitch.  I have even switched it down to access vs trunk to eliminate any issues that vmware could possibly have with trunking....it is a flat vlan anyway. 

Edited by sc302

as to uninstalling itself..  Does the vnic get removed from the vm settings, or does the vm just think there is no nic?

I thought the only problem with the vmx3 was performance?  But what part about performance you have lots of balls in the air here, you have webserver I take it and then that talks to your db and what your other servers in this system do talking to each other I don't know.  What does your performance and utilization look like on your vms? Are you seeing drops or retrans across your physical network between your hosts?

What is the performance like if you put all the vms on the same host on the same vswitch?

There can be issues with offloading, is it enabled on your hardware of the host? Is it enabled in the OS driver for vmx3 nic?

Lets see what the tech says about your setup I guess.

This doesn't spell out 2k12 - but have you looked at this http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2008925

 

the setup is very vanilla so whatever the defaults are.  Offloading is not enabled.  Utilization is on the very low side.

 

with the e1000/e1000e eventlog:

event id 27 + 32,

followed by 4201 7043 7036 then the eventual event id 4199

to me it seems like the nic or a component of the nic is not installed, or has the proper software support to run correctly. and eventually commits suicide killing all communications.

 

------------------------------------------------------

the vmxnet 3 only has slowness with no event logs.  never had an issue like this in other setups with using either driver.   It is a complex setup, but it isn't an uncommon setup.

Edited by sc302

you get 32 this error?

event ID: 32 - Source: disk - Description: The driver detected that the device \Device\Harddisk0\DR0 has its write cache enabled. Data corruption may occur.

that could be a problem if your seeing corruption in the file system for sure!!!

so how many of those 4201 are you seeing? do you get the mac address in your 4199 event?

Edited by BudMan

Event id 32:

the description for event id 32 from source e1iexpress cannot be found.  either the component that raises this event is not installed on your local computer or the installation is corrupted.  you can install or repair the component on the local computer.  the following information was included with the event:

Intel(R) 82574L Gigabit Network Connection.

------------------------------------

 

4201 ~ 25 a day

 

-------------------------------

 

Mac address is that of my switch. 

  On 13/08/2015 at 18:06, njeske said:

Don't use e1000 NICs with Server 2008+ guests.  Use VMXNET3.  The e1000 NIC on Server 2008+ can cause all kinds of weird problems like dropped packets, VLAN tags being incorrectly applied, even a PSOD on the ESXi host.  Before you switch over to VMXNET3, make sure you take the IP address off of the e1000 NIC.  Also make sure you remove the e1000 device through Device Manager.

Serious question, Have you got any Evidence to back this up? I know some people that might be interested in this. 

  On 13/08/2015 at 20:56, John Teacake said:

Serious question, Have you got any Evidence to back this up? I know some people that might be interested in this. 

Here are some kbs for you and you can make a decision based on what you read or send them to your people:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109922

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2058692&sliceId=1&docTypeID=DT_KB_1_1&dialogID=124387093&stateId=0 0 124391023

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001805

 

 

 

  On 13/08/2015 at 20:56, John Teacake said:

Serious question, Have you got any Evidence to back this up? I know some people that might be interested in this. 

I got the Windows version wrong, as it looks like it's 2012+, not 2008+.  But that's still relevant to this post.  Also looks like VMware fixed at least some of the e1000 and e1000e issues in subsequent 5.x and 6 releases.  However after personal horrible experiences with e1000 NICs I always use VMXNET3 and have never had any issues.
http://community.spiceworks.com/topic/640996-esxi5-e1000-server-2012-and-the-purple-screen-of-death-psod
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2059053

One more...  Experienced this problem on ESXi 5.5, though the article only specifies 5.0 and 5.1
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2058692

Edited by njeske
clarify OS version.
  On 13/08/2015 at 19:41, sc302 said:

why does the e1000e driver work normally as far as speed goes and the vmxnet3 does not.  something is different between these two, I don't think it is an issue with the vswitch (could be) but why would the vswitch care?  And what would I change to optimize speed on the vswitch.  this is all out of the box configs, nothing really changed outside of naming the vswitch.  I have even switched it down to access vs trunk to eliminate any issues that vmware could possibly have with trunking....it is a flat vlan anyway. 

Did you go through the process of removing the E1000 NIC from Device Manager? If you didn't then the problems you're seeing with VMxnet3 might be related to a ghost NIC.

Follow these instructions to remove the ghost NIC: http://blogs.technet.com/b/danstolts/archive/2010/09/25/how-to-find-a-lost-missing-hidden-or-removed-network-card-nic-or-other-device-and-even-remove-it.aspx

fully aware on how to remove a ghost nic, thanks though.

this is a bit easier to follow along though, looking at that page you linked to gives me a headache.  I would remove that from your favorites and never refer back to that ever.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1179

 

very preliminary, but I think we have a fix for the vmxnet driver.  2 servers are in a processor over commitment state.  Not really an issue for the e1000 driver but, according to support, the vmxnet driver uses processor cores to take some of the load off the nic utilizing receive-side scaling.   Upon disabling receive-side scaling, it seems to have resolved (I cannot tell if this is temporary or not at this point) the slowness issue.  Normally I do not run over committed vms, however this isn't a normal install.  Right now I am faced with 3 options

1. disable receive side scaling

2. drop the cores down on the guest oses

3. purchase new processors and install.

 

This is a beta environment (total current concurrent user count is 5 and they were experiencing extreme slowness with that few user count) so I am not overly concerned with disabling receive side scaling, however if this is a fix my choice will be to either drop the cores or purchase new processors.   If you guys are interested I will keep you updated, otherwise I will go on with those options.

  • Like 2
  On 14/08/2015 at 14:54, sc302 said:

This is a beta environment (total current concurrent user count is 5 and they were experiencing extreme slowness with that few user count) so I am not overly concerned with disabling receive side scaling, however if this is a fix my choice will be to either drop the cores or purchase new processors.   If you guys are interested I will keep you updated, otherwise I will go on with those options.

Very interested, thank you!

Right now we have chose to leave the receive side scaling in the default setting which is enabled.  I have dropped the vcores on the guest oses to be equal or less than the total physical cores on the boxes.  With some light user testing, meaning 1-2 users, everything seems to be running ok with the vmxnet driver.  Monday-Thursday will be the test.  I will post back sometime next week to let you know the outcome, but if you don't hear anything until Friday you can safely assume that this has indeed fixed the issue. 

So related to this - here is a good article on troubleshooting network issues on esxi with vsish

http://www.v-front.de/2015/08/troubleshooting-vm-network-performance.html

So did I read that right you were running more cores in your VMs than you actually had on your boxes?  So for example host has say 4 cores, you were telling your vms they had say 8?  I don't think that has ever been a recommended setup??

 

That is correct.  

 

Yes you shouldn't allow a single guest to use more cores than what is available by the host. I was going by the recommended configuration based on what the vendor recommended,  they had our physical server configurations before they sent us their recommended configuration.  Vendor using a cookie cutter document without really looking at what is available at the individual client site...not the first time something like that has happened. 

 

Regardless, that is what the vmware tech was using when he was looking at the nic/vnic.  No drops,  no ring buffering,  what he saw on the card was perfect.  When he looked further at the setup that is when he looked at the guest configuration and saw the core issue.  I didn't know that or would have that drastic of effect on communications...thought it would treat it like other servers on the hardware where you can have an over commitment of cores through multiple guests (8 physical cores, 10 guests utilizing 4 cores each)

so far no issues, things are moving pretty well.  we are looking to up the processors on those boxes from 2.0 6-core to 3.2 10-core.  Looked at new servers and they are double to triple the price of buying processors (we have 412GB of mem in each host). 

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Operation Warp Speed Vaccine Summit - December 8, 2020 Donald Trump:  Thank you very much. Appreciate it very much. I’m honored to welcome doctors, scientists, industry executives, and state and local leaders to our historic Operation Warp Speed Vaccine Summit. It’s been some journey for all of us. It’s been an incredible success. We’re grateful to be joined by Vice President Mike Pence, who has done an absolutely incredible job on the Coronavirus Task Force. Mike, thank you. Stand up, Mike. (Applause.) Great job. We’re here to discuss a monumental national achievement. From the instant the coronavirus invaded our shores, we raced into action to develop a safe and effective vaccine at breakneck speed. It would normally take five years, six years, seven years, or even more. In order to achieve this goal, we harnessed the full power of government, the genius of American scientists, and the might of American industry to save millions and millions of lives all over the world. We’re just days away from authorization from the FDA, and we’re pushing them hard, at which point we will immediately begin mass distribution. Before Operation Warp Speed, the typical timeframe for development and approval, as you know, could be infinity. And we were very, very happy that we were able to get things done at a level that nobody has ever seen before. The gold standard vaccine has been done in less than nine months. And now this: Meghan McCain:  If you regret taking the shot, there's hope.  Buy this "Vaccine Detox"   🤣  
    • I haven't heard that sound in a while and now I kinda miss it.
    • Do we really expect home users to have access to enterprise licenses?
    • The 2025 Complete Splunk Beginner Bundle is now 25% off by Steven Parker Today's highlighted deal comes via Neowin Deals store, where you can save 75% on The 2025 Complete Splunk Beginner Bundle. Splunk is a powerful data platform used to gather information from multiple sources and index it for efficient access. You can then use collected data to create visualizations, analytics, and a variety of automated and security-related functions. With its web-style interface, Splunk is easy to use and is utilized by many companies worldwide. What's Included: Splunk Fundamentals for Effective Management of SOC and SIEM Oak Academy 38 Lessons (3.5h) Lifetime $20.00 Value Splunk | Splunk Core Certified User Certification Prep Lab Oak Academy, 63 Lessons (6h),Lifetime, $20.00 Value Splunk | Splunk Core Certified Power User SPLK 1002 Prep Oak Academy, 53 Lessons (5.5h), Lifetime, $20.00 Value Splunk| Splunk Enterprise Certified Admin Certification Prep Oak Academy, 68 Lessons (8.5h), Lifetime, $20.00 Value Requirements Basic understanding of IT and networking concepts Familiarity with Linux and Windows operating systems A computer with internet access for hands-on practice Good to Know Length of time users can access this course: lifetime Access options: desktop or mobile Redemption deadline: redeem your code within 30 days of purchase Experience level required: all levels Certificate of Completion ONLY Updates included Closed captioning NOT available NOT downloadable for offline viewing Learn more about our Lifetime deals here! Lifetime access to this 2025 Complete Splunk Beginner Bundle normally costs $80, but this deal can be yours for just $19.99, that's a saving of $60. For full terms, specifications, and info, click the link below. Get the 2025 Complete Splunk Beginner Bundle for just $19.99 (75% off MSRP: $80) Although priced in U.S. dollars, this deal is available for digital purchase worldwide. We post these because we earn commission on each sale so as not to rely solely on advertising, which many of our readers block. It all helps toward paying staff reporters, servers and hosting costs. Other ways to support Neowin Whitelist Neowin by not blocking our ads Create a free member account to see fewer ads Make a donation to support our day to day running costs Subscribe to Neowin - for $14 a year, or $28 a year for an ad-free experience Disclosure: Neowin benefits from revenue of each sale made through our branded deals site powered by StackCommerce.
  • Recent Achievements

    • Apprentice
      Wireless wookie went up a rank
      Apprentice
    • Week One Done
      bukro earned a badge
      Week One Done
    • One Year In
      Wulle earned a badge
      One Year In
    • One Month Later
      Wulle earned a badge
      One Month Later
    • One Month Later
      Simmo3D earned a badge
      One Month Later
  • Popular Contributors

    1. 1
      +primortal
      558
    2. 2
      ATLien_0
      258
    3. 3
      +FloatingFatMan
      182
    4. 4
      Michael Scrip
      125
    5. 5
      Steven P.
      104
  • Tell a friend

    Love Neowin? Tell a friend!