Jump to content



Photo

Sudden broken RAM in a Rackmount Server

servers

  • Please log in to reply
No replies to this topic

#1 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 28 November 2013 - 05:00

We have an Asus RS920-E7/RS8 or RS926-E7/RS8 at my work (purchased through a third party vendor). Yesterday, after a scheduled reboot, the machine stopped posting suddenly. Upon investigation it appeared that 4 RAM modules (out of the 16 modules @ 8GB each) are suddenly bad. Moreover the particular sockets where the failed modules failed correspond to an interesting configuration. It is one per NUMA node or processor. And it appears to be what would correspond to the same socket on each node if you assume that there are 8 sockets per node.

 

I find it hard to believe that 4 modules which previously worked independently failed at the same time. Given the particular circumstances I have a suspicion that the the board itself is faulty and damaged the modules somehow. Currently, based on the layout, I am guessing that these particular modules shared some kind of voltage source on the board.

 

Does this sound plausible/any other ideas?

 

For reference:

The modules failed in sockets L1, N1, F1, and D1. Here is the manual:

http://www.manualsli...?page=31#manual