Power overload in oss01 #22
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Also the flags are not the same with xeon07 (0x0500 vs 0x0100):
Here is the FRU:
changed the description
changed the description
changed the description
TODO:
https://filecenter.deltaww.com/Products/Download/19/1906/catalogue/DPS-2000AB-13%20A%20datasheet_20181218.pdf
Installed in xeon01, and working fine. The problem is the node oss01.
Attempt to boot with good xeon01 power supply from both power supply sockets results in overload error from bmc.
After removing all the components from the main board, the BMC still complains about the power supply.
No obvious problems with capacitors.
I removed the main board from the chassis and I can see a burn part.
This area corresponds to the voltage regulator of the CPU:
It looks that it has burned and is causing a short circuit.
The regulator has some pads in the back side so I will need a hot air solder station to replace it.
This is from Aliexpress, in mouser is out of stock.
Using a SMD soldering air gun I removed the suspicious regulator after some effort. The back side is quite difficult to desolder. Unfortunately, I also accidentally removed some of the nearby SMD capacitors. They are quite small, but I took them and carefully stored in a sticky tape.
With the regulator removed, I installed back the heat sink, the RAM and the fans and tried to boot the node again. Surprisingly, after the BMC loaded it started to load the BIOS.
So I turned the node off and connected a display to see the BIOS messages. This is what I observed after trying again:
It looks that the BIOS passes and attempts to boot from PXE (all the PCI cables and disks are disconnected, so its unable to boot). So at least the power supply is no longer latched in overload mode.
I will try to book some replacements and solder the pieces back together. We should be able to get it working properly.
Although some parts remain in Aliexpress and eBay, the IR3351M is no longer manufactured, but there are alternatives like the SIC634CD.
Apparently, Vishay is the only replacement part that I can find that is still manufactured, after Infineon bough Internation Rectifier back in 2014.
However, the pad layout designed by Vishay is mirrored as reported by the datasheet:
Compared with the IR3551M layout:
Which make them incompatible.
The remaining option is to get a Chinese clone of the IR3551M from Aliexpress.
Email sent to Vishay:
No reply for one week. Let's try Aliexpress then.
Opened resource petition.
Petition accepted: https://webapps.bsc.es/resource-petitions/normal-petition/18472
Ordered IR3551M on AliExpress, estimated to arrive on 15th of September.
mentioned in issue #11
https://parcelsapp.com/es/tracking/cnes00627404929
Arrived
So, I soldered the new regulator back in the slot along with some of the SMD components that were lost:
The big capacitor below (C3J11) is twisted a bit, but is properly connected 😅
However, I'm not confident that the pads of the regulator were properly connected. In particular, some of the small pads seems to have disappeared, so I don't think this regulator will work in this socket.
The resistor R3J14 seems to be 0.5 Omhs, and I couldn't solder it back, as it is too small. Same with the capacitor C3J7.
There is another empty socket in the right which could be moved there, but resoldering the small SMD is very hard and I don't have the proper tools to do it (yet).
Nevertheless, the node boots fine and is capable of using both CPUs. Now, the power limit of this CPU is probably lower than usual, so we shouldn't stress it. As this node will be used for node storage, it should work okay.
The node is installed back in the cluster and is ready to be configured for Ceph.
I removed two DIMM modules from the bad CPU, so we can donate them to the failing module of eudy. Here are they:
mentioned in issue #39
The power overload is fixed, but the fans now spin at a high speed (see #39).
Notice the power supply was replaced too:
Closing this one.