jungle/doc/2025-05-maintenance-purchase.md
Rodrigo Arias Mallo e8d7ae345d Maintenance purchase for 2025-05
List of components we would need to buy.
2025-05-07 14:14:58 +02:00

5.2 KiB

Maintenance purchase 2025-05

We need to buy some components to replace broken parts or to have spare ones for when they break. We also need some tools to do basic repairs.

Here is the list:

Total: 1324.96 €

Rationale

Below is the search procedure I followed to come up with that list.

Power supplies

They are the first components to fail. We already have some problems with the monitoring of some power supplies. They will soon stop being manufactured, so we should increase out stack.

Most Xeon nodes use the DELTA DPS-750XB A:

hut% sudo ipmitool fru
...
FRU Device Description : Pwr Supply 1 FRU (ID 2)
 Product Manufacturer  : DELTA
 Product Name          : DPS-750XB A
 Product Part Number   : E98791-010
 Product Version       : 05
 Product Serial        : XXXXXXXXXXXXXXXXX

And we only have one per node. We should make the power supply redundant so we can tolerate it to fail without bringing down the node.

They are available on Amazon, but they are very expensive (287.54 €):

https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT

On Aliexpress they are much cheaper (57.69 €):

https://es.aliexpress.com/item/1005004090017186.html

We have 11 nodes plus the login, but I'm not able to figure out which power supply the login is using.

The login uses another one, AXX1100PCRPS, and only has one slot populated. We may want to also we another one, but I would need to reset the FRU and I don't have access to the login node. So I will leave this for Operations to deal with. We can live without the login if needed.

RAM DIMM

The DIMM modules also experience errors, which are monitored by Linux. In some nodes we see non-recoverable errors that are no longer corrected by the ECC. We need to replace the bad modules.

Having two spare modules per node would be enough to cover most problems in the future.

16 GB, 2400 MHz RDIMM

The module from dmidecode:

Handle 0x0026, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x0020
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 16 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_B1
    Bank Locator: NODE 1
    Type: DDR4
    Type Detail: Synchronous
    Speed: 2400 MT/s
    Manufacturer: Micron
    Serial Number: XXXXXXXX
    Asset Tag:
    Part Number: 36ASF2G72PZ-2G3B1
    Rank: 2
    Configured Memory Speed: 2400 MT/s
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown

Which is this module:

https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI

But they have only one in stock. Here is more details:

16GB PC4-19200 DDR4-2400MHz

The must have the following features:

  • 16 GB
  • DDR4
  • Speed at least 2400 MT/s
  • ECC
  • Registered
  • Best if from Micron

I would say having 8 spare modules would be enough for now, as we only have a few that are currently failing. We could upgrade the modules later, as they don't have much risk of stopping being manufactured like the power supplies.

These may work:

It is cheaper to buy them by pairs, so let's use the last one.

Screwdriver set

In order to change and replace the machine parts we need a set of screwdrivers. Instead of having to bring my own from home, I want to have one at BSC. These are enough and come in a nice box so I don't lose them:

https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S

Serial port adaptor

In order to debug problems with several components, we need to be able to plug to the serial port of the CPU. As we may deal with different voltages and pinouts, the most versatile option is to just be able to select the voltage and expose a pin interface.

This one would do:

https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB

Storage for raccoon

Given that we are currently using raccoon for builds too, we would need to increase its current storage. We only have available 270 GB, so we can benefit from another disk. Using 2 TiB would be plenty. This one seems enough: