Maintenance purchase 2025-05 #109

Closed
opened 2025-06-02 13:28:31 +02:00 by rarias · 2 comments
Owner

We need to buy some components to replace broken parts or to have spare ones for when they break. We also need some tools to do basic repairs.

Here is the list:

Total: 1324.96 €

Rationale

Below is the search procedure I followed to come up with that list.

Power supplies

They are the first components to fail. We already have some problems with the monitoring of some power supplies. They will soon stop being manufactured, so we should increase out stack.

Most Xeon nodes use the DELTA DPS-750XB A:

hut% sudo ipmitool fru
...
FRU Device Description : Pwr Supply 1 FRU (ID 2)
 Product Manufacturer  : DELTA
 Product Name          : DPS-750XB A
 Product Part Number   : E98791-010
 Product Version       : 05
 Product Serial        : XXXXXXXXXXXXXXXXX

And we only have one per node. We should make the power supply redundant so we can tolerate it to fail without bringing down the node.

They are available on Amazon, but they are very expensive (287.54 €):

https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT

On Aliexpress they are much cheaper (57.69 €):

https://es.aliexpress.com/item/1005004090017186.html

We have 11 nodes plus the login, but I'm not able to figure out which power supply the login is using.

The login uses another one, AXX1100PCRPS, and only has one slot populated. We may want to also we another one, but I would need to reset the FRU and I don't have access to the login node. So I will leave this for Operations to deal with. We can live without the login if needed.

RAM DIMM

The DIMM modules also experience errors, which are monitored by Linux. In some nodes we see non-recoverable errors that are no longer corrected by the ECC. We need to replace the bad modules.

Having two spare modules per node would be enough to cover most problems in the future.

16 GB, 2400 MHz RDIMM

The module from dmidecode:

Handle 0x0026, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x0020
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 16 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_B1
    Bank Locator: NODE 1
    Type: DDR4
    Type Detail: Synchronous
    Speed: 2400 MT/s
    Manufacturer: Micron
    Serial Number: XXXXXXXX
    Asset Tag:
    Part Number: 36ASF2G72PZ-2G3B1
    Rank: 2
    Configured Memory Speed: 2400 MT/s
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown

Which is this module:

https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI

But they have only one in stock. Here is more details:

16GB PC4-19200 DDR4-2400MHz

The must have the following features:

  • 16 GB
  • DDR4
  • Speed at least 2400 MT/s
  • ECC
  • Registered
  • Best if from Micron

I would say having 8 spare modules would be enough for now, as we only have a few that are currently failing. We could upgrade the modules later, as they don't have much risk of stopping being manufactured like the power supplies.

These may work:

It is cheaper to buy them by pairs, so let's use the last one.

Screwdriver set

In order to change and replace the machine parts we need a set of screwdrivers. Instead of having to bring my own from home, I want to have one at BSC. These are enough and come in a nice box so I don't lose them:

https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S

Serial port adaptor

In order to debug problems with several components, we need to be able to plug to the serial port of the CPU. As we may deal with different voltages and pinouts, the most versatile option is to just be able to select the voltage and expose a pin interface.

This one would do:

https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB

Storage for raccoon

Given that we are currently using raccoon for builds too, we would need to increase its current storage. We only have available 270 GB, so we can benefit from another disk. Using 2 TiB would be plenty. This one seems enough:

We need to buy some components to replace broken parts or to have spare ones for when they break. We also need some tools to do basic repairs. Here is the list: - 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical) - 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html> - 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered - 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF> - 1 x Set of screwdrivers - 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S> - 1 x UART adaptor - 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB> - 1 x SSD SATA disk of 2 TB - 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT> Total: 1324.96 € # Rationale Below is the search procedure I followed to come up with that list. ## Power supplies They are the first components to fail. We already have some problems with the monitoring of some power supplies. They will soon stop being manufactured, so we should increase out stack. Most Xeon nodes use the DELTA DPS-750XB A: hut% sudo ipmitool fru ... FRU Device Description : Pwr Supply 1 FRU (ID 2) Product Manufacturer : DELTA Product Name : DPS-750XB A Product Part Number : E98791-010 Product Version : 05 Product Serial : XXXXXXXXXXXXXXXXX And we only have one per node. We should make the power supply redundant so we can tolerate it to fail without bringing down the node. They are available on Amazon, but they are very expensive (287.54 €): <https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT> On Aliexpress they are much cheaper (57.69 €): <https://es.aliexpress.com/item/1005004090017186.html> We have 11 nodes plus the login, but I'm not able to figure out which power supply the login is using. The login uses another one, AXX1100PCRPS, and only has one slot populated. We may want to also we another one, but I would need to reset the FRU and I don't have access to the login node. So I will leave this for Operations to deal with. We can live without the login if needed. ## RAM DIMM The DIMM modules also experience errors, which are monitored by Linux. In some nodes we see non-recoverable errors that are no longer corrected by the ECC. We need to replace the bad modules. Having two spare modules per node would be enough to cover most problems in the future. > 16 GB, 2400 MHz RDIMM The module from dmidecode: Handle 0x0026, DMI type 17, 40 bytes Memory Device Array Handle: 0x0020 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16 GB Form Factor: DIMM Set: None Locator: DIMM_B1 Bank Locator: NODE 1 Type: DDR4 Type Detail: Synchronous Speed: 2400 MT/s Manufacturer: Micron Serial Number: XXXXXXXX Asset Tag: Part Number: 36ASF2G72PZ-2G3B1 Rank: 2 Configured Memory Speed: 2400 MT/s Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Which is this module: <https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI> But they have only one in stock. Here is more details: > 16GB PC4-19200 DDR4-2400MHz The must have the following features: - 16 GB - DDR4 - Speed at least 2400 MT/s - ECC - Registered - Best if from Micron I would say having 8 spare modules would be enough for now, as we only have a few that are currently failing. We could upgrade the modules later, as they don't have much risk of stopping being manufactured like the power supplies. These may work: - 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N> - 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF> It is cheaper to buy them by pairs, so let's use the last one. ## Screwdriver set In order to change and replace the machine parts we need a set of screwdrivers. Instead of having to bring my own from home, I want to have one at BSC. These are enough and come in a nice box so I don't lose them: <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S> ## Serial port adaptor In order to debug problems with several components, we need to be able to plug to the serial port of the CPU. As we may deal with different voltages and pinouts, the most versatile option is to just be able to select the voltage and expose a pin interface. This one would do: <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB> ## Storage for raccoon Given that we are currently using raccoon for builds too, we would need to increase its current storage. We only have available 270 GB, so we can benefit from another disk. Using 2 TiB would be plenty. This one seems enough: - 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
Author
Owner

See https://webapps.bsc.es/resource-petitions/normal-petition/27483

RAM and power supplies cannot be purchased directly, but need a provider.

Approved on 2025-05-12 13:49

See https://webapps.bsc.es/resource-petitions/normal-petition/27483 RAM and power supplies cannot be purchased directly, but need a provider. Approved on 2025-05-12 13:49
Author
Owner

All components arrived good, but power supplies are damaged. I think I can fix the bend pins myself but one of them has the chassis bent as well. Probably fixable, but what a shitty provider.

They seem to come from Intel servers, so they should be ok. I'm testing them one by one in tent, but it would take a while as I need for it to switch to the new power supply.

All components arrived good, but power supplies are damaged. I think I can fix the bend pins myself but one of them has the chassis bent as well. Probably fixable, but what a shitty provider. They seem to come from Intel servers, so they should be ok. I'm testing them one by one in tent, but it would take a while as I need for it to switch to the new power supply.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#109
No description provided.