Maintenance purchase for 2025-05

List of components we would need to buy.
This commit is contained in:
Rodrigo Arias 2025-05-07 14:14:58 +02:00
parent 82fc3209de
commit e8d7ae345d

View File

@ -0,0 +1,156 @@
# Maintenance purchase 2025-05
We need to buy some components to replace broken parts or to have spare ones for
when they break. We also need some tools to do basic repairs.
Here is the list:
- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical)
- 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html>
- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered
- 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
- 1 x Set of screwdrivers
- 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
- 1 x UART adaptor
- 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
- 1 x SSD SATA disk of 2 TB
- 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
Total: 1324.96 €
# Rationale
Below is the search procedure I followed to come up with that list.
## Power supplies
They are the first components to fail. We already have some problems with the
monitoring of some power supplies. They will soon stop being manufactured, so we
should increase out stack.
Most Xeon nodes use the DELTA DPS-750XB A:
hut% sudo ipmitool fru
...
FRU Device Description : Pwr Supply 1 FRU (ID 2)
Product Manufacturer : DELTA
Product Name : DPS-750XB A
Product Part Number : E98791-010
Product Version : 05
Product Serial : XXXXXXXXXXXXXXXXX
And we only have one per node. We should make the power supply redundant so we
can tolerate it to fail without bringing down the node.
They are available on Amazon, but they are very expensive (287.54 €):
<https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT>
On Aliexpress they are much cheaper (57.69 €):
<https://es.aliexpress.com/item/1005004090017186.html>
We have 11 nodes plus the login, but I'm not able to figure out which power
supply the login is using.
The login uses another one, AXX1100PCRPS, and only has one slot populated. We
may want to also we another one, but I would need to reset the FRU and I don't
have access to the login node. So I will leave this for Operations to deal with.
We can live without the login if needed.
## RAM DIMM
The DIMM modules also experience errors, which are monitored by Linux. In some
nodes we see non-recoverable errors that are no longer corrected by the ECC. We
need to replace the bad modules.
Having two spare modules per node would be enough to cover most problems in the
future.
> 16 GB, 2400 MHz RDIMM
The module from dmidecode:
Handle 0x0026, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0020
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM_B1
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Micron
Serial Number: XXXXXXXX
Asset Tag:
Part Number: 36ASF2G72PZ-2G3B1
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Which is this module:
<https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI>
But they have only one in stock. Here is more details:
> 16GB PC4-19200 DDR4-2400MHz
The must have the following features:
- 16 GB
- DDR4
- Speed at least 2400 MT/s
- ECC
- Registered
- Best if from Micron
I would say having 8 spare modules would be enough for now, as we only have a
few that are currently failing. We could upgrade the modules later, as they
don't have much risk of stopping being manufactured like the power supplies.
These may work:
- 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N>
- 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
It is cheaper to buy them by pairs, so let's use the last one.
## Screwdriver set
In order to change and replace the machine parts we need a set of screwdrivers.
Instead of having to bring my own from home, I want to have one at BSC. These
are enough and come in a nice box so I don't lose them:
<https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
## Serial port adaptor
In order to debug problems with several components, we need to be able to plug
to the serial port of the CPU. As we may deal with different voltages and
pinouts, the most versatile option is to just be able to select the voltage and
expose a pin interface.
This one would do:
<https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
## Storage for raccoon
Given that we are currently using raccoon for builds too, we would need to
increase its current storage. We only have available 270 GB, so we can benefit
from another disk. Using 2 TiB would be plenty. This one seems enough:
- 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>