diff --git a/doc/2025-05-maintenance-purchase.md b/doc/2025-05-maintenance-purchase.md new file mode 100644 index 00000000..e9d61894 --- /dev/null +++ b/doc/2025-05-maintenance-purchase.md @@ -0,0 +1,156 @@ +# Maintenance purchase 2025-05 + +We need to buy some components to replace broken parts or to have spare ones for +when they break. We also need some tools to do basic repairs. + +Here is the list: + +- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical) + - 57.69€/unit, 634.59€ total + +- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered + - 128.85€/pair, 515.40€ total + +- 1 x Set of screwdrivers + - 23.99€ + +- 1 x UART adaptor + - 14.99€ + +- 1 x SSD SATA disk of 2 TB + - 135.99€ + +Total: 1324.96 € + +# Rationale + +Below is the search procedure I followed to come up with that list. + +## Power supplies + +They are the first components to fail. We already have some problems with the +monitoring of some power supplies. They will soon stop being manufactured, so we +should increase out stack. + +Most Xeon nodes use the DELTA DPS-750XB A: + + hut% sudo ipmitool fru + ... + FRU Device Description : Pwr Supply 1 FRU (ID 2) + Product Manufacturer : DELTA + Product Name : DPS-750XB A + Product Part Number : E98791-010 + Product Version : 05 + Product Serial : XXXXXXXXXXXXXXXXX + +And we only have one per node. We should make the power supply redundant so we +can tolerate it to fail without bringing down the node. + +They are available on Amazon, but they are very expensive (287.54 €): + + + +On Aliexpress they are much cheaper (57.69 €): + + + +We have 11 nodes plus the login, but I'm not able to figure out which power +supply the login is using. + +The login uses another one, AXX1100PCRPS, and only has one slot populated. We +may want to also we another one, but I would need to reset the FRU and I don't +have access to the login node. So I will leave this for Operations to deal with. +We can live without the login if needed. + +## RAM DIMM + +The DIMM modules also experience errors, which are monitored by Linux. In some +nodes we see non-recoverable errors that are no longer corrected by the ECC. We +need to replace the bad modules. + +Having two spare modules per node would be enough to cover most problems in the +future. + +> 16 GB, 2400 MHz RDIMM + +The module from dmidecode: + + Handle 0x0026, DMI type 17, 40 bytes + Memory Device + Array Handle: 0x0020 + Error Information Handle: Not Provided + Total Width: 72 bits + Data Width: 64 bits + Size: 16 GB + Form Factor: DIMM + Set: None + Locator: DIMM_B1 + Bank Locator: NODE 1 + Type: DDR4 + Type Detail: Synchronous + Speed: 2400 MT/s + Manufacturer: Micron + Serial Number: XXXXXXXX + Asset Tag: + Part Number: 36ASF2G72PZ-2G3B1 + Rank: 2 + Configured Memory Speed: 2400 MT/s + Minimum Voltage: Unknown + Maximum Voltage: Unknown + Configured Voltage: Unknown + +Which is this module: + + + +But they have only one in stock. Here is more details: + +> 16GB PC4-19200 DDR4-2400MHz + +The must have the following features: + +- 16 GB +- DDR4 +- Speed at least 2400 MT/s +- ECC +- Registered +- Best if from Micron + +I would say having 8 spare modules would be enough for now, as we only have a +few that are currently failing. We could upgrade the modules later, as they +don't have much risk of stopping being manufactured like the power supplies. + +These may work: + +- 1 x 16GB, 69,11€ + +- 2 x 16GB, 128,85€ + +It is cheaper to buy them by pairs, so let's use the last one. + +## Screwdriver set + +In order to change and replace the machine parts we need a set of screwdrivers. +Instead of having to bring my own from home, I want to have one at BSC. These +are enough and come in a nice box so I don't lose them: + + + +## Serial port adaptor + +In order to debug problems with several components, we need to be able to plug +to the serial port of the CPU. As we may deal with different voltages and +pinouts, the most versatile option is to just be able to select the voltage and +expose a pin interface. + +This one would do: + + + +## Storage for raccoon + +Given that we are currently using raccoon for builds too, we would need to +increase its current storage. We only have available 270 GB, so we can benefit +from another disk. Using 2 TiB would be plenty. This one seems enough: + +- 135,99€