# Maintenance purchase 2025-05 We need to buy some components to replace broken parts or to have spare ones for when they break. We also need some tools to do basic repairs. Here is the list: - 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical) - 57.69€/unit, 634.59€ total - 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered - 128.85€/pair, 515.40€ total - 1 x Set of screwdrivers - 23.99€ - 1 x UART adaptor - 14.99€ - 1 x SSD SATA disk of 2 TB - 135.99€ Total: 1324.96 € # Rationale Below is the search procedure I followed to come up with that list. ## Power supplies They are the first components to fail. We already have some problems with the monitoring of some power supplies. They will soon stop being manufactured, so we should increase out stack. Most Xeon nodes use the DELTA DPS-750XB A: hut% sudo ipmitool fru ... FRU Device Description : Pwr Supply 1 FRU (ID 2) Product Manufacturer : DELTA Product Name : DPS-750XB A Product Part Number : E98791-010 Product Version : 05 Product Serial : XXXXXXXXXXXXXXXXX And we only have one per node. We should make the power supply redundant so we can tolerate it to fail without bringing down the node. They are available on Amazon, but they are very expensive (287.54 €): On Aliexpress they are much cheaper (57.69 €): We have 11 nodes plus the login, but I'm not able to figure out which power supply the login is using. The login uses another one, AXX1100PCRPS, and only has one slot populated. We may want to also we another one, but I would need to reset the FRU and I don't have access to the login node. So I will leave this for Operations to deal with. We can live without the login if needed. ## RAM DIMM The DIMM modules also experience errors, which are monitored by Linux. In some nodes we see non-recoverable errors that are no longer corrected by the ECC. We need to replace the bad modules. Having two spare modules per node would be enough to cover most problems in the future. > 16 GB, 2400 MHz RDIMM The module from dmidecode: Handle 0x0026, DMI type 17, 40 bytes Memory Device Array Handle: 0x0020 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16 GB Form Factor: DIMM Set: None Locator: DIMM_B1 Bank Locator: NODE 1 Type: DDR4 Type Detail: Synchronous Speed: 2400 MT/s Manufacturer: Micron Serial Number: XXXXXXXX Asset Tag: Part Number: 36ASF2G72PZ-2G3B1 Rank: 2 Configured Memory Speed: 2400 MT/s Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Which is this module: But they have only one in stock. Here is more details: > 16GB PC4-19200 DDR4-2400MHz The must have the following features: - 16 GB - DDR4 - Speed at least 2400 MT/s - ECC - Registered - Best if from Micron I would say having 8 spare modules would be enough for now, as we only have a few that are currently failing. We could upgrade the modules later, as they don't have much risk of stopping being manufactured like the power supplies. These may work: - 1 x 16GB, 69,11€ - 2 x 16GB, 128,85€ It is cheaper to buy them by pairs, so let's use the last one. ## Screwdriver set In order to change and replace the machine parts we need a set of screwdrivers. Instead of having to bring my own from home, I want to have one at BSC. These are enough and come in a nice box so I don't lose them: ## Serial port adaptor In order to debug problems with several components, we need to be able to plug to the serial port of the CPU. As we may deal with different voltages and pinouts, the most versatile option is to just be able to select the voltage and expose a pin interface. This one would do: ## Storage for raccoon Given that we are currently using raccoon for builds too, we would need to increase its current storage. We only have available 270 GB, so we can benefit from another disk. Using 2 TiB would be plenty. This one seems enough: - 135,99€