Compare commits
1 Commits
maintenanc
...
monitor-ni
| Author | SHA1 | Date | |
|---|---|---|---|
| 13e084f34f |
@@ -1,156 +0,0 @@
|
|||||||
# Maintenance purchase 2025-05
|
|
||||||
|
|
||||||
We need to buy some components to replace broken parts or to have spare ones for
|
|
||||||
when they break. We also need some tools to do basic repairs.
|
|
||||||
|
|
||||||
Here is the list:
|
|
||||||
|
|
||||||
- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical)
|
|
||||||
- 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html>
|
|
||||||
|
|
||||||
- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered
|
|
||||||
- 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
|
||||||
|
|
||||||
- 1 x Set of screwdrivers
|
|
||||||
- 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
|
||||||
|
|
||||||
- 1 x UART adaptor
|
|
||||||
- 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
|
||||||
|
|
||||||
- 1 x SSD SATA disk of 2 TB
|
|
||||||
- 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
|
||||||
|
|
||||||
Total: 1324.96 €
|
|
||||||
|
|
||||||
# Rationale
|
|
||||||
|
|
||||||
Below is the search procedure I followed to come up with that list.
|
|
||||||
|
|
||||||
## Power supplies
|
|
||||||
|
|
||||||
They are the first components to fail. We already have some problems with the
|
|
||||||
monitoring of some power supplies. They will soon stop being manufactured, so we
|
|
||||||
should increase out stack.
|
|
||||||
|
|
||||||
Most Xeon nodes use the DELTA DPS-750XB A:
|
|
||||||
|
|
||||||
hut% sudo ipmitool fru
|
|
||||||
...
|
|
||||||
FRU Device Description : Pwr Supply 1 FRU (ID 2)
|
|
||||||
Product Manufacturer : DELTA
|
|
||||||
Product Name : DPS-750XB A
|
|
||||||
Product Part Number : E98791-010
|
|
||||||
Product Version : 05
|
|
||||||
Product Serial : XXXXXXXXXXXXXXXXX
|
|
||||||
|
|
||||||
And we only have one per node. We should make the power supply redundant so we
|
|
||||||
can tolerate it to fail without bringing down the node.
|
|
||||||
|
|
||||||
They are available on Amazon, but they are very expensive (287.54 €):
|
|
||||||
|
|
||||||
<https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT>
|
|
||||||
|
|
||||||
On Aliexpress they are much cheaper (57.69 €):
|
|
||||||
|
|
||||||
<https://es.aliexpress.com/item/1005004090017186.html>
|
|
||||||
|
|
||||||
We have 11 nodes plus the login, but I'm not able to figure out which power
|
|
||||||
supply the login is using.
|
|
||||||
|
|
||||||
The login uses another one, AXX1100PCRPS, and only has one slot populated. We
|
|
||||||
may want to also we another one, but I would need to reset the FRU and I don't
|
|
||||||
have access to the login node. So I will leave this for Operations to deal with.
|
|
||||||
We can live without the login if needed.
|
|
||||||
|
|
||||||
## RAM DIMM
|
|
||||||
|
|
||||||
The DIMM modules also experience errors, which are monitored by Linux. In some
|
|
||||||
nodes we see non-recoverable errors that are no longer corrected by the ECC. We
|
|
||||||
need to replace the bad modules.
|
|
||||||
|
|
||||||
Having two spare modules per node would be enough to cover most problems in the
|
|
||||||
future.
|
|
||||||
|
|
||||||
> 16 GB, 2400 MHz RDIMM
|
|
||||||
|
|
||||||
The module from dmidecode:
|
|
||||||
|
|
||||||
Handle 0x0026, DMI type 17, 40 bytes
|
|
||||||
Memory Device
|
|
||||||
Array Handle: 0x0020
|
|
||||||
Error Information Handle: Not Provided
|
|
||||||
Total Width: 72 bits
|
|
||||||
Data Width: 64 bits
|
|
||||||
Size: 16 GB
|
|
||||||
Form Factor: DIMM
|
|
||||||
Set: None
|
|
||||||
Locator: DIMM_B1
|
|
||||||
Bank Locator: NODE 1
|
|
||||||
Type: DDR4
|
|
||||||
Type Detail: Synchronous
|
|
||||||
Speed: 2400 MT/s
|
|
||||||
Manufacturer: Micron
|
|
||||||
Serial Number: XXXXXXXX
|
|
||||||
Asset Tag:
|
|
||||||
Part Number: 36ASF2G72PZ-2G3B1
|
|
||||||
Rank: 2
|
|
||||||
Configured Memory Speed: 2400 MT/s
|
|
||||||
Minimum Voltage: Unknown
|
|
||||||
Maximum Voltage: Unknown
|
|
||||||
Configured Voltage: Unknown
|
|
||||||
|
|
||||||
Which is this module:
|
|
||||||
|
|
||||||
<https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI>
|
|
||||||
|
|
||||||
But they have only one in stock. Here is more details:
|
|
||||||
|
|
||||||
> 16GB PC4-19200 DDR4-2400MHz
|
|
||||||
|
|
||||||
The must have the following features:
|
|
||||||
|
|
||||||
- 16 GB
|
|
||||||
- DDR4
|
|
||||||
- Speed at least 2400 MT/s
|
|
||||||
- ECC
|
|
||||||
- Registered
|
|
||||||
- Best if from Micron
|
|
||||||
|
|
||||||
I would say having 8 spare modules would be enough for now, as we only have a
|
|
||||||
few that are currently failing. We could upgrade the modules later, as they
|
|
||||||
don't have much risk of stopping being manufactured like the power supplies.
|
|
||||||
|
|
||||||
These may work:
|
|
||||||
|
|
||||||
- 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N>
|
|
||||||
|
|
||||||
- 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
|
||||||
|
|
||||||
It is cheaper to buy them by pairs, so let's use the last one.
|
|
||||||
|
|
||||||
## Screwdriver set
|
|
||||||
|
|
||||||
In order to change and replace the machine parts we need a set of screwdrivers.
|
|
||||||
Instead of having to bring my own from home, I want to have one at BSC. These
|
|
||||||
are enough and come in a nice box so I don't lose them:
|
|
||||||
|
|
||||||
<https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
|
||||||
|
|
||||||
## Serial port adaptor
|
|
||||||
|
|
||||||
In order to debug problems with several components, we need to be able to plug
|
|
||||||
to the serial port of the CPU. As we may deal with different voltages and
|
|
||||||
pinouts, the most versatile option is to just be able to select the voltage and
|
|
||||||
expose a pin interface.
|
|
||||||
|
|
||||||
This one would do:
|
|
||||||
|
|
||||||
<https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
|
||||||
|
|
||||||
## Storage for raccoon
|
|
||||||
|
|
||||||
Given that we are currently using raccoon for builds too, we would need to
|
|
||||||
increase its current storage. We only have available 270 GB, so we can benefit
|
|
||||||
from another disk. Using 2 TiB would be plenty. This one seems enough:
|
|
||||||
|
|
||||||
- 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
|
||||||
@@ -4,6 +4,7 @@
|
|||||||
imports = [
|
imports = [
|
||||||
../module/slurm-exporter.nix
|
../module/slurm-exporter.nix
|
||||||
./gpfs-probe.nix
|
./gpfs-probe.nix
|
||||||
|
./nix-daemon-exporter.nix
|
||||||
];
|
];
|
||||||
|
|
||||||
age.secrets.grafanaJungleRobotPassword = {
|
age.secrets.grafanaJungleRobotPassword = {
|
||||||
@@ -108,6 +109,7 @@
|
|||||||
"127.0.0.1:${toString config.services.prometheus.exporters.smartctl.port}"
|
"127.0.0.1:${toString config.services.prometheus.exporters.smartctl.port}"
|
||||||
"127.0.0.1:9341" # Slurm exporter
|
"127.0.0.1:9341" # Slurm exporter
|
||||||
"127.0.0.1:9966" # GPFS custom exporter
|
"127.0.0.1:9966" # GPFS custom exporter
|
||||||
|
"127.0.0.1:9999" # Nix-daemon custom exporter
|
||||||
"127.0.0.1:${toString config.services.prometheus.exporters.blackbox.port}"
|
"127.0.0.1:${toString config.services.prometheus.exporters.blackbox.port}"
|
||||||
];
|
];
|
||||||
}];
|
}];
|
||||||
|
|||||||
26
m/hut/nix-daemon-builds.sh
Executable file
26
m/hut/nix-daemon-builds.sh
Executable file
@@ -0,0 +1,26 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
|
||||||
|
# Locate nix daemon pid
|
||||||
|
nd=$(pgrep -o nix-daemon)
|
||||||
|
|
||||||
|
# Locate children of nix-daemon
|
||||||
|
pids1=$(tr ' ' '\n' < "/proc/$nd/task/$nd/children")
|
||||||
|
|
||||||
|
# For each children, locate 2nd level children
|
||||||
|
pids2=$(echo "$pids1" | xargs -I @ /bin/sh -c 'cat /proc/@/task/*/children' | tr ' ' '\n')
|
||||||
|
|
||||||
|
cat <<EOF
|
||||||
|
HTTP/1.1 200 OK
|
||||||
|
Content-Type: text/plain; version=0.0.4; charset=utf-8; escaping=values
|
||||||
|
|
||||||
|
# HELP nix_daemon_build Nix daemon derivation build state.
|
||||||
|
# TYPE nix_daemon_build gauge
|
||||||
|
EOF
|
||||||
|
|
||||||
|
for pid in $pids2; do
|
||||||
|
name=$(cat /proc/$pid/environ 2>/dev/null | tr '\0' '\n' | rg "^name=(.+)" - --replace '$1')
|
||||||
|
user=$(ps -o uname= -p "$pid")
|
||||||
|
if [ -n "$name" -a -n "$user" ]; then
|
||||||
|
printf 'nix_daemon_build{user="%s",name="%s"} 1\n' "$user" "$name"
|
||||||
|
fi
|
||||||
|
done
|
||||||
23
m/hut/nix-daemon-exporter.nix
Normal file
23
m/hut/nix-daemon-exporter.nix
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
{ pkgs, config, lib, ... }:
|
||||||
|
let
|
||||||
|
script = pkgs.runCommand "nix-daemon-exporter.sh" { }
|
||||||
|
''
|
||||||
|
cp ${./nix-daemon-builds.sh} $out;
|
||||||
|
chmod +x $out
|
||||||
|
''
|
||||||
|
;
|
||||||
|
in
|
||||||
|
{
|
||||||
|
systemd.services.nix-daemon-exporter = {
|
||||||
|
description = "Daemon to export nix-daemon metrics";
|
||||||
|
path = [ pkgs.procps pkgs.ripgrep ];
|
||||||
|
wantedBy = [ "default.target" ];
|
||||||
|
serviceConfig = {
|
||||||
|
Type = "simple";
|
||||||
|
ExecStart = "${pkgs.socat}/bin/socat TCP4-LISTEN:9999,fork EXEC:${script}";
|
||||||
|
# Needed root to read the environment, potentially unsafe
|
||||||
|
User = "root";
|
||||||
|
Group = "root";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user