Compare commits
1 Commits
7a2f37aaa2
...
maintenanc
| Author | SHA1 | Date | |
|---|---|---|---|
| e8d7ae345d |
156
doc/2025-05-maintenance-purchase.md
Normal file
156
doc/2025-05-maintenance-purchase.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Maintenance purchase 2025-05
|
||||
|
||||
We need to buy some components to replace broken parts or to have spare ones for
|
||||
when they break. We also need some tools to do basic repairs.
|
||||
|
||||
Here is the list:
|
||||
|
||||
- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical)
|
||||
- 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html>
|
||||
|
||||
- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered
|
||||
- 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
||||
|
||||
- 1 x Set of screwdrivers
|
||||
- 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
||||
|
||||
- 1 x UART adaptor
|
||||
- 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
||||
|
||||
- 1 x SSD SATA disk of 2 TB
|
||||
- 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
||||
|
||||
Total: 1324.96 €
|
||||
|
||||
# Rationale
|
||||
|
||||
Below is the search procedure I followed to come up with that list.
|
||||
|
||||
## Power supplies
|
||||
|
||||
They are the first components to fail. We already have some problems with the
|
||||
monitoring of some power supplies. They will soon stop being manufactured, so we
|
||||
should increase out stack.
|
||||
|
||||
Most Xeon nodes use the DELTA DPS-750XB A:
|
||||
|
||||
hut% sudo ipmitool fru
|
||||
...
|
||||
FRU Device Description : Pwr Supply 1 FRU (ID 2)
|
||||
Product Manufacturer : DELTA
|
||||
Product Name : DPS-750XB A
|
||||
Product Part Number : E98791-010
|
||||
Product Version : 05
|
||||
Product Serial : XXXXXXXXXXXXXXXXX
|
||||
|
||||
And we only have one per node. We should make the power supply redundant so we
|
||||
can tolerate it to fail without bringing down the node.
|
||||
|
||||
They are available on Amazon, but they are very expensive (287.54 €):
|
||||
|
||||
<https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT>
|
||||
|
||||
On Aliexpress they are much cheaper (57.69 €):
|
||||
|
||||
<https://es.aliexpress.com/item/1005004090017186.html>
|
||||
|
||||
We have 11 nodes plus the login, but I'm not able to figure out which power
|
||||
supply the login is using.
|
||||
|
||||
The login uses another one, AXX1100PCRPS, and only has one slot populated. We
|
||||
may want to also we another one, but I would need to reset the FRU and I don't
|
||||
have access to the login node. So I will leave this for Operations to deal with.
|
||||
We can live without the login if needed.
|
||||
|
||||
## RAM DIMM
|
||||
|
||||
The DIMM modules also experience errors, which are monitored by Linux. In some
|
||||
nodes we see non-recoverable errors that are no longer corrected by the ECC. We
|
||||
need to replace the bad modules.
|
||||
|
||||
Having two spare modules per node would be enough to cover most problems in the
|
||||
future.
|
||||
|
||||
> 16 GB, 2400 MHz RDIMM
|
||||
|
||||
The module from dmidecode:
|
||||
|
||||
Handle 0x0026, DMI type 17, 40 bytes
|
||||
Memory Device
|
||||
Array Handle: 0x0020
|
||||
Error Information Handle: Not Provided
|
||||
Total Width: 72 bits
|
||||
Data Width: 64 bits
|
||||
Size: 16 GB
|
||||
Form Factor: DIMM
|
||||
Set: None
|
||||
Locator: DIMM_B1
|
||||
Bank Locator: NODE 1
|
||||
Type: DDR4
|
||||
Type Detail: Synchronous
|
||||
Speed: 2400 MT/s
|
||||
Manufacturer: Micron
|
||||
Serial Number: XXXXXXXX
|
||||
Asset Tag:
|
||||
Part Number: 36ASF2G72PZ-2G3B1
|
||||
Rank: 2
|
||||
Configured Memory Speed: 2400 MT/s
|
||||
Minimum Voltage: Unknown
|
||||
Maximum Voltage: Unknown
|
||||
Configured Voltage: Unknown
|
||||
|
||||
Which is this module:
|
||||
|
||||
<https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI>
|
||||
|
||||
But they have only one in stock. Here is more details:
|
||||
|
||||
> 16GB PC4-19200 DDR4-2400MHz
|
||||
|
||||
The must have the following features:
|
||||
|
||||
- 16 GB
|
||||
- DDR4
|
||||
- Speed at least 2400 MT/s
|
||||
- ECC
|
||||
- Registered
|
||||
- Best if from Micron
|
||||
|
||||
I would say having 8 spare modules would be enough for now, as we only have a
|
||||
few that are currently failing. We could upgrade the modules later, as they
|
||||
don't have much risk of stopping being manufactured like the power supplies.
|
||||
|
||||
These may work:
|
||||
|
||||
- 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N>
|
||||
|
||||
- 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
||||
|
||||
It is cheaper to buy them by pairs, so let's use the last one.
|
||||
|
||||
## Screwdriver set
|
||||
|
||||
In order to change and replace the machine parts we need a set of screwdrivers.
|
||||
Instead of having to bring my own from home, I want to have one at BSC. These
|
||||
are enough and come in a nice box so I don't lose them:
|
||||
|
||||
<https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
||||
|
||||
## Serial port adaptor
|
||||
|
||||
In order to debug problems with several components, we need to be able to plug
|
||||
to the serial port of the CPU. As we may deal with different voltages and
|
||||
pinouts, the most versatile option is to just be able to select the voltage and
|
||||
expose a pin interface.
|
||||
|
||||
This one would do:
|
||||
|
||||
<https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
||||
|
||||
## Storage for raccoon
|
||||
|
||||
Given that we are currently using raccoon for builds too, we would need to
|
||||
increase its current storage. We only have available 270 GB, so we can benefit
|
||||
from another disk. Using 2 TiB would be plenty. This one seems enough:
|
||||
|
||||
- 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
||||
@@ -3,10 +3,7 @@
|
||||
{
|
||||
imports = [
|
||||
../module/slurm-exporter.nix
|
||||
../module/meteocat-exporter.nix
|
||||
../module/upc-qaire-exporter.nix
|
||||
./gpfs-probe.nix
|
||||
./nix-daemon-exporter.nix
|
||||
];
|
||||
|
||||
age.secrets.grafanaJungleRobotPassword = {
|
||||
@@ -111,9 +108,6 @@
|
||||
"127.0.0.1:${toString config.services.prometheus.exporters.smartctl.port}"
|
||||
"127.0.0.1:9341" # Slurm exporter
|
||||
"127.0.0.1:9966" # GPFS custom exporter
|
||||
"127.0.0.1:9999" # Nix-daemon custom exporter
|
||||
"127.0.0.1:9929" # Meteocat custom exporter
|
||||
"127.0.0.1:9928" # UPC Qaire custom exporter
|
||||
"127.0.0.1:${toString config.services.prometheus.exporters.blackbox.port}"
|
||||
];
|
||||
}];
|
||||
|
||||
@@ -1,26 +0,0 @@
|
||||
#!/bin/sh
|
||||
|
||||
# Locate nix daemon pid
|
||||
nd=$(pgrep -o nix-daemon)
|
||||
|
||||
# Locate children of nix-daemon
|
||||
pids1=$(tr ' ' '\n' < "/proc/$nd/task/$nd/children")
|
||||
|
||||
# For each children, locate 2nd level children
|
||||
pids2=$(echo "$pids1" | xargs -I @ /bin/sh -c 'cat /proc/@/task/*/children' | tr ' ' '\n')
|
||||
|
||||
cat <<EOF
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: text/plain; version=0.0.4; charset=utf-8; escaping=values
|
||||
|
||||
# HELP nix_daemon_build Nix daemon derivation build state.
|
||||
# TYPE nix_daemon_build gauge
|
||||
EOF
|
||||
|
||||
for pid in $pids2; do
|
||||
name=$(cat /proc/$pid/environ 2>/dev/null | tr '\0' '\n' | rg "^name=(.+)" - --replace '$1' | tr -dc ' [:alnum:]_\-\.')
|
||||
user=$(ps -o uname= -p "$pid")
|
||||
if [ -n "$name" -a -n "$user" ]; then
|
||||
printf 'nix_daemon_build{user="%s",name="%s"} 1\n' "$user" "$name"
|
||||
fi
|
||||
done
|
||||
@@ -1,23 +0,0 @@
|
||||
{ pkgs, config, lib, ... }:
|
||||
let
|
||||
script = pkgs.runCommand "nix-daemon-exporter.sh" { }
|
||||
''
|
||||
cp ${./nix-daemon-builds.sh} $out;
|
||||
chmod +x $out
|
||||
''
|
||||
;
|
||||
in
|
||||
{
|
||||
systemd.services.nix-daemon-exporter = {
|
||||
description = "Daemon to export nix-daemon metrics";
|
||||
path = [ pkgs.procps pkgs.ripgrep ];
|
||||
wantedBy = [ "default.target" ];
|
||||
serviceConfig = {
|
||||
Type = "simple";
|
||||
ExecStart = "${pkgs.socat}/bin/socat TCP4-LISTEN:9999,fork EXEC:${script}";
|
||||
# Needed root to read the environment, potentially unsafe
|
||||
User = "root";
|
||||
Group = "root";
|
||||
};
|
||||
};
|
||||
}
|
||||
@@ -1,17 +0,0 @@
|
||||
{ config, lib, pkgs, ... }:
|
||||
|
||||
with lib;
|
||||
|
||||
{
|
||||
systemd.services."prometheus-meteocat-exporter" = {
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
after = [ "network.target" ];
|
||||
serviceConfig = {
|
||||
Restart = mkDefault "always";
|
||||
PrivateTmp = mkDefault true;
|
||||
WorkingDirectory = mkDefault "/tmp";
|
||||
DynamicUser = mkDefault true;
|
||||
ExecStart = "${pkgs.meteocat-exporter}/bin/meteocat-exporter";
|
||||
};
|
||||
};
|
||||
}
|
||||
@@ -1,17 +0,0 @@
|
||||
{ config, lib, pkgs, ... }:
|
||||
|
||||
with lib;
|
||||
|
||||
{
|
||||
systemd.services."prometheus-upc-qaire-exporter" = {
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
after = [ "network.target" ];
|
||||
serviceConfig = {
|
||||
Restart = mkDefault "always";
|
||||
PrivateTmp = mkDefault true;
|
||||
WorkingDirectory = mkDefault "/tmp";
|
||||
DynamicUser = mkDefault true;
|
||||
ExecStart = "${pkgs.upc-qaire-exporter}/bin/upc-qaire-exporter";
|
||||
};
|
||||
};
|
||||
}
|
||||
@@ -1,25 +0,0 @@
|
||||
{ python3Packages, lib }:
|
||||
|
||||
python3Packages.buildPythonApplication rec {
|
||||
pname = "meteocat-exporter";
|
||||
version = "1.0";
|
||||
|
||||
src = ./.;
|
||||
|
||||
doCheck = false;
|
||||
|
||||
build-system = with python3Packages; [
|
||||
setuptools
|
||||
];
|
||||
|
||||
dependencies = with python3Packages; [
|
||||
beautifulsoup4
|
||||
lxml
|
||||
prometheus-client
|
||||
];
|
||||
|
||||
meta = with lib; {
|
||||
description = "MeteoCat Prometheus Exporter";
|
||||
platforms = platforms.linux;
|
||||
};
|
||||
}
|
||||
@@ -1,54 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import time
|
||||
from prometheus_client import start_http_server, Gauge
|
||||
from bs4 import BeautifulSoup
|
||||
from urllib import request
|
||||
|
||||
# Configuration -------------------------------------------
|
||||
meteo_station = "X8" # Barcelona - Zona Universitària
|
||||
listening_port = 9929
|
||||
update_period = 60 * 5 # Each 5 min
|
||||
# ---------------------------------------------------------
|
||||
|
||||
metric_tmin = Gauge('meteocat_temp_min', 'Min temperature')
|
||||
metric_tmax = Gauge('meteocat_temp_max', 'Max temperature')
|
||||
metric_tavg = Gauge('meteocat_temp_avg', 'Average temperature')
|
||||
metric_srad = Gauge('meteocat_solar_radiation', 'Solar radiation')
|
||||
|
||||
def update(st):
|
||||
url = 'https://www.meteo.cat/observacions/xema/dades?codi=' + st
|
||||
response = request.urlopen(url)
|
||||
data = response.read()
|
||||
soup = BeautifulSoup(data, 'lxml')
|
||||
table = soup.find("table", {"class" : "tblperiode"})
|
||||
rows = table.find_all('tr')
|
||||
row = rows[-1] # Take the last row
|
||||
row_data = []
|
||||
header = row.find('th')
|
||||
header_text = header.text.strip()
|
||||
row_data.append(header_text)
|
||||
for col in row.find_all('td'):
|
||||
row_data.append(col.text)
|
||||
try:
|
||||
# Sometimes it will return '(s/d)' and fail to parse
|
||||
metric_tavg.set(float(row_data[1]))
|
||||
metric_tmax.set(float(row_data[2]))
|
||||
metric_tmin.set(float(row_data[3]))
|
||||
metric_srad.set(float(row_data[10]))
|
||||
#print("ok: temp_avg={}".format(float(row_data[1])))
|
||||
except:
|
||||
print("cannot parse row: {}".format(row))
|
||||
metric_tavg.set(float("nan"))
|
||||
metric_tmax.set(float("nan"))
|
||||
metric_tmin.set(float("nan"))
|
||||
metric_srad.set(float("nan"))
|
||||
|
||||
if __name__ == '__main__':
|
||||
start_http_server(port=listening_port, addr="localhost")
|
||||
while True:
|
||||
try:
|
||||
update(meteo_station)
|
||||
except:
|
||||
print("update failed")
|
||||
time.sleep(update_period)
|
||||
@@ -1,11 +0,0 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
from setuptools import setup, find_packages
|
||||
|
||||
setup(name='meteocat-exporter',
|
||||
version='1.0',
|
||||
# Modules to import from other scripts:
|
||||
packages=find_packages(),
|
||||
# Executables
|
||||
scripts=["meteocat-exporter"],
|
||||
)
|
||||
@@ -54,6 +54,4 @@ final: prev:
|
||||
});
|
||||
|
||||
prometheus-slurm-exporter = prev.callPackage ./slurm-exporter.nix { };
|
||||
meteocat-exporter = prev.callPackage ./meteocat-exporter/default.nix { };
|
||||
upc-qaire-exporter = prev.callPackage ./upc-qaire-exporter/default.nix { };
|
||||
}
|
||||
|
||||
@@ -1,24 +0,0 @@
|
||||
{ python3Packages, lib }:
|
||||
|
||||
python3Packages.buildPythonApplication rec {
|
||||
pname = "upc-qaire-exporter";
|
||||
version = "1.0";
|
||||
|
||||
src = ./.;
|
||||
|
||||
doCheck = false;
|
||||
|
||||
build-system = with python3Packages; [
|
||||
setuptools
|
||||
];
|
||||
|
||||
dependencies = with python3Packages; [
|
||||
prometheus-client
|
||||
requests
|
||||
];
|
||||
|
||||
meta = with lib; {
|
||||
description = "UPC Qaire Prometheus Exporter";
|
||||
platforms = platforms.linux;
|
||||
};
|
||||
}
|
||||
@@ -1,11 +0,0 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
from setuptools import setup, find_packages
|
||||
|
||||
setup(name='upc-qaire-exporter',
|
||||
version='1.0',
|
||||
# Modules to import from other scripts:
|
||||
packages=find_packages(),
|
||||
# Executables
|
||||
scripts=["upc-qaire-exporter"],
|
||||
)
|
||||
@@ -1,74 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import time
|
||||
from prometheus_client import start_http_server, Gauge
|
||||
import requests, json
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
# Configuration -------------------------------------------
|
||||
listening_port = 9928
|
||||
update_period = 60 * 5 # Each 5 min
|
||||
# ---------------------------------------------------------
|
||||
|
||||
metric_temp = Gauge('upc_c6_s302_temp', 'UPC C6 S302 temperature sensor')
|
||||
|
||||
def genparams():
|
||||
d = {}
|
||||
d['topic'] = 'TEMPERATURE'
|
||||
d['shift_dates_to'] = ''
|
||||
d['datapoints'] = 301
|
||||
d['devicesAndColors'] = '1148418@@@#40ACB6'
|
||||
|
||||
now = datetime.now()
|
||||
|
||||
d['fromDate'] = now.strftime('%d/%m/%Y')
|
||||
d['toDate'] = now.strftime('%d/%m/%Y')
|
||||
d['serviceFrequency'] = 'NONE'
|
||||
|
||||
# WTF!
|
||||
for i in range(7):
|
||||
for j in range(48):
|
||||
key = 'week.days[{}].hours[{}].value'.format(i, j)
|
||||
d[key] = 'OPEN'
|
||||
|
||||
return d
|
||||
|
||||
def measure():
|
||||
# First we need to load session
|
||||
s = requests.Session()
|
||||
r = s.get("https://upc.edu/sirena")
|
||||
if r.status_code != 200:
|
||||
print("bad HTTP status code on new session: {}".format(r.status_code))
|
||||
return
|
||||
|
||||
if s.cookies.get("JSESSIONID") is None:
|
||||
print("cannot get JSESSIONID")
|
||||
return
|
||||
|
||||
# Now we can pull the data
|
||||
url = "https://upcsirena.app.dexma.com/l_12535/analysis/by_datapoints/data.json"
|
||||
r = s.post(url, data=genparams())
|
||||
|
||||
if r.status_code != 200:
|
||||
print("bad HTTP status code on data: {}".format(r.status_code))
|
||||
return
|
||||
|
||||
#print(r.text)
|
||||
j = json.loads(r.content)
|
||||
|
||||
# Just take the last one
|
||||
last = j['data']['chartElementList'][-1]
|
||||
temp = last['values']['1148418-Temperatura']
|
||||
|
||||
return temp
|
||||
|
||||
if __name__ == '__main__':
|
||||
start_http_server(port=listening_port, addr="localhost")
|
||||
while True:
|
||||
try:
|
||||
metric_temp.set(measure())
|
||||
except:
|
||||
print("measure failed")
|
||||
metric_temp.set(float("nan"))
|
||||
|
||||
time.sleep(update_period)
|
||||
Reference in New Issue
Block a user