Compare commits
5 Commits
maintenanc
...
70a6d2e644
| Author | SHA1 | Date | |
|---|---|---|---|
|
70a6d2e644
|
|||
| 31d03ee3ac | |||
| 4e92b14384 | |||
| b90209b4bf | |||
| 785f7cfee8 |
@@ -1,156 +0,0 @@
|
||||
# Maintenance purchase 2025-05
|
||||
|
||||
We need to buy some components to replace broken parts or to have spare ones for
|
||||
when they break. We also need some tools to do basic repairs.
|
||||
|
||||
Here is the list:
|
||||
|
||||
- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical)
|
||||
- 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html>
|
||||
|
||||
- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered
|
||||
- 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
||||
|
||||
- 1 x Set of screwdrivers
|
||||
- 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
||||
|
||||
- 1 x UART adaptor
|
||||
- 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
||||
|
||||
- 1 x SSD SATA disk of 2 TB
|
||||
- 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
||||
|
||||
Total: 1324.96 €
|
||||
|
||||
# Rationale
|
||||
|
||||
Below is the search procedure I followed to come up with that list.
|
||||
|
||||
## Power supplies
|
||||
|
||||
They are the first components to fail. We already have some problems with the
|
||||
monitoring of some power supplies. They will soon stop being manufactured, so we
|
||||
should increase out stack.
|
||||
|
||||
Most Xeon nodes use the DELTA DPS-750XB A:
|
||||
|
||||
hut% sudo ipmitool fru
|
||||
...
|
||||
FRU Device Description : Pwr Supply 1 FRU (ID 2)
|
||||
Product Manufacturer : DELTA
|
||||
Product Name : DPS-750XB A
|
||||
Product Part Number : E98791-010
|
||||
Product Version : 05
|
||||
Product Serial : XXXXXXXXXXXXXXXXX
|
||||
|
||||
And we only have one per node. We should make the power supply redundant so we
|
||||
can tolerate it to fail without bringing down the node.
|
||||
|
||||
They are available on Amazon, but they are very expensive (287.54 €):
|
||||
|
||||
<https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT>
|
||||
|
||||
On Aliexpress they are much cheaper (57.69 €):
|
||||
|
||||
<https://es.aliexpress.com/item/1005004090017186.html>
|
||||
|
||||
We have 11 nodes plus the login, but I'm not able to figure out which power
|
||||
supply the login is using.
|
||||
|
||||
The login uses another one, AXX1100PCRPS, and only has one slot populated. We
|
||||
may want to also we another one, but I would need to reset the FRU and I don't
|
||||
have access to the login node. So I will leave this for Operations to deal with.
|
||||
We can live without the login if needed.
|
||||
|
||||
## RAM DIMM
|
||||
|
||||
The DIMM modules also experience errors, which are monitored by Linux. In some
|
||||
nodes we see non-recoverable errors that are no longer corrected by the ECC. We
|
||||
need to replace the bad modules.
|
||||
|
||||
Having two spare modules per node would be enough to cover most problems in the
|
||||
future.
|
||||
|
||||
> 16 GB, 2400 MHz RDIMM
|
||||
|
||||
The module from dmidecode:
|
||||
|
||||
Handle 0x0026, DMI type 17, 40 bytes
|
||||
Memory Device
|
||||
Array Handle: 0x0020
|
||||
Error Information Handle: Not Provided
|
||||
Total Width: 72 bits
|
||||
Data Width: 64 bits
|
||||
Size: 16 GB
|
||||
Form Factor: DIMM
|
||||
Set: None
|
||||
Locator: DIMM_B1
|
||||
Bank Locator: NODE 1
|
||||
Type: DDR4
|
||||
Type Detail: Synchronous
|
||||
Speed: 2400 MT/s
|
||||
Manufacturer: Micron
|
||||
Serial Number: XXXXXXXX
|
||||
Asset Tag:
|
||||
Part Number: 36ASF2G72PZ-2G3B1
|
||||
Rank: 2
|
||||
Configured Memory Speed: 2400 MT/s
|
||||
Minimum Voltage: Unknown
|
||||
Maximum Voltage: Unknown
|
||||
Configured Voltage: Unknown
|
||||
|
||||
Which is this module:
|
||||
|
||||
<https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI>
|
||||
|
||||
But they have only one in stock. Here is more details:
|
||||
|
||||
> 16GB PC4-19200 DDR4-2400MHz
|
||||
|
||||
The must have the following features:
|
||||
|
||||
- 16 GB
|
||||
- DDR4
|
||||
- Speed at least 2400 MT/s
|
||||
- ECC
|
||||
- Registered
|
||||
- Best if from Micron
|
||||
|
||||
I would say having 8 spare modules would be enough for now, as we only have a
|
||||
few that are currently failing. We could upgrade the modules later, as they
|
||||
don't have much risk of stopping being manufactured like the power supplies.
|
||||
|
||||
These may work:
|
||||
|
||||
- 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N>
|
||||
|
||||
- 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
||||
|
||||
It is cheaper to buy them by pairs, so let's use the last one.
|
||||
|
||||
## Screwdriver set
|
||||
|
||||
In order to change and replace the machine parts we need a set of screwdrivers.
|
||||
Instead of having to bring my own from home, I want to have one at BSC. These
|
||||
are enough and come in a nice box so I don't lose them:
|
||||
|
||||
<https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
||||
|
||||
## Serial port adaptor
|
||||
|
||||
In order to debug problems with several components, we need to be able to plug
|
||||
to the serial port of the CPU. As we may deal with different voltages and
|
||||
pinouts, the most versatile option is to just be able to select the voltage and
|
||||
expose a pin interface.
|
||||
|
||||
This one would do:
|
||||
|
||||
<https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
||||
|
||||
## Storage for raccoon
|
||||
|
||||
Given that we are currently using raccoon for builds too, we would need to
|
||||
increase its current storage. We only have available 270 GB, so we can benefit
|
||||
from another disk. Using 2 TiB would be plenty. This one seems enough:
|
||||
|
||||
- 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
||||
@@ -23,7 +23,6 @@
|
||||
trusted-users = [ "@wheel" ];
|
||||
flake-registry = pkgs.writeText "global-registry.json"
|
||||
''{"flakes":[],"version":2}'';
|
||||
keep-outputs = true;
|
||||
};
|
||||
|
||||
gc = {
|
||||
|
||||
@@ -10,7 +10,7 @@ in
|
||||
|
||||
# Connect to intranet git hosts via proxy
|
||||
programs.ssh.extraConfig = ''
|
||||
Host bscpm02.bsc.es bscpm03.bsc.es bscpm04.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
|
||||
Host bscpm02.bsc.es bscpm03.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
|
||||
User git
|
||||
ProxyCommand nc -X connect -x hut:23080 %h %p
|
||||
|
||||
@@ -22,7 +22,6 @@ in
|
||||
programs.ssh.knownHosts = hostsKeys // {
|
||||
"gitlab-internal.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3";
|
||||
"bscpm03.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM2NuSUPsEhqz1j5b4Gqd+MWFnRqyqY57+xMvBUqHYUS";
|
||||
"bscpm04.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPx4mC0etyyjYUT2Ztc/bs4ZXSbVMrogs1ZTP924PDgT";
|
||||
"glogin1.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
|
||||
"glogin2.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
|
||||
};
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
|
||||
proxy = {
|
||||
default = "http://hut:23080/";
|
||||
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40,hut";
|
||||
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40";
|
||||
# Don't set all_proxy as go complains and breaks the gitlab runner, see:
|
||||
# https://github.com/golang/go/issues/16715
|
||||
allProxy = null;
|
||||
|
||||
@@ -56,11 +56,6 @@
|
||||
iptables -A nixos-fw -p tcp -s 10.0.40.30 --dport 23080 -j nixos-fw-log-refuse
|
||||
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 23080 -j nixos-fw-accept
|
||||
'';
|
||||
# Flush all rules and chains on stop so it won't break on start
|
||||
extraStopCommands = ''
|
||||
iptables -F
|
||||
iptables -X
|
||||
'';
|
||||
};
|
||||
};
|
||||
|
||||
|
||||
@@ -97,13 +97,12 @@
|
||||
};
|
||||
};
|
||||
|
||||
# DOCKER* chains are useless, override at FORWARD and nixos-fw
|
||||
# DOCKER* chains are useless, override at FORWARD
|
||||
networking.firewall.extraCommands = ''
|
||||
# Don't forward any traffic from docker
|
||||
iptables -I FORWARD 1 -p all -i docker0 -j nixos-fw-log-refuse
|
||||
|
||||
# Allow incoming traffic from docker to 23080
|
||||
iptables -A nixos-fw -p tcp -i docker0 -d hut --dport 23080 -j ACCEPT
|
||||
# Allow docker to use our proxy
|
||||
iptables -I FORWARD 1 -p tcp -i docker0 -d hut --dport 23080 -j nixos-fw-accept
|
||||
# Block anything else coming from docker
|
||||
iptables -I FORWARD 2 -p all -i docker0 -j nixos-fw-log-refuse
|
||||
'';
|
||||
|
||||
#systemd.services.gitlab-runner.serviceConfig.Shell = "${pkgs.bash}/bin/bash";
|
||||
|
||||
@@ -46,7 +46,7 @@
|
||||
services.prometheus = {
|
||||
enable = true;
|
||||
port = 9001;
|
||||
retentionTime = "5y";
|
||||
retentionTime = "1y";
|
||||
listenAddress = "127.0.0.1";
|
||||
};
|
||||
|
||||
@@ -250,14 +250,6 @@
|
||||
module = [ "raccoon" ];
|
||||
};
|
||||
}
|
||||
{
|
||||
job_name = "raccoon";
|
||||
static_configs = [
|
||||
{
|
||||
targets = [ "127.0.0.1:19002" ]; # Node exporter
|
||||
}
|
||||
];
|
||||
}
|
||||
{
|
||||
job_name = "ipmi-fox";
|
||||
metrics_path = "/ipmi";
|
||||
|
||||
@@ -17,14 +17,13 @@ let
|
||||
};
|
||||
in
|
||||
{
|
||||
networking.firewall.allowedTCPPorts = [ 80 ];
|
||||
services.nginx = {
|
||||
enable = true;
|
||||
virtualHosts."jungle.bsc.es" = {
|
||||
root = "${website}";
|
||||
listen = [
|
||||
{
|
||||
addr = "0.0.0.0";
|
||||
addr = "127.0.0.1";
|
||||
port = 80;
|
||||
}
|
||||
];
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
# Don't add hut as a cache to itself
|
||||
assert config.networking.hostName != "hut";
|
||||
{
|
||||
substituters = [ "http://hut/cache" ];
|
||||
substituters = [ "https://jungle.bsc.es/cache" ];
|
||||
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
|
||||
};
|
||||
}
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
{
|
||||
imports = [
|
||||
../common/base.nix
|
||||
../module/hut-substituter.nix
|
||||
];
|
||||
|
||||
# Don't install Grub on the disk yet
|
||||
@@ -25,11 +26,6 @@
|
||||
} ];
|
||||
};
|
||||
|
||||
nix.settings = {
|
||||
substituters = [ "https://jungle.bsc.es/cache" ];
|
||||
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
|
||||
};
|
||||
|
||||
# Configure Nvidia driver to use with CUDA
|
||||
hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
|
||||
hardware.graphics.enable = true;
|
||||
|
||||
@@ -57,7 +57,7 @@ Note: you'll have to be a trusted user.
|
||||
|
||||
### Nix configuration file (non-nixos)
|
||||
|
||||
If using nix outside of NixOS, you'll have to update `/etc/nix/nix.conf`
|
||||
If using nix outside of NixOS, you'll have to update `nix.conf`
|
||||
|
||||
```
|
||||
# echo "substituters = https://jungle.bsc.es/cache" >> /etc/nix/nix.conf
|
||||
|
||||
Reference in New Issue
Block a user