Compare commits
28 Commits
add-fox-ma
...
maintenanc
| Author | SHA1 | Date | |
|---|---|---|---|
| e8d7ae345d | |||
| 82fc3209de | |||
| abeab18270 | |||
| 1985b58619 | |||
| 44bd061823 | |||
| e8c309f584 | |||
| 71ae7fb585 | |||
| 8834d561d2 | |||
| 29daa3c364 | |||
| 9c503fbefb | |||
| 51b6a8b612 | |||
| 52213d388d | |||
| edf744db8d | |||
| b82894eaec | |||
| 1c47199891 | |||
| 8738bd4eeb | |||
| 7699783aac | |||
| fee1d4da7e | |||
| b77ce7fb56 | |||
| b4a12625c5 | |||
| 302106ea9a | |||
| 96877de8d9 | |||
| 8878985be6 | |||
| 737578db34 | |||
| 88555e3f8c | |||
| feb2060be7 | |||
| 00999434c2 | |||
| 29d58cc62d |
156
doc/2025-05-maintenance-purchase.md
Normal file
156
doc/2025-05-maintenance-purchase.md
Normal file
@@ -0,0 +1,156 @@
|
|||||||
|
# Maintenance purchase 2025-05
|
||||||
|
|
||||||
|
We need to buy some components to replace broken parts or to have spare ones for
|
||||||
|
when they break. We also need some tools to do basic repairs.
|
||||||
|
|
||||||
|
Here is the list:
|
||||||
|
|
||||||
|
- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical)
|
||||||
|
- 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html>
|
||||||
|
|
||||||
|
- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered
|
||||||
|
- 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
||||||
|
|
||||||
|
- 1 x Set of screwdrivers
|
||||||
|
- 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
||||||
|
|
||||||
|
- 1 x UART adaptor
|
||||||
|
- 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
||||||
|
|
||||||
|
- 1 x SSD SATA disk of 2 TB
|
||||||
|
- 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
||||||
|
|
||||||
|
Total: 1324.96 €
|
||||||
|
|
||||||
|
# Rationale
|
||||||
|
|
||||||
|
Below is the search procedure I followed to come up with that list.
|
||||||
|
|
||||||
|
## Power supplies
|
||||||
|
|
||||||
|
They are the first components to fail. We already have some problems with the
|
||||||
|
monitoring of some power supplies. They will soon stop being manufactured, so we
|
||||||
|
should increase out stack.
|
||||||
|
|
||||||
|
Most Xeon nodes use the DELTA DPS-750XB A:
|
||||||
|
|
||||||
|
hut% sudo ipmitool fru
|
||||||
|
...
|
||||||
|
FRU Device Description : Pwr Supply 1 FRU (ID 2)
|
||||||
|
Product Manufacturer : DELTA
|
||||||
|
Product Name : DPS-750XB A
|
||||||
|
Product Part Number : E98791-010
|
||||||
|
Product Version : 05
|
||||||
|
Product Serial : XXXXXXXXXXXXXXXXX
|
||||||
|
|
||||||
|
And we only have one per node. We should make the power supply redundant so we
|
||||||
|
can tolerate it to fail without bringing down the node.
|
||||||
|
|
||||||
|
They are available on Amazon, but they are very expensive (287.54 €):
|
||||||
|
|
||||||
|
<https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT>
|
||||||
|
|
||||||
|
On Aliexpress they are much cheaper (57.69 €):
|
||||||
|
|
||||||
|
<https://es.aliexpress.com/item/1005004090017186.html>
|
||||||
|
|
||||||
|
We have 11 nodes plus the login, but I'm not able to figure out which power
|
||||||
|
supply the login is using.
|
||||||
|
|
||||||
|
The login uses another one, AXX1100PCRPS, and only has one slot populated. We
|
||||||
|
may want to also we another one, but I would need to reset the FRU and I don't
|
||||||
|
have access to the login node. So I will leave this for Operations to deal with.
|
||||||
|
We can live without the login if needed.
|
||||||
|
|
||||||
|
## RAM DIMM
|
||||||
|
|
||||||
|
The DIMM modules also experience errors, which are monitored by Linux. In some
|
||||||
|
nodes we see non-recoverable errors that are no longer corrected by the ECC. We
|
||||||
|
need to replace the bad modules.
|
||||||
|
|
||||||
|
Having two spare modules per node would be enough to cover most problems in the
|
||||||
|
future.
|
||||||
|
|
||||||
|
> 16 GB, 2400 MHz RDIMM
|
||||||
|
|
||||||
|
The module from dmidecode:
|
||||||
|
|
||||||
|
Handle 0x0026, DMI type 17, 40 bytes
|
||||||
|
Memory Device
|
||||||
|
Array Handle: 0x0020
|
||||||
|
Error Information Handle: Not Provided
|
||||||
|
Total Width: 72 bits
|
||||||
|
Data Width: 64 bits
|
||||||
|
Size: 16 GB
|
||||||
|
Form Factor: DIMM
|
||||||
|
Set: None
|
||||||
|
Locator: DIMM_B1
|
||||||
|
Bank Locator: NODE 1
|
||||||
|
Type: DDR4
|
||||||
|
Type Detail: Synchronous
|
||||||
|
Speed: 2400 MT/s
|
||||||
|
Manufacturer: Micron
|
||||||
|
Serial Number: XXXXXXXX
|
||||||
|
Asset Tag:
|
||||||
|
Part Number: 36ASF2G72PZ-2G3B1
|
||||||
|
Rank: 2
|
||||||
|
Configured Memory Speed: 2400 MT/s
|
||||||
|
Minimum Voltage: Unknown
|
||||||
|
Maximum Voltage: Unknown
|
||||||
|
Configured Voltage: Unknown
|
||||||
|
|
||||||
|
Which is this module:
|
||||||
|
|
||||||
|
<https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI>
|
||||||
|
|
||||||
|
But they have only one in stock. Here is more details:
|
||||||
|
|
||||||
|
> 16GB PC4-19200 DDR4-2400MHz
|
||||||
|
|
||||||
|
The must have the following features:
|
||||||
|
|
||||||
|
- 16 GB
|
||||||
|
- DDR4
|
||||||
|
- Speed at least 2400 MT/s
|
||||||
|
- ECC
|
||||||
|
- Registered
|
||||||
|
- Best if from Micron
|
||||||
|
|
||||||
|
I would say having 8 spare modules would be enough for now, as we only have a
|
||||||
|
few that are currently failing. We could upgrade the modules later, as they
|
||||||
|
don't have much risk of stopping being manufactured like the power supplies.
|
||||||
|
|
||||||
|
These may work:
|
||||||
|
|
||||||
|
- 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N>
|
||||||
|
|
||||||
|
- 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
|
||||||
|
|
||||||
|
It is cheaper to buy them by pairs, so let's use the last one.
|
||||||
|
|
||||||
|
## Screwdriver set
|
||||||
|
|
||||||
|
In order to change and replace the machine parts we need a set of screwdrivers.
|
||||||
|
Instead of having to bring my own from home, I want to have one at BSC. These
|
||||||
|
are enough and come in a nice box so I don't lose them:
|
||||||
|
|
||||||
|
<https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
|
||||||
|
|
||||||
|
## Serial port adaptor
|
||||||
|
|
||||||
|
In order to debug problems with several components, we need to be able to plug
|
||||||
|
to the serial port of the CPU. As we may deal with different voltages and
|
||||||
|
pinouts, the most versatile option is to just be able to select the voltage and
|
||||||
|
expose a pin interface.
|
||||||
|
|
||||||
|
This one would do:
|
||||||
|
|
||||||
|
<https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
|
||||||
|
|
||||||
|
## Storage for raccoon
|
||||||
|
|
||||||
|
Given that we are currently using raccoon for builds too, we would need to
|
||||||
|
increase its current storage. We only have available 270 GB, so we can benefit
|
||||||
|
from another disk. Using 2 TiB would be plenty. This one seems enough:
|
||||||
|
|
||||||
|
- 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
|
||||||
@@ -23,6 +23,7 @@
|
|||||||
trusted-users = [ "@wheel" ];
|
trusted-users = [ "@wheel" ];
|
||||||
flake-registry = pkgs.writeText "global-registry.json"
|
flake-registry = pkgs.writeText "global-registry.json"
|
||||||
''{"flakes":[],"version":2}'';
|
''{"flakes":[],"version":2}'';
|
||||||
|
keep-outputs = true;
|
||||||
};
|
};
|
||||||
|
|
||||||
gc = {
|
gc = {
|
||||||
|
|||||||
@@ -10,7 +10,7 @@ in
|
|||||||
|
|
||||||
# Connect to intranet git hosts via proxy
|
# Connect to intranet git hosts via proxy
|
||||||
programs.ssh.extraConfig = ''
|
programs.ssh.extraConfig = ''
|
||||||
Host bscpm02.bsc.es bscpm03.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
|
Host bscpm02.bsc.es bscpm03.bsc.es bscpm04.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
|
||||||
User git
|
User git
|
||||||
ProxyCommand nc -X connect -x hut:23080 %h %p
|
ProxyCommand nc -X connect -x hut:23080 %h %p
|
||||||
|
|
||||||
@@ -22,6 +22,7 @@ in
|
|||||||
programs.ssh.knownHosts = hostsKeys // {
|
programs.ssh.knownHosts = hostsKeys // {
|
||||||
"gitlab-internal.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3";
|
"gitlab-internal.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3";
|
||||||
"bscpm03.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM2NuSUPsEhqz1j5b4Gqd+MWFnRqyqY57+xMvBUqHYUS";
|
"bscpm03.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM2NuSUPsEhqz1j5b4Gqd+MWFnRqyqY57+xMvBUqHYUS";
|
||||||
|
"bscpm04.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPx4mC0etyyjYUT2Ztc/bs4ZXSbVMrogs1ZTP924PDgT";
|
||||||
"glogin1.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
|
"glogin1.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
|
||||||
"glogin2.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
|
"glogin2.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
|
||||||
};
|
};
|
||||||
|
|||||||
@@ -81,7 +81,7 @@
|
|||||||
home = "/home/Computational/abonerib";
|
home = "/home/Computational/abonerib";
|
||||||
description = "Aleix Boné";
|
description = "Aleix Boné";
|
||||||
group = "Computational";
|
group = "Computational";
|
||||||
hosts = [ "owl1" "owl2" "hut" "raccoon" ];
|
hosts = [ "owl1" "owl2" "hut" "raccoon" "fox" ];
|
||||||
hashedPassword = "$6$V1EQWJr474whv7XJ$OfJ0wueM2l.dgiJiiah0Tip9ITcJ7S7qDvtSycsiQ43QBFyP4lU0e0HaXWps85nqB4TypttYR4hNLoz3bz662/";
|
hashedPassword = "$6$V1EQWJr474whv7XJ$OfJ0wueM2l.dgiJiiah0Tip9ITcJ7S7qDvtSycsiQ43QBFyP4lU0e0HaXWps85nqB4TypttYR4hNLoz3bz662/";
|
||||||
openssh.authorizedKeys.keys = [
|
openssh.authorizedKeys.keys = [
|
||||||
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIIFiqXqt88VuUfyANkZyLJNiuroIITaGlOOTMhVDKjf abonerib@bsc"
|
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIIFiqXqt88VuUfyANkZyLJNiuroIITaGlOOTMhVDKjf abonerib@bsc"
|
||||||
@@ -126,6 +126,19 @@
|
|||||||
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGEfy6F4rF80r4Cpo2H5xaWqhuUZzUsVsILSKGJzt5jF dalvare1@ssfhead"
|
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGEfy6F4rF80r4Cpo2H5xaWqhuUZzUsVsILSKGJzt5jF dalvare1@ssfhead"
|
||||||
];
|
];
|
||||||
};
|
};
|
||||||
|
|
||||||
|
varcila = {
|
||||||
|
uid = 5650;
|
||||||
|
isNormalUser = true;
|
||||||
|
home = "/home/Computational/varcila";
|
||||||
|
description = "Vincent Arcila";
|
||||||
|
group = "Computational";
|
||||||
|
hosts = [ "hut" "fox" ];
|
||||||
|
hashedPassword = "$6$oB0Tcn99DcM4Ch$Vn1A0ulLTn/8B2oFPi9wWl/NOsJzaFAWjqekwcuC9sMC7cgxEVb.Nk5XSzQ2xzYcNe5MLtmzkVYnRS1CqP39Y0";
|
||||||
|
openssh.authorizedKeys.keys = [
|
||||||
|
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKGt0ESYxekBiHJQowmKpfdouw0hVm3N7tUMtAaeLejK vincent@varch"
|
||||||
|
];
|
||||||
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
groups = {
|
groups = {
|
||||||
|
|||||||
@@ -11,7 +11,7 @@
|
|||||||
|
|
||||||
proxy = {
|
proxy = {
|
||||||
default = "http://hut:23080/";
|
default = "http://hut:23080/";
|
||||||
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40";
|
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40,hut";
|
||||||
# Don't set all_proxy as go complains and breaks the gitlab runner, see:
|
# Don't set all_proxy as go complains and breaks the gitlab runner, see:
|
||||||
# https://github.com/golang/go/issues/16715
|
# https://github.com/golang/go/issues/16715
|
||||||
allProxy = null;
|
allProxy = null;
|
||||||
|
|||||||
@@ -56,6 +56,11 @@
|
|||||||
iptables -A nixos-fw -p tcp -s 10.0.40.30 --dport 23080 -j nixos-fw-log-refuse
|
iptables -A nixos-fw -p tcp -s 10.0.40.30 --dport 23080 -j nixos-fw-log-refuse
|
||||||
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 23080 -j nixos-fw-accept
|
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 23080 -j nixos-fw-accept
|
||||||
'';
|
'';
|
||||||
|
# Flush all rules and chains on stop so it won't break on start
|
||||||
|
extraStopCommands = ''
|
||||||
|
iptables -F
|
||||||
|
iptables -X
|
||||||
|
'';
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -22,8 +22,8 @@
|
|||||||
"--docker-network-mode host"
|
"--docker-network-mode host"
|
||||||
];
|
];
|
||||||
environmentVariables = {
|
environmentVariables = {
|
||||||
https_proxy = "http://localhost:23080";
|
https_proxy = "http://hut:23080";
|
||||||
http_proxy = "http://localhost:23080";
|
http_proxy = "http://hut:23080";
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
in {
|
in {
|
||||||
@@ -38,14 +38,13 @@
|
|||||||
gitlab-bsc-docker = {
|
gitlab-bsc-docker = {
|
||||||
# gitlab.bsc.es still uses the old token mechanism
|
# gitlab.bsc.es still uses the old token mechanism
|
||||||
registrationConfigFile = config.age.secrets.gitlab-bsc-docker.path;
|
registrationConfigFile = config.age.secrets.gitlab-bsc-docker.path;
|
||||||
|
tagList = [ "docker" "hut" ];
|
||||||
environmentVariables = {
|
environmentVariables = {
|
||||||
https_proxy = "http://localhost:23080";
|
# We cannot access the hut local interface from docker, so we connect
|
||||||
http_proxy = "http://localhost:23080";
|
# to hut directly via the ethernet one.
|
||||||
|
https_proxy = "http://hut:23080";
|
||||||
|
http_proxy = "http://hut:23080";
|
||||||
};
|
};
|
||||||
# FIXME
|
|
||||||
registrationFlags = [
|
|
||||||
"--docker-network-mode host"
|
|
||||||
];
|
|
||||||
executor = "docker";
|
executor = "docker";
|
||||||
dockerImage = "alpine";
|
dockerImage = "alpine";
|
||||||
dockerVolumes = [
|
dockerVolumes = [
|
||||||
@@ -53,7 +52,15 @@
|
|||||||
"/nix/var/nix/db:/nix/var/nix/db:ro"
|
"/nix/var/nix/db:/nix/var/nix/db:ro"
|
||||||
"/nix/var/nix/daemon-socket:/nix/var/nix/daemon-socket:ro"
|
"/nix/var/nix/daemon-socket:/nix/var/nix/daemon-socket:ro"
|
||||||
];
|
];
|
||||||
|
dockerExtraHosts = [
|
||||||
|
# Required to pass the proxy via hut
|
||||||
|
"hut:10.0.40.7"
|
||||||
|
];
|
||||||
dockerDisableCache = true;
|
dockerDisableCache = true;
|
||||||
|
registrationFlags = [
|
||||||
|
# Increase build log length to 64 MiB
|
||||||
|
"--output-limit 65536"
|
||||||
|
];
|
||||||
preBuildScript = pkgs.writeScript "setup-container" ''
|
preBuildScript = pkgs.writeScript "setup-container" ''
|
||||||
mkdir -p -m 0755 /nix/var/log/nix/drvs
|
mkdir -p -m 0755 /nix/var/log/nix/drvs
|
||||||
mkdir -p -m 0755 /nix/var/nix/gcroots
|
mkdir -p -m 0755 /nix/var/nix/gcroots
|
||||||
@@ -66,32 +73,39 @@
|
|||||||
mkdir -p -m 0700 "$HOME/.nix-defexpr"
|
mkdir -p -m 0700 "$HOME/.nix-defexpr"
|
||||||
mkdir -p -m 0700 "$HOME/.ssh"
|
mkdir -p -m 0700 "$HOME/.ssh"
|
||||||
cat > "$HOME/.ssh/config" << EOF
|
cat > "$HOME/.ssh/config" << EOF
|
||||||
Host bscpm03.bsc.es gitlab-internal.bsc.es
|
Host bscpm04.bsc.es gitlab-internal.bsc.es
|
||||||
User git
|
User git
|
||||||
ProxyCommand nc -X connect -x hut:23080 %h %p
|
ProxyCommand nc -X connect -x hut:23080 %h %p
|
||||||
Host amdlogin1.bsc.es armlogin1.bsc.es hualogin1.bsc.es glogin1.bsc.es glogin2.bsc.es fpgalogin1.bsc.es
|
Host amdlogin1.bsc.es armlogin1.bsc.es hualogin1.bsc.es glogin1.bsc.es glogin2.bsc.es fpgalogin1.bsc.es
|
||||||
ProxyCommand nc -X connect -x hut:23080 %h %p
|
ProxyCommand nc -X connect -x hut:23080 %h %p
|
||||||
EOF
|
EOF
|
||||||
cat >> "$HOME/.ssh/known_hosts" << EOF
|
cat >> "$HOME/.ssh/known_hosts" << EOF
|
||||||
bscpm03.bsc.es ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM2NuSUPsEhqz1j5b4Gqd+MWFnRqyqY57+xMvBUqHYUS
|
bscpm04.bsc.es ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPx4mC0etyyjYUT2Ztc/bs4ZXSbVMrogs1ZTP924PDgT
|
||||||
gitlab-internal.bsc.es ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3
|
gitlab-internal.bsc.es ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3
|
||||||
EOF
|
EOF
|
||||||
. ${pkgs.nix}/etc/profile.d/nix-daemon.sh
|
. ${pkgs.nix}/etc/profile.d/nix-daemon.sh
|
||||||
${pkgs.nix}/bin/nix-channel --add https://nixos.org/channels/nixos-24.11 nixpkgs
|
# Required to load SSL certificate paths
|
||||||
${pkgs.nix}/bin/nix-channel --update nixpkgs
|
. ${pkgs.cacert}/nix-support/setup-hook
|
||||||
${pkgs.nix}/bin/nix-env -i ${lib.concatStringsSep " " (with pkgs; [ nix cacert git openssh netcat curl ])}
|
|
||||||
'';
|
'';
|
||||||
environmentVariables = {
|
environmentVariables = {
|
||||||
ENV = "/etc/profile";
|
ENV = "/etc/profile";
|
||||||
USER = "root";
|
USER = "root";
|
||||||
NIX_REMOTE = "daemon";
|
NIX_REMOTE = "daemon";
|
||||||
PATH = "/nix/var/nix/profiles/default/bin:/nix/var/nix/profiles/default/sbin:/bin:/sbin:/usr/bin:/usr/sbin";
|
PATH = "${config.system.path}/bin:/bin:/sbin:/usr/bin:/usr/sbin";
|
||||||
NIX_SSL_CERT_FILE = "/nix/var/nix/profiles/default/etc/ssl/certs/ca-bundle.crt";
|
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
|
# DOCKER* chains are useless, override at FORWARD and nixos-fw
|
||||||
|
networking.firewall.extraCommands = ''
|
||||||
|
# Don't forward any traffic from docker
|
||||||
|
iptables -I FORWARD 1 -p all -i docker0 -j nixos-fw-log-refuse
|
||||||
|
|
||||||
|
# Allow incoming traffic from docker to 23080
|
||||||
|
iptables -A nixos-fw -p tcp -i docker0 -d hut --dport 23080 -j ACCEPT
|
||||||
|
'';
|
||||||
|
|
||||||
#systemd.services.gitlab-runner.serviceConfig.Shell = "${pkgs.bash}/bin/bash";
|
#systemd.services.gitlab-runner.serviceConfig.Shell = "${pkgs.bash}/bin/bash";
|
||||||
systemd.services.gitlab-runner.serviceConfig.DynamicUser = lib.mkForce false;
|
systemd.services.gitlab-runner.serviceConfig.DynamicUser = lib.mkForce false;
|
||||||
systemd.services.gitlab-runner.serviceConfig.User = "gitlab-runner";
|
systemd.services.gitlab-runner.serviceConfig.User = "gitlab-runner";
|
||||||
|
|||||||
@@ -46,7 +46,7 @@
|
|||||||
services.prometheus = {
|
services.prometheus = {
|
||||||
enable = true;
|
enable = true;
|
||||||
port = 9001;
|
port = 9001;
|
||||||
retentionTime = "1y";
|
retentionTime = "5y";
|
||||||
listenAddress = "127.0.0.1";
|
listenAddress = "127.0.0.1";
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -76,7 +76,7 @@
|
|||||||
group = "root";
|
group = "root";
|
||||||
user = "root";
|
user = "root";
|
||||||
configFile = config.age.secrets.ipmiYml.path;
|
configFile = config.age.secrets.ipmiYml.path;
|
||||||
extraFlags = [ "--log.level=debug" ];
|
# extraFlags = [ "--log.level=debug" ];
|
||||||
listenAddress = "127.0.0.1";
|
listenAddress = "127.0.0.1";
|
||||||
};
|
};
|
||||||
node = {
|
node = {
|
||||||
@@ -250,6 +250,14 @@
|
|||||||
module = [ "raccoon" ];
|
module = [ "raccoon" ];
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
{
|
||||||
|
job_name = "raccoon";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
targets = [ "127.0.0.1:19002" ]; # Node exporter
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
{
|
{
|
||||||
job_name = "ipmi-fox";
|
job_name = "ipmi-fox";
|
||||||
metrics_path = "/ipmi";
|
metrics_path = "/ipmi";
|
||||||
|
|||||||
@@ -12,16 +12,19 @@ let
|
|||||||
installPhase = ''
|
installPhase = ''
|
||||||
cp -r public $out
|
cp -r public $out
|
||||||
'';
|
'';
|
||||||
|
# Don't mess doc/
|
||||||
|
dontFixup = true;
|
||||||
};
|
};
|
||||||
in
|
in
|
||||||
{
|
{
|
||||||
|
networking.firewall.allowedTCPPorts = [ 80 ];
|
||||||
services.nginx = {
|
services.nginx = {
|
||||||
enable = true;
|
enable = true;
|
||||||
virtualHosts."jungle.bsc.es" = {
|
virtualHosts."jungle.bsc.es" = {
|
||||||
root = "${website}";
|
root = "${website}";
|
||||||
listen = [
|
listen = [
|
||||||
{
|
{
|
||||||
addr = "127.0.0.1";
|
addr = "0.0.0.0";
|
||||||
port = 80;
|
port = 80;
|
||||||
}
|
}
|
||||||
];
|
];
|
||||||
@@ -38,7 +41,7 @@ in
|
|||||||
proxy_redirect http:// $scheme://;
|
proxy_redirect http:// $scheme://;
|
||||||
}
|
}
|
||||||
location /cache {
|
location /cache {
|
||||||
rewrite ^/cache(.*) /$1 break;
|
rewrite ^/cache/(.*) /$1 break;
|
||||||
proxy_pass http://127.0.0.1:5000;
|
proxy_pass http://127.0.0.1:5000;
|
||||||
proxy_redirect http:// $scheme://;
|
proxy_redirect http:// $scheme://;
|
||||||
}
|
}
|
||||||
|
|||||||
10
m/module/hut-substituter.nix
Normal file
10
m/module/hut-substituter.nix
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
{ config, ... }:
|
||||||
|
{
|
||||||
|
nix.settings =
|
||||||
|
# Don't add hut as a cache to itself
|
||||||
|
assert config.networking.hostName != "hut";
|
||||||
|
{
|
||||||
|
substituters = [ "http://hut/cache" ];
|
||||||
|
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
|
||||||
|
};
|
||||||
|
}
|
||||||
@@ -27,22 +27,6 @@ let
|
|||||||
done
|
done
|
||||||
'';
|
'';
|
||||||
|
|
||||||
prolog = pkgs.writeScript "prolog.sh" ''
|
|
||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
echo "hello from the prolog"
|
|
||||||
|
|
||||||
exit 0
|
|
||||||
'';
|
|
||||||
|
|
||||||
epilog = pkgs.writeScript "epilog.sh" ''
|
|
||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
echo "hello from the epilog"
|
|
||||||
|
|
||||||
exit 0
|
|
||||||
'';
|
|
||||||
|
|
||||||
in {
|
in {
|
||||||
systemd.services.slurmd.serviceConfig = {
|
systemd.services.slurmd.serviceConfig = {
|
||||||
# Kill all processes in the control group on stop/restart. This will kill
|
# Kill all processes in the control group on stop/restart. This will kill
|
||||||
@@ -59,14 +43,13 @@ in {
|
|||||||
clusterName = "jungle";
|
clusterName = "jungle";
|
||||||
nodeName = [
|
nodeName = [
|
||||||
"owl[1,2] Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 Feature=owl"
|
"owl[1,2] Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 Feature=owl"
|
||||||
"fox Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Feature=fox"
|
"fox Sockets=2 CoresPerSocket=96 ThreadsPerCore=1 Feature=fox"
|
||||||
"hut Sockets=2 CoresPerSocket=14 ThreadsPerCore=2"
|
"hut Sockets=2 CoresPerSocket=14 ThreadsPerCore=2"
|
||||||
];
|
];
|
||||||
|
|
||||||
partitionName = [
|
partitionName = [
|
||||||
"owl Nodes=owl[1-2] Default=YES DefaultTime=01:00:00 MaxTime=INFINITE State=UP"
|
"owl Nodes=owl[1-2] Default=YES DefaultTime=01:00:00 MaxTime=INFINITE State=UP"
|
||||||
"fox Nodes=fox Default=NO DefaultTime=01:00:00 MaxTime=INFINITE State=UP"
|
"fox Nodes=fox Default=NO DefaultTime=01:00:00 MaxTime=INFINITE State=UP"
|
||||||
"all Nodes=owl[1-2],hut Default=NO DefaultTime=01:00:00 MaxTime=INFINITE State=UP"
|
|
||||||
];
|
];
|
||||||
|
|
||||||
# See slurm.conf(5) for more details about these options.
|
# See slurm.conf(5) for more details about these options.
|
||||||
|
|||||||
@@ -8,6 +8,7 @@
|
|||||||
../module/slurm-client.nix
|
../module/slurm-client.nix
|
||||||
../module/slurm-firewall.nix
|
../module/slurm-firewall.nix
|
||||||
../module/debuginfod.nix
|
../module/debuginfod.nix
|
||||||
|
../module/hut-substituter.nix
|
||||||
];
|
];
|
||||||
|
|
||||||
# Select the this using the ID to avoid mismatches
|
# Select the this using the ID to avoid mismatches
|
||||||
|
|||||||
@@ -8,6 +8,7 @@
|
|||||||
../module/slurm-client.nix
|
../module/slurm-client.nix
|
||||||
../module/slurm-firewall.nix
|
../module/slurm-firewall.nix
|
||||||
../module/debuginfod.nix
|
../module/debuginfod.nix
|
||||||
|
../module/hut-substituter.nix
|
||||||
];
|
];
|
||||||
|
|
||||||
# Select the this using the ID to avoid mismatches
|
# Select the this using the ID to avoid mismatches
|
||||||
|
|||||||
@@ -25,6 +25,11 @@
|
|||||||
} ];
|
} ];
|
||||||
};
|
};
|
||||||
|
|
||||||
|
nix.settings = {
|
||||||
|
substituters = [ "https://jungle.bsc.es/cache" ];
|
||||||
|
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
|
||||||
|
};
|
||||||
|
|
||||||
# Configure Nvidia driver to use with CUDA
|
# Configure Nvidia driver to use with CUDA
|
||||||
hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
|
hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
|
||||||
hardware.graphics.enable = true;
|
hardware.graphics.enable = true;
|
||||||
|
|||||||
@@ -1,9 +1,11 @@
|
|||||||
age-encryption.org/v1
|
age-encryption.org/v1
|
||||||
-> ssh-ed25519 HY2yRg 4Xns3jybBuv8flzd+h3DArVBa/AlKjt1J9jAyJsasCE
|
-> ssh-ed25519 HY2yRg WSdjyQPzBJ4JbzQpGeq1AAYpWKoXmLI1ZtmNmM5QOzs
|
||||||
uyVjJxh5i8aGgAgCpPl6zTYeIkf9mIwURof51IKWvwE
|
qGDlDT31DQF1DdHen0+5+52DdsQlabJdA2pOB5O1I6g
|
||||||
-> ssh-ed25519 CAWG4Q T2r6r1tyNgq1XlYXVtLJFfOfUnm6pSVlPwUqC1pkyRo
|
-> ssh-ed25519 CAWG4Q wioWMDxQjN+d4JdIbCwZg0DLQu1OH2mV6gukRprjuAs
|
||||||
9yDoKU0EC34QMUXYnsJvhPCLm6oD9w7NlTi2sheoBqQ
|
670fE61hidOEh20hHiQAhP0+CjDF0WMBNzgwkGT8Yqg
|
||||||
-> ssh-ed25519 MSF3dg Bh9DekFTq+QMUEAonwcaIAJX4Js1O7cHjDniCD0gtm8
|
-> ssh-ed25519 MSF3dg DN19uvAEtqq4708P6HpuX9i/o/qAvHX6dj69dCF2H1o
|
||||||
t/Ro0URLeDUWcvb7rlkG2s03PZ+9Rr3N4TIX03tXpVc
|
4Lu9GnjiFLMeXJ2C7aVPJsCHCQVlhylNWJi896Av92s
|
||||||
--- E5+/D4aK2ihKRR4YC5XOTmUbKgOqBR0Nk0gYvFOzXOI
|
--- 7cKBwOYNOUZ2h3/kAY09aSMASZSxX7hZIT4kvlIiT6w
|
||||||
<EFBFBD><EFBFBD><EFBFBD>yKF~dj<64><6A>r%<25><>'<27><><EFBFBD>P<EFBFBD>&_-l<><6C><EFBFBD>&<26>o<EFBFBD>_<EFBFBD>r<><72>r<EFBFBD><72>߁<EFBFBD>0<18>,<2C>U7<55>nC<6E>Te<54><18>[f<>97<39><37><EFBFBD><EFBFBD><EFBFBD><EFBFBD><10><><EFBFBD>C!D<>E<EFBFBD>W<EFBFBD>*<2A>LA<4C>x6<78>#<EFBFBD><EFBFBD>
|
<EFBFBD>6<EFBFBD><02><><EFBFBD><06>fQF5=<3D>bX+<2B>v e`<60>7/<2F><05>A~P<><50>Ѧ7<EFBFBD><EFBFBD>
|
||||||
|
<EFBFBD><EFBFBD>A<EFBFBD>)<29>h<05><>=oZ<6F>$<24>^<5E>V0<56>/܅<>r
|
||||||
|
k<EFBFBD>u<EFBFBD>bĶ:R<><52>>^g<><67><06>ik_*%<0B>a7<61>KG<4B><47><EFBFBD><EFBFBD><EFBFBD><EFBFBD>&<26>PI<50><49>n
|
||||||
@@ -13,6 +13,115 @@ which is available at `hut` or `xeon07`. It runs the following services:
|
|||||||
- Grafana: to plot the data in the web browser.
|
- Grafana: to plot the data in the web browser.
|
||||||
- Slurmctld: to manage the SLURM nodes.
|
- Slurmctld: to manage the SLURM nodes.
|
||||||
- Gitlab runner: to run CI jobs from Gitlab.
|
- Gitlab runner: to run CI jobs from Gitlab.
|
||||||
|
- Nix binary cache: to serve cached nix builds
|
||||||
|
|
||||||
This node is prone to interruptions from all the services it runs, so it is not
|
This node is prone to interruptions from all the services it runs, so it is not
|
||||||
a good candidate for low noise executions.
|
a good candidate for low noise executions.
|
||||||
|
|
||||||
|
# Binary cache
|
||||||
|
|
||||||
|
We provide a binary cache in `hut`, with the aim of avoiding unnecessary
|
||||||
|
recompilation of packages.
|
||||||
|
|
||||||
|
The cache should contain common packages from bscpkgs, but we don't provide
|
||||||
|
any guarantee that of what will be available in the cache, or for how long.
|
||||||
|
We recommend following the latest version of the `jungle` flake to avoid cache
|
||||||
|
misses.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### From NixOS
|
||||||
|
|
||||||
|
In NixOS, we can add the cache through the `nix.settings` option, which will
|
||||||
|
enable it for all builds in the system.
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{ ... }: {
|
||||||
|
nix.settings = {
|
||||||
|
substituters = [ "https://jungle.bsc.es/cache" ];
|
||||||
|
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
|
||||||
|
};
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Interactively
|
||||||
|
|
||||||
|
The cache can also be specified in a per-command basis through the flags
|
||||||
|
`--substituters` and `--trusted-public-keys`:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
nix build --substituters "https://jungle.bsc.es/cache" --trusted-public-keys "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" <...>
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: you'll have to be a trusted user.
|
||||||
|
|
||||||
|
### Nix configuration file (non-nixos)
|
||||||
|
|
||||||
|
If using nix outside of NixOS, you'll have to update `/etc/nix/nix.conf`
|
||||||
|
|
||||||
|
```
|
||||||
|
# echo "substituters = https://jungle.bsc.es/cache" >> /etc/nix/nix.conf
|
||||||
|
# echo "trusted-public-keys = jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" >> /etc/nix/nix.conf
|
||||||
|
```
|
||||||
|
|
||||||
|
### Hint in flakes
|
||||||
|
|
||||||
|
By adding the configuration below to a `flake.nix`, when someone uses the flake,
|
||||||
|
`nix` will interactively ask to trust and use the provided binary cache:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{
|
||||||
|
nixConfig = {
|
||||||
|
extra-substituters = [
|
||||||
|
"https://jungle.bsc.es/cache"
|
||||||
|
];
|
||||||
|
extra-trusted-public-keys = [
|
||||||
|
"jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0="
|
||||||
|
];
|
||||||
|
};
|
||||||
|
outputs = { ... }: {
|
||||||
|
...
|
||||||
|
};
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Querying the cache
|
||||||
|
|
||||||
|
Check if the cache is available:
|
||||||
|
```sh
|
||||||
|
$ curl https://jungle.bsc.es/cache/nix-cache-info
|
||||||
|
StoreDir: /nix/store
|
||||||
|
WantMassQuery: 1
|
||||||
|
Priority: 30
|
||||||
|
```
|
||||||
|
|
||||||
|
Prevent nix from building locally:
|
||||||
|
```bash
|
||||||
|
nix build --max-jobs 0 <...>
|
||||||
|
```
|
||||||
|
|
||||||
|
Check if a package is in cache:
|
||||||
|
```bash
|
||||||
|
# Do a raw eval on the <package>.outPath (this should not build the package)
|
||||||
|
$ nix eval --raw jungle#openmp.outPath
|
||||||
|
/nix/store/dwnn4dgm1m4184l4xbi0qfrprji9wjmi-openmp-2024.11
|
||||||
|
# Take the hash (everything from / to - in the basename) and curl <hash>.narinfo
|
||||||
|
# if it exists in the cache, it will return HTTP 200 and some information
|
||||||
|
# if not, it will return 404
|
||||||
|
$ curl https://jungle.bsc.es/cache/dwnn4dgm1m4184l4xbi0qfrprji9wjmi.narinfo
|
||||||
|
StorePath: /nix/store/dwnn4dgm1m4184l4xbi0qfrprji9wjmi-openmp-2024.11
|
||||||
|
URL: nar/dwnn4dgm1m4184l4xbi0qfrprji9wjmi-17imkdfqzmnb013d14dx234bx17bnvws8baf3ii1xra5qi2y1wiz.nar
|
||||||
|
Compression: none
|
||||||
|
NarHash: sha256:17imkdfqzmnb013d14dx234bx17bnvws8baf3ii1xra5qi2y1wiz
|
||||||
|
NarSize: 1519328
|
||||||
|
References: 4gk773fqcsv4fh2rfkhs9bgfih86fdq8-gcc-13.3.0-lib nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36
|
||||||
|
Deriver: vcn0x8hikc4mvxdkvrdxp61bwa5r7lr6-openmp-2024.11.drv
|
||||||
|
Sig: jungle.bsc.es:GDTOUEs1jl91wpLbb+gcKsAZjpKdARO9j5IQqb3micBeqzX2M/NDtKvgCS1YyiudOUdcjwa3j+hyzV2njokcCA==
|
||||||
|
# In oneline:
|
||||||
|
$ curl "https://jungle.bsc.es/cache/$(nix eval --raw jungle#<package>.outPath | cut -d '/' -f4 | cut -d '-' -f1).narinfo"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### References
|
||||||
|
|
||||||
|
- https://nix.dev/guides/recipes/add-binary-cache.html
|
||||||
|
- https://nixos.wiki/wiki/Binary_Cache
|
||||||
|
|||||||
Reference in New Issue
Block a user