28 Commits

Author SHA1 Message Date
e8d7ae345d Maintenance purchase for 2025-05
List of components we would need to buy.
2025-05-07 14:14:58 +02:00
82fc3209de Set keep-outputs to true in all machines
From the documentation of keep-outputs, setting it to true would prevent
the GC from removing build time dependencies:

If true, the garbage collector will keep the outputs of non-garbage
derivations. If false (default), outputs will be deleted unless they are
GC roots themselves (or reachable from other roots).

In general, outputs must be registered as roots separately. However,
even if the output of a derivation is registered as a root, the
collector will still delete store paths that are used only at build time
(e.g., the C compiler, or source tarballs downloaded from the network).
To prevent it from doing so, set this option to true.

See: https://nix.dev/manual/nix/2.24/command-ref/conf-file.html#conf-keep-outputs
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-04-22 17:27:37 +02:00
abeab18270 Add raccoon node exporter monitoring
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-22 14:50:08 +02:00
1985b58619 Increase data retention to 5 years
Now that we have more space, we can extend the retention time to 5 years
to hold the monitoring metrics. For a year we have:

	# du -sh /var/lib/prometheus2
	13G     /var/lib/prometheus2

So we can expect it to increase to about 65 GiB. In the future we may
want to reduce some adquisition frequency.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-22 14:50:03 +02:00
44bd061823 Don't forward any docker traffic
Access to the 23080 local port will be done by applying the INPUT rules,
which pass through nixos-fw.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:16:15 +02:00
e8c309f584 Allow traffic from docker to enter port 23080
Before:

  hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
  + true
  + nc -w 3 -v 10.0.40.7 23080
  nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out

After:

  hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
  + true
  + nc -w 3 -v 10.0.40.7 23080
  10.0.40.7 (10.0.40.7:23080) open

Fixes: #94
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:16:10 +02:00
71ae7fb585 Add bscpm04.bsc.es SSH host and public key
Allows fetching repositories from hut and other machines in jungle
without the need to do any extra configuration.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:15:45 +02:00
8834d561d2 Add nix cache documentation section
Include usage from NixOS and non-NixOS hosts and a test with curl to
ensure it can be reached.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-04-15 14:08:22 +02:00
29daa3c364 Use hut nix cache in owl1, owl2 and raccoon
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-04-15 14:08:17 +02:00
9c503fbefb Clean all iptables rules on stop
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:08:14 +02:00
51b6a8b612 Make nginx listen on all interfaces
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:08:07 +02:00
52213d388d Fix nginx /cache regex
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```

This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info` but
`curl https://jungle.bsc.es/cachenix-cache-info` worked.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-04-15 14:08:04 +02:00
edf744db8d Add new GitLab runner for gitlab.bsc.es
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:41:18 +02:00
b82894eaec Remove SLURM partition all
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:27 +02:00
1c47199891 Add varcila user to hut and fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:25 +02:00
8738bd4eeb Adjust fox slurm config after disabling SMT
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:23 +02:00
7699783aac Add abonerib user to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:21 +02:00
fee1d4da7e Don't move doc in web output
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:19 +02:00
b77ce7fb56 Add quickstart guide
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:17 +02:00
b4a12625c5 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:15 +02:00
302106ea9a Add users to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:13 +02:00
96877de8d9 Add dalvare1 user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:11 +02:00
8878985be6 Add fox page in jungle website
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:08 +02:00
737578db34 Mount NVME disks in /nvme{0,1}
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:06 +02:00
88555e3f8c Exclude fox from being suspended by slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:04 +02:00
feb2060be7 Use IPMI host names instead of IP addresses
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:01 +02:00
00999434c2 Add fox IPMI monitoring
Use agenix to store the credentials safely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:14:59 +02:00
29d58cc62d Add new fox machine
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:14:42 +02:00
15 changed files with 391 additions and 11 deletions

View File

@@ -0,0 +1,156 @@
# Maintenance purchase 2025-05
We need to buy some components to replace broken parts or to have spare ones for
when they break. We also need some tools to do basic repairs.
Here is the list:
- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical)
- 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html>
- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered
- 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
- 1 x Set of screwdrivers
- 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
- 1 x UART adaptor
- 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
- 1 x SSD SATA disk of 2 TB
- 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
Total: 1324.96 €
# Rationale
Below is the search procedure I followed to come up with that list.
## Power supplies
They are the first components to fail. We already have some problems with the
monitoring of some power supplies. They will soon stop being manufactured, so we
should increase out stack.
Most Xeon nodes use the DELTA DPS-750XB A:
hut% sudo ipmitool fru
...
FRU Device Description : Pwr Supply 1 FRU (ID 2)
Product Manufacturer : DELTA
Product Name : DPS-750XB A
Product Part Number : E98791-010
Product Version : 05
Product Serial : XXXXXXXXXXXXXXXXX
And we only have one per node. We should make the power supply redundant so we
can tolerate it to fail without bringing down the node.
They are available on Amazon, but they are very expensive (287.54 €):
<https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT>
On Aliexpress they are much cheaper (57.69 €):
<https://es.aliexpress.com/item/1005004090017186.html>
We have 11 nodes plus the login, but I'm not able to figure out which power
supply the login is using.
The login uses another one, AXX1100PCRPS, and only has one slot populated. We
may want to also we another one, but I would need to reset the FRU and I don't
have access to the login node. So I will leave this for Operations to deal with.
We can live without the login if needed.
## RAM DIMM
The DIMM modules also experience errors, which are monitored by Linux. In some
nodes we see non-recoverable errors that are no longer corrected by the ECC. We
need to replace the bad modules.
Having two spare modules per node would be enough to cover most problems in the
future.
> 16 GB, 2400 MHz RDIMM
The module from dmidecode:
Handle 0x0026, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0020
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM_B1
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Micron
Serial Number: XXXXXXXX
Asset Tag:
Part Number: 36ASF2G72PZ-2G3B1
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Which is this module:
<https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI>
But they have only one in stock. Here is more details:
> 16GB PC4-19200 DDR4-2400MHz
The must have the following features:
- 16 GB
- DDR4
- Speed at least 2400 MT/s
- ECC
- Registered
- Best if from Micron
I would say having 8 spare modules would be enough for now, as we only have a
few that are currently failing. We could upgrade the modules later, as they
don't have much risk of stopping being manufactured like the power supplies.
These may work:
- 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N>
- 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
It is cheaper to buy them by pairs, so let's use the last one.
## Screwdriver set
In order to change and replace the machine parts we need a set of screwdrivers.
Instead of having to bring my own from home, I want to have one at BSC. These
are enough and come in a nice box so I don't lose them:
<https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
## Serial port adaptor
In order to debug problems with several components, we need to be able to plug
to the serial port of the CPU. As we may deal with different voltages and
pinouts, the most versatile option is to just be able to select the voltage and
expose a pin interface.
This one would do:
<https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
## Storage for raccoon
Given that we are currently using raccoon for builds too, we would need to
increase its current storage. We only have available 270 GB, so we can benefit
from another disk. Using 2 TiB would be plenty. This one seems enough:
- 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>

View File

@@ -23,6 +23,7 @@
trusted-users = [ "@wheel" ];
flake-registry = pkgs.writeText "global-registry.json"
''{"flakes":[],"version":2}'';
keep-outputs = true;
};
gc = {

View File

@@ -10,7 +10,7 @@ in
# Connect to intranet git hosts via proxy
programs.ssh.extraConfig = ''
Host bscpm02.bsc.es bscpm03.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
Host bscpm02.bsc.es bscpm03.bsc.es bscpm04.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
User git
ProxyCommand nc -X connect -x hut:23080 %h %p
@@ -22,6 +22,7 @@ in
programs.ssh.knownHosts = hostsKeys // {
"gitlab-internal.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3";
"bscpm03.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM2NuSUPsEhqz1j5b4Gqd+MWFnRqyqY57+xMvBUqHYUS";
"bscpm04.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPx4mC0etyyjYUT2Ztc/bs4ZXSbVMrogs1ZTP924PDgT";
"glogin1.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
"glogin2.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
};

View File

@@ -11,7 +11,7 @@
proxy = {
default = "http://hut:23080/";
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40";
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40,hut";
# Don't set all_proxy as go complains and breaks the gitlab runner, see:
# https://github.com/golang/go/issues/16715
allProxy = null;

View File

@@ -56,6 +56,11 @@
iptables -A nixos-fw -p tcp -s 10.0.40.30 --dport 23080 -j nixos-fw-log-refuse
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 23080 -j nixos-fw-accept
'';
# Flush all rules and chains on stop so it won't break on start
extraStopCommands = ''
iptables -F
iptables -X
'';
};
};

View File

@@ -1,8 +1,9 @@
{ pkgs, lib, config, ... }:
{
age.secrets.gitlabRunnerShellToken.file = ../../secrets/gitlab-runner-shell-token.age;
age.secrets.gitlabRunnerDockerToken.file = ../../secrets/gitlab-runner-docker-token.age;
age.secrets.gitlab-pm-shell.file = ../../secrets/gitlab-runner-shell-token.age;
age.secrets.gitlab-pm-docker.file = ../../secrets/gitlab-runner-docker-token.age;
age.secrets.gitlab-bsc-docker.file = ../../secrets/gitlab-bsc-docker-token.age;
services.gitlab-runner = {
enable = true;
@@ -21,21 +22,90 @@
"--docker-network-mode host"
];
environmentVariables = {
https_proxy = "http://localhost:23080";
http_proxy = "http://localhost:23080";
https_proxy = "http://hut:23080";
http_proxy = "http://hut:23080";
};
};
in {
# For pm.bsc.es/gitlab
gitlab-pm-shell = common-shell // {
authenticationTokenConfigFile = config.age.secrets.gitlabRunnerShellToken.path;
authenticationTokenConfigFile = config.age.secrets.gitlab-pm-shell.path;
};
gitlab-pm-docker = common-docker // {
authenticationTokenConfigFile = config.age.secrets.gitlabRunnerDockerToken.path;
authenticationTokenConfigFile = config.age.secrets.gitlab-pm-docker.path;
};
gitlab-bsc-docker = {
# gitlab.bsc.es still uses the old token mechanism
registrationConfigFile = config.age.secrets.gitlab-bsc-docker.path;
tagList = [ "docker" "hut" ];
environmentVariables = {
# We cannot access the hut local interface from docker, so we connect
# to hut directly via the ethernet one.
https_proxy = "http://hut:23080";
http_proxy = "http://hut:23080";
};
executor = "docker";
dockerImage = "alpine";
dockerVolumes = [
"/nix/store:/nix/store:ro"
"/nix/var/nix/db:/nix/var/nix/db:ro"
"/nix/var/nix/daemon-socket:/nix/var/nix/daemon-socket:ro"
];
dockerExtraHosts = [
# Required to pass the proxy via hut
"hut:10.0.40.7"
];
dockerDisableCache = true;
registrationFlags = [
# Increase build log length to 64 MiB
"--output-limit 65536"
];
preBuildScript = pkgs.writeScript "setup-container" ''
mkdir -p -m 0755 /nix/var/log/nix/drvs
mkdir -p -m 0755 /nix/var/nix/gcroots
mkdir -p -m 0755 /nix/var/nix/profiles
mkdir -p -m 0755 /nix/var/nix/temproots
mkdir -p -m 0755 /nix/var/nix/userpool
mkdir -p -m 1777 /nix/var/nix/gcroots/per-user
mkdir -p -m 1777 /nix/var/nix/profiles/per-user
mkdir -p -m 0755 /nix/var/nix/profiles/per-user/root
mkdir -p -m 0700 "$HOME/.nix-defexpr"
mkdir -p -m 0700 "$HOME/.ssh"
cat > "$HOME/.ssh/config" << EOF
Host bscpm04.bsc.es gitlab-internal.bsc.es
User git
ProxyCommand nc -X connect -x hut:23080 %h %p
Host amdlogin1.bsc.es armlogin1.bsc.es hualogin1.bsc.es glogin1.bsc.es glogin2.bsc.es fpgalogin1.bsc.es
ProxyCommand nc -X connect -x hut:23080 %h %p
EOF
cat >> "$HOME/.ssh/known_hosts" << EOF
bscpm04.bsc.es ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPx4mC0etyyjYUT2Ztc/bs4ZXSbVMrogs1ZTP924PDgT
gitlab-internal.bsc.es ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3
EOF
. ${pkgs.nix}/etc/profile.d/nix-daemon.sh
# Required to load SSL certificate paths
. ${pkgs.cacert}/nix-support/setup-hook
'';
environmentVariables = {
ENV = "/etc/profile";
USER = "root";
NIX_REMOTE = "daemon";
PATH = "${config.system.path}/bin:/bin:/sbin:/usr/bin:/usr/sbin";
};
};
};
};
# DOCKER* chains are useless, override at FORWARD and nixos-fw
networking.firewall.extraCommands = ''
# Don't forward any traffic from docker
iptables -I FORWARD 1 -p all -i docker0 -j nixos-fw-log-refuse
# Allow incoming traffic from docker to 23080
iptables -A nixos-fw -p tcp -i docker0 -d hut --dport 23080 -j ACCEPT
'';
#systemd.services.gitlab-runner.serviceConfig.Shell = "${pkgs.bash}/bin/bash";
systemd.services.gitlab-runner.serviceConfig.DynamicUser = lib.mkForce false;
systemd.services.gitlab-runner.serviceConfig.User = "gitlab-runner";

View File

@@ -46,7 +46,7 @@
services.prometheus = {
enable = true;
port = 9001;
retentionTime = "1y";
retentionTime = "5y";
listenAddress = "127.0.0.1";
};
@@ -250,6 +250,14 @@
module = [ "raccoon" ];
};
}
{
job_name = "raccoon";
static_configs = [
{
targets = [ "127.0.0.1:19002" ]; # Node exporter
}
];
}
{
job_name = "ipmi-fox";
metrics_path = "/ipmi";

View File

@@ -17,13 +17,14 @@ let
};
in
{
networking.firewall.allowedTCPPorts = [ 80 ];
services.nginx = {
enable = true;
virtualHosts."jungle.bsc.es" = {
root = "${website}";
listen = [
{
addr = "127.0.0.1";
addr = "0.0.0.0";
port = 80;
}
];
@@ -40,7 +41,7 @@ in
proxy_redirect http:// $scheme://;
}
location /cache {
rewrite ^/cache(.*) /$1 break;
rewrite ^/cache/(.*) /$1 break;
proxy_pass http://127.0.0.1:5000;
proxy_redirect http:// $scheme://;
}

View File

@@ -0,0 +1,10 @@
{ config, ... }:
{
nix.settings =
# Don't add hut as a cache to itself
assert config.networking.hostName != "hut";
{
substituters = [ "http://hut/cache" ];
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
};
}

View File

@@ -8,6 +8,7 @@
../module/slurm-client.nix
../module/slurm-firewall.nix
../module/debuginfod.nix
../module/hut-substituter.nix
];
# Select the this using the ID to avoid mismatches

View File

@@ -8,6 +8,7 @@
../module/slurm-client.nix
../module/slurm-firewall.nix
../module/debuginfod.nix
../module/hut-substituter.nix
];
# Select the this using the ID to avoid mismatches

View File

@@ -25,6 +25,11 @@
} ];
};
nix.settings = {
substituters = [ "https://jungle.bsc.es/cache" ];
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
};
# Configure Nvidia driver to use with CUDA
hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
hardware.graphics.enable = true;

View File

@@ -0,0 +1,11 @@
age-encryption.org/v1
-> ssh-ed25519 HY2yRg WSdjyQPzBJ4JbzQpGeq1AAYpWKoXmLI1ZtmNmM5QOzs
qGDlDT31DQF1DdHen0+5+52DdsQlabJdA2pOB5O1I6g
-> ssh-ed25519 CAWG4Q wioWMDxQjN+d4JdIbCwZg0DLQu1OH2mV6gukRprjuAs
670fE61hidOEh20hHiQAhP0+CjDF0WMBNzgwkGT8Yqg
-> ssh-ed25519 MSF3dg DN19uvAEtqq4708P6HpuX9i/o/qAvHX6dj69dCF2H1o
4Lu9GnjiFLMeXJ2C7aVPJsCHCQVlhylNWJi896Av92s
--- 7cKBwOYNOUZ2h3/kAY09aSMASZSxX7hZIT4kvlIiT6w
<EFBFBD>6<EFBFBD><02><><EFBFBD><06>fQF5=<3D>bX+<2B>v e`<60>7/<2F><05>A~P<><50>Ѧ7<15><>
<EFBFBD><EFBFBD>A<EFBFBD>)<29>h<05><>=oZ<6F>$<24> ^<5E>V0<56><>r
k<EFBFBD>u<EFBFBD>bĶ:R<><52>>^g<><67><06>ik_*% <0B>a7<61>KG<4B><47><EFBFBD><EFBFBD><EFBFBD><EFBFBD>&<26>PI<50><49>n

View File

@@ -9,6 +9,7 @@ in
"gitea-runner-token.age".publicKeys = hut;
"gitlab-runner-docker-token.age".publicKeys = hut;
"gitlab-runner-shell-token.age".publicKeys = hut;
"gitlab-bsc-docker-token.age".publicKeys = hut;
"nix-serve.age".publicKeys = hut;
"jungle-robot-password.age".publicKeys = hut;
"ipmi.yml.age".publicKeys = hut;

View File

@@ -13,6 +13,115 @@ which is available at `hut` or `xeon07`. It runs the following services:
- Grafana: to plot the data in the web browser.
- Slurmctld: to manage the SLURM nodes.
- Gitlab runner: to run CI jobs from Gitlab.
- Nix binary cache: to serve cached nix builds
This node is prone to interruptions from all the services it runs, so it is not
a good candidate for low noise executions.
# Binary cache
We provide a binary cache in `hut`, with the aim of avoiding unnecessary
recompilation of packages.
The cache should contain common packages from bscpkgs, but we don't provide
any guarantee that of what will be available in the cache, or for how long.
We recommend following the latest version of the `jungle` flake to avoid cache
misses.
## Usage
### From NixOS
In NixOS, we can add the cache through the `nix.settings` option, which will
enable it for all builds in the system.
```nix
{ ... }: {
nix.settings = {
substituters = [ "https://jungle.bsc.es/cache" ];
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
};
}
```
### Interactively
The cache can also be specified in a per-command basis through the flags
`--substituters` and `--trusted-public-keys`:
```sh
nix build --substituters "https://jungle.bsc.es/cache" --trusted-public-keys "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" <...>
```
Note: you'll have to be a trusted user.
### Nix configuration file (non-nixos)
If using nix outside of NixOS, you'll have to update `/etc/nix/nix.conf`
```
# echo "substituters = https://jungle.bsc.es/cache" >> /etc/nix/nix.conf
# echo "trusted-public-keys = jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" >> /etc/nix/nix.conf
```
### Hint in flakes
By adding the configuration below to a `flake.nix`, when someone uses the flake,
`nix` will interactively ask to trust and use the provided binary cache:
```nix
{
nixConfig = {
extra-substituters = [
"https://jungle.bsc.es/cache"
];
extra-trusted-public-keys = [
"jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0="
];
};
outputs = { ... }: {
...
};
}
```
### Querying the cache
Check if the cache is available:
```sh
$ curl https://jungle.bsc.es/cache/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
```
Prevent nix from building locally:
```bash
nix build --max-jobs 0 <...>
```
Check if a package is in cache:
```bash
# Do a raw eval on the <package>.outPath (this should not build the package)
$ nix eval --raw jungle#openmp.outPath
/nix/store/dwnn4dgm1m4184l4xbi0qfrprji9wjmi-openmp-2024.11
# Take the hash (everything from / to - in the basename) and curl <hash>.narinfo
# if it exists in the cache, it will return HTTP 200 and some information
# if not, it will return 404
$ curl https://jungle.bsc.es/cache/dwnn4dgm1m4184l4xbi0qfrprji9wjmi.narinfo
StorePath: /nix/store/dwnn4dgm1m4184l4xbi0qfrprji9wjmi-openmp-2024.11
URL: nar/dwnn4dgm1m4184l4xbi0qfrprji9wjmi-17imkdfqzmnb013d14dx234bx17bnvws8baf3ii1xra5qi2y1wiz.nar
Compression: none
NarHash: sha256:17imkdfqzmnb013d14dx234bx17bnvws8baf3ii1xra5qi2y1wiz
NarSize: 1519328
References: 4gk773fqcsv4fh2rfkhs9bgfih86fdq8-gcc-13.3.0-lib nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36
Deriver: vcn0x8hikc4mvxdkvrdxp61bwa5r7lr6-openmp-2024.11.drv
Sig: jungle.bsc.es:GDTOUEs1jl91wpLbb+gcKsAZjpKdARO9j5IQqb3micBeqzX2M/NDtKvgCS1YyiudOUdcjwa3j+hyzV2njokcCA==
# In oneline:
$ curl "https://jungle.bsc.es/cache/$(nix eval --raw jungle#<package>.outPath | cut -d '/' -f4 | cut -d '-' -f1).narinfo"
```
#### References
- https://nix.dev/guides/recipes/add-binary-cache.html
- https://nixos.wiki/wiki/Binary_Cache