6 Commits

Author SHA1 Message Date
f4eb5f27d3 make hut cache a trusted substituter 2025-04-08 18:30:20 +02:00
3590879c17 Add documentation section with commands to query the cache 2025-04-08 17:48:33 +02:00
31d03ee3ac Add instructions on how to use the cache in the web 2025-04-08 17:48:33 +02:00
4e92b14384 Use hut-substituter module in owl{1,2} and raccoon 2025-04-08 17:48:33 +02:00
b90209b4bf Add module to use hut as a binary substituter 2025-04-08 17:48:33 +02:00
785f7cfee8 Fix nginx /cache regex
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```

This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info` but
`curl https://jungle.bsc.es/cachenix-cache-info` worked.
2025-04-08 17:48:33 +02:00
11 changed files with 14 additions and 190 deletions

View File

@@ -1,156 +0,0 @@
# Maintenance purchase 2025-05
We need to buy some components to replace broken parts or to have spare ones for
when they break. We also need some tools to do basic repairs.
Here is the list:
- 11 x Power supply DELTA DPS-750XB A (700 W) (this is critical)
- 57.69€/unit, 634.59€ total <https://es.aliexpress.com/item/1005004090017186.html>
- 8 x RAM DDR4 2400MHz PC4-19200 ECC Registered
- 128.85€/pair, 515.40€ total <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
- 1 x Set of screwdrivers
- 23.99€ <https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
- 1 x UART adaptor
- 14.99€ <https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
- 1 x SSD SATA disk of 2 TB
- 135.99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>
Total: 1324.96 €
# Rationale
Below is the search procedure I followed to come up with that list.
## Power supplies
They are the first components to fail. We already have some problems with the
monitoring of some power supplies. They will soon stop being manufactured, so we
should increase out stack.
Most Xeon nodes use the DELTA DPS-750XB A:
hut% sudo ipmitool fru
...
FRU Device Description : Pwr Supply 1 FRU (ID 2)
Product Manufacturer : DELTA
Product Name : DPS-750XB A
Product Part Number : E98791-010
Product Version : 05
Product Serial : XXXXXXXXXXXXXXXXX
And we only have one per node. We should make the power supply redundant so we
can tolerate it to fail without bringing down the node.
They are available on Amazon, but they are very expensive (287.54 €):
<https://www.amazon.es/DPS-750XB-E98791-010-alimentaci%C3%B3n-conmutada-Platinum/dp/B0DB65G4VT>
On Aliexpress they are much cheaper (57.69 €):
<https://es.aliexpress.com/item/1005004090017186.html>
We have 11 nodes plus the login, but I'm not able to figure out which power
supply the login is using.
The login uses another one, AXX1100PCRPS, and only has one slot populated. We
may want to also we another one, but I would need to reset the FRU and I don't
have access to the login node. So I will leave this for Operations to deal with.
We can live without the login if needed.
## RAM DIMM
The DIMM modules also experience errors, which are monitored by Linux. In some
nodes we see non-recoverable errors that are no longer corrected by the ECC. We
need to replace the bad modules.
Having two spare modules per node would be enough to cover most problems in the
future.
> 16 GB, 2400 MHz RDIMM
The module from dmidecode:
Handle 0x0026, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0020
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM_B1
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Micron
Serial Number: XXXXXXXX
Asset Tag:
Part Number: 36ASF2G72PZ-2G3B1
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Which is this module:
<https://www.amazon.com/Micron-PC4-19200-DDR4-2400MHz-Registered-MTA36ASF2G72PZ-2G3B1/dp/B01KBCNEGI>
But they have only one in stock. Here is more details:
> 16GB PC4-19200 DDR4-2400MHz
The must have the following features:
- 16 GB
- DDR4
- Speed at least 2400 MT/s
- ECC
- Registered
- Best if from Micron
I would say having 8 spare modules would be enough for now, as we only have a
few that are currently failing. We could upgrade the modules later, as they
don't have much risk of stopping being manufactured like the power supplies.
These may work:
- 1 x 16GB, 69,11€ <https://www.amazon.es/PC4-19200-REGISTRADO-SERVIDORES-Estaciones-CHIPKILL/dp/B06X42HC9N>
- 2 x 16GB, 128,85€ <https://www.amazon.es/PC4-19200-REGISTERED-MEMORY-WORKSTATIONS-MOTHERBOARDS/dp/B06W9P3RKF>
It is cheaper to buy them by pairs, so let's use the last one.
## Screwdriver set
In order to change and replace the machine parts we need a set of screwdrivers.
Instead of having to bring my own from home, I want to have one at BSC. These
are enough and come in a nice box so I don't lose them:
<https://www.amazon.es/BLOSTM-Juego-Destornilladores-Profesionales-Destornillador/dp/B09W9R8J3S>
## Serial port adaptor
In order to debug problems with several components, we need to be able to plug
to the serial port of the CPU. As we may deal with different voltages and
pinouts, the most versatile option is to just be able to select the voltage and
expose a pin interface.
This one would do:
<https://www.amazon.es/DSD-TECH-SH-U09C5-convertidor-Soporte/dp/B07WX2DSVB>
## Storage for raccoon
Given that we are currently using raccoon for builds too, we would need to
increase its current storage. We only have available 270 GB, so we can benefit
from another disk. Using 2 TiB would be plenty. This one seems enough:
- 135,99€ <https://www.amazon.es/Crucial-BX500-pulgadas-interno-CT2000BX500SSD101/dp/B0CCN9QWKT>

View File

@@ -23,7 +23,6 @@
trusted-users = [ "@wheel" ];
flake-registry = pkgs.writeText "global-registry.json"
''{"flakes":[],"version":2}'';
keep-outputs = true;
};
gc = {

View File

@@ -10,7 +10,7 @@ in
# Connect to intranet git hosts via proxy
programs.ssh.extraConfig = ''
Host bscpm02.bsc.es bscpm03.bsc.es bscpm04.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
Host bscpm02.bsc.es bscpm03.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
User git
ProxyCommand nc -X connect -x hut:23080 %h %p
@@ -22,7 +22,6 @@ in
programs.ssh.knownHosts = hostsKeys // {
"gitlab-internal.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3";
"bscpm03.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM2NuSUPsEhqz1j5b4Gqd+MWFnRqyqY57+xMvBUqHYUS";
"bscpm04.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPx4mC0etyyjYUT2Ztc/bs4ZXSbVMrogs1ZTP924PDgT";
"glogin1.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
"glogin2.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFsHsZGCrzpd4QDVn5xoDOtrNBkb0ylxKGlyBt6l9qCz";
};

View File

@@ -11,7 +11,7 @@
proxy = {
default = "http://hut:23080/";
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40,hut";
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40";
# Don't set all_proxy as go complains and breaks the gitlab runner, see:
# https://github.com/golang/go/issues/16715
allProxy = null;

View File

@@ -56,11 +56,6 @@
iptables -A nixos-fw -p tcp -s 10.0.40.30 --dport 23080 -j nixos-fw-log-refuse
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 23080 -j nixos-fw-accept
'';
# Flush all rules and chains on stop so it won't break on start
extraStopCommands = ''
iptables -F
iptables -X
'';
};
};

View File

@@ -97,13 +97,12 @@
};
};
# DOCKER* chains are useless, override at FORWARD and nixos-fw
# DOCKER* chains are useless, override at FORWARD
networking.firewall.extraCommands = ''
# Don't forward any traffic from docker
iptables -I FORWARD 1 -p all -i docker0 -j nixos-fw-log-refuse
# Allow incoming traffic from docker to 23080
iptables -A nixos-fw -p tcp -i docker0 -d hut --dport 23080 -j ACCEPT
# Allow docker to use our proxy
iptables -I FORWARD 1 -p tcp -i docker0 -d hut --dport 23080 -j nixos-fw-accept
# Block anything else coming from docker
iptables -I FORWARD 2 -p all -i docker0 -j nixos-fw-log-refuse
'';
#systemd.services.gitlab-runner.serviceConfig.Shell = "${pkgs.bash}/bin/bash";

View File

@@ -46,7 +46,7 @@
services.prometheus = {
enable = true;
port = 9001;
retentionTime = "5y";
retentionTime = "1y";
listenAddress = "127.0.0.1";
};
@@ -250,14 +250,6 @@
module = [ "raccoon" ];
};
}
{
job_name = "raccoon";
static_configs = [
{
targets = [ "127.0.0.1:19002" ]; # Node exporter
}
];
}
{
job_name = "ipmi-fox";
metrics_path = "/ipmi";

View File

@@ -17,14 +17,13 @@ let
};
in
{
networking.firewall.allowedTCPPorts = [ 80 ];
services.nginx = {
enable = true;
virtualHosts."jungle.bsc.es" = {
root = "${website}";
listen = [
{
addr = "0.0.0.0";
addr = "127.0.0.1";
port = 80;
}
];

View File

@@ -4,7 +4,7 @@
# Don't add hut as a cache to itself
assert config.networking.hostName != "hut";
{
substituters = [ "http://hut/cache" ];
trusted-substituters = [ "https://jungle.bsc.es/cache" ];
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
};
}

View File

@@ -3,6 +3,7 @@
{
imports = [
../common/base.nix
../module/hut-substituter.nix
];
# Don't install Grub on the disk yet
@@ -25,11 +26,6 @@
} ];
};
nix.settings = {
substituters = [ "https://jungle.bsc.es/cache" ];
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
};
# Configure Nvidia driver to use with CUDA
hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
hardware.graphics.enable = true;

View File

@@ -39,6 +39,7 @@ enable it for all builds in the system.
{ ... }: {
nix.settings = {
substituters = [ "https://jungle.bsc.es/cache" ];
trusted-substituters = [ "https://jungle.bsc.es/cache" ];
trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
};
}
@@ -57,10 +58,10 @@ Note: you'll have to be a trusted user.
### Nix configuration file (non-nixos)
If using nix outside of NixOS, you'll have to update `/etc/nix/nix.conf`
If using nix outside of NixOS, you'll have to update `nix.conf`
```
# echo "substituters = https://jungle.bsc.es/cache" >> /etc/nix/nix.conf
# echo "trusted-substituters = https://jungle.bsc.es/cache" >> /etc/nix/nix.conf
# echo "trusted-public-keys = jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" >> /etc/nix/nix.conf
```