ee1b1a7679
Add acinca user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-30 18:26:33 +02:00
ef914953d4
Restart slurmd on failure
...
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:
owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137
Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-29 19:17:33 +02:00
98abb3edf2
Lower connect timeout when using hut substituter
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-09-29 09:41:34 +02:00
0cbcdcbe38
Use hut substituter in all nodes
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-09-25 17:10:10 +02:00
fce7cb795c
Remove machine access for user csiringo
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-09-29 17:30:02 +02:00
bf69d242d0
Mount apex /home via NFS in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 13:48:50 +02:00
e4c0f95906
Remove extra SSH jump configuration
...
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-25 15:15:43 +02:00
57f6f7bb10
Add raccoon peer to wireguard
...
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-25 15:01:33 +02:00
9c39ce006a
Add raccoon host key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 13:26:56 +02:00
405a7a7415
Restrict fox peer to a single IP
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 13:20:54 +02:00
04b094a627
Use lowercase peer hostnames
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 13:18:12 +02:00
f2c38f9316
Share a public folder for documents
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-17 13:08:48 +02:00
3d344a5a4d
Fix AMDuProfPcm so it finds libnuma.so
...
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Tested-by: Vincent Arcila <vincent.arcila@bsc.es >
2025-09-18 13:15:44 +02:00
e50fb05df7
Add amd_hsmp module in fox for AMD uProf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-18 11:44:49 +02:00
66068bc412
Fix hidden dependencies for AMDuProfSys
...
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-16 15:57:04 +02:00
ff5db631f7
Disable NMI watchdog in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-16 15:53:28 +02:00
e8a3d6d647
Fix amd-uprof dependencies with patchelf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-05 13:01:11 +02:00
6c544f79c4
Fix hrtimer new interface
...
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.
See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-04 12:20:42 +02:00
3b7cf58aad
Use CFLAGS_MODULE instead of EXTRA_CFLAGS
...
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-04 12:00:33 +02:00
87bae5b9df
Add AMD uProf module and enable it in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-20 15:51:46 +02:00
6f958c14cd
Add AMD uProf package and driver
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-20 14:55:43 +02:00
dcffeed542
Mount home via NFS from apex in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 13:24:06 +02:00
a22d0d4135
Allow access to NFS via wireguard subnet
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 13:16:27 +02:00
7d4ebd8495
Use 10.106.0.0/24 subnet to avoid collisions
...
The 106 byte is the code for 'j' (jungle) in ASCII:
% printf j | od -t d
0000000 106
0000001
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 11:12:25 +02:00
3a917f75c7
Revert "Remove pam_slurm_adopt from fox"
...
This reverts commit 64a52801ed .
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-02 17:12:56 +02:00
7657b860a8
Enable fail2ban in fox
...
Protect fox against ssh bruteforce attacks:
fox% sudo lastb | head
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-01 11:25:29 +02:00
50ae3ab4f0
Accept connections from apex to fox slurmd
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 14:55:53 +02:00
02e2470c1a
Accept fox connection to slurm controller
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 14:46:24 +02:00
3f67bc4a2e
Add fox machine to SLURM
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 14:40:43 +02:00
71a23ec68b
Rekey secrets with trusted fox key
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 14:39:28 +02:00
11f52da199
Trust fox for compute node secrets
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 14:35:51 +02:00
f1a98190b5
Make apex host specific to each machine
...
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 14:29:14 +02:00
2fbf3ee8b6
Add local host fox in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 14:11:19 +02:00
dd4ad901df
Enable wireguard in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 13:52:05 +02:00
c9669408c5
Add wireguard server in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-29 13:38:47 +02:00
ddfb26be5a
Use writeShellScript for suspend.sh and resume.sh
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-29 12:02:12 +02:00
1b21a398a8
Add firewall rules to slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-27 12:59:21 +02:00
4d16e794cd
Remove hut from slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-27 12:43:12 +02:00
38a45f20b4
Only configure apex as slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-27 12:37:21 +02:00
0cc76fc98d
Split slurm configuration for client and server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-27 12:36:52 +02:00
70da186d15
Move slurm control server to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-27 11:56:20 +02:00
d71831016e
Fix typo in csiringo ssh key
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-08-27 17:21:23 +02:00
0fb3cec09c
Enable nix-ld in weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-16 16:20:40 +02:00
5ccfc2411f
Add csiringo user with access to apex and weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-08-27 12:42:08 +02:00
dbb7e1fe36
Access gitlab via raccoon in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-27 15:20:34 +02:00
d1f58a62f5
Move StartLimit* options to unit section
...
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-24 12:21:05 +02:00
0642df0bbd
Set power policy to always turn on
...
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-23 15:25:47 +02:00
3d7e8b8a07
Add NixOS module to control power policy
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-23 14:07:06 +02:00
2e429bf09e
Move August shutdown to 3rd at 22h
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-23 13:42:57 +02:00
9e22760628
Disable automatic August shutdown for Fox
...
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-23 13:40:33 +02:00
8bb09dd061
Add cudainfo program to test CUDA
...
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-22 15:24:55 +02:00
f686797234
Add missing symlink in cuda sandbox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-21 17:19:25 +02:00
6411a94f77
Enable cuda systemFeature in raccoon and fox
...
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-18 11:34:28 +02:00
7b61cfbe54
Move shared nvidia settings to a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-18 11:31:59 +02:00
4e1fd7b0e0
Replace xeon07 by hut in ssh config
...
The xeon07 machine has been renamed to hut.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-18 10:59:39 +02:00
4e24135d35
Enable automatic Nix GC in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-18 13:43:58 +02:00
7131d82ba2
Select proprietary NVIDIA driver in raccoon
...
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:
> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-18 13:00:03 +02:00
e8cd0d9f58
Enable open source NVidia driver in fox
...
It is recommended for newer versions.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-17 11:32:35 +02:00
a9ba65cdca
Remove option allowUnfree from fox and raccoon
...
It is already set to true for all machines.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-17 11:26:27 +02:00
94f398e661
Ban another scanner trying to connect via SSH
...
It is constantly spamming out logs:
apex# journalctl | grep 'Connection closed by 84.88.52.176' | wc -l
2255
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-16 16:59:29 +02:00
387e1cada7
Update weasel IPMI hostname for monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 18:48:08 +02:00
c6cc2a7638
Remove merged MPICH patch
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-15 17:57:22 +02:00
29071a6020
Remove package ix as it is gone
...
Fails with: "error: ix has been removed from Nixpkgs, as the ix.io
pastebin has been offline since Dec. 2023".
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-15 17:50:12 +02:00
f59218c898
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41?narHash=sha256-b%2Buqzj%2BWa6xgMS9aNbX4I%2BsXeb5biPDi39VgvSFqFvU%3D' (2024-08-10)
→ 'github:ryantm/agenix/531beac616433bac6f9e2a19feb8e99a22a66baf?narHash=sha256-9P1FziAwl5%2B3edkfFcr5HeGtQUtrSdk/MksX39GieoA%3D' (2025-06-17)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d?narHash=sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0%3D' (2023-11-24)
→ 'github:lnl7/nix-darwin/43975d782b418ebf4969e9ccba82466728c2851b?narHash=sha256-dyN%2BteG9G82G%2Bm%2BPX/aSAagkC%2BvUv0SgUw3XkPhQodQ%3D' (2025-04-12)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1?narHash=sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE%3D' (2023-12-20)
→ 'github:nix-community/home-manager/abfad3d2958c9e6300a883bd443512c55dfeb1be?narHash=sha256-YZCh2o9Ua1n9uCvrvi5pRxtuVNml8X2a03qIFfRKpFs%3D' (2025-04-24)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=9d1944c658929b6f98b3f3803fead4d1b91c4405 ' (2025-06-11)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc?narHash=sha256-i/UJ5I7HoqmFMwZEH6vAvBxOrjjOJNU739lnZnhUln8%3D' (2025-01-14)
→ 'github:NixOS/nixpkgs/dfcd5b901dbab46c9c6e80b265648481aafb01f8?narHash=sha256-Kt1UIPi7kZqkSc5HVj6UY5YLHHEzPBkgpNUByuyxtlw%3D' (2025-07-13)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-15 17:46:48 +02:00
871515a736
Upgrade nixpkgs to nixos 25.05
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-15 17:45:40 +02:00
ef65a49ed1
Silently ban OpenVAS BSC scanner from apex
...
It is spamming our logs with refused connection lines:
apex% sudo journalctl -b0 | grep 'refused connection.*SRC=192.168.8.16' | wc -l
13945
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 17:30:20 +02:00
061bd24453
Rotate anavarro password and SSH key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 17:15:59 +02:00
0a876e7a83
Add weasel machine configuration
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 15:07:52 +02:00
ba425f6647
Remove extra flush commands on firewall stop
...
They are not needed as they are already flushed when the firewall
starts or stops.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 16:13:35 +02:00
5a4e7d2bdf
Prevent accidental use of nftables
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 16:12:44 +02:00
998a7f839d
Add proxy configuration for internal hosts
...
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.
This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 12:29:52 +02:00
cdad30dd55
Remove unused blackbox configuration modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 11:34:08 +02:00
bffa8d94a9
Use IPv4 in blackbox probes
...
Otherwise they simply fail as IPv6 doesn't work.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 11:33:04 +02:00
8e80ed7034
Make NFS mount async to improve latency
...
Don't wait to flush writes, as we don't care about consistency on a
crash:
> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 11:10:07 +02:00
71e1562a0b
Disable root_squash from NFS
...
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 10:35:38 +02:00
8623e7c2bc
Remove SSH proxy to access BSC clusters
...
We now have direct connection to them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-11 10:22:04 +02:00
b10504cb59
Add users to apex machine
...
They need to be able to login to apex to access any other machine from
the SSF rack.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-09 11:59:36 +02:00
ba66cb0b71
Remove proxy from hut HTTP probes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-09 11:26:22 +02:00
bb779a9630
Remove proxy configuration from environment
...
All machines have now direct connection with the outside world.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-09 11:24:22 +02:00
76ce684be4
Add storcli utility to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-09 11:11:22 +02:00
eebcf2f239
Add new configuration for apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-09 11:02:11 +02:00
69b7be9026
Add pmartin1 user with access to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-03 10:26:44 +02:00
a1e45941cc
Add access to fox for rpenacob user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:20:51 +02:00
9c5c26e94d
Revert "Only allow Vincent to access fox for now"
...
This reverts commit efac36b186 .
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:20:05 +02:00
df2f25873f
Add all terminfo files in environment
...
Fixes problems with the kitty terminal when opening vim or kakoune.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-01 14:59:39 +02:00
7304c60a98
Monitor Fox BMC with ICMP probes too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-20 16:06:50 +02:00
904bb5f2ba
Restrict DAC VPN to fox-ipmi machine only
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-20 14:47:55 +02:00
55b2860b67
Monitor fox via VPN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-17 16:41:25 +02:00
23310cbfa9
Add OpenVPN service to connect to fox BMC
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-17 14:29:15 +02:00
fd49be6033
Add ac.upc.edu as name search server
...
Allows referring to fox.ac.upc.edu directly as fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-18 16:36:34 +02:00
b9ca4fcca3
Disable kptr_restrict in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-18 11:07:19 +02:00
0baec02de3
Disable NUMA balancing in fox
...
See: https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-17 14:04:46 +02:00
39f6455d8c
Load amd_uncore module in fox
...
Needed for L3 events in perf.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-13 13:14:47 +02:00
ce5228f696
Enable SSH X11 forwarding
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-13 10:26:59 +02:00
b097cbfe2f
Disable registration in Gitea
...
Get rid of all the spam accounts they are trying to register.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-16 15:55:53 +02:00
926d443e24
Enable msmtp configuration in tent
...
Allows gitea to send notifications via email.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-16 15:40:06 +02:00
9f0deec40a
Add GitLab runner with debian docker for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-13 15:52:31 +02:00
415d09600a
Monitor nix-daemon in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-13 15:11:24 +02:00
02da9f1847
Move nix-daemon exporter to modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-13 15:09:54 +02:00
996602845c
Add p service for pastes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-13 12:53:58 +02:00
3cc2ed1d18
Enable public-inbox service in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-13 11:52:10 +02:00
54c595fa62
Enable gitea in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-13 11:10:39 +02:00
7a7b847cb9
Add bsc.es to resolve domain names
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-13 09:40:17 +02:00
dec3ab49a7
Monitor AXLE machine too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 16:47:40 +02:00
72e475edbb
Use IPv4 for blackbox exporter
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 16:38:40 +02:00
2f9eb39fac
Add public html files to tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 15:24:31 +02:00
377cc66d16
Add docker GitLab runner for BSC GitLab
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 13:49:51 +02:00
f711a26778
Add GitLab shell runner in tent for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-05 11:11:13 +02:00
67c991fc6f
Enable jungle robot emails for Grafana in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 13:25:43 +02:00
a7b1334dd7
Add tent key for nix-serve
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 13:20:29 +02:00
f5ac62577e
Remove jungle nix cache from tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 13:18:01 +02:00
6bbadc5246
Enable nix cache
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 13:17:26 +02:00
5026f0257e
Serve Grafana from subpath
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 12:57:34 +02:00
cdbdef9bb1
Add nginx server in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 12:47:43 +02:00
a5b5765d57
Add monitoring in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-12 10:32:31 +02:00
a208cfbc6f
Disable nix garbage collector in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-06-07 17:51:40 +02:00
9d8234024d
Rekey secrets with tent keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-05 11:09:15 +02:00
a20e8844c6
Add tent host key and admin keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-05 11:07:00 +02:00
c89f9d79a0
Create directories in /vault/home for tent users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-03 19:07:43 +02:00
39a070852f
Add software RAID in tent using 3 disks
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-03 18:27:56 +02:00
6f5dacbcd3
Add access to tent to all hut users too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-03 17:24:40 +02:00
70eecd1e39
Add hut SSH configuration from outside SSF LAN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-03 17:17:29 +02:00
5f59a22705
Don't use proxy in base preset
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-03 12:52:10 +02:00
3734a9210c
Add tent machine from xeon04
...
We moved the tent machine to the server room in the BSC building and is
now directly connected to the raccoon via NAT.
Fixes: #106
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 09:07:32 +02:00
c9b6edb6a9
Create specific SSF rack configuration
...
Allow xeon machines to optionally inherit SSF configuration such as the
NFS mount point and the network configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 12:22:41 +02:00
10693417a3
Only allow Vincent to access fox for now
...
Needed to run benchmarks without interference.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-10 14:38:02 +02:00
c441d4aad7
Use performance governor in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-10 14:37:39 +02:00
729b781cdd
Add hut as nix cache in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-10 18:23:20 +02:00
08953f64fb
Use extra- for substituters and trusted-public-keys
...
From the nix manual:
> A configuration setting usually overrides any previous value. However,
> for settings that take a list of items, you can prefix the name of the
> setting by extra- to append to the previous value.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-06-03 17:59:17 +02:00
0c9f31ffe1
Use DHCP for Ethernet in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-06 15:11:12 +02:00
59d6742e77
Use UPC time servers as others are blocked
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-06 14:44:47 +02:00
075dd928ad
Create tracing group and add arocanon in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-05-15 12:24:49 +02:00
007418a52c
Extend perf support in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-05-15 12:21:26 +02:00
87e5fc8af6
Enable nixdebuginfod in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-05-06 14:39:48 +02:00
1089dd10b7
Make raccoon use performance governor
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-05-05 10:50:43 +02:00
6f07c93b5a
Enable binfmt emulation in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-03-21 17:51:41 +01:00
34d55ea815
Disable nix garbage collector in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-03-18 16:48:47 +01:00
78d7b522bf
Add dbautist user to raccoon machine
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-03-03 13:55:23 +01:00
1b6c948325
Add node exporter monitoring in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-02-25 17:11:09 +01:00
8d01909666
Allow X11 forwarding via SSH
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-02-18 16:19:04 +01:00
57a0d58691
Enable linger for user rarias
...
Allows services to run without a login session.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-10-14 19:12:25 +02:00
f79debb7a1
Only proxy SSH git remotes via hut in xeon
...
Other machines like raccoon have direct access.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-09-10 15:03:03 +02:00
23d3cc5f18
Add machine map file
...
Documents the location, board and serial numbers so we can track the
machines if they move around. Some information is unkown.
Using the Nix language to encode the machines location and properties
allows us to later use that information in the configuration of the
machines themselves.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:12:30 +02:00
e31c80c6c5
Remove fox monitoring via IPMI
...
We will need to setup an VPN to be able to access fox in its new
location, so for now we simply remove the IPMI monitoring.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 07:55:11 +02:00
69e1cde614
Monitor fox, gateway and UPC anella via ICMP
...
Fox should reply once the machine is connected to the UPC network.
Monitoring also the gateway and UPC anella allows us to estimate if the
whole network is down or just fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-28 13:03:01 +02:00
3264788343
Update configuration for UPC network
...
The fox machine will be placed in the UPC network, so we update the
configuration with the new IP and gateway. We won't be able to reach hut
directly so we also remove the host entry and proxy.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 14:17:06 +02:00
97e5e5d04b
Disable home via NFS in fox
...
It won't be accesible anymore as we won't be in the same LAN.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 13:41:36 +02:00
490977cdc1
Rekey all secrets
...
Fox is no longer able to use munge or ceph, so we remove the key and
rekey them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 12:30:03 +02:00
0b1feca6ac
Rotate fox SSH host key
...
Prevent decrypting old secrets by reading the git history.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 12:27:57 +02:00
7d9340e8cb
Distrust fox SSH key
...
We no longer will share secrets with fox until we can regain our trust.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 12:00:21 +02:00
653a197bf4
Remove Ceph module from fox
...
It will no longer be accesible from the UPC.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 11:50:57 +02:00
b386d30380
Remove fox from SLURM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 11:43:16 +02:00
dd8d3c508b
Remove pam_slurm_adopt from fox
...
We no longer will be able to use SLURM from jungle.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 11:40:07 +02:00
bbf09ab960
Add UPC temperature sensor monitoring
...
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-26 11:24:12 +02:00
3b5781ba63
Add meteocat exporter
...
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-23 15:40:09 +02:00
7d10816c98
Add custom nix-daemon exporter
...
Allows us to see which derivations are being built in realtime. It is a
bit of a hack, but it seems to work. We simply look at the environment
of the child processes of nix-daemon (usually bash) and then look for
the $name variable which should hold the current derivation being
built. Needs root to be able to read the environ file of the different
nix-daemon processes as they are owned by the nixbld* users.
See: https://discourse.nixos.org/t/query-ongoing-builds/23486
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-24 23:51:06 +02:00
1b77b70074
Set keep-outputs to true in all machines
...
From the documentation of keep-outputs, setting it to true would prevent
the GC from removing build time dependencies:
If true, the garbage collector will keep the outputs of non-garbage
derivations. If false (default), outputs will be deleted unless they are
GC roots themselves (or reachable from other roots).
In general, outputs must be registered as roots separately. However,
even if the output of a derivation is registered as a root, the
collector will still delete store paths that are used only at build time
(e.g., the C compiler, or source tarballs downloaded from the network).
To prevent it from doing so, set this option to true.
See: https://nix.dev/manual/nix/2.24/command-ref/conf-file.html#conf-keep-outputs
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-04-22 16:16:42 +02:00
b5d0b34179
Add raccoon node exporter monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-22 11:41:43 +02:00
7ad4af686b
Increase data retention to 5 years
...
Now that we have more space, we can extend the retention time to 5 years
to hold the monitoring metrics. For a year we have:
# du -sh /var/lib/prometheus2
13G /var/lib/prometheus2
So we can expect it to increase to about 65 GiB. In the future we may
want to reduce some adquisition frequency.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-22 11:20:57 +02:00
614fcfe596
Don't forward any docker traffic
...
Access to the 23080 local port will be done by applying the INPUT rules,
which pass through nixos-fw.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-15 12:46:08 +02:00
7c901742e0
Allow traffic from docker to enter port 23080
...
Before:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out
After:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
10.0.40.7 (10.0.40.7:23080) open
Fixes: #94
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-15 12:17:00 +02:00
a492e06327
Add bscpm04.bsc.es SSH host and public key
...
Allows fetching repositories from hut and other machines in jungle
without the need to do any extra configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-11 12:15:33 +02:00
4ed53d4384
Use hut nix cache in owl1, owl2 and raccoon
...
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-02-26 16:03:26 +01:00
ad26c63fa2
Clean all iptables rules on stop
...
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-11 10:23:26 +02:00
563dc575fd
Make nginx listen on all interfaces
...
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-11 10:03:05 +02:00
097c7bc31f
Fix nginx /cache regex
...
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```
This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info ` but
`curl https://jungle.bsc.es/cachenix-cache-info ` worked.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-02-26 15:31:05 +01:00
17e42b3872
Add new GitLab runner for gitlab.bsc.es
...
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-24 13:00:54 +01:00
db04825a11
Remove SLURM partition all
...
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-07 16:17:32 +02:00
7f395ba2d9
Add varcila user to hut and fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-03-28 11:53:33 +01:00
5683fe5be1
Adjust fox slurm config after disabling SMT
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-03-28 11:04:19 +01:00
b44bdfb10f
Add abonerib user to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-25 14:33:11 +01:00
b1adbed3de
Don't move doc in web output
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-14 16:36:57 +01:00
8ff54219f6
Reject SSH connections without SLURM allocation
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-13 14:47:38 +01:00
580bfad9ec
Add users to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-12 16:46:56 +01:00
afe7ae445b
Add dalvare1 user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-12 16:39:51 +01:00
9dea4e2379
Mount NVME disks in /nvme{0,1}
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-12 15:49:55 +01:00
b046baee48
Exclude fox from being suspended by slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-12 15:02:18 +01:00
8766fd8439
Use IPMI host names instead of IP addresses
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-12 12:14:40 +01:00
b70d99f479
Add fox IPMI monitoring
...
Use agenix to store the credentials safely.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-12 11:36:53 +01:00
a0eae1feea
Add new fox machine
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-02-11 12:56:30 +01:00
e9740c471d
Update PM GitLab tokens to new URL
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-15 14:38:57 +01:00
9b183c4202
Fix MPICH build by fetching upstream patches too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-15 13:16:10 +01:00
90036b8ea2
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
→ 'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41' (2024-08-10)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
→ 'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc' (2025-01-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-15 12:44:51 +01:00
bb4e42e149
Set nixpkgs to track nixos-24.11
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-15 12:43:45 +01:00
23aa682816
Add script to monitor GPFS
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-14 12:01:00 +01:00
3e26c69f69
Add BSC machines to ssh config
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-14 15:51:34 +01:00
aa977ee62a
Collect statistics from logged users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-11-14 12:21:13 +01:00
7b9d805d12
Add custom GPFS exporter for MN5
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-11-12 16:30:24 +01:00
4aa011ff85
Remove exception to fetch task endpoint
...
It causes the request to go to the website rather than the Gitea
service.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-10-22 16:13:01 +02:00
4b41b67d25
Use SSD for boot, then switch to NVME
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-10-21 14:28:17 +02:00
e3f6e67348
Use NVME as root
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-10-17 14:39:31 +02:00
129fa52e9b
Keep host header for Grafana requests
...
This was breaking requests due to CSRF check.
See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-10-17 13:35:45 +02:00
0e1ea5d504
Ignore logging requests from the gitea runner
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-20 15:44:22 +02:00
95eef3b0c5
Log the client IP not the proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-20 15:24:38 +02:00
7d25055f98
Ignore misc directory
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-20 15:25:06 +02:00
b978f12d19
Create paste directories in /ceph/p
...
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-20 11:19:30 +02:00
c1617266b6
Add p command to paste files
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-16 16:33:42 +02:00
83830dbfed
Use nginx to serve website and other services
...
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-16 16:33:34 +02:00
0bcac3bca4
Mount the NVME disk in /nvme
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-23 16:15:26 +02:00
f41771d55f
Delay nix-gc until /home is mounted
...
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.
See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-09-18 11:04:44 +02:00
1e90c038a1
Add dbautist user with access to hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-09-18 15:21:01 +02:00
439f40240f
Set the serial console to ttyS1 in raccoon
...
Apparently the ttyS0 console doesn't exist but ttyS1 does:
raccoon% sudo stty -F /dev/ttyS0
stty: /dev/ttyS0: Input/output error
raccoon% sudo stty -F /dev/ttyS1
speed 9600 baud; line = 0;
-brkint -imaxbel
The dmesg line agrees:
00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-22 13:34:19 +02:00
e5feebbd8f
Remove setLdLibraryPath and driSupport options
...
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".
See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-22 12:36:20 +02:00
38f0fb7f78
Add documentation section about GRUB chain loading
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-06-07 10:40:37 +02:00
bb566b7eeb
Add 10 min shutdown jitter to avoid spikes
...
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-22 11:20:02 +02:00
f7d60c4bbe
Don't mount the nix store in owl nodes
...
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-22 11:02:32 +02:00
3c1be2d4b4
Emulate other architectures in owl nodes too
...
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-19 17:53:10 +02:00
b04a064583
Program shutdown for August 2nd for all machines
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-18 18:01:45 +02:00
e78021c319
Enable debuginfod daemon in owl nodes
...
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-18 16:12:16 +02:00
2cba78cee1
Set gitea and grafana log level to warn
...
Prevents filling the journal logs with information messages.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-18 13:39:16 +02:00
be802804d1
Set default SLURM job time limit to one hour
...
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-18 11:44:01 +02:00
e1967ccda6
Allow other jobs to run in unused cores
...
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.
As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-18 11:19:03 +02:00
cd9032bca9
Use authentication tokens for PM GitLab runner
...
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead. Older tokens and
runners are also removed, as they are no longer used.
With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.
See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-16 14:58:58 +02:00
30f1ab9144
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
→ 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
→ 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-16 14:12:06 +02:00
b57bb47aa6
Allow ptrace to any process of the same user
...
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-17 13:10:59 +02:00
555879f04e
Add abonerib user to hut, raccon, owl1 and owl2
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-16 18:16:05 +02:00
af38221cfa
Grant rpenacob access to owl1 and owl2 nodes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-16 18:04:16 +02:00
57158b5257
Access private repositories via hut SSH proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-17 12:47:53 +02:00
e12d99fd46
Set the default proxy to point to hut
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-17 12:59:02 +02:00
9c686a846f
Allow incoming traffic to hut proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-07-17 12:56:59 +02:00
22a7de03a0
eudy: koro: fcs: Fix fcs unprotected cpuid all
...
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2024-07-16 17:36:21 +02:00
b0cc9c959e
Add support for armv7 emulation in hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-06-21 13:52:08 +02:00
c781a2262f
Monitor raccoon machine via IPMI
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-06-07 13:46:33 +02:00
7b5e4f3978
Move vlopez user to jungleUsers for koro host
...
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-06-07 10:06:58 +02:00
b14b4fab1f
Add raccoon motd file
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-06-06 19:36:53 +02:00
cd3284d1b2
Split xeon specific configuration from base
...
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-06-03 09:20:11 +02:00
91a42375e3
Control user access to each machine
...
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-06-06 14:06:33 +02:00
2f6673cb3e
Add PostgreSQL DB for performance test results
...
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-30 13:35:58 +02:00
584fe927b6
Enable Grafana email alerts
...
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-31 13:54:06 +02:00
2abc1e8fca
Enable mail notification in Gitea
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 18:54:38 +02:00
38255dfa0f
Add msmtp to send notifications via email
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 17:54:09 +02:00
8033531246
Allow Ceph traffic to lake2
2024-04-30 13:04:45 +02:00
17fc1b0c9a
Collect Gitea metrics in Prometheus
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-29 11:22:45 +02:00
249d3e472f
Add Gitea service
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-26 16:52:52 +02:00
3ae2938cad
Add firewall rules for Ceph and monitoring
...
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-24 16:55:06 +02:00
d93fea8288
Add workaround for MPICH 4.2.0
...
See: https://github.com/pmodels/mpich/issues/6946
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-03-15 21:39:43 +01:00
5f69d51134
Fix SLURM bug in rank integer sign expansion
...
See: https://bugs.schedmd.com/show_bug.cgi?id=19324
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-03-15 13:12:46 +01:00
a2ec4546df
Merge pmix outputs for MPICH
...
MPICH expects headers and libraries to be present in the same directory.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-03-14 16:59:11 +01:00
b5da1c6521
Remove nixseparatedebuginfod input
...
It has been integrated in nixpkgs, so is no longer required.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-03-14 16:44:21 +01:00
082221f2c3
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
→ 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
→ 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
→ 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
→ 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
→ 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-03-14 16:41:30 +01:00
67bcf7b2a0
Use google.com probe instead of bsc.es
...
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-02-29 09:57:18 +01:00
bd56c2340d
Add another HTTPS probe for bsc.es
...
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.
Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-02-13 11:50:38 +01:00
df5a5e1668
Move slurm client in a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2024-02-09 11:14:34 +01:00
d982b45c26
Enable public-inbox at jungle.bsc.es/lists
...
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.
The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.
We also edited the CSS to improve the readability and have larger fonts
by default.
The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-12-07 11:08:15 +01:00
171f26e192
Monitor https://pm.bsc.es/gitlab/ too
...
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/ .
Cc: Víctor López <victor.lopez@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-12-01 12:17:50 +01:00
1c6e5d8f82
Enable nixseparatedebuginfod module
...
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2023-12-01 19:57:04 +01:00
f78f1a3ce6
Use tmpfs in /tmp
...
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-21 23:56:55 +01:00
8c7d37859b
Enable runners for pm.bsc.es/gitlab too
...
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-23 12:39:43 +01:00
4d833d2088
Remove complete ceph package from hut
...
Only the ceph-client is needed.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-20 12:57:31 +01:00
3d67c17cac
Fix warning in slurm exporter using vendorHash
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-20 12:40:24 +01:00
ea2eeff5f9
Remove old Ceph package overlay
...
The Ceph package is now integrated in upstream nixpkgs.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-07 00:02:26 +01:00
e58ffd9652
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/d8c973fd228949736dedf61b7f8cc1ece3236792' (2023-07-24)
→ 'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
→ 'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-20 12:37:50 +01:00
2acfd589d4
BSC packages are no longer in bsc attribute
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-06 23:03:56 +01:00
838b2d73e9
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-06 17:54:14 +01:00
0e1ada08cf
Switch bscpkgs URL to sourcehut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-06 17:50:38 +01:00
c307fc9bb3
Monitor anella instead of gw.bsc.es
...
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-26 12:36:06 +02:00
6f5f234480
Add ICMP probes
...
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.
In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-24 11:49:42 +02:00
1e9bc4086f
Enable proxy for Grafana too
...
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-20 16:04:15 +02:00
734f52e87f
Make blackbox exporter use the proxy
...
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-20 15:34:06 +02:00
18908c3019
Don't log SLURM connection attempts from ssfhead
2023-10-04 08:19:09 +02:00
72658ee5e6
Add docker runner too
2023-10-04 07:55:26 +02:00
cfa3e08e4b
Monitor gitlab.bsc.es too
2023-10-03 09:45:13 +02:00
10101c631d
Monitor PM webpage via blackbox
2023-10-03 08:58:07 +02:00
4d865d7a7e
Temporarily disable pm runners
2023-09-28 14:14:41 +02:00
d9511dab22
Add runner for gitlab.bsc.es
2023-09-28 14:11:30 +02:00
c3ecba513d
Allow anonymous access to grafana
2023-09-22 10:50:14 +02:00
24c05e5ebf
Remove user/group when using DynamicUsers
2023-09-22 10:13:06 +02:00
7aef154dd4
Set the SLURM_CONF variable
2023-09-21 22:18:30 +02:00
4ca4e0fae9
Enable slurm-exporter service
2023-09-21 21:38:34 +02:00
7b686d0ea4
Add prometheus-slurm-exporter package
2023-09-21 21:34:18 +02:00
d4c803dbfb
Mount the hut nix store for SLURM jobs
2023-09-20 18:26:48 +02:00
94ead9b759
Enable direnv integration
2023-09-17 22:27:51 +02:00
e0b3dd961c
Remove bscpkgs from the registry and nixPath
...
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2023-09-15 11:58:47 +02:00
656de00d65
Add bscpkgs and nixpkgs top level attributes
...
Allows the evaluation of packages of the intermediate overlays.
2023-09-15 11:58:10 +02:00
fefdbe9c55
Use hut packages as the default package set
...
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2023-09-14 18:28:09 +02:00
c73a337471
Don't fetch registry flakes from the net
2023-09-15 09:13:24 +02:00
dbd57ed57f
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
2023-09-14 18:09:05 +02:00
010491618e
Revert "Update slurm to 23.02.05.1"
...
This reverts commit aaefddc44a .
2023-09-14 15:46:18 +02:00
722c0b0eaa
Open ports in firewall of compute nodes
2023-09-14 15:45:43 +02:00
772e0f00fb
Update slurm to 23.02.05.1
2023-09-13 17:44:24 +02:00
de3a28b7df
Monitor storage nodes via IPMI too
2023-09-13 15:57:13 +02:00
a05d87d4b9
Enable fstrim service
2023-09-12 16:39:45 +02:00
826d6263fd
Serve the nix store from hut
2023-09-12 12:19:43 +02:00
b0b04e8fb1
Add encrypted munge key with agenix
2023-09-08 19:01:57 +02:00
a5e81fea95
Remove unused large port hole in firewall
2023-09-08 18:22:48 +02:00
dd616a7fb1
Make exporters listen in localhost only
2023-09-08 18:13:04 +02:00
e41404f619
Allow only some ports for srun
2023-09-08 17:51:37 +02:00
1c7ce3fc51
Block ssfhead from reaching our slurm daemon
2023-09-08 17:20:32 +02:00
bdd03dac60
Poweroff idle slurm nodes after 1 hour
2023-09-08 13:31:23 +02:00
21b38de26d
Add IB and IPMI node host names
2023-09-08 13:21:37 +02:00
52d3794b14
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
2023-09-07 11:13:45 +02:00
d91c9b7473
Unlock ovni gitlab runners
2023-09-05 16:24:27 +02:00
6b526f9827
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27 ' (2023-08-31)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
→ 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2023-09-05 15:03:26 +02:00
ae4ad95902
Add agenix to all nodes
2023-09-04 22:09:40 +02:00
3cc7b33c5a
Add agenix module to ceph
2023-09-04 22:06:20 +02:00
8fc87885da
Remove old secrets
2023-09-04 22:04:32 +02:00
1ea8912d6c
Mount /ceph in owl1 and owl2
2023-09-04 22:00:36 +02:00
7d9e7e4e83
Warn about the owl2 omnipath device
2023-09-04 22:00:17 +02:00
779b591d40
Clean owl2 configuration
2023-09-04 21:59:56 +02:00
c13022596a
Move the ceph client config to an external module
2023-09-04 21:59:04 +02:00
875622ad0f
Reorganize secrets and ssh keys
...
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2023-09-04 21:36:31 +02:00
a7eddecf80
Add anavarro user
2023-09-04 16:00:01 +02:00
fcddbdb72b
Set zsh inc_append_history option
2023-09-03 16:57:53 +02:00
bfb5363d94
Set zsh shell for rarias
2023-09-03 16:46:27 +02:00
44c1d958f4
Enable zsh and fix key bindings
2023-09-03 11:51:53 +02:00
e334891c41
Keep a log over time with the config commits
2023-09-02 23:49:41 +02:00
ea73a72b79
Configure bscpkgs.nixpkgs to follow nixpkgs
2023-09-02 23:37:59 +02:00
13b2379d97
Store nixos config in /etc/nixos/config.rev
2023-09-02 23:37:11 +02:00
48727d3a88
Enable binary emulation for other architectures
2023-08-31 17:22:36 +02:00
b9598df864
Enable watchdog
2023-08-29 22:26:12 +02:00
a0e447301e
Enable all osd on boot in lake2
2023-08-29 18:47:25 +02:00
4495cbf380
Scrape lake2 too
2023-08-29 12:33:26 +02:00
042d85ba61
Also enable monitoring in lake2
2023-08-29 12:29:41 +02:00
c47c190c79
Scrape metrics from bay
2023-08-29 11:58:00 +02:00
a1271f007f
Add monitoring in the bay node
2023-08-29 11:53:32 +02:00
042e56b5b2
Add fio tool
2023-08-29 11:27:50 +02:00
a510a41eed
Add ceph tools in hut too
2023-08-28 17:58:21 +02:00
a68909f96c
Switch ceph logs to journal
2023-08-28 17:58:08 +02:00
3c523572cb
Update ceph to 18.2.0 in overlay
2023-08-25 18:12:46 +02:00
7cd15b9732
Move pkgs overlay to overlay.nix
2023-08-25 18:12:00 +02:00
7ae2403db8
Enable ceph osd daemons in lake2
2023-08-25 14:44:53 +02:00
e8824bf72e
Add the lake2 hostname to the hosts
2023-08-25 14:44:35 +02:00
e46ded9843
Use the sda for lake2
2023-08-25 13:40:10 +02:00
d6d3624617
Remove netboot module
2023-08-25 13:39:01 +02:00
300690df4c
Disable pixiecore in hut for now
2023-08-25 13:21:00 +02:00
9d15c13a44
Add PXE helper
2023-08-25 12:03:30 +02:00
3c030307f1
Enable netboot again for PXE
2023-08-24 19:08:23 +02:00
d30399d31b
Specify the disk by path
2023-08-24 15:27:37 +02:00
9ac05ed4c0
Prepare lake2 config after bootstrap
...
The disk ID is different under NixOS.
2023-08-24 13:54:22 +02:00
43c63f45d7
Add lake2 bootstrap config
2023-08-24 12:30:46 +02:00
35580a83a0
Add section to enable serial console
2023-08-24 12:29:44 +02:00
591a4c774e
Add agenix to PATH in hut
2023-08-23 17:42:50 +02:00
e8d5eeb5cf
Store ceph secret key in age
...
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2023-08-23 17:18:17 +02:00
2516559fac
Add rarias key for secrets
2023-08-23 17:15:26 +02:00
bb8bf86051
Add ceph metrics to prometheus
2023-08-22 16:33:55 +02:00
2416ec7806
Mount the ceph filesystem in hut
2023-08-22 15:57:49 +02:00
34ebe09f66
Add ceph config in bay
2023-08-22 15:57:25 +02:00
1f270d070d
Add the bay host name
2023-08-22 15:56:09 +02:00
817bea45a5
Remove netboot and fixes
2023-07-28 20:31:44 +02:00
490cdf7b95
Add bay node
2023-07-28 19:49:48 +02:00
335c77593d
Update flake
2023-08-22 10:28:26 +02:00
199358a5e3
Monitor power from other nodes via LAN
2023-08-17 18:55:40 +02:00
776a582c10
Increase prometheus retention time to one year
2023-07-28 16:19:59 +02:00
b526531f20
Don't set all_proxy
2023-08-17 12:37:58 +02:00
ad78e41c8b
Update nixpkgs to fix docker problem
2023-07-28 14:24:51 +02:00
b978839406
Allow access to devices for node_exporter
2023-07-28 13:48:30 +02:00
b698b9da12
GRUB version no longer needed
2023-07-27 17:22:20 +02:00
92f5c1ee19
Upgrade flake: nixpkgs, bscpkgs and agenix
2023-07-27 17:19:17 +02:00
c8ff31ec08
Kill slurmd remaining processes on upgrade
2023-07-27 14:24:21 +02:00
b408af0092
koro: Add vlopez user
2023-07-21 10:34:37 +02:00
4878b6fd8b
Add koro node
2023-07-21 10:34:19 +02:00
b5d3d08706
eudy: Add fcsv3 and intermediate versions for testing
2023-07-12 13:22:42 +02:00
72497a88d4
eudy: Enable memory overcommit
2023-06-30 12:49:44 +02:00
cb90c9c73f
eudy: disable all cpu mitigations
2023-06-29 09:14:39 +02:00
246226b3d3
Enable NTP using the BSC time server
2023-06-30 14:02:15 +02:00
aaa082390e
Add the ssfhead node as gateway
2023-06-30 14:01:35 +02:00
cc2160f134
Use our host names first by default
2023-06-23 16:22:18 +02:00
01e7a9b8a4
Add DNS tools to resolve hosts
2023-06-23 16:12:25 +02:00
a66a4d9a43
Lower perf_event_paranoid to -1
2023-06-23 16:01:27 +02:00
31eace8400
Set perf paranoid to 0 by default
2023-06-21 16:23:16 +02:00
4997191f30
Add perf to packages
2023-06-21 15:41:06 +02:00
3ea8bdcdf1
Allow srun to specify the cpu binding
...
The task/affinity plugin needs to be selected.
2023-06-21 13:16:23 +02:00
7db6671ce5
Move authorized keys to users.nix
2023-06-20 14:08:34 +02:00
952541ff4a
Add rpenacob user
2023-06-20 12:48:00 +02:00
d200e4b172
Add osumb to the system packages
2023-06-16 19:22:41 +02:00
cced1c2e08
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=c775ee4d6f76aded05b08ae13924c302f18f9b2c ' (2023-04-26)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=cbe9af5d042e9d5585fe2acef65a1347c68b2fbd ' (2023-06-16)
2023-06-16 18:33:54 +02:00
197c93a2be
Set mpi to mpich by default in bscpkgs
2023-06-16 16:05:17 +02:00
d9002dd028
Add missing parameter to extend
2023-06-16 16:04:36 +02:00
60ee744a54
Use explicit order in overlays
2023-06-16 16:02:25 +02:00
cd1fde4760
Replace mpi inside bsc attribute
2023-06-16 15:54:55 +02:00
3985e66fa4
Add mpich overlay
2023-06-16 14:16:51 +02:00
5010746e9c
Add coments in slurm config
2023-06-16 14:16:14 +02:00
6df4924b00
Add eudy host key to known hosts
2023-06-16 17:29:48 +02:00
59a29e1af6
Rename xeon08 to eudy
...
From Eudyptula, a little penguin.
2023-06-16 17:16:05 +02:00
a4141301ad
Update rebuild script for all nodes
2023-06-16 12:13:07 +02:00
3a07842480
Add ssh host keys
2023-06-16 12:01:12 +02:00
e2aa26a8b3
Set the name of the slurm cluster to jungle
2023-06-16 12:00:54 +02:00
ebf45be2b5
Change owl hostnames
2023-06-16 11:42:39 +02:00
e0ab4e1408
Add owl and all partition
2023-06-16 11:34:00 +02:00
3cb263ea71
Simplify flake and expose host pkgs
...
The configuration of the machines is now moved to m/
2023-06-14 17:28:00 +02:00
2c73c4a7c3
Rename xeon07 to hut
2023-06-14 11:15:00 +02:00
27a6bc1736
Remove profiles older than 30 days with gc
2023-06-14 13:55:19 +02:00
22372b19f0
Add ncdu to system packages
2023-06-14 12:05:15 +02:00
1d0f42a93c
Move arocanon user from xeon08 to common
2023-06-14 16:16:46 +02:00
bf5fffb8ca
xeon08: Add config for kernel non-voluntary preemption
2023-06-12 17:16:01 +02:00
d7f21b39c0
xeon08: Add perf
2023-06-09 10:58:11 +02:00
c8e0d87d42
xeon08: Enable lttng lockdep tracepoints
2023-06-09 08:04:30 +02:00
6f4b356b73
xeon08: Add lttng module and tools
2023-06-07 19:52:24 +02:00
2517f2d0da
Serve grafana in https://jungle.bsc.es/grafana
2023-05-31 17:23:08 +02:00
14650a1e0d
Add tree command
2023-05-31 17:06:09 +02:00
736866afa0
Add file to system packages
2023-05-22 18:56:01 +02:00
e7625328b6
Add gnumake to system packages
2023-05-22 18:31:48 +02:00
2d84a08d38
Add cmake to system packages
2023-05-22 18:28:49 +02:00
e410506722
Add ix to common packages
2023-05-22 13:50:34 +02:00
327837481d
Improve documentation
2023-05-10 10:58:27 +02:00
be2702ebf1
Add gitignore
2023-05-10 17:38:11 +02:00
755290a032
Set intel_pstate=passive and disable frequency boost
2023-05-11 17:25:48 +02:00
eb5bb85cc7
Add xeon08 basic config
2023-05-05 20:18:01 +02:00
cd1129894a
Add nixos-config.nix to easily enable nix repl
2023-05-08 16:45:40 +02:00
08666ddb5c
Automatically resume restarted nodes in SLURM
2023-05-18 12:48:04 +02:00
e288e3c121
Allow public dashboards in grafana
2023-05-09 18:53:31 +02:00
c436a93bfc
Add hal ssh key
2023-05-09 18:37:38 +02:00
201d1c6a22
Increase the number of CPUs to 56 for nOS-V docker
2023-05-02 17:47:57 +02:00
95fec816d2
Allow 5 concurrent buils in the gitlab-runner
2023-05-02 17:38:10 +02:00
f8f94f2604
Simplify bash prompt
2023-04-28 18:12:10 +02:00
4a2f0ff881
Roolback to bash as default shell
...
Zsh doesn't behave properly, it needs further configuration.
2023-04-28 17:59:19 +02:00
4f76bd9ee5
Use pmix by default in slurm
2023-04-28 17:07:48 +02:00
b5ae691d4b
Increase locked memory to 1 GiB
2023-04-28 12:34:51 +02:00
3938218c74
Use the latest kernel
2023-04-28 11:50:43 +02:00
6df8e03a8c
Disable osnoise and hwlat tracer for now
...
Reuse nix cache to avoid rebuilding the kernel.
2023-04-28 11:19:47 +02:00
4dbc9a4021
Update nixpkgs to nixos-unstable
2023-04-28 11:18:37 +02:00
224fb2402f
Update nixpkgs
2023-04-28 11:13:46 +02:00
b0f0e0e134
Update ib interface name in xeon02
...
It seems to be plugged in another PCI port
2023-04-27 18:29:32 +02:00
3e701b22c2
Add steps in install documentation
2023-04-27 16:36:48 +02:00
26344d45af
Add minimal netboot module to build kexec image
2023-04-27 16:36:15 +02:00
67eb58a8f7
Add xeon02 configuration
2023-04-27 16:28:12 +02:00
83f80b2cfd
Refacto slurm configuration into compute/control
2023-04-27 16:27:04 +02:00
578f1e04be
Lock flakes and add inputs
2023-04-26 17:36:36 +02:00
5bc5b3fe35
Test flakes
2023-04-26 14:26:39 +02:00
53521010e9
Enable slurm in xeon01
2023-04-26 13:35:06 +02:00
84ea8ba1cb
Use xeon07 as control machine
2023-04-26 13:29:28 +02:00
4dc04ecbff
Remove xeon07 overlay to load upstream slurm
2023-04-26 13:28:04 +02:00
307af602ca
Add script to rebuild configuration
2023-04-26 14:09:23 +02:00
c95e5aa689
Add configuration for xeon01
2023-04-18 18:56:31 +02:00
f848cd3aca
Load overlays from /config
2023-04-18 18:55:07 +02:00
d5c00f204a
Move net.nix to common
2023-04-18 18:50:44 +02:00
9d7d47e43b
Remove host specific network options from net.nix
2023-04-18 18:49:54 +02:00
7aa9c486c3
Move ssh.nix to common
2023-04-18 18:46:53 +02:00
a42dc88f16
Move overlays.nix to common
2023-04-18 18:46:01 +02:00
e51240da50
Move users.nix to common
2023-04-18 18:45:10 +02:00
b26ddbaca8
Move common options from configuration.nix
2023-04-18 18:43:23 +02:00
80f838fa19
Move the remaining hw config to common
2023-04-18 18:38:08 +02:00
65b6ee624a
Move boot config to common/boot.nix
2023-04-18 18:37:01 +02:00
d91efda035
Move filesystems config to common/fs.nix
2023-04-18 18:35:58 +02:00
2208781c7d
Use partition labels for / and swap
2023-04-18 18:34:27 +02:00
6f821011b9
Move fs.nix to common
2023-04-18 18:31:35 +02:00
25c9adf5bb
Move boot.nix to common
2023-04-18 18:30:02 +02:00
52eb8e818f
Move disk selection to configuration.nix
2023-04-18 18:28:37 +02:00
f394c29cca
Add common directory
2023-04-18 18:27:08 +02:00
960e8eeb5a
Move xeon07 configuration to a directory
2023-04-18 16:09:23 +02:00
d29c33eb66
Add smartctl monitoring
2023-04-18 16:03:46 +02:00
56e785ea24
Allow wheel users to build derivations
2023-04-14 10:14:17 +02:00
91e670500c
Use bscpkgs master
2023-04-11 21:22:00 +02:00
d99af26c48
Run the garbage collector once a week
2023-04-11 21:21:22 +02:00
d2cb42ec80
Set EDITOR and add nix-diff
2023-04-11 20:36:54 +02:00
cc32ad0740
Add nos-v gitlab runner
2023-04-11 12:59:21 +02:00
e8133f9dc0
Disable debug from gitlab runner
2023-04-11 12:58:24 +02:00
ccae9e96c7
Add gitlab-runner secrets using agenix
2023-04-11 12:47:52 +02:00
2258c77aac
Disable ethernet specific useDHCP
...
Is already configured by default for all interfaces.
2023-04-06 13:58:55 +02:00
cdc1cf387b
Enable IPoIB and set the infiniband IP
2023-04-06 13:58:24 +02:00
8dff45903f
Export nix store over nfs
2023-04-06 13:57:32 +02:00
93b416ff19
Enable gitlab runner monitoring
2023-04-06 13:56:52 +02:00
2437e223e0
Add agenix tool
2023-04-05 17:04:42 +02:00
da9b350691
Add monitoring services
2023-04-05 17:00:01 +02:00
a5b4a1b8fb
Add some tools and use relaxed for build sandbox
2023-04-05 16:59:09 +02:00
ec13892ae8
Remove commencted docker settings
2023-04-05 16:56:27 +02:00
e31a72eeac
Add mio key
2023-04-05 16:56:05 +02:00
907de95e01
Setup slurm and gitlab-runner
2023-04-03 12:51:44 +02:00