Commit Graph

356 Commits

Author SHA1 Message Date
958dcd4774 Add emonteir user to apex and fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-02-04 15:08:08 +01:00
7a6e4232de Add nom and nixfmt-tree to system packages
All checks were successful
CI / build:all (pull_request) Successful in 55m38s
CI / build:all (push) Successful in 27m13s
CI / build:cross (push) Successful in 55m5s
CI / build:cross (pull_request) Successful in 8s
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2026-02-03 15:17:30 +01:00
3b56e905e5 Add standalone home-manager to system packages
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2026-02-03 15:17:29 +01:00
2d41309466 Format and sort default package list
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2026-02-03 15:17:24 +01:00
deb0cd1488 Allow USB access to TC1 from Gitlab Runner
All checks were successful
CI / build:cross (pull_request) Successful in 8s
CI / build:all (pull_request) Successful in 16s
CI / build:all (push) Successful in 4s
CI / build:cross (push) Successful in 8s
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-23 17:56:16 +01:00
cd1f502ecc Allow user USB access to FTDI device in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-23 17:56:11 +01:00
dda6a66782 Fix gitea user to allow sending email
All checks were successful
CI / build:cross (pull_request) Successful in 8s
CI / build:all (pull_request) Successful in 16s
CI / build:all (push) Successful in 4s
CI / build:cross (push) Successful in 8s
In order to send email, the gitea user needs to be in the mail-robot
group.

Fixes: #220
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-20 12:18:52 +01:00
22420e6ac8 Remove unneeded perf package from eudy
It is already included in the base list of packages, which is now only
"perf" and doesn't depend on the kernel version.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-20 12:18:49 +01:00
a71cd78b4c Fix infiniband interface names
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-20 12:18:46 +01:00
933c78a80b Fix moved package linuxPackages.perf is now perf
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2026-01-20 12:15:10 +01:00
150969be9b Fix replaced nixseparatedebuginfod
nixseparatedebuginfod has been replaced by nixseparatedebuginfod2

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2026-01-20 12:15:06 +01:00
779449f1db Fix renamed option watchdog.runtimeTime
The option 'systemd.watchdog.runtimeTime' has been renamed to
'systemd.settings.Manager.RuntimeWatchdogSec'.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2026-01-20 12:14:59 +01:00
859eebda98 Change varcila shell to zsh
All checks were successful
CI / build:all (push) Successful in 59m37s
CI / build:cross (push) Successful in 1h27m33s
CI / build:cross (pull_request) Successful in 1h29m20s
CI / build:all (pull_request) Successful in 1h29m22s
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2026-01-07 13:22:17 +01:00
c2a201b085 Increase fail2ban ban time on each attempt
Some checks failed
CI / build:all (push) Has been cancelled
CI / build:cross (push) Has been cancelled
CI / build:all (pull_request) Successful in 1h38m5s
CI / build:cross (pull_request) Successful in 1h38m3s
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-07 13:14:34 +01:00
f921f0a4bd Disable password login via SSH in apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-07 13:14:30 +01:00
aa16bfc0bc Enable fail2ban in apex login node
We are seeing a lot of failed attempts from the same IPs:

    apex% sudo journalctl -u sshd -b0 | grep 'Failed password' | wc -l
    2441

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2026-01-07 13:14:22 +01:00
fc69ef3217 Enable pam_slurm_adopt in all compute nodes
All checks were successful
CI / build:cross (pull_request) Successful in 5s
CI / build:all (pull_request) Successful in 15s
CI / build:all (push) Successful in 3s
CI / build:cross (push) Successful in 6s
Prevents access to owl1 and owl2 too if the user doesn't have any jobs
running there.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-31 11:41:50 +01:00
1d025f7a38 Don't suspend owl compute nodes
Currently the owl nodes are located on top of the rack and turning them
off causes a high temperature increase at that region, which accumulates
heat from the whole rack. To maximize airflow we will leave them on at
all times. This also makes allocations immediate at the extra cost of
around 200 W.

In the future, if we include more nodes in SLURM we can configure those
to turn off if needed.

Fixes: #156
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-31 11:41:44 +01:00
5ff1b1343b Add nixgen to all machines
All checks were successful
CI / build:cross (pull_request) Successful in 5s
CI / build:all (pull_request) Successful in 16s
CI / build:all (push) Successful in 3s
CI / build:cross (push) Successful in 5s
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-29 16:28:05 +01:00
019826d09e Add OmpSs-2 release timers and services
All checks were successful
CI / build:cross (pull_request) Successful in 6s
CI / build:all (pull_request) Successful in 15s
CI / build:all (push) Successful in 4s
CI / build:cross (push) Successful in 6s
Send a reminder email to the STAR group to mark the release cycle dates.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-28 12:38:37 +01:00
a294daf7e3 Use specific mail-robot group to send mail
Allows any user to be able to send mail from the robot account as long
as it is added to the mail-robot group.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-28 12:38:17 +01:00
e3d1785285 Run a shell in the allocated node with salloc
By default, salloc will open a new shell in the *current* node instead
of in the allocated node. This often causes users to leave the extra
shell running once the allocation ends. Repeating this process several
times causes chains of shells.

By running the shell in the remote node, once the allocation ends the
shell finishes as well.

Fixes: #174
See: https://slurm.schedmd.com/faq.html#prompt
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-28 11:44:14 +01:00
14f2393d30 Update website
All checks were successful
CI / build:cross (pull_request) Successful in 6s
CI / build:all (pull_request) Successful in 15s
CI / build:all (push) Successful in 4s
CI / build:cross (push) Successful in 6s
Add apex page and replace bscpkgs references for jungle after the merge.

See: rarias/jungle-website#1
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-22 15:48:13 +02:00
f115d611e7 Add aaguirre user
All checks were successful
CI / build:cross (pull_request) Successful in 6s
CI / build:all (pull_request) Successful in 15s
CI / build:all (push) Successful in 3s
CI / build:cross (push) Successful in 6s
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-22 15:28:29 +02:00
4261d327c6 Include agenix module and package directly
All checks were successful
CI / build:cross (pull_request) Successful in 6s
CI / build:all (pull_request) Successful in 15s
CI / build:all (push) Successful in 4s
CI / build:cross (push) Successful in 6s
Avoids adding an extra flake input only to fetch a single module and
package.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-14 09:37:47 +02:00
98d17b19d3 Enable custom sys-devices system feature
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-09 11:40:44 +02:00
188ba6df0a Remove bscpkgs input
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:26 +02:00
e42058f08b Allow access to hut from fox
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-02 17:03:21 +02:00
f3bfe89f27 Fetch website from its own git repository
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-02 15:45:21 +02:00
b040bebd1d Add acinca user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 12:27:43 +02:00
f69629d2da Restart slurmd on failure
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-30 17:20:39 +02:00
0668f0db74 Lower connect timeout when using hut substituter
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 18:44:48 +02:00
5fcd57a061 Use hut substituter in all nodes
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 18:44:38 +02:00
ad1544759f Remove machine access for user csiringo
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 18:23:24 +02:00
e1c950a530 Mount apex /home via NFS in raccoon
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:53 +02:00
f9632c37f8 Remove extra SSH jump configuration
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:51 +02:00
1f0cb4ae76 Add raccoon peer to wireguard
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:48 +02:00
e98fdb89ab Restrict fox peer to a single IP
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:43 +02:00
6afe05b5fd Use lowercase peer hostnames
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:25 +02:00
7d5aebf882 Share a public folder for documents
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:59:40 +02:00
4da7780472 Add amd_hsmp module in fox for AMD uProf
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:24 +02:00
d6126501ba Disable NMI watchdog in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:17 +02:00
e6e4846529 Add AMD uProf module and enable it in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:05 +02:00
ff0fc18d0a Mount home via NFS from apex in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 15:34:02 +02:00
19c7e32678 Allow access to NFS via wireguard subnet
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 15:33:47 +02:00
017c19e7d0 Use 10.106.0.0/24 subnet to avoid collisions
The 106 byte is the code for 'j' (jungle) in ASCII:

	% printf j | od -t d
	0000000         106
	0000001

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:13 +02:00
a36eff8749 Revert "Remove pam_slurm_adopt from fox"
This reverts commit 1eac0fcad8.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:06 +02:00
df17b11458 Enable fail2ban in fox
Protect fox against ssh bruteforce attacks:

fox% sudo lastb | head
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:02 +02:00
0dc7b7eb3d Accept connections from apex to fox slurmd
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:00 +02:00
dff6eaf587 Accept fox connection to slurm controller
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:59 +02:00