426 Commits

Author SHA1 Message Date
dd4ad901df Enable wireguard in apex
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 13:52:05 +02:00
c9669408c5 Add wireguard server in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 13:38:47 +02:00
ddfb26be5a Use writeShellScript for suspend.sh and resume.sh
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:02:12 +02:00
1b21a398a8 Add firewall rules to slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:59:21 +02:00
4d16e794cd Remove hut from slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:43:12 +02:00
38a45f20b4 Only configure apex as slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:37:21 +02:00
0cc76fc98d Split slurm configuration for client and server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:36:52 +02:00
70da186d15 Move slurm control server to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 11:56:20 +02:00
d71831016e Fix typo in csiringo ssh key
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 17:21:23 +02:00
0fb3cec09c Enable nix-ld in weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-16 16:20:40 +02:00
5ccfc2411f Add csiringo user with access to apex and weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 12:42:08 +02:00
dbb7e1fe36 Access gitlab via raccoon in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-27 15:20:34 +02:00
d1f58a62f5 Move StartLimit* options to unit section
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:

> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.

When using [Unit], the limits are properly set:

  apex% systemctl show power-policy.service | grep StartLimit
  StartLimitIntervalUSec=10min
  StartLimitBurst=10
  StartLimitAction=none

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 12:21:05 +02:00
0642df0bbd Set power policy to always turn on
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 15:25:47 +02:00
3d7e8b8a07 Add NixOS module to control power policy
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 14:07:06 +02:00
2e429bf09e Move August shutdown to 3rd at 22h
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 13:42:57 +02:00
9e22760628 Disable automatic August shutdown for Fox
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 13:40:33 +02:00
8bb09dd061 Add cudainfo program to test CUDA
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-22 15:24:55 +02:00
f686797234 Add missing symlink in cuda sandbox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-21 17:19:25 +02:00
6411a94f77 Enable cuda systemFeature in raccoon and fox
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 11:34:28 +02:00
7b61cfbe54 Move shared nvidia settings to a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 11:31:59 +02:00
4e1fd7b0e0 Replace xeon07 by hut in ssh config
The xeon07 machine has been renamed to hut.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 10:59:39 +02:00
4e24135d35 Enable automatic Nix GC in raccoon
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 13:43:58 +02:00
7131d82ba2 Select proprietary NVIDIA driver in raccoon
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:

> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 13:00:03 +02:00
e8cd0d9f58 Enable open source NVidia driver in fox
It is recommended for newer versions.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-17 11:32:35 +02:00
a9ba65cdca Remove option allowUnfree from fox and raccoon
It is already set to true for all machines.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-17 11:26:27 +02:00
94f398e661 Ban another scanner trying to connect via SSH
It is constantly spamming out logs:

apex# journalctl | grep 'Connection closed by 84.88.52.176' | wc -l
2255

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-16 16:59:29 +02:00
387e1cada7 Update weasel IPMI hostname for monitoring
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 18:48:08 +02:00
c6cc2a7638 Remove merged MPICH patch
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:57:22 +02:00
29071a6020 Remove package ix as it is gone
Fails with: "error: ix has been removed from Nixpkgs, as the ix.io
pastebin has been offline since Dec. 2023".

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:50:12 +02:00
f59218c898 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41?narHash=sha256-b%2Buqzj%2BWa6xgMS9aNbX4I%2BsXeb5biPDi39VgvSFqFvU%3D' (2024-08-10)
  → 'github:ryantm/agenix/531beac616433bac6f9e2a19feb8e99a22a66baf?narHash=sha256-9P1FziAwl5%2B3edkfFcr5HeGtQUtrSdk/MksX39GieoA%3D' (2025-06-17)
• Updated input 'agenix/darwin':
    'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d?narHash=sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0%3D' (2023-11-24)
  → 'github:lnl7/nix-darwin/43975d782b418ebf4969e9ccba82466728c2851b?narHash=sha256-dyN%2BteG9G82G%2Bm%2BPX/aSAagkC%2BvUv0SgUw3XkPhQodQ%3D' (2025-04-12)
• Updated input 'agenix/home-manager':
    'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1?narHash=sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE%3D' (2023-12-20)
  → 'github:nix-community/home-manager/abfad3d2958c9e6300a883bd443512c55dfeb1be?narHash=sha256-YZCh2o9Ua1n9uCvrvi5pRxtuVNml8X2a03qIFfRKpFs%3D' (2025-04-24)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f' (2024-11-29)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=9d1944c658929b6f98b3f3803fead4d1b91c4405' (2025-06-11)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc?narHash=sha256-i/UJ5I7HoqmFMwZEH6vAvBxOrjjOJNU739lnZnhUln8%3D' (2025-01-14)
  → 'github:NixOS/nixpkgs/dfcd5b901dbab46c9c6e80b265648481aafb01f8?narHash=sha256-Kt1UIPi7kZqkSc5HVj6UY5YLHHEzPBkgpNUByuyxtlw%3D' (2025-07-13)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:46:48 +02:00
871515a736 Upgrade nixpkgs to nixos 25.05
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:45:40 +02:00
ef65a49ed1 Silently ban OpenVAS BSC scanner from apex
It is spamming our logs with refused connection lines:

apex% sudo journalctl -b0 | grep 'refused connection.*SRC=192.168.8.16' | wc -l
13945

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 17:30:20 +02:00
061bd24453 Rotate anavarro password and SSH key
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 17:15:59 +02:00
0a876e7a83 Add weasel machine configuration
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 15:07:52 +02:00
ba425f6647 Remove extra flush commands on firewall stop
They are not needed as they are already flushed when the firewall
starts or stops.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 16:13:35 +02:00
5a4e7d2bdf Prevent accidental use of nftables
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 16:12:44 +02:00
998a7f839d Add proxy configuration for internal hosts
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.

This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 12:29:52 +02:00
cdad30dd55 Remove unused blackbox configuration modules
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 11:34:08 +02:00
bffa8d94a9 Use IPv4 in blackbox probes
Otherwise they simply fail as IPv6 doesn't work.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 11:33:04 +02:00
8e80ed7034 Make NFS mount async to improve latency
Don't wait to flush writes, as we don't care about consistency on a
crash:

> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 11:10:07 +02:00
71e1562a0b Disable root_squash from NFS
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 10:35:38 +02:00
8623e7c2bc Remove SSH proxy to access BSC clusters
We now have direct connection to them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 10:22:04 +02:00
b10504cb59 Add users to apex machine
They need to be able to login to apex to access any other machine from
the SSF rack.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:59:36 +02:00
ba66cb0b71 Remove proxy from hut HTTP probes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:26:22 +02:00
bb779a9630 Remove proxy configuration from environment
All machines have now direct connection with the outside world.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:24:22 +02:00
76ce684be4 Add storcli utility to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:11:22 +02:00
eebcf2f239 Add new configuration for apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:02:11 +02:00
69b7be9026 Add pmartin1 user with access to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-03 10:26:44 +02:00
a1e45941cc Add access to fox for rpenacob user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:20:51 +02:00