WIP: Remove blobs and split website in another repository #186

Closed
rarias wants to merge 458 commits from remove-website into old-master

458 Commits

Author SHA1 Message Date
ee1b1a7679 Add acinca user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-30 18:26:33 +02:00
ef914953d4 Restart slurmd on failure
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-29 19:17:33 +02:00
98abb3edf2 Lower connect timeout when using hut substituter
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 09:41:34 +02:00
0cbcdcbe38 Use hut substituter in all nodes
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-25 17:10:10 +02:00
fce7cb795c Remove machine access for user csiringo
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 17:30:02 +02:00
bf69d242d0 Mount apex /home via NFS in raccoon
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:48:50 +02:00
e4c0f95906 Remove extra SSH jump configuration
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-25 15:15:43 +02:00
57f6f7bb10 Add raccoon peer to wireguard
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-25 15:01:33 +02:00
9c39ce006a Add raccoon host key
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:26:56 +02:00
405a7a7415 Restrict fox peer to a single IP
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:20:54 +02:00
04b094a627 Use lowercase peer hostnames
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:18:12 +02:00
f2c38f9316 Share a public folder for documents
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-17 13:08:48 +02:00
3d344a5a4d Fix AMDuProfPcm so it finds libnuma.so
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Vincent Arcila <vincent.arcila@bsc.es>
2025-09-18 13:15:44 +02:00
e50fb05df7 Add amd_hsmp module in fox for AMD uProf
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-18 11:44:49 +02:00
66068bc412 Fix hidden dependencies for AMDuProfSys
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-16 15:57:04 +02:00
ff5db631f7 Disable NMI watchdog in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-16 15:53:28 +02:00
e8a3d6d647 Fix amd-uprof dependencies with patchelf
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-05 13:01:11 +02:00
6c544f79c4 Fix hrtimer new interface
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.

See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-04 12:20:42 +02:00
3b7cf58aad Use CFLAGS_MODULE instead of EXTRA_CFLAGS
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-04 12:00:33 +02:00
87bae5b9df Add AMD uProf module and enable it in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 15:51:46 +02:00
6f958c14cd Add AMD uProf package and driver
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 14:55:43 +02:00
dcffeed542 Mount home via NFS from apex in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 13:24:06 +02:00
a22d0d4135 Allow access to NFS via wireguard subnet
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 13:16:27 +02:00
7d4ebd8495 Use 10.106.0.0/24 subnet to avoid collisions
The 106 byte is the code for 'j' (jungle) in ASCII:

	% printf j | od -t d
	0000000         106
	0000001

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 11:12:25 +02:00
3a917f75c7 Revert "Remove pam_slurm_adopt from fox"
This reverts commit 64a52801ed.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-02 17:12:56 +02:00
7657b860a8 Enable fail2ban in fox
Protect fox against ssh bruteforce attacks:

fox% sudo lastb | head
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-01 11:25:29 +02:00
50ae3ab4f0 Accept connections from apex to fox slurmd
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:55:53 +02:00
02e2470c1a Accept fox connection to slurm controller
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:46:24 +02:00
3f67bc4a2e Add fox machine to SLURM
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:40:43 +02:00
71a23ec68b Rekey secrets with trusted fox key
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:39:28 +02:00
11f52da199 Trust fox for compute node secrets
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:35:51 +02:00
f1a98190b5 Make apex host specific to each machine
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:29:14 +02:00
2fbf3ee8b6 Add local host fox in apex
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:11:19 +02:00
dd4ad901df Enable wireguard in apex
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 13:52:05 +02:00
c9669408c5 Add wireguard server in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 13:38:47 +02:00
ddfb26be5a Use writeShellScript for suspend.sh and resume.sh
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:02:12 +02:00
1b21a398a8 Add firewall rules to slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:59:21 +02:00
4d16e794cd Remove hut from slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:43:12 +02:00
38a45f20b4 Only configure apex as slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:37:21 +02:00
0cc76fc98d Split slurm configuration for client and server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:36:52 +02:00
70da186d15 Move slurm control server to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 11:56:20 +02:00
d71831016e Fix typo in csiringo ssh key
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 17:21:23 +02:00
0fb3cec09c Enable nix-ld in weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-16 16:20:40 +02:00
5ccfc2411f Add csiringo user with access to apex and weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 12:42:08 +02:00
dbb7e1fe36 Access gitlab via raccoon in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-27 15:20:34 +02:00
d1f58a62f5 Move StartLimit* options to unit section
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:

> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.

When using [Unit], the limits are properly set:

  apex% systemctl show power-policy.service | grep StartLimit
  StartLimitIntervalUSec=10min
  StartLimitBurst=10
  StartLimitAction=none

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 12:21:05 +02:00
0642df0bbd Set power policy to always turn on
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 15:25:47 +02:00
3d7e8b8a07 Add NixOS module to control power policy
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 14:07:06 +02:00
2e429bf09e Move August shutdown to 3rd at 22h
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 13:42:57 +02:00
9e22760628 Disable automatic August shutdown for Fox
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 13:40:33 +02:00
8bb09dd061 Add cudainfo program to test CUDA
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-22 15:24:55 +02:00
f686797234 Add missing symlink in cuda sandbox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-21 17:19:25 +02:00
6411a94f77 Enable cuda systemFeature in raccoon and fox
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 11:34:28 +02:00
7b61cfbe54 Move shared nvidia settings to a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 11:31:59 +02:00
4e1fd7b0e0 Replace xeon07 by hut in ssh config
The xeon07 machine has been renamed to hut.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 10:59:39 +02:00
4e24135d35 Enable automatic Nix GC in raccoon
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 13:43:58 +02:00
7131d82ba2 Select proprietary NVIDIA driver in raccoon
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:

> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 13:00:03 +02:00
e8cd0d9f58 Enable open source NVidia driver in fox
It is recommended for newer versions.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-17 11:32:35 +02:00
a9ba65cdca Remove option allowUnfree from fox and raccoon
It is already set to true for all machines.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-17 11:26:27 +02:00
94f398e661 Ban another scanner trying to connect via SSH
It is constantly spamming out logs:

apex# journalctl | grep 'Connection closed by 84.88.52.176' | wc -l
2255

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-16 16:59:29 +02:00
387e1cada7 Update weasel IPMI hostname for monitoring
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 18:48:08 +02:00
c6cc2a7638 Remove merged MPICH patch
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:57:22 +02:00
29071a6020 Remove package ix as it is gone
Fails with: "error: ix has been removed from Nixpkgs, as the ix.io
pastebin has been offline since Dec. 2023".

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:50:12 +02:00
f59218c898 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41?narHash=sha256-b%2Buqzj%2BWa6xgMS9aNbX4I%2BsXeb5biPDi39VgvSFqFvU%3D' (2024-08-10)
  → 'github:ryantm/agenix/531beac616433bac6f9e2a19feb8e99a22a66baf?narHash=sha256-9P1FziAwl5%2B3edkfFcr5HeGtQUtrSdk/MksX39GieoA%3D' (2025-06-17)
• Updated input 'agenix/darwin':
    'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d?narHash=sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0%3D' (2023-11-24)
  → 'github:lnl7/nix-darwin/43975d782b418ebf4969e9ccba82466728c2851b?narHash=sha256-dyN%2BteG9G82G%2Bm%2BPX/aSAagkC%2BvUv0SgUw3XkPhQodQ%3D' (2025-04-12)
• Updated input 'agenix/home-manager':
    'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1?narHash=sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE%3D' (2023-12-20)
  → 'github:nix-community/home-manager/abfad3d2958c9e6300a883bd443512c55dfeb1be?narHash=sha256-YZCh2o9Ua1n9uCvrvi5pRxtuVNml8X2a03qIFfRKpFs%3D' (2025-04-24)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f' (2024-11-29)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=9d1944c658929b6f98b3f3803fead4d1b91c4405' (2025-06-11)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc?narHash=sha256-i/UJ5I7HoqmFMwZEH6vAvBxOrjjOJNU739lnZnhUln8%3D' (2025-01-14)
  → 'github:NixOS/nixpkgs/dfcd5b901dbab46c9c6e80b265648481aafb01f8?narHash=sha256-Kt1UIPi7kZqkSc5HVj6UY5YLHHEzPBkgpNUByuyxtlw%3D' (2025-07-13)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:46:48 +02:00
871515a736 Upgrade nixpkgs to nixos 25.05
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:45:40 +02:00
ef65a49ed1 Silently ban OpenVAS BSC scanner from apex
It is spamming our logs with refused connection lines:

apex% sudo journalctl -b0 | grep 'refused connection.*SRC=192.168.8.16' | wc -l
13945

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 17:30:20 +02:00
061bd24453 Rotate anavarro password and SSH key
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 17:15:59 +02:00
0a876e7a83 Add weasel machine configuration
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 15:07:52 +02:00
ba425f6647 Remove extra flush commands on firewall stop
They are not needed as they are already flushed when the firewall
starts or stops.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 16:13:35 +02:00
5a4e7d2bdf Prevent accidental use of nftables
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 16:12:44 +02:00
998a7f839d Add proxy configuration for internal hosts
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.

This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 12:29:52 +02:00
cdad30dd55 Remove unused blackbox configuration modules
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 11:34:08 +02:00
bffa8d94a9 Use IPv4 in blackbox probes
Otherwise they simply fail as IPv6 doesn't work.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 11:33:04 +02:00
8e80ed7034 Make NFS mount async to improve latency
Don't wait to flush writes, as we don't care about consistency on a
crash:

> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 11:10:07 +02:00
71e1562a0b Disable root_squash from NFS
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 10:35:38 +02:00
8623e7c2bc Remove SSH proxy to access BSC clusters
We now have direct connection to them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-11 10:22:04 +02:00
b10504cb59 Add users to apex machine
They need to be able to login to apex to access any other machine from
the SSF rack.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:59:36 +02:00
ba66cb0b71 Remove proxy from hut HTTP probes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:26:22 +02:00
bb779a9630 Remove proxy configuration from environment
All machines have now direct connection with the outside world.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:24:22 +02:00
76ce684be4 Add storcli utility to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:11:22 +02:00
eebcf2f239 Add new configuration for apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-09 11:02:11 +02:00
69b7be9026 Add pmartin1 user with access to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-03 10:26:44 +02:00
a1e45941cc Add access to fox for rpenacob user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:20:51 +02:00
9c5c26e94d Revert "Only allow Vincent to access fox for now"
This reverts commit efac36b186.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:20:05 +02:00
df2f25873f Add all terminfo files in environment
Fixes problems with the kitty terminal when opening vim or kakoune.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-01 14:59:39 +02:00
7304c60a98 Monitor Fox BMC with ICMP probes too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 16:06:50 +02:00
904bb5f2ba Restrict DAC VPN to fox-ipmi machine only
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 14:47:55 +02:00
55b2860b67 Monitor fox via VPN
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-17 16:41:25 +02:00
23310cbfa9 Add OpenVPN service to connect to fox BMC
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-17 14:29:15 +02:00
fd49be6033 Add ac.upc.edu as name search server
Allows referring to fox.ac.upc.edu directly as fox.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-18 16:36:34 +02:00
b9ca4fcca3 Disable kptr_restrict in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-18 11:07:19 +02:00
0baec02de3 Disable NUMA balancing in fox
See: https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-17 14:04:46 +02:00
39f6455d8c Load amd_uncore module in fox
Needed for L3 events in perf.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-13 13:14:47 +02:00
ce5228f696 Enable SSH X11 forwarding
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-13 10:26:59 +02:00
b097cbfe2f Disable registration in Gitea
Get rid of all the spam accounts they are trying to register.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-16 15:55:53 +02:00
926d443e24 Enable msmtp configuration in tent
Allows gitea to send notifications via email.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-16 15:40:06 +02:00
9f0deec40a Add GitLab runner with debian docker for PM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 15:52:31 +02:00
415d09600a Monitor nix-daemon in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 15:11:24 +02:00
02da9f1847 Move nix-daemon exporter to modules
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 15:09:54 +02:00
996602845c Add p service for pastes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 12:53:58 +02:00
3cc2ed1d18 Enable public-inbox service in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 11:52:10 +02:00
54c595fa62 Enable gitea in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 11:10:39 +02:00
7a7b847cb9 Add bsc.es to resolve domain names
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 09:40:17 +02:00
dec3ab49a7 Monitor AXLE machine too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 16:47:40 +02:00
72e475edbb Use IPv4 for blackbox exporter
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 16:38:40 +02:00
2f9eb39fac Add public html files to tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 15:24:31 +02:00
377cc66d16 Add docker GitLab runner for BSC GitLab
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 13:49:51 +02:00
f711a26778 Add GitLab shell runner in tent for PM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-05 11:11:13 +02:00
67c991fc6f Enable jungle robot emails for Grafana in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 13:25:43 +02:00
a7b1334dd7 Add tent key for nix-serve
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 13:20:29 +02:00
f5ac62577e Remove jungle nix cache from tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 13:18:01 +02:00
6bbadc5246 Enable nix cache
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 13:17:26 +02:00
5026f0257e Serve Grafana from subpath
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 12:57:34 +02:00
cdbdef9bb1 Add nginx server in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 12:47:43 +02:00
a5b5765d57 Add monitoring in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-12 10:32:31 +02:00
a208cfbc6f Disable nix garbage collector in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-06-07 17:51:40 +02:00
9d8234024d Rekey secrets with tent keys
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-05 11:09:15 +02:00
a20e8844c6 Add tent host key and admin keys
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-05 11:07:00 +02:00
c89f9d79a0 Create directories in /vault/home for tent users
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-03 19:07:43 +02:00
39a070852f Add software RAID in tent using 3 disks
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-03 18:27:56 +02:00
6f5dacbcd3 Add access to tent to all hut users too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-03 17:24:40 +02:00
70eecd1e39 Add hut SSH configuration from outside SSF LAN
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-03 17:17:29 +02:00
5f59a22705 Don't use proxy in base preset
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-03 12:52:10 +02:00
3734a9210c Add tent machine from xeon04
We moved the tent machine to the server room in the BSC building and is
now directly connected to the raccoon via NAT.

Fixes: #106
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 09:07:32 +02:00
c9b6edb6a9 Create specific SSF rack configuration
Allow xeon machines to optionally inherit SSF configuration such as the
NFS mount point and the network configuration.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 12:22:41 +02:00
10693417a3 Only allow Vincent to access fox for now
Needed to run benchmarks without interference.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-10 14:38:02 +02:00
c441d4aad7 Use performance governor in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-10 14:37:39 +02:00
729b781cdd Add hut as nix cache in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-10 18:23:20 +02:00
08953f64fb Use extra- for substituters and trusted-public-keys
From the nix manual:

> A configuration setting usually overrides any previous value. However,
> for settings that take a list of items, you can prefix the name of the
> setting by extra- to append to the previous value.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-06-03 17:59:17 +02:00
0c9f31ffe1 Use DHCP for Ethernet in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-06 15:11:12 +02:00
59d6742e77 Use UPC time servers as others are blocked
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-06 14:44:47 +02:00
075dd928ad Create tracing group and add arocanon in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-05-15 12:24:49 +02:00
007418a52c Extend perf support in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-05-15 12:21:26 +02:00
87e5fc8af6 Enable nixdebuginfod in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-05-06 14:39:48 +02:00
1089dd10b7 Make raccoon use performance governor
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-05-05 10:50:43 +02:00
6f07c93b5a Enable binfmt emulation in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-03-21 17:51:41 +01:00
34d55ea815 Disable nix garbage collector in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-03-18 16:48:47 +01:00
78d7b522bf Add dbautist user to raccoon machine
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-03-03 13:55:23 +01:00
1b6c948325 Add node exporter monitoring in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-02-25 17:11:09 +01:00
8d01909666 Allow X11 forwarding via SSH
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-02-18 16:19:04 +01:00
57a0d58691 Enable linger for user rarias
Allows services to run without a login session.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-10-14 19:12:25 +02:00
f79debb7a1 Only proxy SSH git remotes via hut in xeon
Other machines like raccoon have direct access.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-10 15:03:03 +02:00
23d3cc5f18 Add machine map file
Documents the location, board and serial numbers so we can track the
machines if they move around. Some information is unkown.

Using the Nix language to encode the machines location and properties
allows us to later use that information in the configuration of the
machines themselves.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:12:30 +02:00
e31c80c6c5 Remove fox monitoring via IPMI
We will need to setup an VPN to be able to access fox in its new
location, so for now we simply remove the IPMI monitoring.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 07:55:11 +02:00
69e1cde614 Monitor fox, gateway and UPC anella via ICMP
Fox should reply once the machine is connected to the UPC network.
Monitoring also the gateway and UPC anella allows us to estimate if the
whole network is down or just fox.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-28 13:03:01 +02:00
3264788343 Update configuration for UPC network
The fox machine will be placed in the UPC network, so we update the
configuration with the new IP and gateway. We won't be able to reach hut
directly so we also remove the host entry and proxy.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 14:17:06 +02:00
97e5e5d04b Disable home via NFS in fox
It won't be accesible anymore as we won't be in the same LAN.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 13:41:36 +02:00
490977cdc1 Rekey all secrets
Fox is no longer able to use munge or ceph, so we remove the key and
rekey them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 12:30:03 +02:00
0b1feca6ac Rotate fox SSH host key
Prevent decrypting old secrets by reading the git history.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 12:27:57 +02:00
7d9340e8cb Distrust fox SSH key
We no longer will share secrets with fox until we can regain our trust.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 12:00:21 +02:00
653a197bf4 Remove Ceph module from fox
It will no longer be accesible from the UPC.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:50:57 +02:00
b386d30380 Remove fox from SLURM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:43:16 +02:00
dd8d3c508b Remove pam_slurm_adopt from fox
We no longer will be able to use SLURM from jungle.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:40:07 +02:00
bbf09ab960 Add UPC temperature sensor monitoring
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:24:12 +02:00
3b5781ba63 Add meteocat exporter
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-23 15:40:09 +02:00
7d10816c98 Add custom nix-daemon exporter
Allows us to see which derivations are being built in realtime. It is a
bit of a hack, but it seems to work. We simply look at the environment
of the child processes of nix-daemon (usually bash) and then look for
the $name variable which should hold the current derivation being
built. Needs root to be able to read the environ file of the different
nix-daemon processes as they are owned by the nixbld* users.

See: https://discourse.nixos.org/t/query-ongoing-builds/23486
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-24 23:51:06 +02:00
1b77b70074 Set keep-outputs to true in all machines
From the documentation of keep-outputs, setting it to true would prevent
the GC from removing build time dependencies:

If true, the garbage collector will keep the outputs of non-garbage
derivations. If false (default), outputs will be deleted unless they are
GC roots themselves (or reachable from other roots).

In general, outputs must be registered as roots separately. However,
even if the output of a derivation is registered as a root, the
collector will still delete store paths that are used only at build time
(e.g., the C compiler, or source tarballs downloaded from the network).
To prevent it from doing so, set this option to true.

See: https://nix.dev/manual/nix/2.24/command-ref/conf-file.html#conf-keep-outputs
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-04-22 16:16:42 +02:00
b5d0b34179 Add raccoon node exporter monitoring
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-22 11:41:43 +02:00
7ad4af686b Increase data retention to 5 years
Now that we have more space, we can extend the retention time to 5 years
to hold the monitoring metrics. For a year we have:

	# du -sh /var/lib/prometheus2
	13G     /var/lib/prometheus2

So we can expect it to increase to about 65 GiB. In the future we may
want to reduce some adquisition frequency.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-22 11:20:57 +02:00
614fcfe596 Don't forward any docker traffic
Access to the 23080 local port will be done by applying the INPUT rules,
which pass through nixos-fw.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 12:46:08 +02:00
7c901742e0 Allow traffic from docker to enter port 23080
Before:

  hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
  + true
  + nc -w 3 -v 10.0.40.7 23080
  nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out

After:

  hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
  + true
  + nc -w 3 -v 10.0.40.7 23080
  10.0.40.7 (10.0.40.7:23080) open

Fixes: #94
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 12:17:00 +02:00
a492e06327 Add bscpm04.bsc.es SSH host and public key
Allows fetching repositories from hut and other machines in jungle
without the need to do any extra configuration.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-11 12:15:33 +02:00
4ed53d4384 Use hut nix cache in owl1, owl2 and raccoon
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-02-26 16:03:26 +01:00
ad26c63fa2 Clean all iptables rules on stop
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-11 10:23:26 +02:00
563dc575fd Make nginx listen on all interfaces
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-11 10:03:05 +02:00
097c7bc31f Fix nginx /cache regex
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```

This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info` but
`curl https://jungle.bsc.es/cachenix-cache-info` worked.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-02-26 15:31:05 +01:00
17e42b3872 Add new GitLab runner for gitlab.bsc.es
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-24 13:00:54 +01:00
db04825a11 Remove SLURM partition all
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-07 16:17:32 +02:00
7f395ba2d9 Add varcila user to hut and fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-03-28 11:53:33 +01:00
5683fe5be1 Adjust fox slurm config after disabling SMT
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-03-28 11:04:19 +01:00
b44bdfb10f Add abonerib user to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-25 14:33:11 +01:00
b1adbed3de Don't move doc in web output
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-14 16:36:57 +01:00
8ff54219f6 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-13 14:47:38 +01:00
580bfad9ec Add users to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 16:46:56 +01:00
afe7ae445b Add dalvare1 user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 16:39:51 +01:00
9dea4e2379 Mount NVME disks in /nvme{0,1}
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 15:49:55 +01:00
b046baee48 Exclude fox from being suspended by slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 15:02:18 +01:00
8766fd8439 Use IPMI host names instead of IP addresses
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 12:14:40 +01:00
b70d99f479 Add fox IPMI monitoring
Use agenix to store the credentials safely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 11:36:53 +01:00
a0eae1feea Add new fox machine
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-11 12:56:30 +01:00
e9740c471d Update PM GitLab tokens to new URL
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-15 14:38:57 +01:00
9b183c4202 Fix MPICH build by fetching upstream patches too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-15 13:16:10 +01:00
90036b8ea2 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
  → 'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41' (2024-08-10)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709' (2024-04-24)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f' (2024-11-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
  → 'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc' (2025-01-14)

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-15 12:44:51 +01:00
bb4e42e149 Set nixpkgs to track nixos-24.11
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-15 12:43:45 +01:00
23aa682816 Add script to monitor GPFS
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-14 12:01:00 +01:00
3e26c69f69 Add BSC machines to ssh config
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-14 15:51:34 +01:00
aa977ee62a Collect statistics from logged users
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-11-14 12:21:13 +01:00
7b9d805d12 Add custom GPFS exporter for MN5
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-11-12 16:30:24 +01:00
4aa011ff85 Remove exception to fetch task endpoint
It causes the request to go to the website rather than the Gitea
service.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-22 16:13:01 +02:00
4b41b67d25 Use SSD for boot, then switch to NVME
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-21 14:28:17 +02:00
e3f6e67348 Use NVME as root
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-17 14:39:31 +02:00
129fa52e9b Keep host header for Grafana requests
This was breaking requests due to CSRF check.

See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-17 13:35:45 +02:00
0e1ea5d504 Ignore logging requests from the gitea runner
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-20 15:44:22 +02:00
95eef3b0c5 Log the client IP not the proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-20 15:24:38 +02:00
7d25055f98 Ignore misc directory
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-20 15:25:06 +02:00
b978f12d19 Create paste directories in /ceph/p
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-20 11:19:30 +02:00
c1617266b6 Add p command to paste files
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-16 16:33:42 +02:00
83830dbfed Use nginx to serve website and other services
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-16 16:33:34 +02:00
0bcac3bca4 Mount the NVME disk in /nvme
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-23 16:15:26 +02:00
f41771d55f Delay nix-gc until /home is mounted
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.

See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-18 11:04:44 +02:00
1e90c038a1 Add dbautist user with access to hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-18 15:21:01 +02:00
439f40240f Set the serial console to ttyS1 in raccoon
Apparently the ttyS0 console doesn't exist but ttyS1 does:

  raccoon% sudo stty -F /dev/ttyS0
  stty: /dev/ttyS0: Input/output error
  raccoon% sudo stty -F /dev/ttyS1
  speed 9600 baud; line = 0;
  -brkint -imaxbel

The dmesg line agrees:

  00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A

The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-22 13:34:19 +02:00
e5feebbd8f Remove setLdLibraryPath and driSupport options
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".

See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-22 12:36:20 +02:00
38f0fb7f78 Add documentation section about GRUB chain loading
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-06-07 10:40:37 +02:00
bb566b7eeb Add 10 min shutdown jitter to avoid spikes
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-22 11:20:02 +02:00
f7d60c4bbe Don't mount the nix store in owl nodes
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-22 11:02:32 +02:00
3c1be2d4b4 Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-19 17:53:10 +02:00
b04a064583 Program shutdown for August 2nd for all machines
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 18:01:45 +02:00
e78021c319 Enable debuginfod daemon in owl nodes
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 16:12:16 +02:00
2cba78cee1 Set gitea and grafana log level to warn
Prevents filling the journal logs with information messages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 13:39:16 +02:00
be802804d1 Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 11:44:01 +02:00
e1967ccda6 Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 11:19:03 +02:00
cd9032bca9 Use authentication tokens for PM GitLab runner
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead.  Older tokens and
runners are also removed, as they are no longer used.

With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.

See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-16 14:58:58 +02:00
30f1ab9144 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
  → 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
  → 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-16 14:12:06 +02:00
b57bb47aa6 Allow ptrace to any process of the same user
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-17 13:10:59 +02:00
555879f04e Add abonerib user to hut, raccon, owl1 and owl2
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-16 18:16:05 +02:00
af38221cfa Grant rpenacob access to owl1 and owl2 nodes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-16 18:04:16 +02:00
57158b5257 Access private repositories via hut SSH proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-17 12:47:53 +02:00
e12d99fd46 Set the default proxy to point to hut
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-17 12:59:02 +02:00
9c686a846f Allow incoming traffic to hut proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-17 12:56:59 +02:00
22a7de03a0 eudy: koro: fcs: Fix fcs unprotected cpuid all
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-07-16 17:36:21 +02:00
b0cc9c959e Add support for armv7 emulation in hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-21 13:52:08 +02:00
c781a2262f Monitor raccoon machine via IPMI
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-07 13:46:33 +02:00
7b5e4f3978 Move vlopez user to jungleUsers for koro host
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-07 10:06:58 +02:00
b14b4fab1f Add raccoon motd file
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-06 19:36:53 +02:00
cd3284d1b2 Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-03 09:20:11 +02:00
91a42375e3 Control user access to each machine
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-06 14:06:33 +02:00
2f6673cb3e Add PostgreSQL DB for performance test results
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-30 13:35:58 +02:00
584fe927b6 Enable Grafana email alerts
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 13:54:06 +02:00
2abc1e8fca Enable mail notification in Gitea
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 18:54:38 +02:00
38255dfa0f Add msmtp to send notifications via email
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:54:09 +02:00
8033531246 Allow Ceph traffic to lake2 2024-04-30 13:04:45 +02:00
17fc1b0c9a Collect Gitea metrics in Prometheus
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-29 11:22:45 +02:00
249d3e472f Add Gitea service
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-26 16:52:52 +02:00
3ae2938cad Add firewall rules for Ceph and monitoring
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-24 16:55:06 +02:00
d93fea8288 Add workaround for MPICH 4.2.0
See: https://github.com/pmodels/mpich/issues/6946

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-15 21:39:43 +01:00
5f69d51134 Fix SLURM bug in rank integer sign expansion
See: https://bugs.schedmd.com/show_bug.cgi?id=19324

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-15 13:12:46 +01:00
a2ec4546df Merge pmix outputs for MPICH
MPICH expects headers and libraries to be present in the same directory.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-14 16:59:11 +01:00
b5da1c6521 Remove nixseparatedebuginfod input
It has been integrated in nixpkgs, so is no longer required.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-14 16:44:21 +01:00
082221f2c3 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
  → 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
    'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
  → 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
    'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
  → 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
    'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b' (2023-11-22)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709' (2024-04-24)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
  → 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
    'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
  → 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-14 16:41:30 +01:00
67bcf7b2a0 Use google.com probe instead of bsc.es
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-29 09:57:18 +01:00
bd56c2340d Add another HTTPS probe for bsc.es
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.

Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-13 11:50:38 +01:00
df5a5e1668 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-09 11:14:34 +01:00
d982b45c26 Enable public-inbox at jungle.bsc.es/lists
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.

The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.

We also edited the CSS to improve the readability and have larger fonts
by default.

The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-07 11:08:15 +01:00
171f26e192 Monitor https://pm.bsc.es/gitlab/ too
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/.

Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-01 12:17:50 +01:00
1c6e5d8f82 Enable nixseparatedebuginfod module
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2023-12-01 19:57:04 +01:00
f78f1a3ce6 Use tmpfs in /tmp
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-21 23:56:55 +01:00
8c7d37859b Enable runners for pm.bsc.es/gitlab too
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-23 12:39:43 +01:00
4d833d2088 Remove complete ceph package from hut
Only the ceph-client is needed.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-20 12:57:31 +01:00
3d67c17cac Fix warning in slurm exporter using vendorHash
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-20 12:40:24 +01:00
ea2eeff5f9 Remove old Ceph package overlay
The Ceph package is now integrated in upstream nixpkgs.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-07 00:02:26 +01:00
e58ffd9652 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/d8c973fd228949736dedf61b7f8cc1ece3236792' (2023-07-24)
  → 'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538' (2023-10-31)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b' (2023-11-22)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
  → 'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-20 12:37:50 +01:00
2acfd589d4 BSC packages are no longer in bsc attribute
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-06 23:03:56 +01:00
838b2d73e9 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538' (2023-10-31)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-06 17:54:14 +01:00
0e1ada08cf Switch bscpkgs URL to sourcehut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-06 17:50:38 +01:00
c307fc9bb3 Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-26 12:36:06 +02:00
6f5f234480 Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-24 11:49:42 +02:00
1e9bc4086f Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-20 16:04:15 +02:00
734f52e87f Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-20 15:34:06 +02:00
18908c3019 Don't log SLURM connection attempts from ssfhead 2023-10-04 08:19:09 +02:00
72658ee5e6 Add docker runner too 2023-10-04 07:55:26 +02:00
cfa3e08e4b Monitor gitlab.bsc.es too 2023-10-03 09:45:13 +02:00
10101c631d Monitor PM webpage via blackbox 2023-10-03 08:58:07 +02:00
4d865d7a7e Temporarily disable pm runners 2023-09-28 14:14:41 +02:00
d9511dab22 Add runner for gitlab.bsc.es 2023-09-28 14:11:30 +02:00
c3ecba513d Allow anonymous access to grafana 2023-09-22 10:50:14 +02:00
24c05e5ebf Remove user/group when using DynamicUsers 2023-09-22 10:13:06 +02:00
7aef154dd4 Set the SLURM_CONF variable 2023-09-21 22:18:30 +02:00
4ca4e0fae9 Enable slurm-exporter service 2023-09-21 21:38:34 +02:00
7b686d0ea4 Add prometheus-slurm-exporter package 2023-09-21 21:34:18 +02:00
d4c803dbfb Mount the hut nix store for SLURM jobs 2023-09-20 18:26:48 +02:00
94ead9b759 Enable direnv integration 2023-09-17 22:27:51 +02:00
e0b3dd961c Remove bscpkgs from the registry and nixPath
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2023-09-15 11:58:47 +02:00
656de00d65 Add bscpkgs and nixpkgs top level attributes
Allows the evaluation of packages of the intermediate overlays.
2023-09-15 11:58:10 +02:00
fefdbe9c55 Use hut packages as the default package set
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2023-09-14 18:28:09 +02:00
c73a337471 Don't fetch registry flakes from the net 2023-09-15 09:13:24 +02:00
dbd57ed57f flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
2023-09-14 18:09:05 +02:00
010491618e Revert "Update slurm to 23.02.05.1"
This reverts commit aaefddc44a.
2023-09-14 15:46:18 +02:00
722c0b0eaa Open ports in firewall of compute nodes 2023-09-14 15:45:43 +02:00
772e0f00fb Update slurm to 23.02.05.1 2023-09-13 17:44:24 +02:00
de3a28b7df Monitor storage nodes via IPMI too 2023-09-13 15:57:13 +02:00
a05d87d4b9 Enable fstrim service 2023-09-12 16:39:45 +02:00
826d6263fd Serve the nix store from hut 2023-09-12 12:19:43 +02:00
b0b04e8fb1 Add encrypted munge key with agenix 2023-09-08 19:01:57 +02:00
a5e81fea95 Remove unused large port hole in firewall 2023-09-08 18:22:48 +02:00
dd616a7fb1 Make exporters listen in localhost only 2023-09-08 18:13:04 +02:00
e41404f619 Allow only some ports for srun 2023-09-08 17:51:37 +02:00
1c7ce3fc51 Block ssfhead from reaching our slurm daemon 2023-09-08 17:20:32 +02:00
bdd03dac60 Poweroff idle slurm nodes after 1 hour 2023-09-08 13:31:23 +02:00
21b38de26d Add IB and IPMI node host names 2023-09-08 13:21:37 +02:00
52d3794b14 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
2023-09-07 11:13:45 +02:00
d91c9b7473 Unlock ovni gitlab runners 2023-09-05 16:24:27 +02:00
6b526f9827 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27' (2023-08-31)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
  → 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2023-09-05 15:03:26 +02:00
ae4ad95902 Add agenix to all nodes 2023-09-04 22:09:40 +02:00
3cc7b33c5a Add agenix module to ceph 2023-09-04 22:06:20 +02:00
8fc87885da Remove old secrets 2023-09-04 22:04:32 +02:00
1ea8912d6c Mount /ceph in owl1 and owl2 2023-09-04 22:00:36 +02:00
7d9e7e4e83 Warn about the owl2 omnipath device 2023-09-04 22:00:17 +02:00
779b591d40 Clean owl2 configuration 2023-09-04 21:59:56 +02:00
c13022596a Move the ceph client config to an external module 2023-09-04 21:59:04 +02:00
875622ad0f Reorganize secrets and ssh keys
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2023-09-04 21:36:31 +02:00
a7eddecf80 Add anavarro user 2023-09-04 16:00:01 +02:00
fcddbdb72b Set zsh inc_append_history option 2023-09-03 16:57:53 +02:00
bfb5363d94 Set zsh shell for rarias 2023-09-03 16:46:27 +02:00
44c1d958f4 Enable zsh and fix key bindings 2023-09-03 11:51:53 +02:00
e334891c41 Keep a log over time with the config commits 2023-09-02 23:49:41 +02:00
ea73a72b79 Configure bscpkgs.nixpkgs to follow nixpkgs 2023-09-02 23:37:59 +02:00
13b2379d97 Store nixos config in /etc/nixos/config.rev 2023-09-02 23:37:11 +02:00
48727d3a88 Enable binary emulation for other architectures 2023-08-31 17:22:36 +02:00
b9598df864 Enable watchdog 2023-08-29 22:26:12 +02:00
a0e447301e Enable all osd on boot in lake2 2023-08-29 18:47:25 +02:00
4495cbf380 Scrape lake2 too 2023-08-29 12:33:26 +02:00
042d85ba61 Also enable monitoring in lake2 2023-08-29 12:29:41 +02:00
c47c190c79 Scrape metrics from bay 2023-08-29 11:58:00 +02:00
a1271f007f Add monitoring in the bay node 2023-08-29 11:53:32 +02:00
042e56b5b2 Add fio tool 2023-08-29 11:27:50 +02:00
a510a41eed Add ceph tools in hut too 2023-08-28 17:58:21 +02:00
a68909f96c Switch ceph logs to journal 2023-08-28 17:58:08 +02:00
3c523572cb Update ceph to 18.2.0 in overlay 2023-08-25 18:12:46 +02:00
7cd15b9732 Move pkgs overlay to overlay.nix 2023-08-25 18:12:00 +02:00
7ae2403db8 Enable ceph osd daemons in lake2 2023-08-25 14:44:53 +02:00
e8824bf72e Add the lake2 hostname to the hosts 2023-08-25 14:44:35 +02:00
e46ded9843 Use the sda for lake2 2023-08-25 13:40:10 +02:00
d6d3624617 Remove netboot module 2023-08-25 13:39:01 +02:00
300690df4c Disable pixiecore in hut for now 2023-08-25 13:21:00 +02:00
9d15c13a44 Add PXE helper 2023-08-25 12:03:30 +02:00
3c030307f1 Enable netboot again for PXE 2023-08-24 19:08:23 +02:00
d30399d31b Specify the disk by path 2023-08-24 15:27:37 +02:00
9ac05ed4c0 Prepare lake2 config after bootstrap
The disk ID is different under NixOS.
2023-08-24 13:54:22 +02:00
43c63f45d7 Add lake2 bootstrap config 2023-08-24 12:30:46 +02:00
35580a83a0 Add section to enable serial console 2023-08-24 12:29:44 +02:00
591a4c774e Add agenix to PATH in hut 2023-08-23 17:42:50 +02:00
e8d5eeb5cf Store ceph secret key in age
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2023-08-23 17:18:17 +02:00
2516559fac Add rarias key for secrets 2023-08-23 17:15:26 +02:00
bb8bf86051 Add ceph metrics to prometheus 2023-08-22 16:33:55 +02:00
2416ec7806 Mount the ceph filesystem in hut 2023-08-22 15:57:49 +02:00
34ebe09f66 Add ceph config in bay 2023-08-22 15:57:25 +02:00
1f270d070d Add the bay host name 2023-08-22 15:56:09 +02:00
817bea45a5 Remove netboot and fixes 2023-07-28 20:31:44 +02:00
490cdf7b95 Add bay node 2023-07-28 19:49:48 +02:00
335c77593d Update flake 2023-08-22 10:28:26 +02:00
199358a5e3 Monitor power from other nodes via LAN 2023-08-17 18:55:40 +02:00
776a582c10 Increase prometheus retention time to one year 2023-07-28 16:19:59 +02:00
b526531f20 Don't set all_proxy 2023-08-17 12:37:58 +02:00
ad78e41c8b Update nixpkgs to fix docker problem 2023-07-28 14:24:51 +02:00
b978839406 Allow access to devices for node_exporter 2023-07-28 13:48:30 +02:00
b698b9da12 GRUB version no longer needed 2023-07-27 17:22:20 +02:00
92f5c1ee19 Upgrade flake: nixpkgs, bscpkgs and agenix 2023-07-27 17:19:17 +02:00
c8ff31ec08 Kill slurmd remaining processes on upgrade 2023-07-27 14:24:21 +02:00
b408af0092 koro: Add vlopez user 2023-07-21 10:34:37 +02:00
4878b6fd8b Add koro node 2023-07-21 10:34:19 +02:00
b5d3d08706 eudy: Add fcsv3 and intermediate versions for testing 2023-07-12 13:22:42 +02:00
72497a88d4 eudy: Enable memory overcommit 2023-06-30 12:49:44 +02:00
cb90c9c73f eudy: disable all cpu mitigations 2023-06-29 09:14:39 +02:00
246226b3d3 Enable NTP using the BSC time server 2023-06-30 14:02:15 +02:00
aaa082390e Add the ssfhead node as gateway 2023-06-30 14:01:35 +02:00
cc2160f134 Use our host names first by default 2023-06-23 16:22:18 +02:00
01e7a9b8a4 Add DNS tools to resolve hosts 2023-06-23 16:12:25 +02:00
a66a4d9a43 Lower perf_event_paranoid to -1 2023-06-23 16:01:27 +02:00
31eace8400 Set perf paranoid to 0 by default 2023-06-21 16:23:16 +02:00
4997191f30 Add perf to packages 2023-06-21 15:41:06 +02:00
3ea8bdcdf1 Allow srun to specify the cpu binding
The task/affinity plugin needs to be selected.
2023-06-21 13:16:23 +02:00
7db6671ce5 Move authorized keys to users.nix 2023-06-20 14:08:34 +02:00
952541ff4a Add rpenacob user 2023-06-20 12:48:00 +02:00
d200e4b172 Add osumb to the system packages 2023-06-16 19:22:41 +02:00
cced1c2e08 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=c775ee4d6f76aded05b08ae13924c302f18f9b2c' (2023-04-26)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=cbe9af5d042e9d5585fe2acef65a1347c68b2fbd' (2023-06-16)
2023-06-16 18:33:54 +02:00
197c93a2be Set mpi to mpich by default in bscpkgs 2023-06-16 16:05:17 +02:00
d9002dd028 Add missing parameter to extend 2023-06-16 16:04:36 +02:00
60ee744a54 Use explicit order in overlays 2023-06-16 16:02:25 +02:00
cd1fde4760 Replace mpi inside bsc attribute 2023-06-16 15:54:55 +02:00
3985e66fa4 Add mpich overlay 2023-06-16 14:16:51 +02:00
5010746e9c Add coments in slurm config 2023-06-16 14:16:14 +02:00
6df4924b00 Add eudy host key to known hosts 2023-06-16 17:29:48 +02:00
59a29e1af6 Rename xeon08 to eudy
From Eudyptula, a little penguin.
2023-06-16 17:16:05 +02:00
a4141301ad Update rebuild script for all nodes 2023-06-16 12:13:07 +02:00
3a07842480 Add ssh host keys 2023-06-16 12:01:12 +02:00
e2aa26a8b3 Set the name of the slurm cluster to jungle 2023-06-16 12:00:54 +02:00
ebf45be2b5 Change owl hostnames 2023-06-16 11:42:39 +02:00
e0ab4e1408 Add owl and all partition 2023-06-16 11:34:00 +02:00
3cb263ea71 Simplify flake and expose host pkgs
The configuration of the machines is now moved to m/
2023-06-14 17:28:00 +02:00
2c73c4a7c3 Rename xeon07 to hut 2023-06-14 11:15:00 +02:00
27a6bc1736 Remove profiles older than 30 days with gc 2023-06-14 13:55:19 +02:00
22372b19f0 Add ncdu to system packages 2023-06-14 12:05:15 +02:00
1d0f42a93c Move arocanon user from xeon08 to common 2023-06-14 16:16:46 +02:00
bf5fffb8ca xeon08: Add config for kernel non-voluntary preemption 2023-06-12 17:16:01 +02:00
d7f21b39c0 xeon08: Add perf 2023-06-09 10:58:11 +02:00
c8e0d87d42 xeon08: Enable lttng lockdep tracepoints 2023-06-09 08:04:30 +02:00
6f4b356b73 xeon08: Add lttng module and tools 2023-06-07 19:52:24 +02:00
2517f2d0da Serve grafana in https://jungle.bsc.es/grafana 2023-05-31 17:23:08 +02:00
14650a1e0d Add tree command 2023-05-31 17:06:09 +02:00
736866afa0 Add file to system packages 2023-05-22 18:56:01 +02:00
e7625328b6 Add gnumake to system packages 2023-05-22 18:31:48 +02:00
2d84a08d38 Add cmake to system packages 2023-05-22 18:28:49 +02:00
e410506722 Add ix to common packages 2023-05-22 13:50:34 +02:00
327837481d Improve documentation 2023-05-10 10:58:27 +02:00
be2702ebf1 Add gitignore 2023-05-10 17:38:11 +02:00
755290a032 Set intel_pstate=passive and disable frequency boost 2023-05-11 17:25:48 +02:00
eb5bb85cc7 Add xeon08 basic config 2023-05-05 20:18:01 +02:00
cd1129894a Add nixos-config.nix to easily enable nix repl 2023-05-08 16:45:40 +02:00
08666ddb5c Automatically resume restarted nodes in SLURM 2023-05-18 12:48:04 +02:00
e288e3c121 Allow public dashboards in grafana 2023-05-09 18:53:31 +02:00
c436a93bfc Add hal ssh key 2023-05-09 18:37:38 +02:00
201d1c6a22 Increase the number of CPUs to 56 for nOS-V docker 2023-05-02 17:47:57 +02:00
95fec816d2 Allow 5 concurrent buils in the gitlab-runner 2023-05-02 17:38:10 +02:00
f8f94f2604 Simplify bash prompt 2023-04-28 18:12:10 +02:00
4a2f0ff881 Roolback to bash as default shell
Zsh doesn't behave properly, it needs further configuration.
2023-04-28 17:59:19 +02:00
4f76bd9ee5 Use pmix by default in slurm 2023-04-28 17:07:48 +02:00
b5ae691d4b Increase locked memory to 1 GiB 2023-04-28 12:34:51 +02:00
3938218c74 Use the latest kernel 2023-04-28 11:50:43 +02:00
6df8e03a8c Disable osnoise and hwlat tracer for now
Reuse nix cache to avoid rebuilding the kernel.
2023-04-28 11:19:47 +02:00
4dbc9a4021 Update nixpkgs to nixos-unstable 2023-04-28 11:18:37 +02:00
224fb2402f Update nixpkgs 2023-04-28 11:13:46 +02:00
b0f0e0e134 Update ib interface name in xeon02
It seems to be plugged in another PCI port
2023-04-27 18:29:32 +02:00
3e701b22c2 Add steps in install documentation 2023-04-27 16:36:48 +02:00
26344d45af Add minimal netboot module to build kexec image 2023-04-27 16:36:15 +02:00
67eb58a8f7 Add xeon02 configuration 2023-04-27 16:28:12 +02:00
83f80b2cfd Refacto slurm configuration into compute/control 2023-04-27 16:27:04 +02:00
578f1e04be Lock flakes and add inputs 2023-04-26 17:36:36 +02:00
5bc5b3fe35 Test flakes 2023-04-26 14:26:39 +02:00
53521010e9 Enable slurm in xeon01 2023-04-26 13:35:06 +02:00
84ea8ba1cb Use xeon07 as control machine 2023-04-26 13:29:28 +02:00
4dc04ecbff Remove xeon07 overlay to load upstream slurm 2023-04-26 13:28:04 +02:00
307af602ca Add script to rebuild configuration 2023-04-26 14:09:23 +02:00
c95e5aa689 Add configuration for xeon01 2023-04-18 18:56:31 +02:00
f848cd3aca Load overlays from /config 2023-04-18 18:55:07 +02:00
d5c00f204a Move net.nix to common 2023-04-18 18:50:44 +02:00
9d7d47e43b Remove host specific network options from net.nix 2023-04-18 18:49:54 +02:00
7aa9c486c3 Move ssh.nix to common 2023-04-18 18:46:53 +02:00
a42dc88f16 Move overlays.nix to common 2023-04-18 18:46:01 +02:00
e51240da50 Move users.nix to common 2023-04-18 18:45:10 +02:00
b26ddbaca8 Move common options from configuration.nix 2023-04-18 18:43:23 +02:00
80f838fa19 Move the remaining hw config to common 2023-04-18 18:38:08 +02:00
65b6ee624a Move boot config to common/boot.nix 2023-04-18 18:37:01 +02:00
d91efda035 Move filesystems config to common/fs.nix 2023-04-18 18:35:58 +02:00
2208781c7d Use partition labels for / and swap 2023-04-18 18:34:27 +02:00
6f821011b9 Move fs.nix to common 2023-04-18 18:31:35 +02:00
25c9adf5bb Move boot.nix to common 2023-04-18 18:30:02 +02:00
52eb8e818f Move disk selection to configuration.nix 2023-04-18 18:28:37 +02:00
f394c29cca Add common directory 2023-04-18 18:27:08 +02:00
960e8eeb5a Move xeon07 configuration to a directory 2023-04-18 16:09:23 +02:00
d29c33eb66 Add smartctl monitoring 2023-04-18 16:03:46 +02:00
56e785ea24 Allow wheel users to build derivations 2023-04-14 10:14:17 +02:00
91e670500c Use bscpkgs master 2023-04-11 21:22:00 +02:00
d99af26c48 Run the garbage collector once a week 2023-04-11 21:21:22 +02:00
d2cb42ec80 Set EDITOR and add nix-diff 2023-04-11 20:36:54 +02:00
cc32ad0740 Add nos-v gitlab runner 2023-04-11 12:59:21 +02:00
e8133f9dc0 Disable debug from gitlab runner 2023-04-11 12:58:24 +02:00
ccae9e96c7 Add gitlab-runner secrets using agenix 2023-04-11 12:47:52 +02:00
2258c77aac Disable ethernet specific useDHCP
Is already configured by default for all interfaces.
2023-04-06 13:58:55 +02:00
cdc1cf387b Enable IPoIB and set the infiniband IP 2023-04-06 13:58:24 +02:00
8dff45903f Export nix store over nfs 2023-04-06 13:57:32 +02:00
93b416ff19 Enable gitlab runner monitoring 2023-04-06 13:56:52 +02:00
2437e223e0 Add agenix tool 2023-04-05 17:04:42 +02:00
da9b350691 Add monitoring services 2023-04-05 17:00:01 +02:00
a5b4a1b8fb Add some tools and use relaxed for build sandbox 2023-04-05 16:59:09 +02:00
ec13892ae8 Remove commencted docker settings 2023-04-05 16:56:27 +02:00
e31a72eeac Add mio key 2023-04-05 16:56:05 +02:00
907de95e01 Setup slurm and gitlab-runner 2023-04-03 12:51:44 +02:00