43a1b25c23
Set strictDeps=true on our top level packages
2025-10-07 16:27:46 +02:00
44cc60fcd8
Update license year range to 2025
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:32 +02:00
ca48ce556c
Update gitlab CI after merge
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:30 +02:00
e8ac9dfb64
Upgrade README after bscpkgs merge
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:28 +02:00
188ba6df0a
Remove bscpkgs input
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:26 +02:00
b1a37ae1fe
Enable unfree packages in nixpkgs config
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:24 +02:00
63822bb054
Move the rest of packages to main overlay
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:23 +02:00
b94a1493d5
Merge flake.nix with bscpkgs
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:21 +02:00
826d6a28ef
Move slurm to pkgs/
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:19 +02:00
ae6b0ae161
Move MPICH to pkgs/mpich and set as default
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-07 16:07:01 +02:00
01986c376b
Merge remote-tracking branch 'bscpkgs/master' into merge-bscpkgs
2025-10-03 13:47:04 +02:00
e42058f08b
Allow access to hut from fox
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-02 17:03:21 +02:00
f3bfe89f27
Fetch website from its own git repository
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-02 15:45:21 +02:00
ee6f981006
Add script to trim the repository
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-02 15:44:56 +02:00
b040bebd1d
Add acinca user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 12:27:43 +02:00
f69629d2da
Restart slurmd on failure
...
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:
owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137
Fixes: rarias/jungle#177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-30 17:20:39 +02:00
0668f0db74
Lower connect timeout when using hut substituter
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 18:44:48 +02:00
5fcd57a061
Use hut substituter in all nodes
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 18:44:38 +02:00
ad1544759f
Remove machine access for user csiringo
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 18:23:24 +02:00
e1c950a530
Mount apex /home via NFS in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:53 +02:00
f9632c37f8
Remove extra SSH jump configuration
...
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:51 +02:00
1f0cb4ae76
Add raccoon peer to wireguard
...
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:48 +02:00
d49d078bed
Add raccoon host key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:46 +02:00
e98fdb89ab
Restrict fox peer to a single IP
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:43 +02:00
6afe05b5fd
Use lowercase peer hostnames
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-26 12:28:25 +02:00
7d5aebf882
Share a public folder for documents
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:59:40 +02:00
94cbfd38a6
Fix AMDuProfPcm so it finds libnuma.so
...
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Vincent Arcila <vincent.arcila@bsc.es>
2025-09-19 10:54:36 +02:00
4da7780472
Add amd_hsmp module in fox for AMD uProf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:24 +02:00
a6dfc267fd
Fix hidden dependencies for AMDuProfSys
...
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:19 +02:00
d6126501ba
Disable NMI watchdog in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:17 +02:00
ac0deb47b6
Fix amd-uprof dependencies with patchelf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:15 +02:00
f7d676de77
Fix hrtimer new interface
...
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.
See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:09 +02:00
cf1db201b2
Use CFLAGS_MODULE instead of EXTRA_CFLAGS
...
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:07 +02:00
e6e4846529
Add AMD uProf module and enable it in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:54:05 +02:00
084d556c56
Add AMD uProf package and driver
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:53:49 +02:00
ff0fc18d0a
Mount home via NFS from apex in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 15:34:02 +02:00
19c7e32678
Allow access to NFS via wireguard subnet
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 15:33:47 +02:00
017c19e7d0
Use 10.106.0.0/24 subnet to avoid collisions
...
The 106 byte is the code for 'j' (jungle) in ASCII:
% printf j | od -t d
0000000 106
0000001
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:13 +02:00
a36eff8749
Revert "Remove pam_slurm_adopt from fox"
...
This reverts commit 1eac0fcad8211195499bc566e6c70312b31af700.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:06 +02:00
df17b11458
Enable fail2ban in fox
...
Protect fox against ssh bruteforce attacks:
fox% sudo lastb | head
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:02 +02:00
0dc7b7eb3d
Accept connections from apex to fox slurmd
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:03:00 +02:00
dff6eaf587
Accept fox connection to slurm controller
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:59 +02:00
4b6b67b587
Add fox machine to SLURM
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:57 +02:00
20e7d244d1
Rekey secrets with trusted fox key
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:55 +02:00
c5d3b8e7f0
Trust fox for compute node secrets
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:52 +02:00
6bbfb0d124
Make apex host specific to each machine
...
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:49 +02:00
46d03d5ca7
Add local host fox in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:46 +02:00
e366e6ce87
Enable wireguard in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:43 +02:00
e415f70bbb
Add wireguard server in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 12:02:38 +02:00
200c727bbf
Use writeShellScript for suspend.sh and resume.sh
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:28 +02:00
7413021440
Add firewall rules to slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:26 +02:00
20b4805335
Remove hut from slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:24 +02:00
f7dff9deab
Only configure apex as slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:22 +02:00
f569933732
Split slurm configuration for client and server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:20 +02:00
ee895d2e4f
Move slurm control server to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:16 +02:00
5ee8623af2
Fix typo in csiringo ssh key
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 17:44:20 +02:00
a0e4b209b0
Enable nix-ld in weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 16:19:34 +02:00
ce25867421
Add csiringo user with access to apex and weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 16:02:26 +02:00
f89bba35a6
Access gitlab via raccoon in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-27 15:27:38 +02:00
d591721a61
Move StartLimit* options to unit section
...
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 14:32:46 +02:00
343b4f155e
Set power policy to always turn on
...
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 11:22:38 +02:00
39a211a846
Add NixOS module to control power policy
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 11:22:36 +02:00
142985c505
Move August shutdown to 3rd at 22h
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 11:22:33 +02:00
3f3dc2d037
Disable automatic August shutdown for Fox
...
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 11:22:10 +02:00
3269d763aa
Add cudainfo program to test CUDA
...
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 11:52:09 +02:00
f2d8ee8552
Add missing symlink in cuda sandbox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 11:51:47 +02:00
8d984a0672
Enable cuda systemFeature in raccoon and fox
...
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-22 17:07:13 +02:00
f3733418b2
Move shared nvidia settings to a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-22 17:06:45 +02:00
ce8b05b142
Replace xeon07 by hut in ssh config
...
The xeon07 machine has been renamed to hut.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-21 18:10:08 +02:00
4a5787e0c6
Enable automatic Nix GC in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-21 17:58:26 +02:00
6c11093033
Select proprietary NVIDIA driver in raccoon
...
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:
> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-21 17:58:21 +02:00
750504744f
Enable open source NVidia driver in fox
...
It is recommended for newer versions.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 09:57:38 +02:00
c26ec1b6f1
Remove option allowUnfree from fox and raccoon
...
It is already set to true for all machines.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 09:57:21 +02:00
2ef32f773c
Ban another scanner trying to connect via SSH
...
It is constantly spamming out logs:
apex# journalctl | grep 'Connection closed by 84.88.52.176' | wc -l
2255
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 09:51:49 +02:00
fc9fcd602a
Update weasel IPMI hostname for monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-18 09:51:21 +02:00
0e37ab5fe1
Remove merged MPICH patch
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-16 13:07:12 +02:00
a1b387e454
Remove package ix as it is gone
...
Fails with: "error: ix has been removed from Nixpkgs, as the ix.io
pastebin has been offline since Dec. 2023".
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-16 13:07:06 +02:00
380abe9957
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41?narHash=sha256-b%2Buqzj%2BWa6xgMS9aNbX4I%2BsXeb5biPDi39VgvSFqFvU%3D' (2024-08-10)
→ 'github:ryantm/agenix/531beac616433bac6f9e2a19feb8e99a22a66baf?narHash=sha256-9P1FziAwl5%2B3edkfFcr5HeGtQUtrSdk/MksX39GieoA%3D' (2025-06-17)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d?narHash=sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0%3D' (2023-11-24)
→ 'github:lnl7/nix-darwin/43975d782b418ebf4969e9ccba82466728c2851b?narHash=sha256-dyN%2BteG9G82G%2Bm%2BPX/aSAagkC%2BvUv0SgUw3XkPhQodQ%3D' (2025-04-12)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1?narHash=sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE%3D' (2023-12-20)
→ 'github:nix-community/home-manager/abfad3d2958c9e6300a883bd443512c55dfeb1be?narHash=sha256-YZCh2o9Ua1n9uCvrvi5pRxtuVNml8X2a03qIFfRKpFs%3D' (2025-04-24)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=9d1944c658929b6f98b3f3803fead4d1b91c4405 ' (2025-06-11)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc?narHash=sha256-i/UJ5I7HoqmFMwZEH6vAvBxOrjjOJNU739lnZnhUln8%3D' (2025-01-14)
→ 'github:NixOS/nixpkgs/dfcd5b901dbab46c9c6e80b265648481aafb01f8?narHash=sha256-Kt1UIPi7kZqkSc5HVj6UY5YLHHEzPBkgpNUByuyxtlw%3D' (2025-07-13)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-16 13:07:01 +02:00
37c12783bb
Upgrade nixpkgs to nixos 25.05
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-16 13:06:40 +02:00
7379e84e79
Silently ban OpenVAS BSC scanner from apex
...
It is spamming our logs with refused connection lines:
apex% sudo journalctl -b0 | grep 'refused connection.*SRC=192.168.8.16' | wc -l
13945
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 17:40:41 +02:00
b802f88df9
Rotate anavarro password and SSH key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 17:24:41 +02:00
bd94c4ad00
Add weasel machine configuration
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 17:24:38 +02:00
570c6e175d
Remove extra flush commands on firewall stop
...
They are not needed as they are already flushed when the firewall
starts or stops.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:45 +02:00
96661dd0d4
Prevent accidental use of nftables
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:42 +02:00
28db7799ea
Add proxy configuration for internal hosts
...
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.
This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:36 +02:00
508059c99e
Remove unused blackbox configuration modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:30 +02:00
b9f9cc7d7a
Use IPv4 in blackbox probes
...
Otherwise they simply fail as IPv6 doesn't work.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:26 +02:00
eae0c7cb59
Make NFS mount async to improve latency
...
Don't wait to flush writes, as we don't care about consistency on a
crash:
> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:20 +02:00
2280635cd6
Disable root_squash from NFS
...
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:16 +02:00
16ada09600
Remove SSH proxy to access BSC clusters
...
We now have direct connection to them.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:13 +02:00
0d291d715c
Add users to apex machine
...
They need to be able to login to apex to access any other machine from
the SSF rack.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:09 +02:00
66001f76f7
Remove proxy from hut HTTP probes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:04 +02:00
1e3b85067d
Remove proxy configuration from environment
...
All machines have now direct connection with the outside world.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:18:00 +02:00
36ee1f3adc
Add storcli utility to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:17:57 +02:00
25e9c071b0
Add new configuration for apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-15 11:17:43 +02:00
80cee2dbd0
Add pmartin1 user with access to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-03 11:16:43 +02:00
ee92934c74
Add access to fox for rpenacob user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 16:58:53 +02:00
db0f3fed91
Revert "Only allow Vincent to access fox for now"
...
This reverts commit e9e3704b677baed1649583f25e4e1bc050a9534e.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 16:58:49 +02:00
adeaa0484d
Add all terminfo files in environment
...
Fixes problems with the kitty terminal when opening vim or kakoune.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-02 16:02:45 +02:00
815810830e
Monitor Fox BMC with ICMP probes too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:51:22 +02:00
7a52e1907c
Restrict DAC VPN to fox-ipmi machine only
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:51:19 +02:00
22a2e1b9e8
Monitor fox via VPN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:51:16 +02:00
f29461ae32
Add OpenVPN service to connect to fox BMC
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:51:13 +02:00
208197f099
Add ac.upc.edu as name search server
...
Allows referring to fox.ac.upc.edu directly as fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:51:09 +02:00
479ca1b671
Disable kptr_restrict in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:08:42 +02:00
40529fbdcb
Disable NUMA balancing in fox
...
See: https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:08:02 +02:00
9b0d3fb21e
Load amd_uncore module in fox
...
Needed for L3 events in perf.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:07:58 +02:00
d8444131d8
Enable SSH X11 forwarding
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-02 15:07:54 +02:00
af540456a6
Disable registration in Gitea
...
Get rid of all the spam accounts they are trying to register.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:18 +02:00
42d6734da8
Enable msmtp configuration in tent
...
Allows gitea to send notifications via email.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:15 +02:00
071a8084a0
Add GitLab runner with debian docker for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:13 +02:00
24a0c58592
Monitor nix-daemon in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:11 +02:00
810a6dfcec
Move nix-daemon exporter to modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:09 +02:00
47ad89dee1
Add p service for pastes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:07 +02:00
8af1b259f5
Enable public-inbox service in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:06 +02:00
560003d4fd
Enable gitea in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:04 +02:00
68ff45075c
Add bsc.es to resolve domain names
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:02 +02:00
fc68d16197
Monitor AXLE machine too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:36:00 +02:00
f6ec1293f4
Use IPv4 for blackbox exporter
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:59 +02:00
4feeff978c
Add public html files to tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:57 +02:00
7b19292912
Add docker GitLab runner for BSC GitLab
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:55 +02:00
0627db0eb9
Add GitLab shell runner in tent for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:54 +02:00
ae2f6dde41
Enable jungle robot emails for Grafana in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:52 +02:00
3bf70656dc
Add tent key for nix-serve
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:50 +02:00
1cf989d727
Remove jungle nix cache from tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:48 +02:00
19f734e622
Enable nix cache
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:47 +02:00
d6e3d9626c
Serve Grafana from subpath
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:45 +02:00
9c32e42dcc
Add nginx server in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:43 +02:00
61e6d3232b
Add monitoring in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-18 15:35:00 +02:00
d0fd8cde46
Disable nix garbage collector in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-06-11 16:05:05 +02:00
5223ea53f6
Rekey secrets with tent keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:04:20 +02:00
253426ce00
Add tent host key and admin keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:04:16 +02:00
df67b6cd26
Create directories in /vault/home for tent users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:04:12 +02:00
766da21097
Add software RAID in tent using 3 disks
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:04:10 +02:00
18461c0d59
Add access to tent to all hut users too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:04:06 +02:00
028b151c78
Add hut SSH configuration from outside SSF LAN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:04:04 +02:00
7176b066bb
Don't use proxy in base preset
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:04:00 +02:00
c3c3614f63
Add tent machine from xeon04
...
We moved the tent machine to the server room in the BSC building and is
now directly connected to the raccoon via NAT.
Fixes: rarias/jungle#106
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:03:54 +02:00
e13288fc29
Create specific SSF rack configuration
...
Allow xeon machines to optionally inherit SSF configuration such as the
NFS mount point and the network configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 16:03:49 +02:00
e9e3704b67
Only allow Vincent to access fox for now
...
Needed to run benchmarks without interference.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 12:08:57 +02:00
7d3c7342ae
Use performance governor in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 12:08:55 +02:00
8f80ed2cce
Add hut as nix cache in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 12:08:51 +02:00
d00f996f59
Use extra- for substituters and trusted-public-keys
...
From the nix manual:
> A configuration setting usually overrides any previous value. However,
> for settings that take a list of items, you can prefix the name of the
> setting by extra- to append to the previous value.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-06-11 11:27:37 +02:00
e40fd24f26
Use DHCP for Ethernet in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 10:24:53 +02:00
83efd6c876
Use UPC time servers as others are blocked
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-11 10:24:47 +02:00
f0c4206ab8
Create tracing group and add arocanon in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 11:09:41 +02:00
8b43a6ffb6
Extend perf support in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 11:09:30 +02:00
2bca10b0e4
Enable nixdebuginfod in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:50:01 +02:00
eec3e27d66
Make raccoon use performance governor
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:45:35 +02:00
e51ef52721
Enable binfmt emulation in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:45:33 +02:00
9dc67d402f
Disable nix garbage collector in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:45:31 +02:00
62ec4e014a
Add dbautist user to raccoon machine
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:45:28 +02:00
4d03842f7c
Add node exporter monitoring in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:45:26 +02:00
8fedc5518e
Allow X11 forwarding via SSH
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:45:23 +02:00
43dc336638
Enable linger for user rarias
...
Allows services to run without a login session.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:45:19 +02:00
2b08fcd21a
Only proxy SSH git remotes via hut in xeon
...
Other machines like raccoon have direct access.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-03 10:44:31 +02:00
557618d43f
Add machine map file
...
Documents the location, board and serial numbers so we can track the
machines if they move around. Some information is unkown.
Using the Nix language to encode the machines location and properties
allows us to later use that information in the configuration of the
machines themselves.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 14:55:58 +02:00
e8ac6cf0f3
Remove fox monitoring via IPMI
...
We will need to setup an VPN to be able to access fox in its new
location, so for now we simply remove the IPMI monitoring.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:53 +02:00
f8fc391cae
Monitor fox, gateway and UPC anella via ICMP
...
Fox should reply once the machine is connected to the UPC network.
Monitoring also the gateway and UPC anella allows us to estimate if the
whole network is down or just fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:51 +02:00
6c1afa3fd8
Update configuration for UPC network
...
The fox machine will be placed in the UPC network, so we update the
configuration with the new IP and gateway. We won't be able to reach hut
directly so we also remove the host entry and proxy.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:48 +02:00
008584b465
Disable home via NFS in fox
...
It won't be accesible anymore as we won't be in the same LAN.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:46 +02:00
a22c862192
Rekey all secrets
...
Fox is no longer able to use munge or ceph, so we remove the key and
rekey them.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:44 +02:00
cd0c070439
Rotate fox SSH host key
...
Prevent decrypting old secrets by reading the git history.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:42 +02:00
201ff64b25
Distrust fox SSH key
...
We no longer will share secrets with fox until we can regain our trust.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:38 +02:00
9bee145e25
Remove Ceph module from fox
...
It will no longer be accesible from the UPC.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:36 +02:00
4528b7c2a6
Remove fox from SLURM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:20 +02:00
1eac0fcad8
Remove pam_slurm_adopt from fox
...
We no longer will be able to use SLURM from jungle.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:02 +02:00
dd15f9c943
Add UPC temperature sensor monitoring
...
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-29 13:01:37 +02:00
4048b3327a
Add meteocat exporter
...
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-29 13:01:29 +02:00
f4229e34f6
Add custom nix-daemon exporter
...
Allows us to see which derivations are being built in realtime. It is a
bit of a hack, but it seems to work. We simply look at the environment
of the child processes of nix-daemon (usually bash) and then look for
the $name variable which should hold the current derivation being
built. Needs root to be able to read the environ file of the different
nix-daemon processes as they are owned by the nixbld* users.
See: https://discourse.nixos.org/t/query-ongoing-builds/23486
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-29 12:57:07 +02:00
5208a3483b
Set keep-outputs to true in all machines
...
From the documentation of keep-outputs, setting it to true would prevent
the GC from removing build time dependencies:
If true, the garbage collector will keep the outputs of non-garbage
derivations. If false (default), outputs will be deleted unless they are
GC roots themselves (or reachable from other roots).
In general, outputs must be registered as roots separately. However,
even if the output of a derivation is registered as a root, the
collector will still delete store paths that are used only at build time
(e.g., the C compiler, or source tarballs downloaded from the network).
To prevent it from doing so, set this option to true.
See: https://nix.dev/manual/nix/2.24/command-ref/conf-file.html#conf-keep-outputs
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-04-22 17:27:37 +02:00
92eacfad20
Add raccoon node exporter monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-22 14:50:08 +02:00
80309d107b
Increase data retention to 5 years
...
Now that we have more space, we can extend the retention time to 5 years
to hold the monitoring metrics. For a year we have:
# du -sh /var/lib/prometheus2
13G /var/lib/prometheus2
So we can expect it to increase to about 65 GiB. In the future we may
want to reduce some adquisition frequency.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-22 14:50:03 +02:00
d0f151595f
Don't forward any docker traffic
...
Access to the 23080 local port will be done by applying the INPUT rules,
which pass through nixos-fw.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:16:15 +02:00
93f8d3aa89
Allow traffic from docker to enter port 23080
...
Before:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out
After:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
10.0.40.7 (10.0.40.7:23080) open
Fixes: rarias/jungle#94
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:16:10 +02:00
d84645f3e1
Add bscpm04.bsc.es SSH host and public key
...
Allows fetching repositories from hut and other machines in jungle
without the need to do any extra configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:15:45 +02:00
55b71d6901
Use hut nix cache in owl1, owl2 and raccoon
...
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-04-15 14:08:17 +02:00
89c65ea578
Clean all iptables rules on stop
...
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:08:14 +02:00
129273e8d8
Make nginx listen on all interfaces
...
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-15 14:08:07 +02:00
fdac196c6c
Fix nginx /cache regex
...
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```
This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info ` but
`curl https://jungle.bsc.es/cachenix-cache-info ` worked.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-04-15 14:08:04 +02:00
3f4b4fb810
Add new GitLab runner for gitlab.bsc.es
...
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:41:18 +02:00
2c7211ffa3
Remove SLURM partition all
...
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:27 +02:00
18f25307ab
Add varcila user to hut and fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:25 +02:00
7c55d10ceb
Adjust fox slurm config after disabling SMT
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:23 +02:00
5c549faaa8
Add abonerib user to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:21 +02:00
9fd35a9ce4
Don't move doc in web output
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:19 +02:00
5487a93972
Reject SSH connections without SLURM allocation
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:15 +02:00
fe16ea373f
Add users to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:13 +02:00
163434af09
Add dalvare1 user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:11 +02:00
71164400d4
Mount NVME disks in /nvme{0,1}
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:06 +02:00
f887dacdea
Exclude fox from being suspended by slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:04 +02:00
4f5c8dbbaf
Use IPMI host names instead of IP addresses
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:01 +02:00
14b192b1d9
Add fox IPMI monitoring
...
Use agenix to store the credentials safely.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:14:59 +02:00
2b04812320
Add new fox machine
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:14:42 +02:00
2f6f6ba703
Update PM GitLab tokens to new URL
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 15:43:13 +01:00
371b0c7e76
Fix MPICH build by fetching upstream patches too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 15:43:13 +01:00
ae34eacf4a
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
→ 'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41' (2024-08-10)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
→ 'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc' (2025-01-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 15:43:13 +01:00
dab6f08d89
Set nixpkgs to track nixos-24.11
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 15:43:13 +01:00
8190523c30
Add script to monitor GPFS
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 15:43:07 +01:00
d335d69ba6
Add BSC machines to ssh config
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:51 +01:00
cec49eb5fc
Collect statistics from logged users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:48 +01:00
22db38c98f
Add custom GPFS exporter for MN5
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:46 +01:00
0d4eebbb59
Remove exception to fetch task endpoint
...
It causes the request to go to the website rather than the Gitea
service.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:43 +01:00
025f6a0c0c
Use SSD for boot, then switch to NVME
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:40 +01:00
abc74c5445
Use NVME as root
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:37 +01:00
6942f09f69
Keep host header for Grafana requests
...
This was breaking requests due to CSRF check.
See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:32 +01:00
56f6855af7
Ignore logging requests from the gitea runner
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:28 +01:00
81c822e68e
Log the client IP not the proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:22 +01:00
53e80b1f19
Ignore misc directory
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:19 +01:00
21feb01e7b
Create paste directories in /ceph/p
...
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:16 +01:00
9ea7b2b475
Add p command to paste files
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:10 +01:00
fce4d89e1d
Use nginx to serve website and other services
...
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:07 +01:00
6b282375f8
Mount the NVME disk in /nvme
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:22:58 +01:00
260986b9f2
Delay nix-gc until /home is mounted
...
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.
See: rarias/jungle#79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-20 09:45:30 +02:00
15afbe94bd
Add dbautist user with access to hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-20 09:42:02 +02:00
efd35a9cd1
Set the serial console to ttyS1 in raccoon
...
Apparently the ttyS0 console doesn't exist but ttyS1 does:
raccoon% sudo stty -F /dev/ttyS0
stty: /dev/ttyS0: Input/output error
raccoon% sudo stty -F /dev/ttyS1
speed 9600 baud; line = 0;
-brkint -imaxbel
The dmesg line agrees:
00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:56 +02:00
50ad1d637c
Remove setLdLibraryPath and driSupport options
...
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".
See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:53 +02:00
c299d53146
Add documentation section about GRUB chain loading
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:47 +02:00
152b71e718
Add 10 min shutdown jitter to avoid spikes
...
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:44 +02:00
0911d5b92a
Don't mount the nix store in owl nodes
...
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:42 +02:00
5ddae068af
Emulate other architectures in owl nodes too
...
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:39 +02:00
d17be714ec
Program shutdown for August 2nd for all machines
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:36 +02:00
28ce15d74d
Enable debuginfod daemon in owl nodes
...
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:30 +02:00
504f9bb570
Set gitea and grafana log level to warn
...
Prevents filling the journal logs with information messages.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:27 +02:00
f158cb63e8
Set default SLURM job time limit to one hour
...
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:24 +02:00
8860f76cad
Allow other jobs to run in unused cores
...
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.
As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:22 +02:00
b86798cd69
Use authentication tokens for PM GitLab runner
...
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead. Older tokens and
runners are also removed, as they are no longer used.
With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.
See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:16 +02:00
7ed74931cf
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
→ 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
→ 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:13 +02:00
6e9d33b483
Allow ptrace to any process of the same user
...
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:09 +02:00
58abaefbc4
Add abonerib user to hut, raccon, owl1 and owl2
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:07 +02:00
5ea7827a8a
Grant rpenacob access to owl1 and owl2 nodes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:05 +02:00
b17e4a13f9
Access private repositories via hut SSH proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:03 +02:00
9c4e60c2c2
Set the default proxy to point to hut
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:35:56 +02:00
e7376917bd
Allow incoming traffic to hut proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:35:23 +02:00
130e191d37
eudy: koro: fcs: Fix fcs unprotected cpuid all
...
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-07-17 11:40:20 +02:00
349f69e30a
Add support for armv7 emulation in hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-17 11:12:48 +02:00
59ab6405c5
Monitor raccoon machine via IPMI
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-17 11:12:32 +02:00
a0dab66aa5
Move vlopez user to jungleUsers for koro host
...
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:39 +02:00
525cad4117
Add raccoon motd file
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:38 +02:00
24ee74d614
Split xeon specific configuration from base
...
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:37 +02:00
15b4b28d2c
Control user access to each machine
...
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:34 +02:00
b1ce302e4b
Add PostgreSQL DB for performance test results
...
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:24 +02:00
b8b85f55cd
Enable Grafana email alerts
...
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 15:57:38 +02:00
1189626a6f
Enable mail notification in Gitea
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:49 +02:00
dbd95dd7b8
Add msmtp to send notifications via email
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:20 +02:00
81b680a7d2
Allow Ceph traffic to lake2
2024-05-02 17:43:48 +02:00
ba60e121df
Collect Gitea metrics in Prometheus
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:25 +02:00
432e6c8521
Add Gitea service
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:31:51 +02:00
c8160122b3
Add firewall rules for Ceph and monitoring
...
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:11 +02:00
3863fc25a5
Add workaround for MPICH 4.2.0
...
See: https://github.com/pmodels/mpich/issues/6946
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:08 +02:00
2b26cd2f46
Fix SLURM bug in rank integer sign expansion
...
See: https://bugs.schedmd.com/show_bug.cgi?id=19324
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:05 +02:00
30f2079f0b
Merge pmix outputs for MPICH
...
MPICH expects headers and libraries to be present in the same directory.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:03 +02:00
366436b6d3
Remove nixseparatedebuginfod input
...
It has been integrated in nixpkgs, so is no longer required.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:24:58 +02:00
9f1cd02144
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
→ 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
→ 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
→ 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
→ 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
→ 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:24:29 +02:00
82ccae1315
Use google.com probe instead of bsc.es
...
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-05 16:52:21 +01:00
1df80460d2
Add another HTTPS probe for bsc.es
...
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.
Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-13 12:26:56 +01:00
7f17fe8874
Move slurm client in a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-13 11:11:17 +01:00
5880a6e5f6
Enable public-inbox at jungle.bsc.es/lists
...
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.
The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.
We also edited the CSS to improve the readability and have larger fonts
by default.
The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-15 11:18:08 +01:00
ecbb45d6ac
Monitor https://pm.bsc.es/gitlab/ too
...
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/ .
Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-05 09:56:28 +01:00
c564d945d4
Enable nixseparatedebuginfod module
...
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2023-12-04 11:04:52 +01:00
ed887b0412
Use tmpfs in /tmp
...
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-28 12:25:50 +01:00
fe1d3fbb80
Enable runners for pm.bsc.es/gitlab too
...
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 14:45:23 +01:00
5234ca32fd
Remove complete ceph package from hut
...
Only the ceph-client is needed.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:58:54 +01:00
cfe0c0e6e6
Fix warning in slurm exporter using vendorHash
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:58:50 +01:00
7afe7344ac
Remove old Ceph package overlay
...
The Ceph package is now integrated in upstream nixpkgs.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:58:47 +01:00
bd83ca53ab
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/d8c973fd228949736dedf61b7f8cc1ece3236792' (2023-07-24)
→ 'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
→ 'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:57:44 +01:00
0d9c99a24e
BSC packages are no longer in bsc attribute
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-09 13:40:48 +01:00
db98b1f698
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-09 13:40:48 +01:00
84c4b6b81c
Switch bscpkgs URL to sourcehut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-09 13:40:48 +01:00
19e195b894
Monitor anella instead of gw.bsc.es
...
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-27 12:46:08 +02:00
54c2bd119f
Add ICMP probes
...
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.
In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 17:13:03 +02:00
e5d85c1b38
Enable proxy for Grafana too
...
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:56 +02:00
f1486b84c1
Make blackbox exporter use the proxy
...
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:24 +02:00
472f4b0334
Don't log SLURM connection attempts from ssfhead
2023-10-06 15:22:04 +02:00
425dca3e00
Add docker runner too
2023-10-06 15:17:07 +02:00
e4080cf931
Monitor gitlab.bsc.es too
2023-10-06 15:17:07 +02:00
fc9285f89d
Monitor PM webpage via blackbox
2023-10-06 15:17:07 +02:00
fbe238f5b6
Temporarily disable pm runners
2023-10-06 15:17:07 +02:00
9874da566d
Add runner for gitlab.bsc.es
2023-10-06 15:17:07 +02:00
ebc5c4d84f
Allow anonymous access to grafana
2023-09-22 10:51:30 +02:00
8634a9e133
Remove user/group when using DynamicUsers
2023-09-22 10:13:06 +02:00
0ce79ed79e
Set the SLURM_CONF variable
2023-09-21 22:22:00 +02:00
5f492ee1d7
Enable slurm-exporter service
2023-09-21 21:40:02 +02:00
9071a4de8b
Add prometheus-slurm-exporter package
2023-09-21 21:34:18 +02:00
3040a803b2
Mount the hut nix store for SLURM jobs
2023-09-20 19:38:43 +02:00
70a9e855cf
Enable direnv integration
2023-09-20 09:32:58 +02:00
aa64e9ef24
Remove bscpkgs from the registry and nixPath
...
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2023-09-15 12:00:33 +02:00
ba2b74fd5a
Add bscpkgs and nixpkgs top level attributes
...
Allows the evaluation of packages of the intermediate overlays.
2023-09-15 12:00:33 +02:00
1ae5d9e25e
Use hut packages as the default package set
...
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2023-09-15 12:00:28 +02:00
ff98ba47c4
Don't fetch registry flakes from the net
2023-09-15 12:00:28 +02:00
599b23ef52
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
2023-09-15 11:50:47 +02:00
8dbee06d1d
Revert "Update slurm to 23.02.05.1"
...
This reverts commit 7bfd786c01c36131cd00b90fc6a9503fd1226578.
2023-09-14 15:46:18 +02:00
d522113cb9
Open ports in firewall of compute nodes
2023-09-14 15:45:43 +02:00
7bfd786c01
Update slurm to 23.02.05.1
2023-09-13 17:44:24 +02:00
5a5f4672cd
Monitor storage nodes via IPMI too
2023-09-13 15:57:13 +02:00
2646ad4b70
Enable fstrim service
2023-09-12 16:39:45 +02:00
b120a7ca85
Serve the nix store from hut
2023-09-12 12:19:43 +02:00
2a0254b684
Add encrypted munge key with agenix
2023-09-08 19:05:45 +02:00
e3e6e7662d
Remove unused large port hole in firewall
2023-09-08 18:22:48 +02:00
868f825e26
Make exporters listen in localhost only
2023-09-08 18:13:04 +02:00
f231dc81f1
Allow only some ports for srun
2023-09-08 17:51:37 +02:00
a758eef354
Block ssfhead from reaching our slurm daemon
2023-09-08 17:36:28 +02:00
9c9c41fb57
Poweroff idle slurm nodes after 1 hour
2023-09-08 16:49:53 +02:00
1a1708f16f
Add IB and IPMI node host names
2023-09-08 13:21:37 +02:00
efe1b7e399
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
2023-09-07 11:13:45 +02:00
eb9876aff6
Unlock ovni gitlab runners
2023-09-05 16:59:45 +02:00
8d31c552f5
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27 ' (2023-08-31)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
→ 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2023-09-05 15:03:26 +02:00
68f4d54dd1
Add agenix to all nodes
2023-09-04 22:10:43 +02:00
2042d58b72
Add agenix module to ceph
2023-09-04 22:07:07 +02:00
2c8c90e6e4
Remove old secrets
2023-09-04 22:04:32 +02:00
208dcb7dde
Mount /ceph in owl1 and owl2
2023-09-04 22:00:36 +02:00
e2f82a6383
Warn about the owl2 omnipath device
2023-09-04 22:00:17 +02:00
d704816de9
Clean owl2 configuration
2023-09-04 21:59:56 +02:00
74ec4eb22a
Move the ceph client config to an external module
2023-09-04 21:59:04 +02:00
0a5f9b55f5
Reorganize secrets and ssh keys
...
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2023-09-04 21:36:31 +02:00
900de39e2f
Add anavarro user
2023-09-04 16:00:01 +02:00
1e466d07df
Set zsh inc_append_history option
2023-09-03 16:57:53 +02:00
13807c5e8f
Set zsh shell for rarias
2023-09-03 16:46:27 +02:00
d8d6d6d421
Enable zsh and fix key bindings
2023-09-03 16:42:04 +02:00
a242ddd39c
Keep a log over time with the config commits
2023-09-03 00:02:14 +02:00
a2c5fe1f5e
Configure bscpkgs.nixpkgs to follow nixpkgs
2023-09-02 23:37:59 +02:00
2c52ef9ff0
Store nixos config in /etc/nixos/config.rev
2023-09-02 23:37:11 +02:00
acb91695ac
Enable binary emulation for other architectures
2023-08-31 17:27:08 +02:00
9d93760e6f
Enable watchdog
2023-08-30 16:32:17 +02:00
aad67b9d99
Enable all osd on boot in lake2
2023-08-30 16:32:17 +02:00
e1d406023d
Scrape lake2 too
2023-08-29 12:33:26 +02:00
db6bb90af8
Also enable monitoring in lake2
2023-08-29 12:29:41 +02:00
1266c8f04e
Scrape metrics from bay
2023-08-29 11:58:00 +02:00
2b7823788c
Add monitoring in the bay node
2023-08-29 11:53:32 +02:00
86eacdd3e5
Add fio tool
2023-08-29 11:27:50 +02:00
4fa074f893
Add ceph tools in hut too
2023-08-28 17:58:21 +02:00
a260a1bc1b
Switch ceph logs to journal
2023-08-28 17:58:08 +02:00
8912d2b9bc
Update ceph to 18.2.0 in overlay
2023-08-25 18:20:21 +02:00
b4015ded86
Move pkgs overlay to overlay.nix
2023-08-25 18:12:00 +02:00
0f54d63a46
Enable ceph osd daemons in lake2
2023-08-25 14:54:51 +02:00
6c656182f1
Add the lake2 hostname to the hosts
2023-08-25 14:44:35 +02:00
be4187de3c
Use the sda for lake2
2023-08-25 13:40:10 +02:00
0b22a1b8a4
Remove netboot module
2023-08-25 13:39:01 +02:00
f18f1937ae
Disable pixiecore in hut for now
2023-08-25 13:21:00 +02:00
4b78ec9134
Add PXE helper
2023-08-25 12:05:33 +02:00
6c0c26b3aa
Enable netboot again for PXE
2023-08-24 19:08:23 +02:00
fb1744306d
Specify the disk by path
2023-08-24 15:27:37 +02:00
394c7ecd7b
Prepare lake2 config after bootstrap
...
The disk ID is different under NixOS.
2023-08-24 13:54:53 +02:00
3276f54e86
Add lake2 bootstrap config
2023-08-24 12:30:46 +02:00
4c806b8ae9
Add section to enable serial console
2023-08-24 12:29:44 +02:00
832866cbfa
Add agenix to PATH in hut
2023-08-23 17:42:50 +02:00
9fc393bb6a
Store ceph secret key in age
...
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2023-08-23 17:26:44 +02:00
d81d9d58e1
Add rarias key for secrets
2023-08-23 17:15:26 +02:00
d54dcc8d8f
Add ceph metrics to prometheus
2023-08-22 16:33:55 +02:00
a5fae4a289
Mount the ceph filesystem in hut
2023-08-22 16:15:46 +02:00
a355926cf0
Add ceph config in bay
2023-08-22 15:58:48 +02:00
d7a4420205
Add the bay host name
2023-08-22 15:56:09 +02:00
0b55ce3d02
Remove netboot and fixes
2023-08-22 12:12:15 +02:00
0ce574800e
Add bay node
2023-08-22 12:12:15 +02:00
a7e09e55df
Update flake
2023-08-22 11:28:54 +02:00
1622b3e7fc
Monitor power from other nodes via LAN
2023-08-22 11:28:54 +02:00
3424cac761
Increase prometheus retention time to one year
2023-08-22 11:28:54 +02:00
f98af9aeef
Don't set all_proxy
2023-08-22 11:28:54 +02:00
8c14b75e44
Update nixpkgs to fix docker problem
2023-07-28 14:24:51 +02:00
e497e1b88b
Allow access to devices for node_exporter
2023-07-28 13:55:35 +02:00
07411beb49
GRUB version no longer needed
2023-07-27 17:22:20 +02:00
e8bab9928d
Upgrade flake: nixpkgs, bscpkgs and agenix
2023-07-27 17:19:17 +02:00
544d5a3d69
Kill slurmd remaining processes on upgrade
2023-07-27 14:49:20 +02:00
312f2cb368
koro: Add vlopez user
2023-07-21 13:00:43 +02:00
45ac6e95e9
Add koro node
2023-07-21 13:00:08 +02:00
e6bb6e735d
eudy: Add fcsv3 and intermediate versions for testing
2023-07-21 11:27:51 +02:00
cfbfcdbe8c
eudy: Enable memory overcommit
2023-07-21 11:27:51 +02:00
c31bfd6b4d
eudy: disable all cpu mitigations
2023-07-21 11:27:51 +02:00
d20fa359d9
Enable NTP using the BSC time server
2023-06-30 14:02:15 +02:00
9be15fdad2
Add the ssfhead node as gateway
2023-06-30 14:01:35 +02:00
13e365002c
Use our host names first by default
2023-06-23 16:22:18 +02:00
a38072762f
Add DNS tools to resolve hosts
2023-06-23 16:15:45 +02:00
adf1ff29a7
Lower perf_event_paranoid to -1
2023-06-23 16:01:27 +02:00
1ec8d7a625
Set perf paranoid to 0 by default
2023-06-21 16:24:19 +02:00
f78f4f5822
Add perf to packages
2023-06-21 15:41:06 +02:00
67a57cb3e5
Allow srun to specify the cpu binding
...
The task/affinity plugin needs to be selected.
2023-06-21 13:16:23 +02:00
85896f8546
Move authorized keys to users.nix
2023-06-20 14:08:34 +02:00
5e728773c3
Add rpenacob user
2023-06-20 12:54:26 +02:00
0a06cf564b
Add osumb to the system packages
2023-06-16 19:22:41 +02:00
db26b2ae37
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=c775ee4d6f76aded05b08ae13924c302f18f9b2c ' (2023-04-26)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=cbe9af5d042e9d5585fe2acef65a1347c68b2fbd ' (2023-06-16)
2023-06-16 18:33:54 +02:00
f7d00dec25
Set mpi to mpich by default in bscpkgs
2023-06-16 18:26:51 +02:00
2053ec82b7
Add missing parameter to extend
2023-06-16 18:26:51 +02:00
f2434a17c2
Use explicit order in overlays
2023-06-16 18:26:51 +02:00
1f7045fcfe
Replace mpi inside bsc attribute
2023-06-16 18:26:51 +02:00
0c4a1efa27
Add mpich overlay
2023-06-16 18:26:51 +02:00
530958496b
Add coments in slurm config
2023-06-16 18:26:50 +02:00
df378a2933
Add eudy host key to known hosts
2023-06-16 17:29:48 +02:00
2a0fe5a137
Rename xeon08 to eudy
...
From Eudyptula, a little penguin.
2023-06-16 17:16:05 +02:00
dfbeafa2b2
Update rebuild script for all nodes
2023-06-16 12:13:07 +02:00
7d4281a5c1
Add ssh host keys
2023-06-16 12:01:12 +02:00
dfea0be2d9
Set the name of the slurm cluster to jungle
2023-06-16 12:00:54 +02:00
df91da8c34
Change owl hostnames
2023-06-16 11:42:39 +02:00
30c21155af
Add owl and all partition
2023-06-16 11:34:00 +02:00
a43016ebee
Simplify flake and expose host pkgs
...
The configuration of the machines is now moved to m/
2023-06-16 11:31:31 +02:00
801bb4ba3c
Rename xeon07 to hut
2023-06-14 17:28:40 +02:00
a9d740e95a
Remove profiles older than 30 days with gc
2023-06-14 17:28:39 +02:00
08eaf312f2
Add ncdu to system packages
2023-06-14 17:28:39 +02:00
0b57bbc6e3
Move arocanon user from xeon08 to common
2023-06-14 16:22:43 +02:00
6558a6ab77
xeon08: Add config for kernel non-voluntary preemption
2023-06-14 16:17:33 +02:00
0d196af473
xeon08: Add perf
2023-06-14 15:42:20 +02:00
d35becb663
xeon08: Enable lttng lockdep tracepoints
2023-06-14 15:42:20 +02:00
5421eab09a
xeon08: Add lttng module and tools
2023-06-14 15:42:20 +02:00
1c7de2f7c9
Serve grafana in https://jungle.bsc.es/grafana
2023-05-31 18:12:14 +02:00
c7692995f4
Add tree command
2023-05-31 18:11:34 +02:00
0af185afd8
Add file to system packages
2023-05-31 18:11:34 +02:00
470b3d2512
Add gnumake to system packages
2023-05-31 18:11:34 +02:00
1bf6747b3a
Add cmake to system packages
2023-05-31 18:11:34 +02:00
59bf51dfde
Add ix to common packages
2023-05-31 18:11:34 +02:00
b72d9936a2
Improve documentation
2023-05-26 11:38:27 +02:00
5ebb57deff
Add gitignore
2023-05-26 11:38:27 +02:00
5b82a72647
Set intel_pstate=passive and disable frequency boost
2023-05-26 11:38:26 +02:00
a5c7205481
Add xeon08 basic config
2023-05-26 11:38:26 +02:00
fd1b467a60
Add nixos-config.nix to easily enable nix repl
2023-05-26 11:29:59 +02:00
882161b21e
Automatically resume restarted nodes in SLURM
2023-05-18 12:48:04 +02:00
5e8ff50c98
Allow public dashboards in grafana
2023-05-09 18:53:31 +02:00
cdb0688ec1
Add hal ssh key
2023-05-09 18:37:38 +02:00
ebb5e94416
Increase the number of CPUs to 56 for nOS-V docker
2023-05-02 17:47:57 +02:00
89049d0b1f
Allow 5 concurrent buils in the gitlab-runner
2023-05-02 17:38:10 +02:00
6d16772d07
Simplify bash prompt
2023-04-28 18:15:04 +02:00
e37f9e2b0f
Roolback to bash as default shell
...
Zsh doesn't behave properly, it needs further configuration.
2023-04-28 17:59:19 +02:00
9767238c76
Use pmix by default in slurm
2023-04-28 17:07:48 +02:00
a5a0fd9b6f
Increase locked memory to 1 GiB
2023-04-28 12:34:51 +02:00
be69070f61
Use the latest kernel
2023-04-28 11:51:38 +02:00
53f6dcec8d
Disable osnoise and hwlat tracer for now
...
Reuse nix cache to avoid rebuilding the kernel.
2023-04-28 11:19:47 +02:00
87c4521de3
Update nixpkgs to nixos-unstable
2023-04-28 11:18:37 +02:00
461d6d2f34
Update nixpkgs
2023-04-28 11:13:46 +02:00
ef2ffa61c3
Update ib interface name in xeon02
...
It seems to be plugged in another PCI port
2023-04-27 18:29:32 +02:00
c0b23ad450
Add steps in install documentation
2023-04-27 17:30:53 +02:00
f12ba9f8b0
Add minimal netboot module to build kexec image
2023-04-27 16:36:15 +02:00
a211e9ebee
Add xeon02 configuration
2023-04-27 16:28:12 +02:00
5dbbb27c43
Refacto slurm configuration into compute/control
2023-04-27 16:27:04 +02:00
69bb2128db
Lock flakes and add inputs
2023-04-27 13:52:59 +02:00
de7cae6208
Test flakes
2023-04-26 14:27:02 +02:00
de4ac8cbd6
Enable slurm in xeon01
2023-04-26 14:10:36 +02:00
e1dcad50d0
Use xeon07 as control machine
2023-04-26 14:10:36 +02:00
0120be66fb
Remove xeon07 overlay to load upstream slurm
2023-04-26 14:10:36 +02:00
6cb079a44e
Add script to rebuild configuration
2023-04-26 14:09:23 +02:00
a5449067a7
Add configuration for xeon01
2023-04-26 11:44:00 +00:00
1009736d81
Load overlays from /config
2023-04-26 11:44:00 +00:00
a94765e8ae
Move net.nix to common
2023-04-26 11:44:00 +00:00
9630b23ce2
Remove host specific network options from net.nix
2023-04-26 11:44:00 +00:00
ed158ee87f
Move ssh.nix to common
2023-04-26 11:44:00 +00:00
480dd95d9b
Move overlays.nix to common
2023-04-26 11:44:00 +00:00
f7b18098b1
Move users.nix to common
2023-04-26 11:44:00 +00:00
c580254dde
Move common options from configuration.nix
2023-04-26 11:44:00 +00:00
7e6c395ff8
Move the remaining hw config to common
2023-04-26 11:44:00 +00:00
6978677cb5
Move boot config to common/boot.nix
2023-04-26 11:44:00 +00:00
f5b4580dae
Move filesystems config to common/fs.nix
2023-04-26 11:44:00 +00:00
035becd018
Use partition labels for / and swap
2023-04-26 11:44:00 +00:00
a7fb69ab92
Move fs.nix to common
2023-04-26 11:44:00 +00:00
733eb93f23
Move boot.nix to common
2023-04-26 11:44:00 +00:00
b60e821eaa
Move disk selection to configuration.nix
2023-04-26 11:44:00 +00:00
f43d549294
Add common directory
2023-04-26 11:44:00 +00:00
848efdcb2d
Move xeon07 configuration to a directory
2023-04-18 16:09:23 +02:00
0f7a0c3ac2
Add smartctl monitoring
2023-04-18 16:03:46 +02:00
40d0a16736
Allow wheel users to build derivations
2023-04-14 10:14:17 +02:00
59b8ba0e76
Use bscpkgs master
2023-04-11 21:22:00 +02:00
b5153009ea
Run the garbage collector once a week
2023-04-11 21:21:22 +02:00
93a37b8353
Set EDITOR and add nix-diff
2023-04-11 20:36:54 +02:00
0ca649b715
Add nos-v gitlab runner
2023-04-11 12:59:21 +02:00
1b5e227095
Disable debug from gitlab runner
2023-04-11 12:58:24 +02:00
9310a7b0b9
Add gitlab-runner secrets using agenix
2023-04-11 12:47:52 +02:00
40b9beb86b
Disable ethernet specific useDHCP
...
Is already configured by default for all interfaces.
2023-04-06 13:58:55 +02:00
72f9659430
Enable IPoIB and set the infiniband IP
2023-04-06 13:58:24 +02:00
8fe301203c
Export nix store over nfs
2023-04-06 13:57:32 +02:00
a813ea6561
Enable gitlab runner monitoring
2023-04-06 13:56:52 +02:00
5d8b4e96b2
Add agenix tool
2023-04-05 17:04:42 +02:00
60ff89b7cc
Add monitoring services
2023-04-05 17:00:01 +02:00
e6c35604bb
Add some tools and use relaxed for build sandbox
2023-04-05 16:59:09 +02:00
d0dfba5c03
Remove commencted docker settings
2023-04-05 16:56:27 +02:00
ccee2339a3
Add mio key
2023-04-05 16:56:05 +02:00
df371c950f
Setup slurm and gitlab-runner
2023-04-03 12:51:44 +02:00
52eed708f0
Add initial configuration
2023-03-31 18:27:25 +02:00