519b6eeeea
weasel: enable hydra tcp port in firewall
2025-09-19 11:20:56 +02:00
a870e4b5fa
Enable hydra on weasel
2025-09-19 11:20:56 +02:00
2491face8f
weasel: use tent cache
2025-09-19 11:20:56 +02:00
cb6882dbf1
Add nixfmt-rfc-style to common packages
2025-09-19 11:20:56 +02:00
35785edfa5
Add packages to user abonerib
2025-09-19 11:20:56 +02:00
a1decb3b47
Add nix-output-monitor to default packages
2025-09-19 11:20:56 +02:00
db41cb29cb
Set fish shell for user abonerib
2025-09-19 11:20:56 +02:00
2922ce9239
weasel: create user folders in /var/lib/podman-users
...
/home is a nfs mount, which does not support extra filesystem arguments
needed to run podman. We need to have a local home.
2025-09-19 11:20:55 +02:00
29b1ea7a76
weasel: add podman
2025-09-19 11:20:55 +02:00
3387cbcc25
Share a public folder for documents
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:59:40 +02:00
017e0d82f7
Fix AMDuProfPcm so it finds libnuma.so
...
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Tested-by: Vincent Arcila <vincent.arcila@bsc.es >
2025-09-19 10:54:36 +02:00
ac5f4e4dca
Add amd_hsmp module in fox for AMD uProf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:24 +02:00
8835dbd764
Add AMD uProf section to fox documentation
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:22 +02:00
84830c66e6
Fix hidden dependencies for AMDuProfSys
...
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:19 +02:00
cad88f92a8
Disable NMI watchdog in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:17 +02:00
40372cd0d9
Fix amd-uprof dependencies with patchelf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:15 +02:00
4e0e96f6fe
Fix hrtimer new interface
...
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.
See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:09 +02:00
b021789a6e
Use CFLAGS_MODULE instead of EXTRA_CFLAGS
...
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:07 +02:00
3ab0e13960
Add AMD uProf module and enable it in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:54:05 +02:00
0166686b6a
Add AMD uProf package and driver
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-09-19 10:53:49 +02:00
d3b355f651
Add /nfs/home to fox documentation
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 15:34:05 +02:00
2ed881cd89
Mount home via NFS from apex in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 15:34:02 +02:00
2a07df1d30
Allow access to NFS via wireguard subnet
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 15:33:47 +02:00
52380eae59
Use 10.106.0.0/24 subnet to avoid collisions
...
The 106 byte is the code for 'j' (jungle) in ASCII:
% printf j | od -t d
0000000 106
0000001
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:03:13 +02:00
2fe84c4cbc
Update fox documentation for SLURM and FS
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:03:09 +02:00
3b16b41be3
Revert "Remove pam_slurm_adopt from fox"
...
This reverts commit 64a52801ed .
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:03:06 +02:00
ee481deffb
Enable fail2ban in fox
...
Protect fox against ssh bruteforce attacks:
fox% sudo lastb | head
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:03:02 +02:00
b1bad25008
Accept connections from apex to fox slurmd
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:03:00 +02:00
85f38e17a2
Accept fox connection to slurm controller
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:59 +02:00
08ab01b89c
Add fox machine to SLURM
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:57 +02:00
194a6fb7f6
Rekey secrets with trusted fox key
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:55 +02:00
365576778b
Trust fox for compute node secrets
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:52 +02:00
e7490858c6
Make apex host specific to each machine
...
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:49 +02:00
7606030135
Add local host fox in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:46 +02:00
e55590f59e
Enable wireguard in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:43 +02:00
c3da39c392
Add wireguard server in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-09-03 12:02:38 +02:00
d3889b3339
Use writeShellScript for suspend.sh and resume.sh
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-29 12:35:28 +02:00
28540d8cf3
Add firewall rules to slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-29 12:35:26 +02:00
f847621ceb
Remove hut from slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-29 12:35:24 +02:00
12fe43f95f
Only configure apex as slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-29 12:35:22 +02:00
0e8329eef3
Split slurm configuration for client and server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-29 12:35:20 +02:00
df3b21b570
Move slurm control server to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-08-29 12:35:16 +02:00
78df61d24a
Fix typo in csiringo ssh key
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-08-27 17:44:20 +02:00
8e7da73151
Enable nix-ld in weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-08-27 16:19:34 +02:00
a7e17e40dc
Add csiringo user with access to apex and weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-08-27 16:02:26 +02:00
0e8bd22347
Access gitlab via raccoon in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-08-27 15:27:38 +02:00
d948f8b752
Move StartLimit* options to unit section
...
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-24 14:32:46 +02:00
8f7787e217
Set power policy to always turn on
...
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-24 11:22:38 +02:00
30b9b23112
Add NixOS module to control power policy
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-24 11:22:36 +02:00
9a056737de
Move August shutdown to 3rd at 22h
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-24 11:22:33 +02:00
ac700d34a5
Disable automatic August shutdown for Fox
...
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-24 11:22:10 +02:00
9b681ab7ce
Add cudainfo program to test CUDA
...
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-23 11:52:09 +02:00
9ce394bffd
Add missing symlink in cuda sandbox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-23 11:51:47 +02:00
8cd7b713ca
Enable cuda systemFeature in raccoon and fox
...
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-22 17:07:13 +02:00
8eed90d2bd
Move shared nvidia settings to a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-22 17:06:45 +02:00
aee54ef39f
Replace xeon07 by hut in ssh config
...
The xeon07 machine has been renamed to hut.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-21 18:10:08 +02:00
69f7ab701b
Enable automatic Nix GC in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-21 17:58:26 +02:00
4c9bcebcdc
Select proprietary NVIDIA driver in raccoon
...
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:
> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-21 17:58:21 +02:00
86e7c72b9b
Enable open source NVidia driver in fox
...
It is recommended for newer versions.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-18 09:57:38 +02:00
a7dffc33b5
Remove option allowUnfree from fox and raccoon
...
It is already set to true for all machines.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-18 09:57:21 +02:00
6765dba3e4
Ban another scanner trying to connect via SSH
...
It is constantly spamming out logs:
apex# journalctl | grep 'Connection closed by 84.88.52.176' | wc -l
2255
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-18 09:51:49 +02:00
0acfb7a8e0
Update weasel IPMI hostname for monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-18 09:51:21 +02:00
dfbb21a5bd
Remove merged MPICH patch
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-16 13:07:12 +02:00
2bb3b2fc4a
Remove package ix as it is gone
...
Fails with: "error: ix has been removed from Nixpkgs, as the ix.io
pastebin has been offline since Dec. 2023".
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-16 13:07:06 +02:00
3270fe50a2
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41?narHash=sha256-b%2Buqzj%2BWa6xgMS9aNbX4I%2BsXeb5biPDi39VgvSFqFvU%3D' (2024-08-10)
→ 'github:ryantm/agenix/531beac616433bac6f9e2a19feb8e99a22a66baf?narHash=sha256-9P1FziAwl5%2B3edkfFcr5HeGtQUtrSdk/MksX39GieoA%3D' (2025-06-17)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d?narHash=sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0%3D' (2023-11-24)
→ 'github:lnl7/nix-darwin/43975d782b418ebf4969e9ccba82466728c2851b?narHash=sha256-dyN%2BteG9G82G%2Bm%2BPX/aSAagkC%2BvUv0SgUw3XkPhQodQ%3D' (2025-04-12)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1?narHash=sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE%3D' (2023-12-20)
→ 'github:nix-community/home-manager/abfad3d2958c9e6300a883bd443512c55dfeb1be?narHash=sha256-YZCh2o9Ua1n9uCvrvi5pRxtuVNml8X2a03qIFfRKpFs%3D' (2025-04-24)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=9d1944c658929b6f98b3f3803fead4d1b91c4405 ' (2025-06-11)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc?narHash=sha256-i/UJ5I7HoqmFMwZEH6vAvBxOrjjOJNU739lnZnhUln8%3D' (2025-01-14)
→ 'github:NixOS/nixpkgs/dfcd5b901dbab46c9c6e80b265648481aafb01f8?narHash=sha256-Kt1UIPi7kZqkSc5HVj6UY5YLHHEzPBkgpNUByuyxtlw%3D' (2025-07-13)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-16 13:07:01 +02:00
499112cdad
Upgrade nixpkgs to nixos 25.05
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-07-16 13:06:40 +02:00
a6698e6a6b
Silently ban OpenVAS BSC scanner from apex
...
It is spamming our logs with refused connection lines:
apex% sudo journalctl -b0 | grep 'refused connection.*SRC=192.168.8.16' | wc -l
13945
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 17:40:41 +02:00
b394c5a8f4
Rotate anavarro password and SSH key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 17:24:41 +02:00
3d5b845057
Add weasel machine configuration
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 17:24:38 +02:00
9e83565977
Remove extra flush commands on firewall stop
...
They are not needed as they are already flushed when the firewall
starts or stops.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:45 +02:00
ce2cda1c41
Prevent accidental use of nftables
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:42 +02:00
e6aef2cbd0
Add proxy configuration for internal hosts
...
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.
This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:36 +02:00
b7603053fa
Remove unused blackbox configuration modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:30 +02:00
3ca55acfdf
Use IPv4 in blackbox probes
...
Otherwise they simply fail as IPv6 doesn't work.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:26 +02:00
e505a952af
Make NFS mount async to improve latency
...
Don't wait to flush writes, as we don't care about consistency on a
crash:
> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:20 +02:00
3ad9452637
Disable root_squash from NFS
...
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:16 +02:00
fdd21d0dd0
Remove SSH proxy to access BSC clusters
...
We now have direct connection to them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:13 +02:00
c40871bbfe
Add users to apex machine
...
They need to be able to login to apex to access any other machine from
the SSF rack.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:09 +02:00
e8f5ce735e
Remove proxy from hut HTTP probes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:04 +02:00
4a25056897
Remove proxy configuration from environment
...
All machines have now direct connection with the outside world.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:18:00 +02:00
89e0c0df28
Add storcli utility to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:17:57 +02:00
1b731a756a
Add new configuration for apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-15 11:17:43 +02:00
3d97fada6d
Add pmartin1 user with access to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-03 11:16:43 +02:00
d1a2bfc90e
Add access to fox for rpenacob user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 16:58:53 +02:00
44e76ce630
Revert "Only allow Vincent to access fox for now"
...
This reverts commit efac36b186 .
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 16:58:49 +02:00
adb7e0ef35
Add all terminfo files in environment
...
Fixes problems with the kitty terminal when opening vim or kakoune.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-07-02 16:02:45 +02:00
b0875816f2
Monitor Fox BMC with ICMP probes too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:51:22 +02:00
592da155a9
Restrict DAC VPN to fox-ipmi machine only
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:51:19 +02:00
5376613ec4
Monitor fox via VPN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:51:16 +02:00
74891f0784
Add OpenVPN service to connect to fox BMC
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:51:13 +02:00
d66f9f21dd
Add ac.upc.edu as name search server
...
Allows referring to fox.ac.upc.edu directly as fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:51:09 +02:00
cb05482b4f
Update access instructions
...
We no longer need to request a petition through BSC, as we will be in
charge of the login. Remove link to the old repository as well and
prefer only email.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:24:51 +02:00
e660268661
Disable kptr_restrict in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:08:42 +02:00
d45b7ea717
Disable NUMA balancing in fox
...
See: https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:08:02 +02:00
c205fa4e34
Load amd_uncore module in fox
...
Needed for L3 events in perf.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:07:58 +02:00
5f055388a5
Enable SSH X11 forwarding
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-07-02 15:07:54 +02:00
0bc69789d9
Disable registration in Gitea
...
Get rid of all the spam accounts they are trying to register.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:18 +02:00
09bc9d9c25
Enable msmtp configuration in tent
...
Allows gitea to send notifications via email.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:15 +02:00
6b53ab4413
Add GitLab runner with debian docker for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:13 +02:00
4618a149b3
Monitor nix-daemon in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:11 +02:00
448d85ef9d
Move nix-daemon exporter to modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:09 +02:00
956b99f02a
Add p service for pastes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:07 +02:00
ec2eb8c3ed
Enable public-inbox service in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:06 +02:00
09a5bdfbe4
Enable gitea in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:04 +02:00
c49dd15303
Add bsc.es to resolve domain names
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:02 +02:00
38fd0eefa3
Monitor AXLE machine too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:36:00 +02:00
e386a320ff
Use IPv4 for blackbox exporter
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:59 +02:00
5ea8d6a6dd
Add public html files to tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:57 +02:00
7b108431dc
Add docker GitLab runner for BSC GitLab
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:55 +02:00
e80b4d7c31
Add GitLab shell runner in tent for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:54 +02:00
e4c22e91b2
Enable jungle robot emails for Grafana in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:52 +02:00
27d4f4f272
Add tent key for nix-serve
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:50 +02:00
978087e53a
Remove jungle nix cache from tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:48 +02:00
ad9a5bc906
Enable nix cache
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:47 +02:00
7aeb78426e
Serve Grafana from subpath
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:45 +02:00
a0d1b31bb6
Add nginx server in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:43 +02:00
a7775f9a8d
Add monitoring in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-18 15:35:00 +02:00
7bb11611a8
Disable nix garbage collector in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-06-11 16:05:05 +02:00
cf9bcc27e0
Rekey secrets with tent keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:04:20 +02:00
81073540b0
Add tent host key and admin keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:04:16 +02:00
a43f856b53
Create directories in /vault/home for tent users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:04:12 +02:00
be231b6d2d
Add software RAID in tent using 3 disks
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:04:10 +02:00
2f2381ad0f
Add access to tent to all hut users too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:04:06 +02:00
19e90a1ef7
Add hut SSH configuration from outside SSF LAN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:04:04 +02:00
090100f180
Don't use proxy in base preset
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:04:00 +02:00
3d48d224c9
Add tent machine from xeon04
...
We moved the tent machine to the server room in the BSC building and is
now directly connected to the raccoon via NAT.
Fixes: #106
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:03:54 +02:00
0317f42613
Create specific SSF rack configuration
...
Allow xeon machines to optionally inherit SSF configuration such as the
NFS mount point and the network configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 16:03:49 +02:00
efac36b186
Only allow Vincent to access fox for now
...
Needed to run benchmarks without interference.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 12:08:57 +02:00
d2385ac639
Use performance governor in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 12:08:55 +02:00
d28ed0ab69
Add hut as nix cache in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 12:08:51 +02:00
1ef6f9a2bb
Use extra- for substituters and trusted-public-keys
...
From the nix manual:
> A configuration setting usually overrides any previous value. However,
> for settings that take a list of items, you can prefix the name of the
> setting by extra- to append to the previous value.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-06-11 11:27:37 +02:00
86b7032bbb
Use DHCP for Ethernet in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 10:24:53 +02:00
8c5f4defd7
Use UPC time servers as others are blocked
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-11 10:24:47 +02:00
b802a59868
Create tracing group and add arocanon in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 11:09:41 +02:00
7247f7e665
Extend perf support in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 11:09:30 +02:00
1d555871a5
Enable nixdebuginfod in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:50:01 +02:00
a2535c996d
Make raccoon use performance governor
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:45:35 +02:00
37e60afb54
Enable binfmt emulation in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:45:33 +02:00
3fe138a418
Disable nix garbage collector in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:45:31 +02:00
4e7a9f7ce4
Add dbautist user to raccoon machine
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:45:28 +02:00
a6a1af673a
Add node exporter monitoring in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:45:26 +02:00
2a3a7b2fb2
Allow X11 forwarding via SSH
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:45:23 +02:00
b4ab1c836a
Enable linger for user rarias
...
Allows services to run without a login session.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:45:19 +02:00
fb8b4defa7
Only proxy SSH git remotes via hut in xeon
...
Other machines like raccoon have direct access.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-06-03 10:44:31 +02:00
1bcfbf8cd6
Add machine map file
...
Documents the location, board and serial numbers so we can track the
machines if they move around. Some information is unkown.
Using the Nix language to encode the machines location and properties
allows us to later use that information in the configuration of the
machines themselves.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 14:55:58 +02:00
9f43a0e13b
Remove fox monitoring via IPMI
...
We will need to setup an VPN to be able to access fox in its new
location, so for now we simply remove the IPMI monitoring.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:53 +02:00
3a3c3050ef
Monitor fox, gateway and UPC anella via ICMP
...
Fox should reply once the machine is connected to the UPC network.
Monitoring also the gateway and UPC anella allows us to estimate if the
whole network is down or just fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:51 +02:00
4419f68948
Update configuration for UPC network
...
The fox machine will be placed in the UPC network, so we update the
configuration with the new IP and gateway. We won't be able to reach hut
directly so we also remove the host entry and proxy.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:48 +02:00
e51fc9ffa5
Disable home via NFS in fox
...
It won't be accesible anymore as we won't be in the same LAN.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:46 +02:00
2ae9e9b635
Rekey all secrets
...
Fox is no longer able to use munge or ceph, so we remove the key and
rekey them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:44 +02:00
be77f6a5f5
Rotate fox SSH host key
...
Prevent decrypting old secrets by reading the git history.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:42 +02:00
6316a12a67
Distrust fox SSH key
...
We no longer will share secrets with fox until we can regain our trust.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:38 +02:00
db663913d8
Remove Ceph module from fox
...
It will no longer be accesible from the UPC.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:36 +02:00
b4846b0f6c
Remove fox from SLURM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:20 +02:00
64a52801ed
Remove pam_slurm_adopt from fox
...
We no longer will be able to use SLURM from jungle.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-06-02 11:26:02 +02:00
7a2f37aaa2
Add UPC temperature sensor monitoring
...
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-29 13:01:37 +02:00
aae6585f66
Add meteocat exporter
...
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-29 13:01:29 +02:00
1c15e77c83
Add custom nix-daemon exporter
...
Allows us to see which derivations are being built in realtime. It is a
bit of a hack, but it seems to work. We simply look at the environment
of the child processes of nix-daemon (usually bash) and then look for
the $name variable which should hold the current derivation being
built. Needs root to be able to read the environ file of the different
nix-daemon processes as they are owned by the nixbld* users.
See: https://discourse.nixos.org/t/query-ongoing-builds/23486
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-05-29 12:57:07 +02:00
82fc3209de
Set keep-outputs to true in all machines
...
From the documentation of keep-outputs, setting it to true would prevent
the GC from removing build time dependencies:
If true, the garbage collector will keep the outputs of non-garbage
derivations. If false (default), outputs will be deleted unless they are
GC roots themselves (or reachable from other roots).
In general, outputs must be registered as roots separately. However,
even if the output of a derivation is registered as a root, the
collector will still delete store paths that are used only at build time
(e.g., the C compiler, or source tarballs downloaded from the network).
To prevent it from doing so, set this option to true.
See: https://nix.dev/manual/nix/2.24/command-ref/conf-file.html#conf-keep-outputs
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-04-22 17:27:37 +02:00
abeab18270
Add raccoon node exporter monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-22 14:50:08 +02:00
1985b58619
Increase data retention to 5 years
...
Now that we have more space, we can extend the retention time to 5 years
to hold the monitoring metrics. For a year we have:
# du -sh /var/lib/prometheus2
13G /var/lib/prometheus2
So we can expect it to increase to about 65 GiB. In the future we may
want to reduce some adquisition frequency.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-22 14:50:03 +02:00
44bd061823
Don't forward any docker traffic
...
Access to the 23080 local port will be done by applying the INPUT rules,
which pass through nixos-fw.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-15 14:16:15 +02:00
e8c309f584
Allow traffic from docker to enter port 23080
...
Before:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out
After:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
10.0.40.7 (10.0.40.7:23080) open
Fixes: #94
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-15 14:16:10 +02:00
71ae7fb585
Add bscpm04.bsc.es SSH host and public key
...
Allows fetching repositories from hut and other machines in jungle
without the need to do any extra configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-15 14:15:45 +02:00
8834d561d2
Add nix cache documentation section
...
Include usage from NixOS and non-NixOS hosts and a test with curl to
ensure it can be reached.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-04-15 14:08:22 +02:00
29daa3c364
Use hut nix cache in owl1, owl2 and raccoon
...
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-04-15 14:08:17 +02:00
9c503fbefb
Clean all iptables rules on stop
...
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-15 14:08:14 +02:00
51b6a8b612
Make nginx listen on all interfaces
...
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-15 14:08:07 +02:00
52213d388d
Fix nginx /cache regex
...
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```
This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info ` but
`curl https://jungle.bsc.es/cachenix-cache-info ` worked.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-04-15 14:08:04 +02:00
edf744db8d
Add new GitLab runner for gitlab.bsc.es
...
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:41:18 +02:00
b82894eaec
Remove SLURM partition all
...
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:27 +02:00
1c47199891
Add varcila user to hut and fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:25 +02:00
8738bd4eeb
Adjust fox slurm config after disabling SMT
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:23 +02:00
7699783aac
Add abonerib user to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:21 +02:00
fee1d4da7e
Don't move doc in web output
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:19 +02:00
b77ce7fb56
Add quickstart guide
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:17 +02:00
b4a12625c5
Reject SSH connections without SLURM allocation
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:15 +02:00
302106ea9a
Add users to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:13 +02:00
96877de8d9
Add dalvare1 user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:11 +02:00
8878985be6
Add fox page in jungle website
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:08 +02:00
737578db34
Mount NVME disks in /nvme{0,1}
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:06 +02:00
88555e3f8c
Exclude fox from being suspended by slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:04 +02:00
feb2060be7
Use IPMI host names instead of IP addresses
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:15:01 +02:00
00999434c2
Add fox IPMI monitoring
...
Use agenix to store the credentials safely.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:14:59 +02:00
29d58cc62d
Add new fox machine
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-04-08 17:14:42 +02:00
587caf262e
Update PM GitLab tokens to new URL
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 15:43:13 +01:00
2730404ca5
Fix MPICH build by fetching upstream patches too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 15:43:13 +01:00
84db5e6fd6
Fix papermod theme in website for new hugo
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 15:43:13 +01:00
f4f34a3159
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
→ 'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41' (2024-08-10)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
→ 'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc' (2025-01-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 15:43:13 +01:00
91b8b4a3c5
Set nixpkgs to track nixos-24.11
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 15:43:13 +01:00
6cad205269
Add script to monitor GPFS
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 15:43:07 +01:00
c57bf76969
Add BSC machines to ssh config
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:51 +01:00
ad4b615211
Collect statistics from logged users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:48 +01:00
b4518b59cf
Add custom GPFS exporter for MN5
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:46 +01:00
45dc4124a3
Remove exception to fetch task endpoint
...
It causes the request to go to the website rather than the Gitea
service.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:43 +01:00
bdfe9a48fd
Use SSD for boot, then switch to NVME
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:40 +01:00
1b337d31f8
Use NVME as root
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:37 +01:00
717cd5a21e
Keep host header for Grafana requests
...
This was breaking requests due to CSRF check.
See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:32 +01:00
def5955614
Ignore logging requests from the gitea runner
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:28 +01:00
0e3c975cb5
Log the client IP not the proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:22 +01:00
93189a575e
Ignore misc directory
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:19 +01:00
36592c44eb
Create paste directories in /ceph/p
...
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:16 +01:00
a34e3752a2
Add paste documentation in jungle website
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:13 +01:00
0d2dea94fb
Add p command to paste files
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:10 +01:00
7f539d7e06
Use nginx to serve website and other services
...
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:23:07 +01:00
f8ec090836
Mount the NVME disk in /nvme
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-01-16 14:22:58 +01:00
9a9161fc55
Delay nix-gc until /home is mounted
...
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.
See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-09-20 09:45:30 +02:00
1a0cf96fc4
Add dbautist user with access to hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-09-20 09:42:02 +02:00
4bd1648074
Set the serial console to ttyS1 in raccoon
...
Apparently the ttyS0 console doesn't exist but ttyS1 does:
raccoon% sudo stty -F /dev/ttyS0
stty: /dev/ttyS0: Input/output error
raccoon% sudo stty -F /dev/ttyS1
speed 9600 baud; line = 0;
-brkint -imaxbel
The dmesg line agrees:
00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:56 +02:00
15b114ffd6
Remove setLdLibraryPath and driSupport options
...
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".
See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:53 +02:00
dd6d8c9735
Add documentation section about GRUB chain loading
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:47 +02:00
e15a3867d4
Add 10 min shutdown jitter to avoid spikes
...
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:44 +02:00
5cad208de6
Don't mount the nix store in owl nodes
...
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:42 +02:00
c8687f7e45
Emulate other architectures in owl nodes too
...
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:39 +02:00
d988ef2eff
Program shutdown for August 2nd for all machines
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:36 +02:00
b07929eab3
Enable debuginfod daemon in owl nodes
...
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:30 +02:00
b3e397eb4c
Set gitea and grafana log level to warn
...
Prevents filling the journal logs with information messages.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:27 +02:00
5ad2c683ed
Set default SLURM job time limit to one hour
...
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:24 +02:00
1f06f0fa0c
Allow other jobs to run in unused cores
...
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.
As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:22 +02:00
8ca1d84844
Use authentication tokens for PM GitLab runner
...
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead. Older tokens and
runners are also removed, as they are no longer used.
With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.
See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:16 +02:00
998f599be3
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
→ 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
→ 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:13 +02:00
fcfc6ac149
Allow ptrace to any process of the same user
...
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:09 +02:00
6e87130166
Add abonerib user to hut, raccon, owl1 and owl2
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:07 +02:00
06f9e6ac6b
Grant rpenacob access to owl1 and owl2 nodes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:05 +02:00
da07aedce2
Access private repositories via hut SSH proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:36:03 +02:00
61427a8bf9
Set the default proxy to point to hut
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:35:56 +02:00
958ad1f025
Allow incoming traffic to hut proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2024-09-12 08:35:23 +02:00
1c5f3a856f
eudy: koro: fcs: Fix fcs unprotected cpuid all
...
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2024-07-17 11:40:20 +02:00
4e2b80defd
Add support for armv7 emulation in hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-07-17 11:12:48 +02:00
1c8efd0877
Monitor raccoon machine via IPMI
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-07-17 11:12:32 +02:00
4c5e85031b
Move vlopez user to jungleUsers for koro host
...
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-07-16 12:35:39 +02:00
5688823fcc
Add raccoon motd file
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-07-16 12:35:38 +02:00
72faf8365b
Split xeon specific configuration from base
...
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-07-16 12:35:37 +02:00
0e22d6def8
Control user access to each machine
...
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-07-16 12:35:34 +02:00
22cc1d33f7
Add PostgreSQL DB for performance test results
...
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-07-16 12:35:24 +02:00
15085c8a05
Enable Grafana email alerts
...
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-31 15:57:38 +02:00
06748dac1d
Enable mail notification in Gitea
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-31 10:56:49 +02:00
63851306ac
Add msmtp to send notifications via email
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-31 10:56:20 +02:00
2bdc793c8c
Allow Ceph traffic to lake2
2024-05-02 17:43:48 +02:00
85d1c5e34c
Fix meta in posts entries
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 17:32:37 +02:00
e6b7af5272
Fix bogus separator
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 17:32:34 +02:00
c0ae8770bc
Manually add links to the menu
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 17:32:32 +02:00
5b51e8947f
Add link to Gitea in the website
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 17:32:28 +02:00
db2c6f7e45
Collect Gitea metrics in Prometheus
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 17:32:25 +02:00
8e8f9e7adb
Add Gitea service
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-05-02 17:31:51 +02:00
d2adc3a6d3
Add firewall rules for Ceph and monitoring
...
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-25 13:25:11 +02:00
76cd9ea47f
Add workaround for MPICH 4.2.0
...
See: https://github.com/pmodels/mpich/issues/6946
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-25 13:25:08 +02:00
2f851bc216
Fix SLURM bug in rank integer sign expansion
...
See: https://bugs.schedmd.com/show_bug.cgi?id=19324
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-25 13:25:05 +02:00
834d3187e5
Merge pmix outputs for MPICH
...
MPICH expects headers and libraries to be present in the same directory.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-25 13:25:03 +02:00
49be0f208c
Remove nixseparatedebuginfod input
...
It has been integrated in nixpkgs, so is no longer required.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-25 13:24:58 +02:00
fb23b41dae
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
→ 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
→ 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
→ 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
→ 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
→ 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-04-25 13:24:29 +02:00
005a67deaf
Use google.com probe instead of bsc.es
...
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-03-05 16:52:21 +01:00
f8097cb5cb
Add another HTTPS probe for bsc.es
...
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.
Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2024-02-13 12:26:56 +01:00
ff792f5f48
Move slurm client in a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2024-02-13 11:11:17 +01:00
5c48b43ae0
Enable public-inbox at jungle.bsc.es/lists
...
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.
The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.
We also edited the CSS to improve the readability and have larger fonts
by default.
The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-12-15 11:18:08 +01:00
b299ead00b
Monitor https://pm.bsc.es/gitlab/ too
...
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/ .
Cc: Víctor López <victor.lopez@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-12-05 09:56:28 +01:00
a92432cf5a
Enable nixseparatedebuginfod module
...
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2023-12-04 11:04:52 +01:00
82f5d828c2
Use tmpfs in /tmp
...
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-28 12:25:50 +01:00
35a94a9b02
Enable runners for pm.bsc.es/gitlab too
...
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-24 14:45:23 +01:00
b6bd31e159
Remove complete ceph package from hut
...
Only the ceph-client is needed.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-24 12:58:54 +01:00
1d4badda5b
Fix warning in slurm exporter using vendorHash
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-24 12:58:50 +01:00
bd5214a3b9
Remove old Ceph package overlay
...
The Ceph package is now integrated in upstream nixpkgs.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-24 12:58:47 +01:00
c32f6dea97
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/d8c973fd228949736dedf61b7f8cc1ece3236792' (2023-07-24)
→ 'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
→ 'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-24 12:57:44 +01:00
dd341902fc
BSC packages are no longer in bsc attribute
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-09 13:40:48 +01:00
190e273112
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-09 13:40:48 +01:00
268807d1d0
Switch bscpkgs URL to sourcehut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-11-09 13:40:48 +01:00
2953080fb8
Monitor anella instead of gw.bsc.es
...
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-27 12:46:08 +02:00
9871517be2
Add ICMP probes
...
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.
In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-25 17:13:03 +02:00
736eacaac5
Enable proxy for Grafana too
...
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-25 16:55:56 +02:00
0e66aad099
Make blackbox exporter use the proxy
...
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2023-10-25 16:55:24 +02:00
67a4905a0a
Don't log SLURM connection attempts from ssfhead
2023-10-06 15:22:04 +02:00
d52d22e0db
Add docker runner too
2023-10-06 15:17:07 +02:00
42920c2521
Monitor gitlab.bsc.es too
2023-10-06 15:17:07 +02:00
4acd35e036
Monitor PM webpage via blackbox
2023-10-06 15:17:07 +02:00
621d20db3a
Temporarily disable pm runners
2023-10-06 15:17:07 +02:00
0926f6ec1f
Add runner for gitlab.bsc.es
2023-10-06 15:17:07 +02:00
61646cb3bd
Allow anonymous access to grafana
2023-09-22 10:51:30 +02:00
c0066c4744
Remove user/group when using DynamicUsers
2023-09-22 10:13:06 +02:00
ffd0593f51
Set the SLURM_CONF variable
2023-09-21 22:22:00 +02:00
f49ae0773e
Enable slurm-exporter service
2023-09-21 21:40:02 +02:00
8fa3fccecb
Add prometheus-slurm-exporter package
2023-09-21 21:34:18 +02:00
9ee7111453
Document the hut shared nix store for SLURM
2023-09-21 13:51:42 +02:00
8de3d2b149
Mount the hut nix store for SLURM jobs
2023-09-20 19:38:43 +02:00
bc62e28ca3
Enable direnv integration
2023-09-20 09:32:58 +02:00
d612a5453c
Add System Integration Service Guide document
2023-09-19 15:12:59 +02:00
653d411b9e
Remove bscpkgs from the registry and nixPath
...
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2023-09-15 12:00:33 +02:00
51c57dbc41
Add bscpkgs and nixpkgs top level attributes
...
Allows the evaluation of packages of the intermediate overlays.
2023-09-15 12:00:33 +02:00
33cd40160e
Use hut packages as the default package set
...
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2023-09-15 12:00:28 +02:00
a1e8cfea47
Don't fetch registry flakes from the net
2023-09-15 12:00:28 +02:00
5d72ee3da3
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
2023-09-15 11:50:47 +02:00
fdc6445d47
Revert "Update slurm to 23.02.05.1"
...
This reverts commit aaefddc44a .
2023-09-14 15:46:18 +02:00
e88805947e
Open ports in firewall of compute nodes
2023-09-14 15:45:43 +02:00
aaefddc44a
Update slurm to 23.02.05.1
2023-09-13 17:44:24 +02:00
d9d249411d
Monitor storage nodes via IPMI too
2023-09-13 15:57:13 +02:00
c07f75c6bb
Specify the space available in /ceph
2023-09-13 14:19:59 +02:00
8d449ba20c
Add update post to website
2023-09-12 18:13:38 +02:00
10ca572aec
Enable fstrim service
2023-09-12 16:39:45 +02:00
75b0f48715
Serve the nix store from hut
2023-09-12 12:19:43 +02:00
19a451db77
Add encrypted munge key with agenix
2023-09-08 19:05:45 +02:00
ec9be9bb62
Remove unused large port hole in firewall
2023-09-08 18:22:48 +02:00
7ddd1977f3
Make exporters listen in localhost only
2023-09-08 18:13:04 +02:00
7050c505b5
Allow only some ports for srun
2023-09-08 17:51:37 +02:00
033a1fe97b
Block ssfhead from reaching our slurm daemon
2023-09-08 17:36:28 +02:00
77cb3c494e
Poweroff idle slurm nodes after 1 hour
2023-09-08 16:49:53 +02:00
6db5772ac4
Add IB and IPMI node host names
2023-09-08 13:21:37 +02:00
3e347e673c
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
2023-09-07 11:13:45 +02:00
dca274d020
Unlock ovni gitlab runners
2023-09-05 16:59:45 +02:00
c33909f32f
Update email contact to jungle mail list
2023-09-05 16:10:58 +02:00
64e856e8b9
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27 ' (2023-08-31)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
→ 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2023-09-05 15:03:26 +02:00
02f40a8217
Add agenix to all nodes
2023-09-04 22:10:43 +02:00
77d43b6da9
Add agenix module to ceph
2023-09-04 22:07:07 +02:00
ab55aac5ff
Remove old secrets
2023-09-04 22:04:32 +02:00
9b5bfbb7a3
Mount /ceph in owl1 and owl2
2023-09-04 22:00:36 +02:00
a69a71d1b0
Warn about the owl2 omnipath device
2023-09-04 22:00:17 +02:00
98374bd303
Clean owl2 configuration
2023-09-04 21:59:56 +02:00
3b6be8a2fc
Move the ceph client config to an external module
2023-09-04 21:59:04 +02:00
2bb366b9ac
Reorganize secrets and ssh keys
...
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2023-09-04 21:36:31 +02:00
2d16709648
Add anavarro user
2023-09-04 16:00:01 +02:00
9344daa31c
Set zsh inc_append_history option
2023-09-03 16:57:53 +02:00
80c98041b5
Set zsh shell for rarias
2023-09-03 16:46:27 +02:00
3418e57907
Enable zsh and fix key bindings
2023-09-03 16:42:04 +02:00
6848b58e39
Keep a log over time with the config commits
2023-09-03 00:02:14 +02:00
13a70411aa
Configure bscpkgs.nixpkgs to follow nixpkgs
2023-09-02 23:37:59 +02:00
f9c77b433a
Store nixos config in /etc/nixos/config.rev
2023-09-02 23:37:11 +02:00
9d487845f6
Enable binary emulation for other architectures
2023-08-31 17:27:08 +02:00
3c99c2a662
Enable watchdog
2023-08-30 16:32:17 +02:00
7d09108c9f
Enable all osd on boot in lake2
2023-08-30 16:32:17 +02:00
0f0a861896
Scrape lake2 too
2023-08-29 12:33:26 +02:00
beb0d5940e
Also enable monitoring in lake2
2023-08-29 12:29:41 +02:00
70321ce237
Scrape metrics from bay
2023-08-29 11:58:00 +02:00
5bd1d67333
Add monitoring in the bay node
2023-08-29 11:53:32 +02:00
fad9df61e1
Add fio tool
2023-08-29 11:27:50 +02:00
d2a80c8c18
Add ceph tools in hut too
2023-08-28 17:58:21 +02:00
599613d139
Switch ceph logs to journal
2023-08-28 17:58:08 +02:00
ac4fa9abd4
Update ceph to 18.2.0 in overlay
2023-08-25 18:20:21 +02:00
cb3a7b19f7
Move pkgs overlay to overlay.nix
2023-08-25 18:12:00 +02:00
f5d6bf627b
Enable ceph osd daemons in lake2
2023-08-25 14:54:51 +02:00
f1ce815edd
Add the lake2 hostname to the hosts
2023-08-25 14:44:35 +02:00
a2075cfd65
Use the sda for lake2
2023-08-25 13:40:10 +02:00
8f1f6f92a8
Remove netboot module
2023-08-25 13:39:01 +02:00
3416416864
Disable pixiecore in hut for now
2023-08-25 13:21:00 +02:00
815888fb07
Add PXE helper
2023-08-25 12:05:33 +02:00
029d9cb1db
Enable netboot again for PXE
2023-08-24 19:08:23 +02:00
95fa67ede1
Specify the disk by path
2023-08-24 15:27:37 +02:00
a19347161f
Prepare lake2 config after bootstrap
...
The disk ID is different under NixOS.
2023-08-24 13:54:53 +02:00
58c1cc1f7c
Add lake2 bootstrap config
2023-08-24 12:30:46 +02:00
b06399dc70
Add section to enable serial console
2023-08-24 12:29:44 +02:00
077eece6b9
Add agenix to PATH in hut
2023-08-23 17:42:50 +02:00
b3ef53de51
Store ceph secret key in age
...
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2023-08-23 17:26:44 +02:00
e0852ee89b
Add rarias key for secrets
2023-08-23 17:15:26 +02:00
dfffc0bdce
Add ceph metrics to prometheus
2023-08-22 16:33:55 +02:00
8257c245b1
Mount the ceph filesystem in hut
2023-08-22 16:15:46 +02:00
cd5853cf53
Add ceph config in bay
2023-08-22 15:58:48 +02:00
b677b827d4
Add the bay host name
2023-08-22 15:56:09 +02:00
b1d5185cca
Remove netboot and fixes
2023-08-22 12:12:15 +02:00
a7e66e2246
Add bay node
2023-08-22 12:12:15 +02:00
480c97e952
Update flake
2023-08-22 11:28:54 +02:00
f8fb5fa4ff
Monitor power from other nodes via LAN
2023-08-22 11:28:54 +02:00
acf9b71f04
Increase prometheus retention time to one year
2023-08-22 11:28:54 +02:00
bf692e6e4e
Don't set all_proxy
2023-08-22 11:28:54 +02:00
c242b65e47
Update nixpkgs to fix docker problem
2023-07-28 14:24:51 +02:00
55d6c17776
Allow access to devices for node_exporter
2023-07-28 13:55:35 +02:00
14b173f67e
GRUB version no longer needed
2023-07-27 17:22:20 +02:00
b9001cdf7d
Upgrade flake: nixpkgs, bscpkgs and agenix
2023-07-27 17:19:17 +02:00
f892d43b47
Kill slurmd remaining processes on upgrade
2023-07-27 14:49:20 +02:00
d9e9ee6e3a
Add details to request access in the web
2023-07-25 16:07:22 +02:00
79adbe76a8
koro: Add vlopez user
2023-07-21 13:00:43 +02:00
66fb848ba8
Add koro node
2023-07-21 13:00:08 +02:00
40b1a8f0df
eudy: Add fcsv3 and intermediate versions for testing
2023-07-21 11:27:51 +02:00
a0b9d10b14
eudy: Enable memory overcommit
2023-07-21 11:27:51 +02:00
4c309dea2f
eudy: disable all cpu mitigations
2023-07-21 11:27:51 +02:00
b3a397eee4
Add jungle.bsc.es hugo website
2023-07-21 10:52:23 +02:00
7c1fe1455b
Enable NTP using the BSC time server
2023-06-30 14:02:15 +02:00
2d4b178895
Add the ssfhead node as gateway
2023-06-30 14:01:35 +02:00
4dd25f2f89
Use our host names first by default
2023-06-23 16:22:18 +02:00
6dcd9d8144
Add DNS tools to resolve hosts
2023-06-23 16:15:45 +02:00
31be81d2b1
Lower perf_event_paranoid to -1
2023-06-23 16:01:27 +02:00
826cfdf43f
Set perf paranoid to 0 by default
2023-06-21 16:24:19 +02:00
a1f258c5ce
Add perf to packages
2023-06-21 15:41:06 +02:00
1c1d3f3231
Allow srun to specify the cpu binding
...
The task/affinity plugin needs to be selected.
2023-06-21 13:16:23 +02:00
623d46c03f
Move authorized keys to users.nix
2023-06-20 14:08:34 +02:00
518a4d6af3
Add rpenacob user
2023-06-20 12:54:26 +02:00
60077948d6
Add osumb to the system packages
2023-06-16 19:22:41 +02:00
c76bfa7f86
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=c775ee4d6f76aded05b08ae13924c302f18f9b2c ' (2023-04-26)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=cbe9af5d042e9d5585fe2acef65a1347c68b2fbd ' (2023-06-16)
2023-06-16 18:33:54 +02:00
6c10933e80
Set mpi to mpich by default in bscpkgs
2023-06-16 18:26:51 +02:00
6402605b1f
Add missing parameter to extend
2023-06-16 18:26:51 +02:00
1724535495
Use explicit order in overlays
2023-06-16 18:26:51 +02:00
5b41670f36
Replace mpi inside bsc attribute
2023-06-16 18:26:51 +02:00
ab04855382
Add mpich overlay
2023-06-16 18:26:51 +02:00
684d5e41c5
Add coments in slurm config
2023-06-16 18:26:50 +02:00
316ea18e24
Add eudy host key to known hosts
2023-06-16 17:29:48 +02:00
c916157fcc
Rename xeon08 to eudy
...
From Eudyptula, a little penguin.
2023-06-16 17:16:05 +02:00
4e9409db10
Update rebuild script for all nodes
2023-06-16 12:13:07 +02:00
94320d9256
Add ssh host keys
2023-06-16 12:01:12 +02:00
9f5941c2be
Set the name of the slurm cluster to jungle
2023-06-16 12:00:54 +02:00
fba0f7b739
Change owl hostnames
2023-06-16 11:42:39 +02:00
2e95281af5
Add owl and all partition
2023-06-16 11:34:00 +02:00
f4ac9f3186
Simplify flake and expose host pkgs
...
The configuration of the machines is now moved to m/
2023-06-16 11:31:31 +02:00
f787343f29
Rename xeon07 to hut
2023-06-14 17:28:40 +02:00
70304d26ff
Remove profiles older than 30 days with gc
2023-06-14 17:28:39 +02:00
76c10ec22e
Add ncdu to system packages
2023-06-14 17:28:39 +02:00
011e8c2bf8
Move arocanon user from xeon08 to common
2023-06-14 16:22:43 +02:00
c1f138a9c1
xeon08: Add config for kernel non-voluntary preemption
2023-06-14 16:17:33 +02:00
1552eeca12
xeon08: Add perf
2023-06-14 15:42:20 +02:00
8769f3d418
xeon08: Enable lttng lockdep tracepoints
2023-06-14 15:42:20 +02:00
a4c254fcd6
xeon08: Add lttng module and tools
2023-06-14 15:42:20 +02:00
24fb1846d2
Serve grafana in https://jungle.bsc.es/grafana
2023-05-31 18:12:14 +02:00
5e77d0b86c
Add tree command
2023-05-31 18:11:34 +02:00
494fda126c
Add file to system packages
2023-05-31 18:11:34 +02:00
5cfa2f9611
Add gnumake to system packages
2023-05-31 18:11:34 +02:00
9539a24bdb
Add cmake to system packages
2023-05-31 18:11:34 +02:00
98c4d924dd
Add ix to common packages
2023-05-31 18:11:34 +02:00
7aae967c65
Improve documentation
2023-05-26 11:38:27 +02:00
49f7edddac
Add gitignore
2023-05-26 11:38:27 +02:00
2f055d9fc5
Set intel_pstate=passive and disable frequency boost
2023-05-26 11:38:26 +02:00
108abffd2a
Add xeon08 basic config
2023-05-26 11:38:26 +02:00
4c19ad66e3
Add nixos-config.nix to easily enable nix repl
2023-05-26 11:29:59 +02:00
19c01aeb1d
Automatically resume restarted nodes in SLURM
2023-05-18 12:48:04 +02:00
fc90b40310
Allow public dashboards in grafana
2023-05-09 18:53:31 +02:00
81de0effb1
Add hal ssh key
2023-05-09 18:37:38 +02:00
5ce93ff85a
Increase the number of CPUs to 56 for nOS-V docker
2023-05-02 17:47:57 +02:00
c020b9f5d6
Allow 5 concurrent buils in the gitlab-runner
2023-05-02 17:38:10 +02:00
f47734b524
Simplify bash prompt
2023-04-28 18:15:04 +02:00
ca3a7d98f5
Roolback to bash as default shell
...
Zsh doesn't behave properly, it needs further configuration.
2023-04-28 17:59:19 +02:00
0d5609ecc2
Use pmix by default in slurm
2023-04-28 17:07:48 +02:00
818edccb34
Increase locked memory to 1 GiB
2023-04-28 12:34:51 +02:00
2815f5bcfd
Use the latest kernel
2023-04-28 11:51:38 +02:00
c1bbbd7793
Disable osnoise and hwlat tracer for now
...
Reuse nix cache to avoid rebuilding the kernel.
2023-04-28 11:19:47 +02:00
aa1dd14b62
Update nixpkgs to nixos-unstable
2023-04-28 11:18:37 +02:00
399103a9b4
Update nixpkgs
2023-04-28 11:13:46 +02:00
74639d3ece
Update ib interface name in xeon02
...
It seems to be plugged in another PCI port
2023-04-27 18:29:32 +02:00
613a76ac29
Add steps in install documentation
2023-04-27 17:30:53 +02:00
c3ea8864bb
Add minimal netboot module to build kexec image
2023-04-27 16:36:15 +02:00
919f211536
Add xeon02 configuration
2023-04-27 16:28:12 +02:00
141d77e2b6
Refacto slurm configuration into compute/control
2023-04-27 16:27:04 +02:00
44fcb97ec7
Lock flakes and add inputs
2023-04-27 13:52:59 +02:00
543983e9f3
Test flakes
2023-04-26 14:27:02 +02:00
95bbeeb646
Enable slurm in xeon01
2023-04-26 14:10:36 +02:00
de2af79810
Use xeon07 as control machine
2023-04-26 14:10:36 +02:00
b9aff1dba5
Remove xeon07 overlay to load upstream slurm
2023-04-26 14:10:36 +02:00
7da979bed2
Add script to rebuild configuration
2023-04-26 14:09:23 +02:00
cfe37640ea
Add configuration for xeon01
2023-04-26 11:44:00 +00:00
096e407571
Load overlays from /config
2023-04-26 11:44:00 +00:00
ae31b546e7
Move net.nix to common
2023-04-26 11:44:00 +00:00
c3a2766bb7
Remove host specific network options from net.nix
2023-04-26 11:44:00 +00:00
b568bb36d4
Move ssh.nix to common
2023-04-26 11:44:00 +00:00
55f784e6b7
Move overlays.nix to common
2023-04-26 11:44:00 +00:00
dfab84b0ba
Move users.nix to common
2023-04-26 11:44:00 +00:00
8f66ba824a
Move common options from configuration.nix
2023-04-26 11:44:00 +00:00
79bd4398f3
Move the remaining hw config to common
2023-04-26 11:44:00 +00:00
b44afdaaa1
Move boot config to common/boot.nix
2023-04-26 11:44:00 +00:00
9528fab3ef
Move filesystems config to common/fs.nix
2023-04-26 11:44:00 +00:00
7e82885d84
Use partition labels for / and swap
2023-04-26 11:44:00 +00:00
57ed0cf319
Move fs.nix to common
2023-04-26 11:44:00 +00:00
b043ee3b1d
Move boot.nix to common
2023-04-26 11:44:00 +00:00
9e3bdaabb6
Move disk selection to configuration.nix
2023-04-26 11:44:00 +00:00
77f72ac939
Add common directory
2023-04-26 11:44:00 +00:00
fa25a68571
Add server board documentation
2023-04-24 10:10:08 +02:00
Rodrigo Arias
ea0f406849
Add BSC SSF slides
2023-04-24 09:47:11 +02:00
Rodrigo Arias
9df6be1b6b
Add SEL troubleshooting guide
2023-04-21 13:31:11 +02:00