176 Commits

Author SHA1 Message Date
aef34f8ba0 Adjust fox slurm config after disabling SMT
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
cf722cbf0d Add abonerib user to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
d27312fdac Don't move doc in web output
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
d25b9a3c51 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
9533cb008c Add users to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
07f8a1763f Add dalvare1 user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
c57802ec54 Mount NVME disks in /nvme{0,1}
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
ea1d93c058 Exclude fox from being suspended by slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
2e5379b9bf Use IPMI host names instead of IP addresses
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
a808dd8a84 Add fox IPMI monitoring
Use agenix to store the credentials safely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
364f05e7af Add new fox machine
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
7fd7781479 Add script to monitor GPFS
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
539a72aad6 Add BSC machines to ssh config
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
3988339d84 Collect statistics from logged users
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
5a1744c28d Add custom GPFS exporter for MN5
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
de8716deaa Remove exception to fetch task endpoint
It causes the request to go to the website rather than the Gitea
service.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
80fb0fe3e5 Use SSD for boot, then switch to NVME
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
6ed625fe62 Use NVME as root
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
c2e88b0573 Keep host header for Grafana requests
This was breaking requests due to CSRF check.

See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
a194dbdd5a Ignore logging requests from the gitea runner
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
348582cdf0 Log the client IP not the proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
e6dc87ad82 Create paste directories in /ceph/p
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
fb506460ba Add p command to paste files
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
0824de661d Use nginx to serve website and other services
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
fb38a8cbef Mount the NVME disk in /nvme
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
574f2bdc4f Delay nix-gc until /home is mounted
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.

See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
935553431e Add dbautist user with access to hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
4489eac910 Set the serial console to ttyS1 in raccoon
Apparently the ttyS0 console doesn't exist but ttyS1 does:

  raccoon% sudo stty -F /dev/ttyS0
  stty: /dev/ttyS0: Input/output error
  raccoon% sudo stty -F /dev/ttyS1
  speed 9600 baud; line = 0;
  -brkint -imaxbel

The dmesg line agrees:

  00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A

The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
3f195b6e74 Remove setLdLibraryPath and driSupport options
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".

See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
2182d117a8 Add 10 min shutdown jitter to avoid spikes
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
60af22125b Don't mount the nix store in owl nodes
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
fed20a536b Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
123598cd03 Program shutdown for August 2nd for all machines
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
368a0a7637 Enable debuginfod daemon in owl nodes
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
1e67cc4d12 Set gitea and grafana log level to warn
Prevents filling the journal logs with information messages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
0b6f2e21b6 Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
1187f40f30 Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
6527afb1b2 Use authentication tokens for PM GitLab runner
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead.  Older tokens and
runners are also removed, as they are no longer used.

With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.

See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
706b9049fa Allow ptrace to any process of the same user
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
aa330aaf63 Add abonerib user to hut, raccon, owl1 and owl2
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
857196efb7 Grant rpenacob access to owl1 and owl2 nodes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
cf0a41cbc4 Access private repositories via hut SSH proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
e81f66d0c4 Set the default proxy to point to hut
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
dfa27a9cf2 Allow incoming traffic to hut proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
b25d378960 eudy: koro: fcs: Fix fcs unprotected cpuid all
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:16 +02:00
75b11e5373 Add support for armv7 emulation in hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
383941ada9 Monitor raccoon machine via IPMI
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
d93914f91d Move vlopez user to jungleUsers for koro host
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
c80433c2a7 Add raccoon motd file
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
f74ec2bce1 Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00