Commit Graph

184 Commits

Author SHA1 Message Date
d88ea6a515 Add raccoon node monitoring 2025-02-25 17:18:13 +01:00
1e10498f97 Add abonerib user to fox 2025-02-25 14:33:11 +01:00
ea49d762d1 MERGEME: Only expose proxy to docker 2025-02-17 16:12:46 +01:00
ab82757b42 MERGEME: Increase log 2025-02-17 15:18:59 +01:00
58ab131553 MERGEME: load cacert setup hook 2025-02-17 15:04:06 +01:00
e9cc635b8a MERGEME: Use system packages 2025-02-17 14:46:52 +01:00
bc8b79566b MERGEME: Add docker,hut tags 2025-02-17 14:34:29 +01:00
3f4e282113 MERGEME: Use docker extra hosts 2025-02-17 14:28:03 +01:00
c9f1712986 MERGEME: Disable debug in ipmi monitoring 2025-02-17 14:15:47 +01:00
6004a3351d Don't move doc in web output 2025-02-17 14:15:47 +01:00
93dc0aed33 Reject SSH connections without SLURM allocation 2025-02-17 14:15:47 +01:00
0cc1ae9fa7 Add users to fox 2025-02-17 14:15:47 +01:00
5ac9c70dd7 Add dalvare1 user 2025-02-17 14:15:47 +01:00
1fe44faa6d Mount NVME disks in /nvme{0,1} 2025-02-17 14:15:46 +01:00
b1fcd1e128 Exclude fox from being suspended by slurm 2025-02-17 14:15:46 +01:00
edf1f3d239 Use IPMI host names instead of IP addresses 2025-02-17 14:15:46 +01:00
878ee61734 Add fox IPMI monitoring
Use agenix to store the credentials safely.
2025-02-17 14:15:46 +01:00
f09556a26f Add new fox machine 2025-02-17 14:15:46 +01:00
98cac8b086 Add new GitLab runner for gitlab.bsc.es
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.
2025-02-17 14:15:37 +01:00
6cad205269 Add script to monitor GPFS
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 15:43:07 +01:00
c57bf76969 Add BSC machines to ssh config
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:51 +01:00
ad4b615211 Collect statistics from logged users
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:48 +01:00
b4518b59cf Add custom GPFS exporter for MN5
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:46 +01:00
45dc4124a3 Remove exception to fetch task endpoint
It causes the request to go to the website rather than the Gitea
service.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:43 +01:00
bdfe9a48fd Use SSD for boot, then switch to NVME
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:40 +01:00
1b337d31f8 Use NVME as root
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:37 +01:00
717cd5a21e Keep host header for Grafana requests
This was breaking requests due to CSRF check.

See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:32 +01:00
def5955614 Ignore logging requests from the gitea runner
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:28 +01:00
0e3c975cb5 Log the client IP not the proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:22 +01:00
36592c44eb Create paste directories in /ceph/p
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:16 +01:00
0d2dea94fb Add p command to paste files
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:10 +01:00
7f539d7e06 Use nginx to serve website and other services
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:23:07 +01:00
f8ec090836 Mount the NVME disk in /nvme
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 14:22:58 +01:00
9a9161fc55 Delay nix-gc until /home is mounted
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.

See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-20 09:45:30 +02:00
1a0cf96fc4 Add dbautist user with access to hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-20 09:42:02 +02:00
4bd1648074 Set the serial console to ttyS1 in raccoon
Apparently the ttyS0 console doesn't exist but ttyS1 does:

  raccoon% sudo stty -F /dev/ttyS0
  stty: /dev/ttyS0: Input/output error
  raccoon% sudo stty -F /dev/ttyS1
  speed 9600 baud; line = 0;
  -brkint -imaxbel

The dmesg line agrees:

  00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A

The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:56 +02:00
15b114ffd6 Remove setLdLibraryPath and driSupport options
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".

See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:53 +02:00
e15a3867d4 Add 10 min shutdown jitter to avoid spikes
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:44 +02:00
5cad208de6 Don't mount the nix store in owl nodes
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:42 +02:00
c8687f7e45 Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:39 +02:00
d988ef2eff Program shutdown for August 2nd for all machines
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:36 +02:00
b07929eab3 Enable debuginfod daemon in owl nodes
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:30 +02:00
b3e397eb4c Set gitea and grafana log level to warn
Prevents filling the journal logs with information messages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:27 +02:00
5ad2c683ed Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:24 +02:00
1f06f0fa0c Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:22 +02:00
8ca1d84844 Use authentication tokens for PM GitLab runner
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead.  Older tokens and
runners are also removed, as they are no longer used.

With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.

See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:16 +02:00
fcfc6ac149 Allow ptrace to any process of the same user
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:09 +02:00
6e87130166 Add abonerib user to hut, raccon, owl1 and owl2
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:07 +02:00
06f9e6ac6b Grant rpenacob access to owl1 and owl2 nodes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:05 +02:00
da07aedce2 Access private repositories via hut SSH proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:03 +02:00