Commit Graph

274 Commits

Author SHA1 Message Date
9a9161fc55 Delay nix-gc until /home is mounted
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.

See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-20 09:45:30 +02:00
1a0cf96fc4 Add dbautist user with access to hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-09-20 09:42:02 +02:00
4bd1648074 Set the serial console to ttyS1 in raccoon
Apparently the ttyS0 console doesn't exist but ttyS1 does:

  raccoon% sudo stty -F /dev/ttyS0
  stty: /dev/ttyS0: Input/output error
  raccoon% sudo stty -F /dev/ttyS1
  speed 9600 baud; line = 0;
  -brkint -imaxbel

The dmesg line agrees:

  00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A

The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:56 +02:00
15b114ffd6 Remove setLdLibraryPath and driSupport options
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".

See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:53 +02:00
dd6d8c9735 Add documentation section about GRUB chain loading
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:47 +02:00
e15a3867d4 Add 10 min shutdown jitter to avoid spikes
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:44 +02:00
5cad208de6 Don't mount the nix store in owl nodes
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:42 +02:00
c8687f7e45 Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:39 +02:00
d988ef2eff Program shutdown for August 2nd for all machines
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:36 +02:00
b07929eab3 Enable debuginfod daemon in owl nodes
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:30 +02:00
b3e397eb4c Set gitea and grafana log level to warn
Prevents filling the journal logs with information messages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:27 +02:00
5ad2c683ed Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:24 +02:00
1f06f0fa0c Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:22 +02:00
8ca1d84844 Use authentication tokens for PM GitLab runner
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead.  Older tokens and
runners are also removed, as they are no longer used.

With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.

See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:16 +02:00
998f599be3 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
  → 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
  → 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:13 +02:00
fcfc6ac149 Allow ptrace to any process of the same user
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:09 +02:00
6e87130166 Add abonerib user to hut, raccon, owl1 and owl2
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:07 +02:00
06f9e6ac6b Grant rpenacob access to owl1 and owl2 nodes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:05 +02:00
da07aedce2 Access private repositories via hut SSH proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:03 +02:00
61427a8bf9 Set the default proxy to point to hut
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:35:56 +02:00
958ad1f025 Allow incoming traffic to hut proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:35:23 +02:00
1c5f3a856f eudy: koro: fcs: Fix fcs unprotected cpuid all
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-07-17 11:40:20 +02:00
4e2b80defd Add support for armv7 emulation in hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-17 11:12:48 +02:00
1c8efd0877 Monitor raccoon machine via IPMI
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-17 11:12:32 +02:00
4c5e85031b Move vlopez user to jungleUsers for koro host
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:39 +02:00
5688823fcc Add raccoon motd file
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:38 +02:00
72faf8365b Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:37 +02:00
0e22d6def8 Control user access to each machine
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:34 +02:00
22cc1d33f7 Add PostgreSQL DB for performance test results
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:24 +02:00
15085c8a05 Enable Grafana email alerts
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 15:57:38 +02:00
06748dac1d Enable mail notification in Gitea
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:49 +02:00
63851306ac Add msmtp to send notifications via email
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:20 +02:00
2bdc793c8c Allow Ceph traffic to lake2 2024-05-02 17:43:48 +02:00
85d1c5e34c Fix meta in posts entries
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:37 +02:00
e6b7af5272 Fix bogus separator
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:34 +02:00
c0ae8770bc Manually add links to the menu
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:32 +02:00
5b51e8947f Add link to Gitea in the website
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:28 +02:00
db2c6f7e45 Collect Gitea metrics in Prometheus
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:25 +02:00
8e8f9e7adb Add Gitea service
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:31:51 +02:00
d2adc3a6d3 Add firewall rules for Ceph and monitoring
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:11 +02:00
76cd9ea47f Add workaround for MPICH 4.2.0
See: https://github.com/pmodels/mpich/issues/6946

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:08 +02:00
2f851bc216 Fix SLURM bug in rank integer sign expansion
See: https://bugs.schedmd.com/show_bug.cgi?id=19324

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:05 +02:00
834d3187e5 Merge pmix outputs for MPICH
MPICH expects headers and libraries to be present in the same directory.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:03 +02:00
49be0f208c Remove nixseparatedebuginfod input
It has been integrated in nixpkgs, so is no longer required.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:24:58 +02:00
fb23b41dae flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
  → 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
    'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
  → 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
    'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
  → 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
    'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b' (2023-11-22)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709' (2024-04-24)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
  → 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
    'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
  → 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:24:29 +02:00
005a67deaf Use google.com probe instead of bsc.es
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-05 16:52:21 +01:00
f8097cb5cb Add another HTTPS probe for bsc.es
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.

Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-13 12:26:56 +01:00
ff792f5f48 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-13 11:11:17 +01:00
5c48b43ae0 Enable public-inbox at jungle.bsc.es/lists
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.

The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.

We also edited the CSS to improve the readability and have larger fonts
by default.

The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-15 11:18:08 +01:00
b299ead00b Monitor https://pm.bsc.es/gitlab/ too
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/.

Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-05 09:56:28 +01:00