77 Commits

Author SHA1 Message Date
ad26c63fa2 Clean all iptables rules on stop
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-11 10:23:26 +02:00
563dc575fd Make nginx listen on all interfaces
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-11 10:03:05 +02:00
097c7bc31f Fix nginx /cache regex
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```

This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info` but
`curl https://jungle.bsc.es/cachenix-cache-info` worked.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-02-26 15:31:05 +01:00
17e42b3872 Add new GitLab runner for gitlab.bsc.es
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-24 13:00:54 +01:00
b1adbed3de Don't move doc in web output
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-14 16:36:57 +01:00
8766fd8439 Use IPMI host names instead of IP addresses
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 12:14:40 +01:00
b70d99f479 Add fox IPMI monitoring
Use agenix to store the credentials safely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 11:36:53 +01:00
23aa682816 Add script to monitor GPFS
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-14 12:01:00 +01:00
aa977ee62a Collect statistics from logged users
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-11-14 12:21:13 +01:00
7b9d805d12 Add custom GPFS exporter for MN5
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-11-12 16:30:24 +01:00
4aa011ff85 Remove exception to fetch task endpoint
It causes the request to go to the website rather than the Gitea
service.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-22 16:13:01 +02:00
4b41b67d25 Use SSD for boot, then switch to NVME
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-21 14:28:17 +02:00
e3f6e67348 Use NVME as root
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-17 14:39:31 +02:00
129fa52e9b Keep host header for Grafana requests
This was breaking requests due to CSRF check.

See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-10-17 13:35:45 +02:00
0e1ea5d504 Ignore logging requests from the gitea runner
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-20 15:44:22 +02:00
95eef3b0c5 Log the client IP not the proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-20 15:24:38 +02:00
b978f12d19 Create paste directories in /ceph/p
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-20 11:19:30 +02:00
c1617266b6 Add p command to paste files
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-16 16:33:42 +02:00
83830dbfed Use nginx to serve website and other services
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-16 16:33:34 +02:00
0bcac3bca4 Mount the NVME disk in /nvme
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-23 16:15:26 +02:00
3c1be2d4b4 Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-19 17:53:10 +02:00
2cba78cee1 Set gitea and grafana log level to warn
Prevents filling the journal logs with information messages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 13:39:16 +02:00
cd9032bca9 Use authentication tokens for PM GitLab runner
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead.  Older tokens and
runners are also removed, as they are no longer used.

With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.

See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-16 14:58:58 +02:00
9c686a846f Allow incoming traffic to hut proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-17 12:56:59 +02:00
b0cc9c959e Add support for armv7 emulation in hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-21 13:52:08 +02:00
c781a2262f Monitor raccoon machine via IPMI
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-07 13:46:33 +02:00
cd3284d1b2 Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-03 09:20:11 +02:00
2f6673cb3e Add PostgreSQL DB for performance test results
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-30 13:35:58 +02:00
584fe927b6 Enable Grafana email alerts
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 13:54:06 +02:00
2abc1e8fca Enable mail notification in Gitea
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 18:54:38 +02:00
38255dfa0f Add msmtp to send notifications via email
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:54:09 +02:00
17fc1b0c9a Collect Gitea metrics in Prometheus
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-29 11:22:45 +02:00
249d3e472f Add Gitea service
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-26 16:52:52 +02:00
67bcf7b2a0 Use google.com probe instead of bsc.es
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-29 09:57:18 +01:00
bd56c2340d Add another HTTPS probe for bsc.es
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.

Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-13 11:50:38 +01:00
df5a5e1668 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-09 11:14:34 +01:00
d982b45c26 Enable public-inbox at jungle.bsc.es/lists
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.

The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.

We also edited the CSS to improve the readability and have larger fonts
by default.

The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-07 11:08:15 +01:00
171f26e192 Monitor https://pm.bsc.es/gitlab/ too
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/.

Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-01 12:17:50 +01:00
1c6e5d8f82 Enable nixseparatedebuginfod module
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2023-12-01 19:57:04 +01:00
8c7d37859b Enable runners for pm.bsc.es/gitlab too
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-23 12:39:43 +01:00
c307fc9bb3 Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-26 12:36:06 +02:00
6f5f234480 Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-24 11:49:42 +02:00
1e9bc4086f Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-20 16:04:15 +02:00
734f52e87f Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-20 15:34:06 +02:00
72658ee5e6 Add docker runner too 2023-10-04 07:55:26 +02:00
cfa3e08e4b Monitor gitlab.bsc.es too 2023-10-03 09:45:13 +02:00
10101c631d Monitor PM webpage via blackbox 2023-10-03 08:58:07 +02:00
4d865d7a7e Temporarily disable pm runners 2023-09-28 14:14:41 +02:00
d9511dab22 Add runner for gitlab.bsc.es 2023-09-28 14:11:30 +02:00
c3ecba513d Allow anonymous access to grafana 2023-09-22 10:50:14 +02:00