Commit Graph

51 Commits

Author SHA1 Message Date
72faf8365b Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:37 +02:00
22cc1d33f7 Add PostgreSQL DB for performance test results
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:24 +02:00
15085c8a05 Enable Grafana email alerts
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 15:57:38 +02:00
06748dac1d Enable mail notification in Gitea
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:49 +02:00
63851306ac Add msmtp to send notifications via email
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:20 +02:00
db2c6f7e45 Collect Gitea metrics in Prometheus
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:25 +02:00
8e8f9e7adb Add Gitea service
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:31:51 +02:00
005a67deaf Use google.com probe instead of bsc.es
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-05 16:52:21 +01:00
f8097cb5cb Add another HTTPS probe for bsc.es
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.

Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-13 12:26:56 +01:00
ff792f5f48 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-13 11:11:17 +01:00
5c48b43ae0 Enable public-inbox at jungle.bsc.es/lists
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.

The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.

We also edited the CSS to improve the readability and have larger fonts
by default.

The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-15 11:18:08 +01:00
b299ead00b Monitor https://pm.bsc.es/gitlab/ too
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/.

Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-05 09:56:28 +01:00
a92432cf5a Enable nixseparatedebuginfod module
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2023-12-04 11:04:52 +01:00
35a94a9b02 Enable runners for pm.bsc.es/gitlab too
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 14:45:23 +01:00
2953080fb8 Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-27 12:46:08 +02:00
9871517be2 Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 17:13:03 +02:00
736eacaac5 Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:56 +02:00
0e66aad099 Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:24 +02:00
d52d22e0db Add docker runner too 2023-10-06 15:17:07 +02:00
42920c2521 Monitor gitlab.bsc.es too 2023-10-06 15:17:07 +02:00
4acd35e036 Monitor PM webpage via blackbox 2023-10-06 15:17:07 +02:00
621d20db3a Temporarily disable pm runners 2023-10-06 15:17:07 +02:00
0926f6ec1f Add runner for gitlab.bsc.es 2023-10-06 15:17:07 +02:00
61646cb3bd Allow anonymous access to grafana 2023-09-22 10:51:30 +02:00
f49ae0773e Enable slurm-exporter service 2023-09-21 21:40:02 +02:00
d9d249411d Monitor storage nodes via IPMI too 2023-09-13 15:57:13 +02:00
75b0f48715 Serve the nix store from hut 2023-09-12 12:19:43 +02:00
7ddd1977f3 Make exporters listen in localhost only 2023-09-08 18:13:04 +02:00
77cb3c494e Poweroff idle slurm nodes after 1 hour 2023-09-08 16:49:53 +02:00
dca274d020 Unlock ovni gitlab runners 2023-09-05 16:59:45 +02:00
02f40a8217 Add agenix to all nodes 2023-09-04 22:10:43 +02:00
ab55aac5ff Remove old secrets 2023-09-04 22:04:32 +02:00
3b6be8a2fc Move the ceph client config to an external module 2023-09-04 21:59:04 +02:00
2bb366b9ac Reorganize secrets and ssh keys
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2023-09-04 21:36:31 +02:00
9d487845f6 Enable binary emulation for other architectures 2023-08-31 17:27:08 +02:00
0f0a861896 Scrape lake2 too 2023-08-29 12:33:26 +02:00
70321ce237 Scrape metrics from bay 2023-08-29 11:58:00 +02:00
fad9df61e1 Add fio tool 2023-08-29 11:27:50 +02:00
d2a80c8c18 Add ceph tools in hut too 2023-08-28 17:58:21 +02:00
3416416864 Disable pixiecore in hut for now 2023-08-25 13:21:00 +02:00
815888fb07 Add PXE helper 2023-08-25 12:05:33 +02:00
077eece6b9 Add agenix to PATH in hut 2023-08-23 17:42:50 +02:00
b3ef53de51 Store ceph secret key in age
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2023-08-23 17:26:44 +02:00
e0852ee89b Add rarias key for secrets 2023-08-23 17:15:26 +02:00
dfffc0bdce Add ceph metrics to prometheus 2023-08-22 16:33:55 +02:00
8257c245b1 Mount the ceph filesystem in hut 2023-08-22 16:15:46 +02:00
f8fb5fa4ff Monitor power from other nodes via LAN 2023-08-22 11:28:54 +02:00
acf9b71f04 Increase prometheus retention time to one year 2023-08-22 11:28:54 +02:00
55d6c17776 Allow access to devices for node_exporter 2023-07-28 13:55:35 +02:00
2e95281af5 Add owl and all partition 2023-06-16 11:34:00 +02:00