jungle

Author	SHA1	Message	Date
Rodrigo Arias Mallo	5e6cf2b563	Set default SLURM job time limit to one hour Prevents enless jobs from being left forever, while allow users to request a larger time limit.	2024-07-18 11:44:01 +02:00
Rodrigo Arias Mallo	29110d2d54	Allow other jobs to run in unused cores The current select mechanism was using the memory too as a consumable resource, which by default only sets 1 MiB per node. As each job already requests 1 MiB, it prevents other jobs from running. As we are not really concerned with memory usage, we only use the unused cores in the select criteria.	2024-07-18 11:19:03 +02:00
Rodrigo Arias Mallo	32c919d1fc	Use authentication tokens for PM GitLab runner Starting with GitLab 16, there is a new mechanism to authenticate the runners via authentication tokens, so use it instead. Older tokens and runners are also removed, as they are no longer used. With the new way of managing tokens, both the tags and the locked state are managed from the GitLab web page. See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html	2024-07-17 17:41:02 +02:00
Rodrigo Arias Mallo	e3985b28a0	Allow ptrace to any process of the same user Allows users to attach GDB to their own processes, without requiring running the program with GDB from the start.	2024-07-17 13:23:45 +02:00
Rodrigo Arias Mallo	9fe29b864a	Add abonerib user to hut, raccon, owl1 and owl2	2024-07-17 13:23:45 +02:00
Rodrigo Arias Mallo	3ea7edf950	Grant rpenacob access to owl1 and owl2 nodes	2024-07-17 13:23:45 +02:00
Rodrigo Arias Mallo	53c200fbc5	Access private repositories via hut SSH proxy	2024-07-17 13:23:45 +02:00
Rodrigo Arias Mallo	f5ebf43019	Set the default proxy to point to hut	2024-07-17 13:23:29 +02:00
Rodrigo Arias Mallo	43e61a8da3	Allow incoming traffic to hut proxy	2024-07-17 12:56:59 +02:00
Aleix Roca Nonell	1c5f3a856f	eudy: koro: fcs: Fix fcs unprotected cpuid all smp_processor_id() was called in a preepmtible context, which could invalidate the returned value. However, this was not harmful, because fcs threads in nosv are pinned. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-07-17 11:40:20 +02:00
Rodrigo Arias Mallo	4e2b80defd	Add support for armv7 emulation in hut Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-17 11:12:48 +02:00
Rodrigo Arias Mallo	1c8efd0877	Monitor raccoon machine via IPMI Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-17 11:12:32 +02:00
Rodrigo Arias Mallo	4c5e85031b	Move vlopez user to jungleUsers for koro host Access to other machines can be easily added into the "hosts" attribute without the need to replicate the configuration. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:39 +02:00
Rodrigo Arias Mallo	5688823fcc	Add raccoon motd file Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:38 +02:00
Rodrigo Arias Mallo	72faf8365b	Split xeon specific configuration from base To accomodate the raccoon knights workstation, some of the configuration pulled by m/common/main.nix has to be removed. To solve it, the xeon specific parts are placed into m/common/xeon.nix and only the common configuration is at m/common/base.nix. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:37 +02:00
Rodrigo Arias Mallo	0e22d6def8	Control user access to each machine The users.jungleUsers configuration option behaves like the users.users option, but defines the list attribute `hosts` for each user, which filters users so that only the user can only access those hosts. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:34 +02:00
Rodrigo Arias Mallo	22cc1d33f7	Add PostgreSQL DB for performance test results The database will hold the performance results of the execution of the benchmarks. We follow the same setup on knights3 for now. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:24 +02:00
Rodrigo Arias Mallo	15085c8a05	Enable Grafana email alerts Allows sending Grafana alerts via email too, so we have a reduntant mechanism in case Slack fails to deliver them. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 15:57:38 +02:00
Rodrigo Arias Mallo	06748dac1d	Enable mail notification in Gitea Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:49 +02:00
Rodrigo Arias Mallo	63851306ac	Add msmtp to send notifications via email Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:20 +02:00
Rodrigo Arias Mallo	2bdc793c8c	Allow Ceph traffic to lake2	2024-05-02 17:43:48 +02:00
Rodrigo Arias Mallo	db2c6f7e45	Collect Gitea metrics in Prometheus Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:32:25 +02:00
Rodrigo Arias Mallo	8e8f9e7adb	Add Gitea service Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:31:51 +02:00
Rodrigo Arias Mallo	d2adc3a6d3	Add firewall rules for Ceph and monitoring The firewall was blocking the monitoring traffic from hut and the Ceph traffic among OSDs. The rules only allow connecting from the specific host that they are supposed to be coming from. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-04-25 13:25:11 +02:00
Rodrigo Arias Mallo	49be0f208c	Remove nixseparatedebuginfod input It has been integrated in nixpkgs, so is no longer required. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-04-25 13:24:58 +02:00
Rodrigo Arias Mallo	005a67deaf	Use google.com probe instead of bsc.es The main website of the BSC is failing every day around 3:00 AM for almost one hour, so it is not a very good target. Instead, google.com is used which should be more reliable. The same robots.txt path is fetched, as it is smaller than the main page. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-03-05 16:52:21 +01:00
Rodrigo Arias Mallo	f8097cb5cb	Add another HTTPS probe for bsc.es As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we cannot detect a problem in our proxy or in the BSC one. Adding another target like bsc.es that doesn't use the ops proxy allows us to discern where the problem lies. Instead of monitoring https://www.bsc.es/ directly, which will trigger the whole Drupal server and take a whole second, we just fetch robots.txt so the overhead on the server is minimal (and returns in less than 10 ms). Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-02-13 12:26:56 +01:00
Aleix Roca Nonell	ff792f5f48	Move slurm client in a separate module Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-02-13 11:11:17 +01:00
Rodrigo Arias Mallo	5c48b43ae0	Enable public-inbox at jungle.bsc.es/lists The public-inbox service fetches emails from the sourcehut mailing lists and displays them on the web. The idea is to reduce the dependency on external services and add a secondary storage for the mailing lists in case sourcehut goes down or changes the current free plans. The service is available in https://jungle.bsc.es/lists/ and is open to the public. It currently mirrors the bscpkgs and jungle mailing list. We also edited the CSS to improve the readability and have larger fonts by default. The service for public-inbox produced by NixOS is not well configured to fetch emails from an IMAP mail server, so we also manually edit the service file to enable the network. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-15 11:18:08 +01:00
Rodrigo Arias Mallo	b299ead00b	Monitor https://pm.bsc.es/gitlab/ too The GitLab instance is in the /gitlab endpoint and may fail independently of https://pm.bsc.es/. Cc: Víctor López <victor.lopez@bsc.es> Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-05 09:56:28 +01:00
Aleix Roca Nonell	a92432cf5a	Enable nixseparatedebuginfod module The module is only enabled on Hut and Eudy because we noticed activity on the debuginfod service even if no debug session was active. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2023-12-04 11:04:52 +01:00
Rodrigo Arias Mallo	82f5d828c2	Use tmpfs in /tmp The /tmp directory was using the SSD disk which is not erased across boots. Nix will use /tmp to perform the builds, so we want it to be as fast as possible. In general, all the machines have enough space to handle large builds like LLVM. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-28 12:25:50 +01:00
Rodrigo Arias Mallo	35a94a9b02	Enable runners for pm.bsc.es/gitlab too The old runners for the PM gitlab were disabled in configuration in the last outage, but they remained working until we reboot the node. With this change we enable the runners for both PM and gitlab.bsc.es. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-24 14:45:23 +01:00
Rodrigo Arias Mallo	b6bd31e159	Remove complete ceph package from hut Only the ceph-client is needed. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-24 12:58:54 +01:00
Rodrigo Arias Mallo	dd341902fc	BSC packages are no longer in bsc attribute Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-09 13:40:48 +01:00
Rodrigo Arias Mallo	2953080fb8	Monitor anella instead of gw.bsc.es The target gw.bsc.es doesn't reply to our ICMP probes from hut. However, the anella hop in the tracepath is a good candidate to identify cuts between the login and the provider and between the provider and external hosts like Google or Cloudflare DNS. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-27 12:46:08 +02:00
Rodrigo Arias Mallo	9871517be2	Add ICMP probes These probes check if we can reach several targets via ICMP, which is not proxied, so they can be used to see if ICMP forwarding is working in the login node. In particular, we test if we can reach the Google (8.8.8.8) and Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping only from the intranet and the login node (ssfhead). Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 17:13:03 +02:00
Rodrigo Arias Mallo	736eacaac5	Enable proxy for Grafana too The alerts need to contact the slack endpoint, so we add the proxy environment variables to the grafana systemd service. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 16:55:56 +02:00
Rodrigo Arias Mallo	0e66aad099	Make blackbox exporter use the proxy By default it was trying to reach the targets using the default gateway, but since the electrical cut of 2023-10-20, the login node has not enabled forwarding again. So better if we don't rely on it. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 16:55:24 +02:00
Rodrigo Arias Mallo	67a4905a0a	Don't log SLURM connection attempts from ssfhead	2023-10-06 15:22:04 +02:00
Rodrigo Arias Mallo	d52d22e0db	Add docker runner too	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	42920c2521	Monitor gitlab.bsc.es too	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	4acd35e036	Monitor PM webpage via blackbox	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	621d20db3a	Temporarily disable pm runners	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	0926f6ec1f	Add runner for gitlab.bsc.es	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	61646cb3bd	Allow anonymous access to grafana	2023-09-22 10:51:30 +02:00
Rodrigo Arias Mallo	c0066c4744	Remove user/group when using DynamicUsers	2023-09-22 10:13:06 +02:00
Rodrigo Arias Mallo	ffd0593f51	Set the SLURM_CONF variable	2023-09-21 22:22:00 +02:00
Rodrigo Arias Mallo	f49ae0773e	Enable slurm-exporter service	2023-09-21 21:40:02 +02:00
Rodrigo Arias Mallo	8de3d2b149	Mount the hut nix store for SLURM jobs	2023-09-20 19:38:43 +02:00

1 2 3

141 Commits