jungle

Author	SHA1	Message	Date
Rodrigo Arias Mallo	9b1391a9f6	Add p command to paste files	2024-09-16 16:33:42 +02:00
Rodrigo Arias Mallo	c8ca5adf84	Enable nginx	2024-09-16 16:33:34 +02:00
Rodrigo Arias Mallo	43e4c60dd5	Mount the NVME disk in /nvme	2024-09-12 09:54:55 +02:00
Rodrigo Arias Mallo	f5d6f32ca8	Rename ceph mount points Use /ceph for cached ceph and /ceph-slow for uncached ceph.	2024-09-12 09:54:55 +02:00
Rodrigo Arias Mallo	8fccb40a7a	Add cached ceph FS mount point in /cache	2024-09-12 09:54:55 +02:00
Rodrigo Arias Mallo	4bd1648074	Set the serial console to ttyS1 in raccoon Apparently the ttyS0 console doesn't exist but ttyS1 does: raccoon% sudo stty -F /dev/ttyS0 stty: /dev/ttyS0: Input/output error raccoon% sudo stty -F /dev/ttyS1 speed 9600 baud; line = 0; -brkint -imaxbel The dmesg line agrees: 00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A The console configuration is then moved from base to xeon to allow changing it for the raccoon machine. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:56 +02:00
Rodrigo Arias Mallo	15b114ffd6	Remove setLdLibraryPath and driSupport options They have been removed from NixOS. The "hardware.opengl" group is now renamed to "hardware.graphics". See: `98cef4c273` Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:53 +02:00
Rodrigo Arias Mallo	e15a3867d4	Add 10 min shutdown jitter to avoid spikes The shutdown timer will fire at slightly different times for the different nodes, so we slowly decrease the power consumption. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:44 +02:00
Rodrigo Arias Mallo	5cad208de6	Don't mount the nix store in owl nodes Initially we planned to run jobs in those nodes by sharing the same nix store from hut. However, these nodes are now used to build packages which are not available in hut. Users also ssh to the nodes, which doesn't mount the hut store, so it doesn't make much sense to keep mounting it. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:42 +02:00
Rodrigo Arias Mallo	c8687f7e45	Emulate other architectures in owl nodes too Allows cross-compilation of packages for RISC-V that are known to try to run RISC-V programs in the host. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:39 +02:00
Rodrigo Arias Mallo	d988ef2eff	Program shutdown for August 2nd for all machines Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:36 +02:00
Rodrigo Arias Mallo	b07929eab3	Enable debuginfod daemon in owl nodes WARNING: This will introduce noise, as the daemon wakes up from time to time to check for new packages. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:30 +02:00
Rodrigo Arias Mallo	b3e397eb4c	Set gitea and grafana log level to warn Prevents filling the journal logs with information messages. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:27 +02:00
Rodrigo Arias Mallo	5ad2c683ed	Set default SLURM job time limit to one hour Prevents enless jobs from being left forever, while allow users to request a larger time limit. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:24 +02:00
Rodrigo Arias Mallo	1f06f0fa0c	Allow other jobs to run in unused cores The current select mechanism was using the memory too as a consumable resource, which by default only sets 1 MiB per node. As each job already requests 1 MiB, it prevents other jobs from running. As we are not really concerned with memory usage, we only use the unused cores in the select criteria. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:22 +02:00
Rodrigo Arias Mallo	8ca1d84844	Use authentication tokens for PM GitLab runner Starting with GitLab 16, there is a new mechanism to authenticate the runners via authentication tokens, so use it instead. Older tokens and runners are also removed, as they are no longer used. With the new way of managing tokens, both the tags and the locked state are managed from the GitLab web page. See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:16 +02:00
Rodrigo Arias Mallo	fcfc6ac149	Allow ptrace to any process of the same user Allows users to attach GDB to their own processes, without requiring running the program with GDB from the start. It is only available in compute nodes, the storage nodes continue with the restricted settings. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:09 +02:00
Rodrigo Arias Mallo	6e87130166	Add abonerib user to hut, raccon, owl1 and owl2 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:07 +02:00
Rodrigo Arias Mallo	06f9e6ac6b	Grant rpenacob access to owl1 and owl2 nodes Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:05 +02:00
Rodrigo Arias Mallo	da07aedce2	Access private repositories via hut SSH proxy Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:03 +02:00
Rodrigo Arias Mallo	61427a8bf9	Set the default proxy to point to hut Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:35:56 +02:00
Rodrigo Arias Mallo	958ad1f025	Allow incoming traffic to hut proxy Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:35:23 +02:00
Aleix Roca Nonell	1c5f3a856f	eudy: koro: fcs: Fix fcs unprotected cpuid all smp_processor_id() was called in a preepmtible context, which could invalidate the returned value. However, this was not harmful, because fcs threads in nosv are pinned. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-07-17 11:40:20 +02:00
Rodrigo Arias Mallo	4e2b80defd	Add support for armv7 emulation in hut Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-17 11:12:48 +02:00
Rodrigo Arias Mallo	1c8efd0877	Monitor raccoon machine via IPMI Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-17 11:12:32 +02:00
Rodrigo Arias Mallo	4c5e85031b	Move vlopez user to jungleUsers for koro host Access to other machines can be easily added into the "hosts" attribute without the need to replicate the configuration. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:39 +02:00
Rodrigo Arias Mallo	5688823fcc	Add raccoon motd file Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:38 +02:00
Rodrigo Arias Mallo	72faf8365b	Split xeon specific configuration from base To accomodate the raccoon knights workstation, some of the configuration pulled by m/common/main.nix has to be removed. To solve it, the xeon specific parts are placed into m/common/xeon.nix and only the common configuration is at m/common/base.nix. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:37 +02:00
Rodrigo Arias Mallo	0e22d6def8	Control user access to each machine The users.jungleUsers configuration option behaves like the users.users option, but defines the list attribute `hosts` for each user, which filters users so that only the user can only access those hosts. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:34 +02:00
Rodrigo Arias Mallo	22cc1d33f7	Add PostgreSQL DB for performance test results The database will hold the performance results of the execution of the benchmarks. We follow the same setup on knights3 for now. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:24 +02:00
Rodrigo Arias Mallo	15085c8a05	Enable Grafana email alerts Allows sending Grafana alerts via email too, so we have a reduntant mechanism in case Slack fails to deliver them. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 15:57:38 +02:00
Rodrigo Arias Mallo	06748dac1d	Enable mail notification in Gitea Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:49 +02:00
Rodrigo Arias Mallo	63851306ac	Add msmtp to send notifications via email Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:20 +02:00
Rodrigo Arias Mallo	2bdc793c8c	Allow Ceph traffic to lake2	2024-05-02 17:43:48 +02:00
Rodrigo Arias Mallo	db2c6f7e45	Collect Gitea metrics in Prometheus Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:32:25 +02:00
Rodrigo Arias Mallo	8e8f9e7adb	Add Gitea service Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:31:51 +02:00
Rodrigo Arias Mallo	d2adc3a6d3	Add firewall rules for Ceph and monitoring The firewall was blocking the monitoring traffic from hut and the Ceph traffic among OSDs. The rules only allow connecting from the specific host that they are supposed to be coming from. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-04-25 13:25:11 +02:00
Rodrigo Arias Mallo	49be0f208c	Remove nixseparatedebuginfod input It has been integrated in nixpkgs, so is no longer required. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-04-25 13:24:58 +02:00
Rodrigo Arias Mallo	005a67deaf	Use google.com probe instead of bsc.es The main website of the BSC is failing every day around 3:00 AM for almost one hour, so it is not a very good target. Instead, google.com is used which should be more reliable. The same robots.txt path is fetched, as it is smaller than the main page. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-03-05 16:52:21 +01:00
Rodrigo Arias Mallo	f8097cb5cb	Add another HTTPS probe for bsc.es As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we cannot detect a problem in our proxy or in the BSC one. Adding another target like bsc.es that doesn't use the ops proxy allows us to discern where the problem lies. Instead of monitoring https://www.bsc.es/ directly, which will trigger the whole Drupal server and take a whole second, we just fetch robots.txt so the overhead on the server is minimal (and returns in less than 10 ms). Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-02-13 12:26:56 +01:00
Aleix Roca Nonell	ff792f5f48	Move slurm client in a separate module Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-02-13 11:11:17 +01:00
Rodrigo Arias Mallo	5c48b43ae0	Enable public-inbox at jungle.bsc.es/lists The public-inbox service fetches emails from the sourcehut mailing lists and displays them on the web. The idea is to reduce the dependency on external services and add a secondary storage for the mailing lists in case sourcehut goes down or changes the current free plans. The service is available in https://jungle.bsc.es/lists/ and is open to the public. It currently mirrors the bscpkgs and jungle mailing list. We also edited the CSS to improve the readability and have larger fonts by default. The service for public-inbox produced by NixOS is not well configured to fetch emails from an IMAP mail server, so we also manually edit the service file to enable the network. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-15 11:18:08 +01:00
Rodrigo Arias Mallo	b299ead00b	Monitor https://pm.bsc.es/gitlab/ too The GitLab instance is in the /gitlab endpoint and may fail independently of https://pm.bsc.es/. Cc: Víctor López <victor.lopez@bsc.es> Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-05 09:56:28 +01:00
Aleix Roca Nonell	a92432cf5a	Enable nixseparatedebuginfod module The module is only enabled on Hut and Eudy because we noticed activity on the debuginfod service even if no debug session was active. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2023-12-04 11:04:52 +01:00
Rodrigo Arias Mallo	82f5d828c2	Use tmpfs in /tmp The /tmp directory was using the SSD disk which is not erased across boots. Nix will use /tmp to perform the builds, so we want it to be as fast as possible. In general, all the machines have enough space to handle large builds like LLVM. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-28 12:25:50 +01:00
Rodrigo Arias Mallo	35a94a9b02	Enable runners for pm.bsc.es/gitlab too The old runners for the PM gitlab were disabled in configuration in the last outage, but they remained working until we reboot the node. With this change we enable the runners for both PM and gitlab.bsc.es. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-24 14:45:23 +01:00
Rodrigo Arias Mallo	b6bd31e159	Remove complete ceph package from hut Only the ceph-client is needed. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-24 12:58:54 +01:00
Rodrigo Arias Mallo	dd341902fc	BSC packages are no longer in bsc attribute Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-09 13:40:48 +01:00
Rodrigo Arias Mallo	2953080fb8	Monitor anella instead of gw.bsc.es The target gw.bsc.es doesn't reply to our ICMP probes from hut. However, the anella hop in the tracepath is a good candidate to identify cuts between the login and the provider and between the provider and external hosts like Google or Cloudflare DNS. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-27 12:46:08 +02:00
Rodrigo Arias Mallo	9871517be2	Add ICMP probes These probes check if we can reach several targets via ICMP, which is not proxied, so they can be used to see if ICMP forwarding is working in the login node. In particular, we test if we can reach the Google (8.8.8.8) and Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping only from the intranet and the login node (ssfhead). Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 17:13:03 +02:00

1 2 3 4

154 Commits