The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.
As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead. Older tokens and
runners are also removed, as they are no longer used.
With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.
See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.
Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.
The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.
We also edited the CSS to improve the readability and have larger fonts
by default.
The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/.
Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.
In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>