Compare commits

...

257 Commits

Author SHA1 Message Date
9b1391a9f6 Add p command to paste files 2024-09-16 16:33:42 +02:00
c8ca5adf84 Enable nginx 2024-09-16 16:33:34 +02:00
43e4c60dd5 Mount the NVME disk in /nvme 2024-09-12 09:54:55 +02:00
f5d6f32ca8 Rename ceph mount points
Use /ceph for cached ceph and /ceph-slow for uncached ceph.
2024-09-12 09:54:55 +02:00
8fccb40a7a Add cached ceph FS mount point in /cache 2024-09-12 09:54:55 +02:00
4bd1648074 Set the serial console to ttyS1 in raccoon
Apparently the ttyS0 console doesn't exist but ttyS1 does:

  raccoon% sudo stty -F /dev/ttyS0
  stty: /dev/ttyS0: Input/output error
  raccoon% sudo stty -F /dev/ttyS1
  speed 9600 baud; line = 0;
  -brkint -imaxbel

The dmesg line agrees:

  00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A

The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:56 +02:00
15b114ffd6 Remove setLdLibraryPath and driSupport options
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".

See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:53 +02:00
dd6d8c9735 Add documentation section about GRUB chain loading
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:47 +02:00
e15a3867d4 Add 10 min shutdown jitter to avoid spikes
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:44 +02:00
5cad208de6 Don't mount the nix store in owl nodes
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:42 +02:00
c8687f7e45 Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:39 +02:00
d988ef2eff Program shutdown for August 2nd for all machines
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:36 +02:00
b07929eab3 Enable debuginfod daemon in owl nodes
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:30 +02:00
b3e397eb4c Set gitea and grafana log level to warn
Prevents filling the journal logs with information messages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:27 +02:00
5ad2c683ed Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:24 +02:00
1f06f0fa0c Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:22 +02:00
8ca1d84844 Use authentication tokens for PM GitLab runner
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead.  Older tokens and
runners are also removed, as they are no longer used.

With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.

See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:16 +02:00
998f599be3 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
  → 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
  → 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:13 +02:00
fcfc6ac149 Allow ptrace to any process of the same user
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:09 +02:00
6e87130166 Add abonerib user to hut, raccon, owl1 and owl2
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:07 +02:00
06f9e6ac6b Grant rpenacob access to owl1 and owl2 nodes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:05 +02:00
da07aedce2 Access private repositories via hut SSH proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:03 +02:00
61427a8bf9 Set the default proxy to point to hut
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:35:56 +02:00
958ad1f025 Allow incoming traffic to hut proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:35:23 +02:00
1c5f3a856f eudy: koro: fcs: Fix fcs unprotected cpuid all
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-07-17 11:40:20 +02:00
4e2b80defd Add support for armv7 emulation in hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-17 11:12:48 +02:00
1c8efd0877 Monitor raccoon machine via IPMI
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-17 11:12:32 +02:00
4c5e85031b Move vlopez user to jungleUsers for koro host
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:39 +02:00
5688823fcc Add raccoon motd file
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:38 +02:00
72faf8365b Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:37 +02:00
0e22d6def8 Control user access to each machine
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:34 +02:00
22cc1d33f7 Add PostgreSQL DB for performance test results
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-07-16 12:35:24 +02:00
15085c8a05 Enable Grafana email alerts
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 15:57:38 +02:00
06748dac1d Enable mail notification in Gitea
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:49 +02:00
63851306ac Add msmtp to send notifications via email
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-31 10:56:20 +02:00
2bdc793c8c Allow Ceph traffic to lake2 2024-05-02 17:43:48 +02:00
85d1c5e34c Fix meta in posts entries
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:37 +02:00
e6b7af5272 Fix bogus separator
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:34 +02:00
c0ae8770bc Manually add links to the menu
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:32 +02:00
5b51e8947f Add link to Gitea in the website
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:28 +02:00
db2c6f7e45 Collect Gitea metrics in Prometheus
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:32:25 +02:00
8e8f9e7adb Add Gitea service
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-05-02 17:31:51 +02:00
d2adc3a6d3 Add firewall rules for Ceph and monitoring
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:11 +02:00
76cd9ea47f Add workaround for MPICH 4.2.0
See: https://github.com/pmodels/mpich/issues/6946

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:08 +02:00
2f851bc216 Fix SLURM bug in rank integer sign expansion
See: https://bugs.schedmd.com/show_bug.cgi?id=19324

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:05 +02:00
834d3187e5 Merge pmix outputs for MPICH
MPICH expects headers and libraries to be present in the same directory.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:03 +02:00
49be0f208c Remove nixseparatedebuginfod input
It has been integrated in nixpkgs, so is no longer required.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:24:58 +02:00
fb23b41dae flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
  → 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
    'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
  → 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
    'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
  → 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
    'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b' (2023-11-22)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709' (2024-04-24)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
  → 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
    'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
  → 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:24:29 +02:00
005a67deaf Use google.com probe instead of bsc.es
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-05 16:52:21 +01:00
f8097cb5cb Add another HTTPS probe for bsc.es
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.

Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-02-13 12:26:56 +01:00
ff792f5f48 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-13 11:11:17 +01:00
5c48b43ae0 Enable public-inbox at jungle.bsc.es/lists
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.

The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.

We also edited the CSS to improve the readability and have larger fonts
by default.

The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-15 11:18:08 +01:00
b299ead00b Monitor https://pm.bsc.es/gitlab/ too
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/.

Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-12-05 09:56:28 +01:00
a92432cf5a Enable nixseparatedebuginfod module
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2023-12-04 11:04:52 +01:00
82f5d828c2 Use tmpfs in /tmp
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-28 12:25:50 +01:00
35a94a9b02 Enable runners for pm.bsc.es/gitlab too
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 14:45:23 +01:00
b6bd31e159 Remove complete ceph package from hut
Only the ceph-client is needed.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:58:54 +01:00
1d4badda5b Fix warning in slurm exporter using vendorHash
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:58:50 +01:00
bd5214a3b9 Remove old Ceph package overlay
The Ceph package is now integrated in upstream nixpkgs.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:58:47 +01:00
c32f6dea97 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/d8c973fd228949736dedf61b7f8cc1ece3236792' (2023-07-24)
  → 'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538' (2023-10-31)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b' (2023-11-22)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
  → 'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:57:44 +01:00
dd341902fc BSC packages are no longer in bsc attribute
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-09 13:40:48 +01:00
190e273112 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538' (2023-10-31)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-09 13:40:48 +01:00
268807d1d0 Switch bscpkgs URL to sourcehut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-09 13:40:48 +01:00
2953080fb8 Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-27 12:46:08 +02:00
9871517be2 Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 17:13:03 +02:00
736eacaac5 Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:56 +02:00
0e66aad099 Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:24 +02:00
67a4905a0a Don't log SLURM connection attempts from ssfhead 2023-10-06 15:22:04 +02:00
d52d22e0db Add docker runner too 2023-10-06 15:17:07 +02:00
42920c2521 Monitor gitlab.bsc.es too 2023-10-06 15:17:07 +02:00
4acd35e036 Monitor PM webpage via blackbox 2023-10-06 15:17:07 +02:00
621d20db3a Temporarily disable pm runners 2023-10-06 15:17:07 +02:00
0926f6ec1f Add runner for gitlab.bsc.es 2023-10-06 15:17:07 +02:00
61646cb3bd Allow anonymous access to grafana 2023-09-22 10:51:30 +02:00
c0066c4744 Remove user/group when using DynamicUsers 2023-09-22 10:13:06 +02:00
ffd0593f51 Set the SLURM_CONF variable 2023-09-21 22:22:00 +02:00
f49ae0773e Enable slurm-exporter service 2023-09-21 21:40:02 +02:00
8fa3fccecb Add prometheus-slurm-exporter package 2023-09-21 21:34:18 +02:00
9ee7111453 Document the hut shared nix store for SLURM 2023-09-21 13:51:42 +02:00
8de3d2b149 Mount the hut nix store for SLURM jobs 2023-09-20 19:38:43 +02:00
bc62e28ca3 Enable direnv integration 2023-09-20 09:32:58 +02:00
d612a5453c Add System Integration Service Guide document 2023-09-19 15:12:59 +02:00
653d411b9e Remove bscpkgs from the registry and nixPath
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2023-09-15 12:00:33 +02:00
51c57dbc41 Add bscpkgs and nixpkgs top level attributes
Allows the evaluation of packages of the intermediate overlays.
2023-09-15 12:00:33 +02:00
33cd40160e Use hut packages as the default package set
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2023-09-15 12:00:28 +02:00
a1e8cfea47 Don't fetch registry flakes from the net 2023-09-15 12:00:28 +02:00
5d72ee3da3 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
2023-09-15 11:50:47 +02:00
fdc6445d47 Revert "Update slurm to 23.02.05.1"
This reverts commit aaefddc44a9073166ac52b8bd56ac96258d3b053.
2023-09-14 15:46:18 +02:00
e88805947e Open ports in firewall of compute nodes 2023-09-14 15:45:43 +02:00
aaefddc44a Update slurm to 23.02.05.1 2023-09-13 17:44:24 +02:00
d9d249411d Monitor storage nodes via IPMI too 2023-09-13 15:57:13 +02:00
c07f75c6bb Specify the space available in /ceph 2023-09-13 14:19:59 +02:00
8d449ba20c Add update post to website 2023-09-12 18:13:38 +02:00
10ca572aec Enable fstrim service 2023-09-12 16:39:45 +02:00
75b0f48715 Serve the nix store from hut 2023-09-12 12:19:43 +02:00
19a451db77 Add encrypted munge key with agenix 2023-09-08 19:05:45 +02:00
ec9be9bb62 Remove unused large port hole in firewall 2023-09-08 18:22:48 +02:00
7ddd1977f3 Make exporters listen in localhost only 2023-09-08 18:13:04 +02:00
7050c505b5 Allow only some ports for srun 2023-09-08 17:51:37 +02:00
033a1fe97b Block ssfhead from reaching our slurm daemon 2023-09-08 17:36:28 +02:00
77cb3c494e Poweroff idle slurm nodes after 1 hour 2023-09-08 16:49:53 +02:00
6db5772ac4 Add IB and IPMI node host names 2023-09-08 13:21:37 +02:00
3e347e673c flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
2023-09-07 11:13:45 +02:00
dca274d020 Unlock ovni gitlab runners 2023-09-05 16:59:45 +02:00
c33909f32f Update email contact to jungle mail list 2023-09-05 16:10:58 +02:00
64e856e8b9 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27' (2023-08-31)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
  → 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2023-09-05 15:03:26 +02:00
02f40a8217 Add agenix to all nodes 2023-09-04 22:10:43 +02:00
77d43b6da9 Add agenix module to ceph 2023-09-04 22:07:07 +02:00
ab55aac5ff Remove old secrets 2023-09-04 22:04:32 +02:00
9b5bfbb7a3 Mount /ceph in owl1 and owl2 2023-09-04 22:00:36 +02:00
a69a71d1b0 Warn about the owl2 omnipath device 2023-09-04 22:00:17 +02:00
98374bd303 Clean owl2 configuration 2023-09-04 21:59:56 +02:00
3b6be8a2fc Move the ceph client config to an external module 2023-09-04 21:59:04 +02:00
2bb366b9ac Reorganize secrets and ssh keys
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2023-09-04 21:36:31 +02:00
2d16709648 Add anavarro user 2023-09-04 16:00:01 +02:00
9344daa31c Set zsh inc_append_history option 2023-09-03 16:57:53 +02:00
80c98041b5 Set zsh shell for rarias 2023-09-03 16:46:27 +02:00
3418e57907 Enable zsh and fix key bindings 2023-09-03 16:42:04 +02:00
6848b58e39 Keep a log over time with the config commits 2023-09-03 00:02:14 +02:00
13a70411aa Configure bscpkgs.nixpkgs to follow nixpkgs 2023-09-02 23:37:59 +02:00
f9c77b433a Store nixos config in /etc/nixos/config.rev 2023-09-02 23:37:11 +02:00
9d487845f6 Enable binary emulation for other architectures 2023-08-31 17:27:08 +02:00
3c99c2a662 Enable watchdog 2023-08-30 16:32:17 +02:00
7d09108c9f Enable all osd on boot in lake2 2023-08-30 16:32:17 +02:00
0f0a861896 Scrape lake2 too 2023-08-29 12:33:26 +02:00
beb0d5940e Also enable monitoring in lake2 2023-08-29 12:29:41 +02:00
70321ce237 Scrape metrics from bay 2023-08-29 11:58:00 +02:00
5bd1d67333 Add monitoring in the bay node 2023-08-29 11:53:32 +02:00
fad9df61e1 Add fio tool 2023-08-29 11:27:50 +02:00
d2a80c8c18 Add ceph tools in hut too 2023-08-28 17:58:21 +02:00
599613d139 Switch ceph logs to journal 2023-08-28 17:58:08 +02:00
ac4fa9abd4 Update ceph to 18.2.0 in overlay 2023-08-25 18:20:21 +02:00
cb3a7b19f7 Move pkgs overlay to overlay.nix 2023-08-25 18:12:00 +02:00
f5d6bf627b Enable ceph osd daemons in lake2 2023-08-25 14:54:51 +02:00
f1ce815edd Add the lake2 hostname to the hosts 2023-08-25 14:44:35 +02:00
a2075cfd65 Use the sda for lake2 2023-08-25 13:40:10 +02:00
8f1f6f92a8 Remove netboot module 2023-08-25 13:39:01 +02:00
3416416864 Disable pixiecore in hut for now 2023-08-25 13:21:00 +02:00
815888fb07 Add PXE helper 2023-08-25 12:05:33 +02:00
029d9cb1db Enable netboot again for PXE 2023-08-24 19:08:23 +02:00
95fa67ede1 Specify the disk by path 2023-08-24 15:27:37 +02:00
a19347161f Prepare lake2 config after bootstrap
The disk ID is different under NixOS.
2023-08-24 13:54:53 +02:00
58c1cc1f7c Add lake2 bootstrap config 2023-08-24 12:30:46 +02:00
b06399dc70 Add section to enable serial console 2023-08-24 12:29:44 +02:00
077eece6b9 Add agenix to PATH in hut 2023-08-23 17:42:50 +02:00
b3ef53de51 Store ceph secret key in age
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2023-08-23 17:26:44 +02:00
e0852ee89b Add rarias key for secrets 2023-08-23 17:15:26 +02:00
dfffc0bdce Add ceph metrics to prometheus 2023-08-22 16:33:55 +02:00
8257c245b1 Mount the ceph filesystem in hut 2023-08-22 16:15:46 +02:00
cd5853cf53 Add ceph config in bay 2023-08-22 15:58:48 +02:00
b677b827d4 Add the bay host name 2023-08-22 15:56:09 +02:00
b1d5185cca Remove netboot and fixes 2023-08-22 12:12:15 +02:00
a7e66e2246 Add bay node 2023-08-22 12:12:15 +02:00
480c97e952 Update flake 2023-08-22 11:28:54 +02:00
f8fb5fa4ff Monitor power from other nodes via LAN 2023-08-22 11:28:54 +02:00
acf9b71f04 Increase prometheus retention time to one year 2023-08-22 11:28:54 +02:00
bf692e6e4e Don't set all_proxy 2023-08-22 11:28:54 +02:00
c242b65e47 Update nixpkgs to fix docker problem 2023-07-28 14:24:51 +02:00
55d6c17776 Allow access to devices for node_exporter 2023-07-28 13:55:35 +02:00
14b173f67e GRUB version no longer needed 2023-07-27 17:22:20 +02:00
b9001cdf7d Upgrade flake: nixpkgs, bscpkgs and agenix 2023-07-27 17:19:17 +02:00
f892d43b47 Kill slurmd remaining processes on upgrade 2023-07-27 14:49:20 +02:00
d9e9ee6e3a Add details to request access in the web 2023-07-25 16:07:22 +02:00
79adbe76a8 koro: Add vlopez user 2023-07-21 13:00:43 +02:00
66fb848ba8 Add koro node 2023-07-21 13:00:08 +02:00
40b1a8f0df eudy: Add fcsv3 and intermediate versions for testing 2023-07-21 11:27:51 +02:00
a0b9d10b14 eudy: Enable memory overcommit 2023-07-21 11:27:51 +02:00
4c309dea2f eudy: disable all cpu mitigations 2023-07-21 11:27:51 +02:00
b3a397eee4 Add jungle.bsc.es hugo website 2023-07-21 10:52:23 +02:00
7c1fe1455b Enable NTP using the BSC time server 2023-06-30 14:02:15 +02:00
2d4b178895 Add the ssfhead node as gateway 2023-06-30 14:01:35 +02:00
4dd25f2f89 Use our host names first by default 2023-06-23 16:22:18 +02:00
6dcd9d8144 Add DNS tools to resolve hosts 2023-06-23 16:15:45 +02:00
31be81d2b1 Lower perf_event_paranoid to -1 2023-06-23 16:01:27 +02:00
826cfdf43f Set perf paranoid to 0 by default 2023-06-21 16:24:19 +02:00
a1f258c5ce Add perf to packages 2023-06-21 15:41:06 +02:00
1c1d3f3231 Allow srun to specify the cpu binding
The task/affinity plugin needs to be selected.
2023-06-21 13:16:23 +02:00
623d46c03f Move authorized keys to users.nix 2023-06-20 14:08:34 +02:00
518a4d6af3 Add rpenacob user 2023-06-20 12:54:26 +02:00
60077948d6 Add osumb to the system packages 2023-06-16 19:22:41 +02:00
c76bfa7f86 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=c775ee4d6f76aded05b08ae13924c302f18f9b2c' (2023-04-26)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=cbe9af5d042e9d5585fe2acef65a1347c68b2fbd' (2023-06-16)
2023-06-16 18:33:54 +02:00
6c10933e80 Set mpi to mpich by default in bscpkgs 2023-06-16 18:26:51 +02:00
6402605b1f Add missing parameter to extend 2023-06-16 18:26:51 +02:00
1724535495 Use explicit order in overlays 2023-06-16 18:26:51 +02:00
5b41670f36 Replace mpi inside bsc attribute 2023-06-16 18:26:51 +02:00
ab04855382 Add mpich overlay 2023-06-16 18:26:51 +02:00
684d5e41c5 Add coments in slurm config 2023-06-16 18:26:50 +02:00
316ea18e24 Add eudy host key to known hosts 2023-06-16 17:29:48 +02:00
c916157fcc Rename xeon08 to eudy
From Eudyptula, a little penguin.
2023-06-16 17:16:05 +02:00
4e9409db10 Update rebuild script for all nodes 2023-06-16 12:13:07 +02:00
94320d9256 Add ssh host keys 2023-06-16 12:01:12 +02:00
9f5941c2be Set the name of the slurm cluster to jungle 2023-06-16 12:00:54 +02:00
fba0f7b739 Change owl hostnames 2023-06-16 11:42:39 +02:00
2e95281af5 Add owl and all partition 2023-06-16 11:34:00 +02:00
f4ac9f3186 Simplify flake and expose host pkgs
The configuration of the machines is now moved to m/
2023-06-16 11:31:31 +02:00
f787343f29 Rename xeon07 to hut 2023-06-14 17:28:40 +02:00
70304d26ff Remove profiles older than 30 days with gc 2023-06-14 17:28:39 +02:00
76c10ec22e Add ncdu to system packages 2023-06-14 17:28:39 +02:00
011e8c2bf8 Move arocanon user from xeon08 to common 2023-06-14 16:22:43 +02:00
c1f138a9c1 xeon08: Add config for kernel non-voluntary preemption 2023-06-14 16:17:33 +02:00
1552eeca12 xeon08: Add perf 2023-06-14 15:42:20 +02:00
8769f3d418 xeon08: Enable lttng lockdep tracepoints 2023-06-14 15:42:20 +02:00
a4c254fcd6 xeon08: Add lttng module and tools 2023-06-14 15:42:20 +02:00
24fb1846d2 Serve grafana in https://jungle.bsc.es/grafana 2023-05-31 18:12:14 +02:00
5e77d0b86c Add tree command 2023-05-31 18:11:34 +02:00
494fda126c Add file to system packages 2023-05-31 18:11:34 +02:00
5cfa2f9611 Add gnumake to system packages 2023-05-31 18:11:34 +02:00
9539a24bdb Add cmake to system packages 2023-05-31 18:11:34 +02:00
98c4d924dd Add ix to common packages 2023-05-31 18:11:34 +02:00
7aae967c65 Improve documentation 2023-05-26 11:38:27 +02:00
49f7edddac Add gitignore 2023-05-26 11:38:27 +02:00
2f055d9fc5 Set intel_pstate=passive and disable frequency boost 2023-05-26 11:38:26 +02:00
108abffd2a Add xeon08 basic config 2023-05-26 11:38:26 +02:00
4c19ad66e3 Add nixos-config.nix to easily enable nix repl 2023-05-26 11:29:59 +02:00
19c01aeb1d Automatically resume restarted nodes in SLURM 2023-05-18 12:48:04 +02:00
fc90b40310 Allow public dashboards in grafana 2023-05-09 18:53:31 +02:00
81de0effb1 Add hal ssh key 2023-05-09 18:37:38 +02:00
5ce93ff85a Increase the number of CPUs to 56 for nOS-V docker 2023-05-02 17:47:57 +02:00
c020b9f5d6 Allow 5 concurrent buils in the gitlab-runner 2023-05-02 17:38:10 +02:00
f47734b524 Simplify bash prompt 2023-04-28 18:15:04 +02:00
ca3a7d98f5 Roolback to bash as default shell
Zsh doesn't behave properly, it needs further configuration.
2023-04-28 17:59:19 +02:00
0d5609ecc2 Use pmix by default in slurm 2023-04-28 17:07:48 +02:00
818edccb34 Increase locked memory to 1 GiB 2023-04-28 12:34:51 +02:00
2815f5bcfd Use the latest kernel 2023-04-28 11:51:38 +02:00
c1bbbd7793 Disable osnoise and hwlat tracer for now
Reuse nix cache to avoid rebuilding the kernel.
2023-04-28 11:19:47 +02:00
aa1dd14b62 Update nixpkgs to nixos-unstable 2023-04-28 11:18:37 +02:00
399103a9b4 Update nixpkgs 2023-04-28 11:13:46 +02:00
74639d3ece Update ib interface name in xeon02
It seems to be plugged in another PCI port
2023-04-27 18:29:32 +02:00
613a76ac29 Add steps in install documentation 2023-04-27 17:30:53 +02:00
c3ea8864bb Add minimal netboot module to build kexec image 2023-04-27 16:36:15 +02:00
919f211536 Add xeon02 configuration 2023-04-27 16:28:12 +02:00
141d77e2b6 Refacto slurm configuration into compute/control 2023-04-27 16:27:04 +02:00
44fcb97ec7 Lock flakes and add inputs 2023-04-27 13:52:59 +02:00
543983e9f3 Test flakes 2023-04-26 14:27:02 +02:00
95bbeeb646 Enable slurm in xeon01 2023-04-26 14:10:36 +02:00
de2af79810 Use xeon07 as control machine 2023-04-26 14:10:36 +02:00
b9aff1dba5 Remove xeon07 overlay to load upstream slurm 2023-04-26 14:10:36 +02:00
7da979bed2 Add script to rebuild configuration 2023-04-26 14:09:23 +02:00
cfe37640ea Add configuration for xeon01 2023-04-26 11:44:00 +00:00
096e407571 Load overlays from /config 2023-04-26 11:44:00 +00:00
ae31b546e7 Move net.nix to common 2023-04-26 11:44:00 +00:00
c3a2766bb7 Remove host specific network options from net.nix 2023-04-26 11:44:00 +00:00
b568bb36d4 Move ssh.nix to common 2023-04-26 11:44:00 +00:00
55f784e6b7 Move overlays.nix to common 2023-04-26 11:44:00 +00:00
dfab84b0ba Move users.nix to common 2023-04-26 11:44:00 +00:00
8f66ba824a Move common options from configuration.nix 2023-04-26 11:44:00 +00:00
79bd4398f3 Move the remaining hw config to common 2023-04-26 11:44:00 +00:00
b44afdaaa1 Move boot config to common/boot.nix 2023-04-26 11:44:00 +00:00
9528fab3ef Move filesystems config to common/fs.nix 2023-04-26 11:44:00 +00:00
7e82885d84 Use partition labels for / and swap 2023-04-26 11:44:00 +00:00
57ed0cf319 Move fs.nix to common 2023-04-26 11:44:00 +00:00
b043ee3b1d Move boot.nix to common 2023-04-26 11:44:00 +00:00
9e3bdaabb6 Move disk selection to configuration.nix 2023-04-26 11:44:00 +00:00
77f72ac939 Add common directory 2023-04-26 11:44:00 +00:00
fa25a68571 Add server board documentation 2023-04-24 10:10:08 +02:00
Rodrigo Arias
ea0f406849 Add BSC SSF slides 2023-04-24 09:47:11 +02:00
Rodrigo Arias
9df6be1b6b Add SEL troubleshooting guide 2023-04-21 13:31:11 +02:00
230 changed files with 28874 additions and 431 deletions

2
.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
*.swp
/result

Binary file not shown.

Binary file not shown.

BIN
doc/bsc-ssf.pdf Normal file

Binary file not shown.

176
doc/install.md Normal file
View File

@ -0,0 +1,176 @@
# Installing NixOS in a new node
This article shows the steps to install NixOS in a node following the
configuration of the repo.
## Enable the serial console
By default, the nodes have the serial console disabled in the GRUB and also boot
without the serial enabled.
To enable the serial console in the GRUB, set in /etc/default/grub the following
lines:
```
GRUB_TERMINAL="console serial"
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=0 --word=8 --parity=no --stop=1"
```
To boot Linux with the serial enabled, so you can see the boot log and login via
serial set:
```
GRUB_CMDLINE_LINUX="console=ttyS0,115200n8 console=tty0"
```
Then update the grub config:
```
# grub2-mkconfig -o /boot/grub2/grub.cfg
```
And reboot.
## Prepare the disk
Create a main partition and label it `nixos` following [the manual][1].
[1]: https://nixos.org/manual/nixos/stable/index.html#sec-installation-manual-partitioning.
```
# disk=/dev/sdX
# parted $disk -- mklabel msdos
# parted $disk -- mkpart primary 1MB -8GB
# parted $disk -- mkpart primary linux-swap -8GB 100%
# parted $disk -- set 1 boot on
```
Then create an etx4 filesystem, labeled `nixos` where the system will be
installed. **Ensure that no other partition has the same label.**
```
# mkfs.ext4 -L nixos "${disk}1"
# mkswap -L swap "${disk}2"
# mount ${disk}1 /mnt
# lsblk -f $disk
NAME FSTYPE LABEL UUID MOUNTPOINT
sdX
`-sdX1 ext4 nixos 10d73b75-809c-4fa3-b99d-4fab2f0d0d8e /mnt
```
## Prepare nix and nixos-install
Mount the nix store from the hut node in read-only /nix.
```
# mkdir /nix
# mount -o ro hut:/nix /nix
```
Get the nix binary and nixos-install tool from hut:
```
# ssh hut 'readlink -f $(which nix)'
/nix/store/0sxbaj71c4c4n43qhdxm31f56gjalksw-nix-2.13.3/bin/nix
# ssh hut 'readlink -f $(which nixos-install)'
/nix/store/9yq8ps06ysr2pfiwiij39ny56yk3pdcs-nixos-install/bin/nixos-install
```
And add them to the PATH:
```
# export PATH=$PATH:/nix/store/0sxbaj71c4c4n43qhdxm31f56gjalksw-nix-2.13.3/bin
# export PATH=$PATH:/nix/store/9yq8ps06ysr2pfiwiij39ny56yk3pdcs-nixos-install/bin/
# nix --version
nix (Nix) 2.13.3
```
## Adapt owl configuration
Clone owl repo:
```
$ git clone git@bscpm03.bsc.es:rarias/owl.git
$ cd owl
```
Edit the configuration to your needs.
## Install from another Linux OS
Install nixOS into the storage drive.
```
# nixos-install --flake --root /mnt .#xeon0X
```
At this point, the nixOS grub has been installed into the nixos device, which
is not the default boot device. To keep both the old Linux and NixOS grubs, add
an entry into the old Linux grub to jump into the new grub.
```
# echo "
menuentry 'NixOS' {
insmod chain
search --no-floppy --label nixos --set root
configfile /boot/grub/grub.cfg
} " >> /etc/grub.d/40_custom
```
Rebuild grub config.
```
# grub2-mkconfig -o /boot/grub/grub.cfg
```
To boot into NixOS manually, reboot and select NixOS in the grub menu to boot
into NixOS.
To temporarily boot into NixOS only on the next reboot run:
```
# grub2-reboot 'NixOS'
```
To permanently boot into NixOS as the default boot OS, edit `/etc/default/grub/`:
```
GRUB_DEFAULT='NixOS'
```
And update grub.
```
# grub2-mkconfig -o /boot/grub/grub.cfg
```
## Build the nixos kexec image
```
# nix build .#nixosConfigurations.xeon02.config.system.build.kexecTree -v
```
## Chain NixOS in same disk with other systems
To install NixOS on a partition along another system which controls the GRUB,
first disable the grub device, so the GRUB is not installed in the disk by
NixOS (only the /boot files will be generated):
```
boot.loader.grub.device = "nodev";
```
Then add the following entry to the old GRUB configuration:
```
menuentry 'NixOS' {
insmod chain
search --no-floppy --label nixos --set root
configfile /boot/grub/grub.cfg
}
```
The partition with NixOS must have the label "nixos" for it to be found. New
system configuration entries will be stored in the GRUB configuration managed
by NixOS, so there is no need to change the old GRUB settings.

130
flake.lock generated Normal file
View File

@ -0,0 +1,130 @@
{
"nodes": {
"agenix": {
"inputs": {
"darwin": "darwin",
"home-manager": "home-manager",
"nixpkgs": [
"nixpkgs"
],
"systems": "systems"
},
"locked": {
"lastModified": 1720546205,
"narHash": "sha256-boCXsjYVxDviyzoEyAk624600f3ZBo/DKtUdvMTpbGY=",
"owner": "ryantm",
"repo": "agenix",
"rev": "de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6",
"type": "github"
},
"original": {
"owner": "ryantm",
"repo": "agenix",
"type": "github"
}
},
"bscpkgs": {
"inputs": {
"nixpkgs": [
"nixpkgs"
]
},
"locked": {
"lastModified": 1713974364,
"narHash": "sha256-ilZTVWSaNP1ibhQIIRXE+q9Lj2XOH+F9W3Co4QyY1eU=",
"ref": "refs/heads/master",
"rev": "de89197a4a7b162db7df9d41c9d07759d87c5709",
"revCount": 937,
"type": "git",
"url": "https://git.sr.ht/~rodarima/bscpkgs"
},
"original": {
"type": "git",
"url": "https://git.sr.ht/~rodarima/bscpkgs"
}
},
"darwin": {
"inputs": {
"nixpkgs": [
"agenix",
"nixpkgs"
]
},
"locked": {
"lastModified": 1700795494,
"narHash": "sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0=",
"owner": "lnl7",
"repo": "nix-darwin",
"rev": "4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d",
"type": "github"
},
"original": {
"owner": "lnl7",
"ref": "master",
"repo": "nix-darwin",
"type": "github"
}
},
"home-manager": {
"inputs": {
"nixpkgs": [
"agenix",
"nixpkgs"
]
},
"locked": {
"lastModified": 1703113217,
"narHash": "sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE=",
"owner": "nix-community",
"repo": "home-manager",
"rev": "3bfaacf46133c037bb356193bd2f1765d9dc82c1",
"type": "github"
},
"original": {
"owner": "nix-community",
"repo": "home-manager",
"type": "github"
}
},
"nixpkgs": {
"locked": {
"lastModified": 1720957393,
"narHash": "sha256-oedh2RwpjEa+TNxhg5Je9Ch6d3W1NKi7DbRO1ziHemA=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "693bc46d169f5af9c992095736e82c3488bf7dbb",
"type": "github"
},
"original": {
"owner": "NixOS",
"ref": "nixos-unstable",
"repo": "nixpkgs",
"type": "github"
}
},
"root": {
"inputs": {
"agenix": "agenix",
"bscpkgs": "bscpkgs",
"nixpkgs": "nixpkgs"
}
},
"systems": {
"locked": {
"lastModified": 1681028828,
"narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
"owner": "nix-systems",
"repo": "default",
"rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
"type": "github"
},
"original": {
"owner": "nix-systems",
"repo": "default",
"type": "github"
}
}
},
"root": "root",
"version": 7
}

35
flake.nix Normal file
View File

@ -0,0 +1,35 @@
{
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
agenix.url = "github:ryantm/agenix";
agenix.inputs.nixpkgs.follows = "nixpkgs";
bscpkgs.url = "git+https://git.sr.ht/~rodarima/bscpkgs";
bscpkgs.inputs.nixpkgs.follows = "nixpkgs";
};
outputs = { self, nixpkgs, agenix, bscpkgs, ... }:
let
mkConf = name: nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
specialArgs = { inherit nixpkgs bscpkgs agenix; theFlake = self; };
modules = [ "${self.outPath}/m/${name}/configuration.nix" ];
};
in
{
nixosConfigurations = {
hut = mkConf "hut";
owl1 = mkConf "owl1";
owl2 = mkConf "owl2";
eudy = mkConf "eudy";
koro = mkConf "koro";
bay = mkConf "bay";
lake2 = mkConf "lake2";
raccoon = mkConf "raccoon";
};
packages.x86_64-linux = self.nixosConfigurations.hut.pkgs // {
bscpkgs = bscpkgs.packages.x86_64-linux;
nixpkgs = nixpkgs.legacyPackages.x86_64-linux;
};
};
}

29
keys.nix Normal file
View File

@ -0,0 +1,29 @@
# As agenix needs to parse the secrets from a standalone .nix file, we describe
# here all the public keys
rec {
hosts = {
hut = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICO7jIp6JRnRWTMDsTB/aiaICJCl4x8qmKMPSs4lCqP1 hut";
owl1 = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMqMEXO0ApVsBA6yjmb0xP2kWyoPDIWxBB0Q3+QbHVhv owl1";
owl2 = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHurEYpQzNHqWYF6B9Pd7W8UPgF3BxEg0BvSbsA7BAdK owl2";
eudy = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIL+WYPRRvZupqLAG0USKmd/juEPmisyyJaP8hAgYwXsG eudy";
koro = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIImiTFDbxyUYPumvm8C4mEnHfuvtBY1H8undtd6oDd67 koro";
bay = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICvGBzpRQKuQYHdlUQeAk6jmdbkrhmdLwTBqf3el7IgU bay";
lake2 = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINo66//S1yatpQHE/BuYD/Gfq64TY7ZN5XOGXmNchiO0 lake2";
};
hostGroup = with hosts; rec {
compute = [ owl1 owl2 ];
playground = [ eudy koro ];
storage = [ bay lake2 ];
monitor = [ hut ];
system = storage ++ monitor;
safe = system ++ compute;
all = safe ++ playground;
};
admins = {
rarias = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIE1oZTPtlEXdGt0Ak+upeCIiBdaDQtcmuWoTUCVuSVIR rarias@hut";
root = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIII/1TNArcwA6D47mgW4TArwlxQRpwmIGiZDysah40Gb root@hut";
};
}

107
m/bay/configuration.nix Normal file
View File

@ -0,0 +1,107 @@
{ config, pkgs, lib, ... }:
{
imports = [
../common/xeon.nix
../module/monitoring.nix
];
# Select the this using the ID to avoid mismatches
boot.loader.grub.device = "/dev/disk/by-id/wwn-0x55cd2e414d53562d";
boot.kernel.sysctl = {
"kernel.yama.ptrace_scope" = lib.mkForce "1";
};
environment.systemPackages = with pkgs; [
ceph
];
networking = {
hostName = "bay";
interfaces.eno1.ipv4.addresses = [ {
address = "10.0.40.40";
prefixLength = 24;
} ];
interfaces.ibp5s0.ipv4.addresses = [ {
address = "10.0.42.40";
prefixLength = 24;
} ];
firewall = {
extraCommands = ''
# Accept all incoming TCP traffic from lake2
iptables -A nixos-fw -p tcp -s lake2 -j nixos-fw-accept
# Accept monitoring requests from hut
iptables -A nixos-fw -p tcp -s hut -m multiport --dport 9283,9002 -j nixos-fw-accept
# Accept all Ceph traffic from the local network
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 -m multiport --dport 3300,6789,6800:7568 -j nixos-fw-accept
'';
};
};
services.ceph = {
enable = true;
global = {
fsid = "9c8d06e0-485f-4aaf-b16b-06d6daf1232b";
monHost = "10.0.40.40";
monInitialMembers = "bay";
clusterNetwork = "10.0.40.40/24"; # Use Ethernet only
};
extraConfig = {
# Only log to stderr so it appears in the journal
"log_file" = "/dev/null";
"mon_cluster_log_file" = "/dev/null";
"log_to_stderr" = "true";
"err_to_stderr" = "true";
"log_to_file" = "false";
};
mds = {
enable = true;
daemons = [ "mds0" "mds1" ];
extraConfig = {
"host" = "bay";
};
};
mgr = {
enable = true;
daemons = [ "bay" ];
};
mon = {
enable = true;
daemons = [ "bay" ];
};
osd = {
enable = true;
# One daemon per NVME disk
daemons = [ "0" "1" "2" "3" ];
extraConfig = {
"osd crush chooseleaf type" = "0";
"osd journal size" = "10000";
"osd pool default min size" = "2";
"osd pool default pg num" = "200";
"osd pool default pgp num" = "200";
"osd pool default size" = "3";
};
};
};
# Missing service for volumes, see:
# https://www.reddit.com/r/ceph/comments/14otjyo/comment/jrd69vt/
systemd.services.ceph-volume = {
enable = true;
description = "Ceph Volume activation";
unitConfig = {
Type = "oneshot";
After = "local-fs.target";
Wants = "local-fs.target";
};
path = [ pkgs.ceph pkgs.util-linux pkgs.lvm2 pkgs.cryptsetup ];
serviceConfig = {
KillMode = "none";
Environment = "CEPH_VOLUME_TIMEOUT=10000";
ExecStart = "/bin/sh -c 'timeout $CEPH_VOLUME_TIMEOUT ${pkgs.ceph}/bin/ceph-volume lvm activate --all --no-systemd'";
TimeoutSec = "0";
};
wantedBy = [ "multi-user.target" ];
};
}

20
m/common/base.nix Normal file
View File

@ -0,0 +1,20 @@
{
# All machines should include this profile.
# Includes the basic configuration for an Intel server.
imports = [
./base/agenix.nix
./base/august-shutdown.nix
./base/boot.nix
./base/env.nix
./base/fs.nix
./base/hw.nix
./base/net.nix
./base/nix.nix
./base/ntp.nix
./base/rev.nix
./base/ssh.nix
./base/users.nix
./base/watchdog.nix
./base/zsh.nix
];
}

9
m/common/base/agenix.nix Normal file
View File

@ -0,0 +1,9 @@
{ agenix, ... }:
{
imports = [ agenix.nixosModules.default ];
environment.systemPackages = [
agenix.packages.x86_64-linux.default
];
}

View File

@ -0,0 +1,14 @@
{
# Shutdown all machines on August 2nd at 11:00 AM, so we can protect the
# hardware from spurious electrical peaks on the yearly electrical cut for
# manteinance that starts on August 4th.
systemd.timers.august-shutdown = {
description = "Shutdown on August 2nd for maintenance";
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-08-02 11:00:00";
RandomizedDelaySec = "10min";
Unit = "systemd-poweroff.service";
};
};
}

37
m/common/base/boot.nix Normal file
View File

@ -0,0 +1,37 @@
{ lib, pkgs, ... }:
{
# Use the GRUB 2 boot loader.
boot.loader.grub.enable = true;
# Enable GRUB2 serial console
boot.loader.grub.extraConfig = ''
serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1
terminal_input --append serial
terminal_output --append serial
'';
boot.kernel.sysctl = {
"kernel.perf_event_paranoid" = lib.mkDefault "-1";
# Allow ptracing (i.e. attach with GDB) any process of the same user, see:
# https://www.kernel.org/doc/Documentation/security/Yama.txt
"kernel.yama.ptrace_scope" = "0";
};
boot.kernelPackages = pkgs.linuxPackages_latest;
#boot.kernelPatches = lib.singleton {
# name = "osnoise-tracer";
# patch = null;
# extraStructuredConfig = with lib.kernel; {
# OSNOISE_TRACER = yes;
# HWLAT_TRACER = yes;
# };
#};
boot.initrd.availableKernelModules = [ "ahci" "xhci_pci" "ehci_pci" "nvme" "usbhid" "sd_mod" ];
boot.initrd.kernelModules = [ ];
boot.kernelModules = [ "kvm-intel" ];
boot.extraModulePackages = [ ];
}

35
m/common/base/env.nix Normal file
View File

@ -0,0 +1,35 @@
{ pkgs, config, ... }:
{
environment.systemPackages = with pkgs; [
vim wget git htop tmux pciutils tcpdump ripgrep nix-index nixos-option
nix-diff ipmitool freeipmi ethtool lm_sensors ix cmake gnumake file tree
ncdu config.boot.kernelPackages.perf ldns
# From bsckgs overlay
osumb
];
programs.direnv.enable = true;
# Increase limits
security.pam.loginLimits = [
{
domain = "*";
type = "-";
item = "memlock";
value = "1048576"; # 1 GiB of mem locked
}
];
environment.variables = {
EDITOR = "vim";
VISUAL = "vim";
};
programs.bash.promptInit = ''
PS1="\h\\$ "
'';
time.timeZone = "Europe/Madrid";
i18n.defaultLocale = "en_DK.UTF-8";
}

24
m/common/base/fs.nix Normal file
View File

@ -0,0 +1,24 @@
{ ... }:
{
fileSystems."/" =
{ device = "/dev/disk/by-label/nixos";
fsType = "ext4";
};
# Trim unused blocks weekly
services.fstrim.enable = true;
swapDevices =
[ { device = "/dev/disk/by-label/swap"; }
];
# Tracing
fileSystems."/sys/kernel/tracing" = {
device = "none";
fsType = "tracefs";
};
# Mount a tmpfs into /tmp
boot.tmp.useTmpfs = true;
}

14
m/common/base/hw.nix Normal file
View File

@ -0,0 +1,14 @@
# Do not modify this file! It was generated by nixos-generate-config
# and may be overwritten by future invocations. Please make changes
# to /etc/nixos/configuration.nix instead.
{ config, lib, pkgs, modulesPath, ... }:
{
imports =
[ (modulesPath + "/installer/scan/not-detected.nix")
];
nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
powerManagement.cpuFreqGovernor = lib.mkDefault "powersave";
hardware.cpu.intel.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
}

19
m/common/base/net.nix Normal file
View File

@ -0,0 +1,19 @@
{ pkgs, ... }:
{
networking = {
enableIPv6 = false;
useDHCP = false;
firewall = {
enable = true;
allowedTCPPorts = [ 22 ];
};
hosts = {
"84.88.53.236" = [ "ssfhead.bsc.es" "ssfhead" ];
"84.88.51.152" = [ "raccoon" ];
"84.88.51.142" = [ "raccoon-ipmi" ];
};
};
}

42
m/common/base/nix.nix Normal file
View File

@ -0,0 +1,42 @@
{ pkgs, nixpkgs, bscpkgs, theFlake, ... }:
{
nixpkgs.overlays = [
bscpkgs.bscOverlay
(import ../../../pkgs/overlay.nix)
];
nix = {
nixPath = [
"nixpkgs=${nixpkgs}"
"jungle=${theFlake.outPath}"
];
registry = {
nixpkgs.flake = nixpkgs;
jungle.flake = theFlake;
};
settings = {
experimental-features = [ "nix-command" "flakes" ];
sandbox = "relaxed";
trusted-users = [ "@wheel" ];
flake-registry = pkgs.writeText "global-registry.json"
''{"flakes":[],"version":2}'';
};
gc = {
automatic = true;
dates = "weekly";
options = "--delete-older-than 30d";
};
};
# This value determines the NixOS release from which the default
# settings for stateful data, like file locations and database versions
# on your system were taken. Its perfectly fine and recommended to leave
# this value at the release version of the first install of this system.
# Before changing this value read the documentation for this option
# (e.g. man configuration.nix or on https://nixos.org/nixos/options.html).
system.stateVersion = "22.11"; # Did you read the comment?
}

9
m/common/base/ntp.nix Normal file
View File

@ -0,0 +1,9 @@
{ pkgs, ... }:
{
services.ntp.enable = true;
# Use the NTP server at BSC, as we don't have direct access
# to the outside world
networking.timeServers = [ "84.88.52.36" ];
}

21
m/common/base/rev.nix Normal file
View File

@ -0,0 +1,21 @@
{ theFlake, ... }:
let
# Prevent building a configuration without revision
rev = if theFlake ? rev then theFlake.rev
else throw ("Refusing to build from a dirty Git tree!");
in {
# Save the commit of the config in /etc/configrev
environment.etc.configrev.text = rev + "\n";
# Keep a log with the config over time
system.activationScripts.configRevLog.text = ''
BOOTED=$(cat /run/booted-system/etc/configrev 2>/dev/null || echo unknown)
CURRENT=$(cat /run/current-system/etc/configrev 2>/dev/null || echo unknown)
NEXT=${rev}
DATENOW=$(date --iso-8601=seconds)
echo "$DATENOW booted=$BOOTED current=$CURRENT next=$NEXT" >> /var/configrev.log
'';
system.configurationRevision = rev;
}

22
m/common/base/ssh.nix Normal file
View File

@ -0,0 +1,22 @@
{ lib, ... }:
let
keys = import ../../../keys.nix;
hostsKeys = lib.mapAttrs (name: value: { publicKey = value; }) keys.hosts;
in
{
# Enable the OpenSSH daemon.
services.openssh.enable = true;
# Connect to intranet git hosts via proxy
programs.ssh.extraConfig = ''
Host bscpm02.bsc.es bscpm03.bsc.es gitlab-internal.bsc.es alya.gitlab.bsc.es
User git
ProxyCommand nc -X connect -x hut:23080 %h %p
'';
programs.ssh.knownHosts = hostsKeys // {
"gitlab-internal.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF9arsAOSRB06hdy71oTvJHG2Mg8zfebADxpvc37lZo3";
"bscpm03.bsc.es".publicKey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM2NuSUPsEhqz1j5b4Gqd+MWFnRqyqY57+xMvBUqHYUS";
};
}

109
m/common/base/users.nix Normal file
View File

@ -0,0 +1,109 @@
{ pkgs, ... }:
{
imports = [
../../module/jungle-users.nix
];
users = {
mutableUsers = false;
users = {
# Generate hashedPassword with `mkpasswd -m sha-512`
root.openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKBOf4r4lzQfyO0bx5BaREePREw8Zw5+xYgZhXwOZoBO ram@hop"
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINa0tvnNgwkc5xOwd6xTtaIdFi5jv0j2FrE7jl5MTLoE ram@mio"
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF3zeB5KSimMBAjvzsp1GCkepVaquVZGPYwRIzyzaCba aleix@bsc"
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIII/1TNArcwA6D47mgW4TArwlxQRpwmIGiZDysah40Gb root@hut"
];
rarias = {
uid = 1880;
isNormalUser = true;
home = "/home/Computational/rarias";
description = "Rodrigo Arias";
group = "Computational";
extraGroups = [ "wheel" ];
hashedPassword = "$6$u06tkCy13enReBsb$xiI.twRvvTfH4jdS3s68NZ7U9PSbGKs5.LXU/UgoawSwNWhZo2hRAjNL5qG0/lAckzcho2LjD0r3NfVPvthY6/";
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKBOf4r4lzQfyO0bx5BaREePREw8Zw5+xYgZhXwOZoBO ram@hop"
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINa0tvnNgwkc5xOwd6xTtaIdFi5jv0j2FrE7jl5MTLoE ram@mio"
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGYcXIxe0poOEGLpk8NjiRozls7fMRX0N3j3Ar94U+Gl rarias@hal"
];
shell = pkgs.zsh;
};
arocanon = {
uid = 1042;
isNormalUser = true;
home = "/home/Computational/arocanon";
description = "Aleix Roca";
group = "Computational";
extraGroups = [ "wheel" ];
hashedPassword = "$6$hliZiW4tULC/tH7p$pqZarwJkNZ7vS0G5llWQKx08UFG9DxDYgad7jplMD8WkZh5k58i4dfPoWtnEShfjTO6JHiIin05ny5lmSXzGM/";
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF3zeB5KSimMBAjvzsp1GCkepVaquVZGPYwRIzyzaCba aleix@bsc"
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGdphWxLAEekicZ/WBrvP7phMyxKSSuLAZBovNX+hZXQ aleix@kerneland"
];
};
};
jungleUsers = {
rpenacob = {
uid = 2761;
isNormalUser = true;
home = "/home/Computational/rpenacob";
description = "Raúl Peñacoba";
group = "Computational";
hosts = [ "owl1" "owl2" "hut" ];
hashedPassword = "$6$TZm3bDIFyPrMhj1E$uEDXoYYd1z2Wd5mMPfh3DZAjP7ztVjJ4ezIcn82C0ImqafPA.AnTmcVftHEzLB3tbe2O4SxDyPSDEQgJ4GOtj/";
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFYfXg37mauGeurqsLpedgA2XQ9d4Nm0ZGo/hI1f7wwH rpenacob@bsc"
];
};
anavarro = {
uid = 1037;
isNormalUser = true;
home = "/home/Computational/anavarro";
description = "Antoni Navarro";
group = "Computational";
hosts = [ "hut" "raccoon" ];
hashedPassword = "$6$QdNDsuLehoZTYZlb$CDhCouYDPrhoiB7/seu7RF.Gqg4zMQz0n5sA4U1KDgHaZOxy2as9pbIGeF8tOHJKRoZajk5GiaZv0rZMn7Oq31";
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILWjRSlKgzBPZQhIeEtk6Lvws2XNcYwHcwPv4osSgst5 anavarro@ssfhead"
];
};
abonerib = {
uid = 4541;
isNormalUser = true;
home = "/home/Computational/abonerib";
description = "Aleix Boné";
group = "Computational";
hosts = [ "owl1" "owl2" "hut" "raccoon" ];
hashedPassword = "$6$V1EQWJr474whv7XJ$OfJ0wueM2l.dgiJiiah0Tip9ITcJ7S7qDvtSycsiQ43QBFyP4lU0e0HaXWps85nqB4TypttYR4hNLoz3bz662/";
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIIFiqXqt88VuUfyANkZyLJNiuroIITaGlOOTMhVDKjf abonerib@bsc"
];
};
vlopez = {
uid = 4334;
isNormalUser = true;
home = "/home/Computational/vlopez";
description = "Victor López";
group = "Computational";
hosts = [ "koro" ];
hashedPassword = "$6$0ZBkgIYE/renVqtt$1uWlJsb0FEezRVNoETTzZMx4X2SvWiOsKvi0ppWCRqI66S6TqMBXBdP4fcQyvRRBt0e4Z7opZIvvITBsEtO0f0";
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGMwlUZRf9jfG666Qa5Sb+KtEhXqkiMlBV2su3x/dXHq victor@arch"
];
};
};
groups = {
Computational = { gid = 564; };
};
};
}

View File

@ -0,0 +1,9 @@
{ ... }:
{
# The boards have a BMC watchdog controlled by IPMI
boot.kernelModules = [ "ipmi_watchdog" ];
# Enable systemd watchdog with 30 s interval
systemd.watchdog.runtimeTime = "30s";
}

91
m/common/base/zsh.nix Normal file
View File

@ -0,0 +1,91 @@
{ pkgs, ... }:
{
environment.systemPackages = with pkgs; [
zsh-completions
nix-zsh-completions
];
programs.zsh = {
enable = true;
histSize = 1000000;
shellInit = ''
# Disable new user prompt
if [ ! -e ~/.zshrc ]; then
touch ~/.zshrc
fi
'';
promptInit = ''
# Note that to manually override this in ~/.zshrc you should run `prompt off`
# before setting your PS1 and etc. Otherwise this will likely to interact with
# your ~/.zshrc configuration in unexpected ways as the default prompt sets
# a lot of different prompt variables.
autoload -U promptinit && promptinit && prompt default && setopt prompt_sp
'';
# Taken from Ulli Kehrle config:
# https://git.hrnz.li/Ulli/nixos/src/commit/2e203b8d8d671f4e3ced0f1744a51d5c6ee19846/profiles/shell.nix#L199-L205
interactiveShellInit = ''
source "${pkgs.zsh-history-substring-search}/share/zsh-history-substring-search/zsh-history-substring-search.zsh"
# Save history immediately, but only load it when the shell starts
setopt inc_append_history
# dircolors doesn't support alacritty:
# https://lists.gnu.org/archive/html/bug-coreutils/2019-05/msg00029.html
export LS_COLORS='rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=00:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.avif=01;35:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:*~=00;90:*#=00;90:*.bak=00;90:*.old=00;90:*.orig=00;90:*.part=00;90:*.rej=00;90:*.swp=00;90:*.tmp=00;90:*.dpkg-dist=00;90:*.dpkg-old=00;90:*.ucf-dist=00;90:*.ucf-new=00;90:*.ucf-old=00;90:*.rpmnew=00;90:*.rpmorig=00;90:*.rpmsave=00;90:';
# From Arch Linux and GRML
bindkey "^R" history-incremental-pattern-search-backward
bindkey "^S" history-incremental-pattern-search-forward
# Auto rehash for new binaries
zstyle ':completion:*' rehash true
# show a nice menu with the matches
zstyle ':completion:*' menu yes select
bindkey '^[OA' history-substring-search-up # Up
bindkey '^[[A' history-substring-search-up # Up
bindkey '^[OB' history-substring-search-down # Down
bindkey '^[[B' history-substring-search-down # Down
bindkey '\e[1~' beginning-of-line # Home
bindkey '\e[7~' beginning-of-line # Home
bindkey '\e[H' beginning-of-line # Home
bindkey '\eOH' beginning-of-line # Home
bindkey '\e[4~' end-of-line # End
bindkey '\e[8~' end-of-line # End
bindkey '\e[F ' end-of-line # End
bindkey '\eOF' end-of-line # End
bindkey '^?' backward-delete-char # Backspace
bindkey '\e[3~' delete-char # Del
# bindkey '\e[3;5~' delete-char # sometimes Del, sometimes C-Del
bindkey '\e[2~' overwrite-mode # Ins
bindkey '^H' backward-kill-word # C-Backspace
bindkey '5~' kill-word # C-Del
bindkey '^[[3;5~' kill-word # C-Del
bindkey '^[[3^' kill-word # C-Del
bindkey "^[[1;5H" backward-kill-line # C-Home
bindkey "^[[7^" backward-kill-line # C-Home
bindkey "^[[1;5F" kill-line # C-End
bindkey "^[[8^" kill-line # C-End
bindkey '^[[1;5C' forward-word # C-Right
bindkey '^[0c' forward-word # C-Right
bindkey '^[[5C' forward-word # C-Right
bindkey '^[[1;5D' backward-word # C-Left
bindkey '^[0d' backward-word # C-Left
bindkey '^[[5D' backward-word # C-Left
'';
};
}

9
m/common/xeon.nix Normal file
View File

@ -0,0 +1,9 @@
{
# Provides the base system for a xeon node.
imports = [
./base.nix
./xeon/fs.nix
./xeon/console.nix
./xeon/net.nix
];
}

14
m/common/xeon/console.nix Normal file
View File

@ -0,0 +1,14 @@
{
# Restart the serial console
systemd.services."serial-getty@ttyS0" = {
enable = true;
wantedBy = [ "getty.target" ];
serviceConfig.Restart = "always";
};
# Enable serial console
boot.kernelParams = [
"console=tty1"
"console=ttyS0,115200"
];
}

View File

@ -1,5 +1,3 @@
{ ... }:
{
# Mount the home via NFS
fileSystems."/home" = {
@ -7,10 +5,4 @@
fsType = "nfs";
options = [ "nfsvers=3" "rsize=1024" "wsize=1024" "cto" "nofail" ];
};
# Tracing
fileSystems."/sys/kernel/tracing" = {
device = "none";
fsType = "tracefs";
};
}

90
m/common/xeon/net.nix Normal file
View File

@ -0,0 +1,90 @@
{ pkgs, ... }:
{
# Infiniband (IPoIB)
environment.systemPackages = [ pkgs.rdma-core ];
boot.kernelModules = [ "ib_umad" "ib_ipoib" ];
networking = {
defaultGateway = "10.0.40.30";
nameservers = ["8.8.8.8"];
proxy = {
default = "http://hut:23080/";
noProxy = "127.0.0.1,localhost,internal.domain,10.0.40.40";
# Don't set all_proxy as go complains and breaks the gitlab runner, see:
# https://github.com/golang/go/issues/16715
allProxy = null;
};
firewall = {
extraCommands = ''
# Prevent ssfhead from contacting our slurmd daemon
iptables -A nixos-fw -p tcp -s ssfhead --dport 6817:6819 -j nixos-fw-refuse
# But accept traffic to slurm ports from any other node in the subnet
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 6817:6819 -j nixos-fw-accept
# We also need to open the srun port range
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 60000:61000 -j nixos-fw-accept
'';
};
extraHosts = ''
10.0.40.30 ssfhead
# Node Entry for node: mds01 (ID=72)
10.0.40.40 bay mds01 mds01-eth0
10.0.42.40 bay-ib mds01-ib0
10.0.40.141 bay-ipmi mds01-ipmi0
# Node Entry for node: oss01 (ID=73)
10.0.40.41 oss01 oss01-eth0
10.0.42.41 oss01-ib0
10.0.40.142 oss01-ipmi0
# Node Entry for node: oss02 (ID=74)
10.0.40.42 lake2 oss02 oss02-eth0
10.0.42.42 lake2-ib oss02-ib0
10.0.40.143 lake2-ipmi oss02-ipmi0
# Node Entry for node: xeon01 (ID=15)
10.0.40.1 owl1 xeon01 xeon01-eth0
10.0.42.1 owl1-ib xeon01-ib0
10.0.40.101 owl1-ipmi xeon01-ipmi0
# Node Entry for node: xeon02 (ID=16)
10.0.40.2 owl2 xeon02 xeon02-eth0
10.0.42.2 owl2-ib xeon02-ib0
10.0.40.102 owl2-ipmi xeon02-ipmi0
# Node Entry for node: xeon03 (ID=17)
10.0.40.3 xeon03 xeon03-eth0
10.0.42.3 xeon03-ib0
10.0.40.103 xeon03-ipmi0
# Node Entry for node: xeon04 (ID=18)
10.0.40.4 xeon04 xeon04-eth0
10.0.42.4 xeon04-ib0
10.0.40.104 xeon04-ipmi0
# Node Entry for node: xeon05 (ID=19)
10.0.40.5 koro xeon05 xeon05-eth0
10.0.42.5 koro-ib xeon05-ib0
10.0.40.105 koro-ipmi xeon05-ipmi0
# Node Entry for node: xeon06 (ID=20)
10.0.40.6 xeon06 xeon06-eth0
10.0.42.6 xeon06-ib0
10.0.40.106 xeon06-ipmi0
# Node Entry for node: xeon07 (ID=21)
10.0.40.7 hut xeon07 xeon07-eth0
10.0.42.7 hut-ib xeon07-ib0
10.0.40.107 hut-ipmi xeon07-ipmi0
# Node Entry for node: xeon08 (ID=22)
10.0.40.8 eudy xeon08 xeon08-eth0
10.0.42.8 eudy-ib xeon08-ib0
10.0.40.108 eudy-ipmi xeon08-ipmi0
'';
};
}

37
m/eudy/configuration.nix Normal file
View File

@ -0,0 +1,37 @@
{ config, pkgs, lib, modulesPath, ... }:
{
imports = [
../common/xeon.nix
#(modulesPath + "/installer/netboot/netboot-minimal.nix")
./kernel/kernel.nix
./cpufreq.nix
./fs.nix
./users.nix
../module/debuginfod.nix
];
# Select this using the ID to avoid mismatches
boot.loader.grub.device = "/dev/disk/by-id/wwn-0x55cd2e414d53564b";
# disable automatic garbage collector
nix.gc.automatic = lib.mkForce false;
# members of the tracing group can use the lttng-provided kernel events
# without root permissions
users.groups.tracing.members = [ "arocanon" ];
# set up both ethernet and infiniband ips
networking = {
hostName = "eudy";
interfaces.eno1.ipv4.addresses = [ {
address = "10.0.40.8";
prefixLength = 24;
} ];
interfaces.ibp5s0.ipv4.addresses = [ {
address = "10.0.42.8";
prefixLength = 24;
} ];
};
}

40
m/eudy/cpufreq.nix Normal file
View File

@ -0,0 +1,40 @@
{ lib, ... }:
{
# Disable frequency boost by default. Use the intel_pstate driver instead of
# acpi_cpufreq driver because the acpi_cpufreq driver does not read the
# complete range of P-States [1]. Use the intel_pstate passive mode [2] to
# disable HWP, which allows a core to "select P-states by itself". Also, this
# disables intel governors, which confusingly, have the same names as the
# generic ones but behave differently [3].
# Essentially, we use the generic governors, but use the intel driver to read
# the P-state list.
# [1] - https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html#intel-pstate-vs-acpi-cpufreq
# [2] - https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html#passive-mode
# [3] - https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html#active-mode
# https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
# set intel_pstate to passive mode
boot.kernelParams = [
"intel_pstate=passive"
];
# Disable frequency boost
system.activationScripts = {
disableFrequencyBoost.text = ''
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
'';
};
## disable intel_pstate
#boot.kernelParams = [
# "intel_pstate=disable"
#];
## Disable frequency boost
#system.activationScripts = {
# disableFrequencyBoost.text = ''
# echo 0 > /sys/devices/system/cpu/cpufreq/boost
# '';
#};
}

13
m/eudy/fs.nix Normal file
View File

@ -0,0 +1,13 @@
{ ... }:
{
fileSystems."/nix" = {
device = "/dev/disk/by-label/optane";
fsType = "ext4";
neededForBoot = true;
};
fileSystems."/mnt/data" = {
device = "/dev/disk/by-label/data";
fsType = "ext4";
};
}

File diff suppressed because it is too large Load Diff

10333
m/eudy/kernel/configs/lockdep Normal file

File diff suppressed because it is too large Load Diff

70
m/eudy/kernel/kernel.nix Normal file
View File

@ -0,0 +1,70 @@
{ pkgs, lib, ... }:
let
#fcs-devel = pkgs.linuxPackages_custom {
# version = "6.2.8";
# src = /mnt/data/kernel/fcs/kernel/src;
# configfile = /mnt/data/kernel/fcs/kernel/configs/defconfig;
#};
#fcsv1 = fcs-kernel "bc11660676d3d68ce2459b9fb5d5e654e3f413be" false;
#fcsv2 = fcs-kernel "db0f2eca0cd57a58bf456d7d2c7d5d8fdb25dfb1" false;
#fcsv1-lockdep = fcs-kernel "bc11660676d3d68ce2459b9fb5d5e654e3f413be" true;
#fcsv2-lockdep = fcs-kernel "db0f2eca0cd57a58bf456d7d2c7d5d8fdb25dfb1" true;
#fcs-kernel = gitCommit: lockdep: pkgs.linuxPackages_custom {
# version = "6.2.8";
# src = builtins.fetchGit {
# url = "git@bscpm03.bsc.es:ompss-kernel/linux.git";
# rev = gitCommit;
# ref = "fcs";
# };
# configfile = if lockdep then ./configs/lockdep else ./configs/defconfig;
#};
kernel = nixos-fcs;
nixos-fcs-kernel = lib.makeOverridable ({gitCommit, lockStat ? false, preempt ? false, branch ? "fcs"}: pkgs.linuxPackagesFor (pkgs.buildLinux rec {
version = "6.2.8";
src = builtins.fetchGit {
url = "git@bscpm03.bsc.es:ompss-kernel/linux.git";
rev = gitCommit;
ref = branch;
};
structuredExtraConfig = with lib.kernel; {
# add general custom kernel options here
} // lib.optionalAttrs lockStat {
LOCK_STAT = yes;
} // lib.optionalAttrs preempt {
PREEMPT = lib.mkForce yes;
PREEMPT_VOLUNTARY = lib.mkForce no;
};
kernelPatches = [];
extraMeta.branch = lib.versions.majorMinor version;
}));
nixos-fcs = nixos-fcs-kernel {gitCommit = "8a09822dfcc8f0626b209d6d2aec8b5da459dfee";};
nixos-fcs-lockstat = nixos-fcs.override {
lockStat = true;
};
nixos-fcs-lockstat-preempt = nixos-fcs.override {
lockStat = true;
preempt = true;
};
latest = pkgs.linuxPackages_latest;
in {
imports = [
./lttng.nix
./perf.nix
];
boot.kernelPackages = lib.mkForce kernel;
# disable all cpu mitigations
boot.kernelParams = [
"mitigations=off"
];
# enable memory overcommit, needed to build a taglibc system using nix after
# increasing the openblas memory footprint
boot.kernel.sysctl."vm.overcommit_memory" = 1;
}

43
m/eudy/kernel/lttng.nix Normal file
View File

@ -0,0 +1,43 @@
{ config, pkgs, lib, ... }:
let
# The lttng btrfs probe crashes at compile time because of an undefined
# function. This disables the btrfs tracepoints to avoid the issue.
# Also enable lockdep tracepoints, this is disabled by default because it
# does not work well on architectures other than x86_64 (i think that arm) as
# I was told on the mailing list.
lttng-modules-fixed = config.boot.kernelPackages.lttng-modules.overrideAttrs (finalAttrs: previousAttrs: {
patchPhase = (lib.optionalString (previousAttrs ? patchPhase) previousAttrs.patchPhase) + ''
# disable btrfs
substituteInPlace src/probes/Kbuild \
--replace " obj-\$(CONFIG_LTTNG) += lttng-probe-btrfs.o" " #obj-\$(CONFIG_LTTNG) += lttng-probe-btrfs.o"
# enable lockdep tracepoints
substituteInPlace src/probes/Kbuild \
--replace "#ifneq (\$(CONFIG_LOCKDEP),)" "ifneq (\$(CONFIG_LOCKDEP),)" \
--replace "# obj-\$(CONFIG_LTTNG) += lttng-probe-lock.o" " obj-\$(CONFIG_LTTNG) += lttng-probe-lock.o" \
--replace "#endif # CONFIG_LOCKDEP" "endif # CONFIG_LOCKDEP"
'';
});
in {
# add the lttng tools and modules to the system environment
boot.extraModulePackages = [ lttng-modules-fixed ];
environment.systemPackages = with pkgs; [
lttng-tools lttng-ust babeltrace
];
# start the lttng root daemon to manage kernel events
systemd.services.lttng-sessiond = {
wantedBy = [ "multi-user.target" ];
description = "LTTng session daemon for the root user";
serviceConfig = {
User = "root";
ExecStart = ''
${pkgs.lttng-tools}/bin/lttng-sessiond
'';
};
};
}

22
m/eudy/kernel/perf.nix Normal file
View File

@ -0,0 +1,22 @@
{ config, pkgs, lib, ... }:
{
# add the perf tool
environment.systemPackages = with pkgs; [
config.boot.kernelPackages.perf
];
# allow non-root users to read tracing data from the kernel
boot.kernel.sysctl."kernel.perf_event_paranoid" = -2;
boot.kernel.sysctl."kernel.kptr_restrict" = 0;
# specify additionl options to the tracefs directory to allow members of the
# tracing group to access tracefs.
fileSystems."/sys/kernel/tracing" = {
options = [
"mode=755"
"gid=tracing"
];
};
}

11
m/eudy/users.nix Normal file
View File

@ -0,0 +1,11 @@
{ ... }:
{
security.sudo.extraRules= [{
users = [ "arocanon" ];
commands = [{
command = "ALL" ;
options= [ "NOPASSWD" ]; # "SETENV" # Adding the following could be a good idea
}];
}];
}

162
m/hut/blackbox.yml Normal file
View File

@ -0,0 +1,162 @@
modules:
http_2xx:
prober: http
timeout: 5s
http:
proxy_url: "http://127.0.0.1:23080"
skip_resolve_phase_with_proxy: true
follow_redirects: true
valid_status_codes: [] # Defaults to 2xx
method: GET
http_with_proxy:
prober: http
http:
proxy_url: "http://127.0.0.1:3128"
skip_resolve_phase_with_proxy: true
http_with_proxy_and_headers:
prober: http
http:
proxy_url: "http://127.0.0.1:3128"
proxy_connect_header:
Proxy-Authorization:
- Bearer token
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{}'
http_post_body_file:
prober: http
timeout: 5s
http:
method: POST
body_file: "/files/body.txt"
http_basic_auth_example:
prober: http
timeout: 5s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"
http_2xx_oauth_client_credentials:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
follow_redirects: true
preferred_ip_protocol: "ip4"
valid_status_codes:
- 200
- 201
oauth2:
client_id: "client_id"
client_secret: "client_secret"
token_url: "https://api.example.com/token"
endpoint_params:
grant_type: "client_credentials"
http_custom_ca_example:
prober: http
http:
method: GET
tls_config:
ca_file: "/certs/my_cert.crt"
http_gzip:
prober: http
http:
method: GET
compression: gzip
http_gzip_with_accept_encoding:
prober: http
http:
method: GET
compression: gzip
headers:
Accept-Encoding: gzip
tls_connect:
prober: tcp
timeout: 5s
tcp:
tls: true
tcp_connect_example:
prober: tcp
timeout: 5s
imap_starttls:
prober: tcp
timeout: 5s
tcp:
query_response:
- expect: "OK.*STARTTLS"
- send: ". STARTTLS"
- expect: "OK"
- starttls: true
- send: ". capability"
- expect: "CAPABILITY IMAP4rev1"
smtp_starttls:
prober: tcp
timeout: 5s
tcp:
query_response:
- expect: "^220 ([^ ]+) ESMTP (.+)$"
- send: "EHLO prober\r"
- expect: "^250-STARTTLS"
- send: "STARTTLS\r"
- expect: "^220"
- starttls: true
- send: "EHLO prober\r"
- expect: "^250-AUTH"
- send: "QUIT\r"
irc_banner_example:
prober: tcp
timeout: 5s
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
dns_udp_example:
prober: dns
timeout: 5s
dns:
query_name: "www.prometheus.io"
query_type: "A"
valid_rcodes:
- NOERROR
validate_answer_rrs:
fail_if_matches_regexp:
- ".*127.0.0.1"
fail_if_all_match_regexp:
- ".*127.0.0.1"
fail_if_not_matches_regexp:
- "www.prometheus.io.\t300\tIN\tA\t127.0.0.1"
fail_if_none_matches_regexp:
- "127.0.0.1"
validate_authority_rrs:
fail_if_matches_regexp:
- ".*127.0.0.1"
validate_additional_rrs:
fail_if_matches_regexp:
- ".*127.0.0.1"
dns_soa:
prober: dns
dns:
query_name: "prometheus.io"
query_type: "SOA"
dns_tcp_example:
prober: dns
dns:
transport_protocol: "tcp" # defaults to "udp"
preferred_ip_protocol: "ip4" # defaults to "ip6"
query_name: "www.prometheus.io"

54
m/hut/configuration.nix Normal file
View File

@ -0,0 +1,54 @@
{ config, pkgs, ... }:
{
imports = [
../common/xeon.nix
../module/ceph.nix
../module/debuginfod.nix
../module/emulation.nix
../module/slurm-client.nix
./gitlab-runner.nix
./monitoring.nix
./nfs.nix
./slurm-server.nix
./nix-serve.nix
./public-inbox.nix
./gitea.nix
./msmtp.nix
./postgresql.nix
./nginx.nix
./p.nix
#./pxe.nix
];
# Select the this using the ID to avoid mismatches
boot.loader.grub.device = "/dev/disk/by-id/ata-INTEL_SSDSC2BB240G7_PHDV6462004Y240AGN";
fileSystems."/nvme" = {
fsType = "ext4";
device = "/dev/disk/by-label/nvme";
};
networking = {
hostName = "hut";
interfaces.eno1.ipv4.addresses = [ {
address = "10.0.40.7";
prefixLength = 24;
} ];
interfaces.ibp5s0.ipv4.addresses = [ {
address = "10.0.42.7";
prefixLength = 24;
} ];
firewall = {
extraCommands = ''
# Accept all proxy traffic from compute nodes but not the login
iptables -A nixos-fw -p tcp -s 10.0.40.30 --dport 23080 -j nixos-fw-log-refuse
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 23080 -j nixos-fw-accept
'';
};
};
# Allow proxy to bind to the ethernet interface
services.openssh.settings.GatewayPorts = "clientspecified";
}

63
m/hut/gitea.nix Normal file
View File

@ -0,0 +1,63 @@
{ config, lib, ... }:
{
age.secrets.giteaRunnerToken.file = ../../secrets/gitea-runner-token.age;
services.gitea = {
enable = true;
appName = "Gitea in the jungle";
settings = {
server = {
ROOT_URL = "https://jungle.bsc.es/git/";
LOCAL_ROOT_URL = "https://jungle.bsc.es/git/";
LANDING_PAGE = "explore";
};
metrics.ENABLED = true;
service = {
REGISTER_MANUAL_CONFIRM = true;
ENABLE_NOTIFY_MAIL = true;
};
log.LEVEL = "Warn";
mailer = {
ENABLED = true;
FROM = "jungle-robot@bsc.es";
PROTOCOL = "sendmail";
SENDMAIL_PATH = "/run/wrappers/bin/sendmail";
SENDMAIL_ARGS = "--";
};
};
};
services.gitea-actions-runner.instances = {
runrun = {
enable = true;
name = "runrun";
url = "https://jungle.bsc.es/git/";
tokenFile = config.age.secrets.giteaRunnerToken.path;
labels = [ "native:host" ];
settings.runner.capacity = 8;
};
};
systemd.services.gitea-runner-runrun = {
path = [ "/run/current-system/sw" ];
serviceConfig = {
# DynamicUser doesn't work well with SSH
DynamicUser = lib.mkForce false;
User = "gitea-runner";
Group = "gitea-runner";
};
};
users.users.gitea-runner = {
isSystemUser = true;
home = "/var/lib/gitea-runner";
description = "Gitea Runner";
group = "gitea-runner";
extraGroups = [ "docker" ];
createHome = true;
};
users.groups.gitea-runner = {};
}

View File

@ -1,39 +1,37 @@
{ pkgs, lib, config, ... }:
{
age.secrets."secrets/ovni-token".file = ./secrets/ovni-token.age;
age.secrets."secrets/nosv-token".file = ./secrets/nosv-token.age;
age.secrets.gitlabRunnerShellToken.file = ../../secrets/gitlab-runner-shell-token.age;
age.secrets.gitlabRunnerDockerToken.file = ../../secrets/gitlab-runner-docker-token.age;
services.gitlab-runner = {
enable = true;
services = {
ovni-shell = {
registrationConfigFile = config.age.secrets."secrets/ovni-token".path;
settings.concurrent = 5;
services = let
common-shell = {
executor = "shell";
tagList = [ "nix" "xeon" ];
environmentVariables = {
SHELL = "${pkgs.bash}/bin/bash";
};
};
ovni-docker = {
registrationConfigFile = config.age.secrets."secrets/ovni-token".path;
common-docker = {
executor = "docker";
dockerImage = "debian:stable";
tagList = [ "docker" "xeon" ];
registrationFlags = [ "--docker-network-mode host" ];
registrationFlags = [
"--docker-network-mode host"
];
environmentVariables = {
https_proxy = "http://localhost:23080";
http_proxy = "http://localhost:23080";
};
};
nosv-docker = {
registrationConfigFile = config.age.secrets."secrets/nosv-token".path;
dockerImage = "debian:stable";
tagList = [ "docker" "xeon" ];
registrationFlags = [ "--docker-network-mode host" ];
environmentVariables = {
https_proxy = "http://localhost:23080";
http_proxy = "http://localhost:23080";
};
in {
# For pm.bsc.es/gitlab
gitlab-pm-shell = common-shell // {
authenticationTokenConfigFile = config.age.secrets.gitlabRunnerShellToken.path;
};
gitlab-pm-docker = common-docker // {
authenticationTokenConfigFile = config.age.secrets.gitlabRunnerDockerToken.path;
};
};
};

13
m/hut/ipmi.yml Normal file
View File

@ -0,0 +1,13 @@
modules:
default:
collectors:
- bmc
- ipmi
- chassis
lan:
collectors:
- ipmi
- chassis
user: ""
pass: ""

249
m/hut/monitoring.nix Normal file
View File

@ -0,0 +1,249 @@
{ config, lib, ... }:
{
imports = [ ../module/slurm-exporter.nix ];
age.secrets.grafanaJungleRobotPassword = {
file = ../../secrets/jungle-robot-password.age;
owner = "grafana";
mode = "400";
};
services.grafana = {
enable = true;
settings = {
server = {
domain = "jungle.bsc.es";
root_url = "%(protocol)s://%(domain)s/grafana";
serve_from_sub_path = true;
http_port = 2342;
http_addr = "127.0.0.1";
};
smtp = {
enabled = true;
from_address = "jungle-robot@bsc.es";
user = "jungle-robot";
# Read the password from a file, which is only readable by grafana user
# https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#file-provider
password = "$__file{${config.age.secrets.grafanaJungleRobotPassword.path}}";
host = "mail.bsc.es:465";
startTLS_policy = "NoStartTLS";
};
feature_toggles.publicDashboards = true;
"auth.anonymous".enabled = true;
log.level = "warn";
};
};
# Make grafana alerts also use the proxy
systemd.services.grafana.environment = config.networking.proxy.envVars;
services.prometheus = {
enable = true;
port = 9001;
retentionTime = "1y";
listenAddress = "127.0.0.1";
};
systemd.services.prometheus-ipmi-exporter.serviceConfig.DynamicUser = lib.mkForce false;
systemd.services.prometheus-ipmi-exporter.serviceConfig.PrivateDevices = lib.mkForce false;
# We need access to the devices to monitor the disk space
systemd.services.prometheus-node-exporter.serviceConfig.PrivateDevices = lib.mkForce false;
systemd.services.prometheus-node-exporter.serviceConfig.ProtectHome = lib.mkForce "read-only";
virtualisation.docker.daemon.settings = {
metrics-addr = "127.0.0.1:9323";
};
# Required to allow the smartctl exporter to read the nvme0 character device,
# see the commit message on:
# https://github.com/NixOS/nixpkgs/commit/12c26aca1fd55ab99f831bedc865a626eee39f80
services.udev.extraRules = ''
SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", GROUP="disk"
'';
services.prometheus = {
exporters = {
ipmi = {
enable = true;
group = "root";
user = "root";
configFile = ./ipmi.yml;
#extraFlags = [ "--log.level=debug" ];
listenAddress = "127.0.0.1";
};
node = {
enable = true;
enabledCollectors = [ "systemd" ];
port = 9002;
listenAddress = "127.0.0.1";
};
smartctl = {
enable = true;
listenAddress = "127.0.0.1";
};
blackbox = {
enable = true;
listenAddress = "127.0.0.1";
configFile = ./blackbox.yml;
};
};
scrapeConfigs = [
{
job_name = "xeon07";
static_configs = [{
targets = [
"127.0.0.1:${toString config.services.prometheus.exporters.node.port}"
"127.0.0.1:${toString config.services.prometheus.exporters.ipmi.port}"
"127.0.0.1:9323"
"127.0.0.1:9252"
"127.0.0.1:${toString config.services.prometheus.exporters.smartctl.port}"
"127.0.0.1:9341" # Slurm exporter
"127.0.0.1:${toString config.services.prometheus.exporters.blackbox.port}"
];
}];
}
{
job_name = "ceph";
static_configs = [{
targets = [
"10.0.40.40:9283" # Ceph statistics
"10.0.40.40:9002" # Node exporter
"10.0.40.42:9002" # Node exporter
];
}];
}
{
job_name = "blackbox-http";
metrics_path = "/probe";
params = { module = [ "http_2xx" ]; };
static_configs = [{
targets = [
"https://www.google.com/robots.txt"
"https://pm.bsc.es/"
"https://pm.bsc.es/gitlab/"
"https://jungle.bsc.es/"
"https://gitlab.bsc.es/"
];
}];
relabel_configs = [
{
# Takes the address and sets it in the "target=<xyz>" URL parameter
source_labels = [ "__address__" ];
target_label = "__param_target";
}
{
# Sets the "instance" label with the remote host we are querying
source_labels = [ "__param_target" ];
target_label = "instance";
}
{
# Shows the host target address instead of the blackbox address
target_label = "__address__";
replacement = "127.0.0.1:${toString config.services.prometheus.exporters.blackbox.port}";
}
];
}
{
job_name = "blackbox-icmp";
metrics_path = "/probe";
params = { module = [ "icmp" ]; };
static_configs = [{
targets = [
"1.1.1.1"
"8.8.8.8"
"ssfhead"
"anella-bsc.cesca.cat"
];
}];
relabel_configs = [
{
# Takes the address and sets it in the "target=<xyz>" URL parameter
source_labels = [ "__address__" ];
target_label = "__param_target";
}
{
# Sets the "instance" label with the remote host we are querying
source_labels = [ "__param_target" ];
target_label = "instance";
}
{
# Shows the host target address instead of the blackbox address
target_label = "__address__";
replacement = "127.0.0.1:${toString config.services.prometheus.exporters.blackbox.port}";
}
];
}
{
job_name = "gitea";
static_configs = [{ targets = [ "127.0.0.1:3000" ]; }];
}
{
# Scrape the IPMI info of the hosts remotely via LAN
job_name = "ipmi-lan";
scrape_interval = "1m";
scrape_timeout = "30s";
metrics_path = "/ipmi";
scheme = "http";
relabel_configs = [
{
# Takes the address and sets it in the "target=<xyz>" URL parameter
source_labels = [ "__address__" ];
separator = ";";
regex = "(.*)(:80)?";
target_label = "__param_target";
replacement = "\${1}";
action = "replace";
}
{
# Sets the "instance" label with the remote host we are querying
source_labels = [ "__param_target" ];
separator = ";";
regex = "(.*)";
target_label = "instance";
replacement = "\${1}";
action = "replace";
}
{
# Sets the fixed "module=lan" URL param
separator = ";";
regex = "(.*)";
target_label = "__param_module";
replacement = "lan";
action = "replace";
}
{
# Sets the target to query as the localhost IPMI exporter
separator = ";";
regex = ".*";
target_label = "__address__";
replacement = "127.0.0.1:9290";
action = "replace";
}
];
# Load the list of targets from another file
file_sd_configs = [
{
files = [ "${./targets.yml}" ];
refresh_interval = "30s";
}
];
}
{
job_name = "ipmi-raccoon";
metrics_path = "/ipmi";
static_configs = [
{ targets = [ "127.0.0.1:9291" ]; }
];
params = {
target = [ "84.88.51.142" ];
module = [ "raccoon" ];
};
}
];
};
}

24
m/hut/msmtp.nix Normal file
View File

@ -0,0 +1,24 @@
{ config, lib, ... }:
{
age.secrets.jungleRobotPassword = {
file = ../../secrets/jungle-robot-password.age;
group = "gitea";
mode = "440";
};
programs.msmtp = {
enable = true;
accounts = {
default = {
auth = true;
tls = true;
tls_starttls = false;
port = 465;
host = "mail.bsc.es";
user = "jungle-robot";
passwordeval = "cat ${config.age.secrets.jungleRobotPassword.path}";
from = "jungle-robot@bsc.es";
};
};
};
}

14
m/hut/nginx.nix Normal file
View File

@ -0,0 +1,14 @@
{
services.nginx = {
enable = true;
virtualHosts."jungle.bsc.es" = {
listen = [
{
addr = "127.0.0.1";
port = 8123;
}
];
locations."/p/".alias = "/ceph/p/";
};
};
}

16
m/hut/nix-serve.nix Normal file
View File

@ -0,0 +1,16 @@
{ config, ... }:
{
age.secrets.nixServe.file = ../../secrets/nix-serve.age;
services.nix-serve = {
enable = true;
# Only listen locally, as we serve it via ssh
bindAddress = "127.0.0.1";
port = 5000;
secretKeyFile = config.age.secrets.nixServe.path;
# Public key:
# jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=
};
}

22
m/hut/p.nix Normal file
View File

@ -0,0 +1,22 @@
{ pkgs, ... }:
let
p = pkgs.writeShellScriptBin "p" ''
set -e
cd /ceph
pastedir="p/$USER"
mkdir -p "$pastedir"
if [ -n "$1" ]; then
out="$pastedir/$1"
else
out=$(mktemp "$pastedir/XXXXXXXX.txt")
fi
cat > "$out"
chmod go+r "$out"
echo "https://jungle.bsc.es/$out"
'';
in
{
environment.systemPackages = with pkgs; [ p ];
}

19
m/hut/postgresql.nix Normal file
View File

@ -0,0 +1,19 @@
{ lib, ... }:
{
services.postgresql = {
enable = true;
ensureDatabases = [ "perftestsdb" ];
ensureUsers = [
{ name = "anavarro"; ensureClauses.superuser = true; }
{ name = "rarias"; ensureClauses.superuser = true; }
{ name = "grafana"; }
];
authentication = ''
#type database DBuser auth-method
local perftestsdb rarias trust
local perftestsdb anavarro trust
local perftestsdb grafana trust
'';
};
}

79
m/hut/public-inbox.css Normal file
View File

@ -0,0 +1,79 @@
/*
* CC0-1.0 <https://creativecommons.org/publicdomain/zero/1.0/legalcode>
* Dark color scheme using 216 web-safe colors, inspired
* somewhat by the default color scheme in mutt.
* It reduces eyestrain for me, and energy usage for all:
* https://en.wikipedia.org/wiki/Light-on-dark_color_scheme
*/
* {
font-size: 14px;
font-family: monospace;
}
pre {
white-space: pre-wrap;
padding: 10px;
background: #f5f5f5;
}
hr {
margin: 30px 0;
}
body {
max-width: 120ex; /* 120 columns wide */
margin: 50px auto;
}
/*
* Underlined links add visual noise which make them hard-to-read.
* Use colors to make them stand out, instead.
*/
a:link {
color: #007;
text-decoration: none;
}
a:visited {
color:#504;
}
a:hover {
text-decoration: underline;
}
/* quoted text in emails gets a different color */
*.q { color:gray }
/*
* these may be used with cgit <https://git.zx2c4.com/cgit/>, too.
* (cgit uses <div>, public-inbox uses <span>)
*/
*.add { color:darkgreen } /* diff post-image lines */
*.del { color:darkred } /* diff pre-image lines */
*.head { color:black } /* diff header (metainformation) */
*.hunk { color:gray } /* diff hunk-header */
/*
* highlight 3.x colors (tested 3.18) for displaying blobs.
* This doesn't use most of the colors available, as I find too
* many colors overwhelming, so the default is commented out.
*/
.hl.num { color:#f30 } /* number */
.hl.esc { color:#f0f } /* escape character */
.hl.str { color:#f30 } /* string */
.hl.ppc { color:#f0f } /* preprocessor */
.hl.pps { color:#f30 } /* preprocessor string */
.hl.slc { color:#09f } /* single-line comment */
.hl.com { color:#09f } /* multi-line comment */
/* .hl.opt { color:#ccc } */ /* operator */
/* .hl.ipl { color:#ccc } */ /* interpolation */
/* keyword groups kw[a-z] */
.hl.kwa { color:#ff0 }
.hl.kwb { color:#0f0 }
.hl.kwc { color:#ff0 }
/* .hl.kwd { color:#ccc } */
/* line-number (unused by public-inbox) */
/* .hl.lin { color:#ccc } */

47
m/hut/public-inbox.nix Normal file
View File

@ -0,0 +1,47 @@
{ lib, ... }:
{
services.public-inbox = {
enable = true;
http = {
enable = true;
port = 8081;
mounts = [ "/lists" ];
};
settings.publicinbox = {
css = [ "${./public-inbox.css}" ];
wwwlisting = "all";
};
inboxes = {
bscpkgs = {
url = "https://jungle.bsc.es/lists/bscpkgs";
address = [ "~rodarima/bscpkgs@lists.sr.ht" ];
watch = [ "imaps://jungle-robot%40gmx.com@imap.gmx.com/INBOX" ];
description = "Patches for bscpkgs";
listid = "~rodarima/bscpkgs.lists.sr.ht";
};
jungle = {
url = "https://jungle.bsc.es/lists/jungle";
address = [ "~rodarima/jungle@lists.sr.ht" ];
watch = [ "imaps://jungle-robot%40gmx.com@imap.gmx.com/INBOX" ];
description = "Patches for jungle";
listid = "~rodarima/jungle.lists.sr.ht";
};
};
};
# We need access to the network for the watch service, as we will fetch the
# emails directly from the IMAP server.
systemd.services.public-inbox-watch.serviceConfig = {
PrivateNetwork = lib.mkForce false;
RestrictAddressFamilies = lib.mkForce [ "AF_UNIX" "AF_INET" "AF_INET6" ];
KillSignal = "SIGKILL"; # Avoid slow shutdown
# Required for chmod(..., 02750) on directories by git, from
# systemd.exec(8):
# > Note that this restricts marking of any type of file system object with
# > these bits, including both regular files and directories (where the SGID
# > is a different meaning than for files, see documentation).
RestrictSUIDSGID = lib.mkForce false;
};
}

35
m/hut/pxe.nix Normal file
View File

@ -0,0 +1,35 @@
{ theFlake, pkgs, ... }:
# This module describes a script that can launch the pixiecore daemon to serve a
# NixOS image via PXE to a node to directly boot from there, without requiring a
# working disk.
let
# The host config must have the netboot-minimal.nix module too
host = theFlake.nixosConfigurations.lake2;
sys = host.config.system;
build = sys.build;
kernel = "${build.kernel}/bzImage";
initrd = "${build.netbootRamdisk}/initrd";
init = "${build.toplevel}/init";
script = pkgs.writeShellScriptBin "pixiecore-helper" ''
#!/usr/bin/env bash -x
${pkgs.pixiecore}/bin/pixiecore \
boot ${kernel} ${initrd} --cmdline "init=${init} loglevel=4" \
--debug --dhcp-no-bind --port 64172 --status-port 64172 "$@"
'';
in
{
## We need a DHCP server to provide the IP
#services.dnsmasq = {
# enable = true;
# settings = {
# domain-needed = true;
# dhcp-range = [ "192.168.0.2,192.168.0.254" ];
# };
#};
environment.systemPackages = [ script ];
}

7
m/hut/slurm-server.nix Normal file
View File

@ -0,0 +1,7 @@
{ ... }:
{
services.slurm = {
server.enable = true;
};
}

15
m/hut/targets.yml Normal file
View File

@ -0,0 +1,15 @@
- targets:
- 10.0.40.101
- 10.0.40.102
- 10.0.40.103
- 10.0.40.104
- 10.0.40.105
- 10.0.40.106
- 10.0.40.107
- 10.0.40.108
# Storage
- 10.0.40.141
- 10.0.40.142
- 10.0.40.143
labels:
job: ipmi-lan

35
m/koro/configuration.nix Normal file
View File

@ -0,0 +1,35 @@
{ config, pkgs, lib, modulesPath, ... }:
{
imports = [
../common/xeon.nix
#(modulesPath + "/installer/netboot/netboot-minimal.nix")
../eudy/cpufreq.nix
../eudy/users.nix
./kernel.nix
];
# Select this using the ID to avoid mismatches
boot.loader.grub.device = "/dev/disk/by-id/wwn-0x55cd2e414d5376d2";
# disable automatic garbage collector
nix.gc.automatic = lib.mkForce false;
# members of the tracing group can use the lttng-provided kernel events
# without root permissions
users.groups.tracing.members = [ "arocanon" "vlopez" ];
# set up both ethernet and infiniband ips
networking = {
hostName = "koro";
interfaces.eno1.ipv4.addresses = [ {
address = "10.0.40.5";
prefixLength = 24;
} ];
interfaces.ibp5s0.ipv4.addresses = [ {
address = "10.0.42.5";
prefixLength = 24;
} ];
};
}

70
m/koro/kernel.nix Normal file
View File

@ -0,0 +1,70 @@
{ pkgs, lib, ... }:
let
#fcs-devel = pkgs.linuxPackages_custom {
# version = "6.2.8";
# src = /mnt/data/kernel/fcs/kernel/src;
# configfile = /mnt/data/kernel/fcs/kernel/configs/defconfig;
#};
#fcsv1 = fcs-kernel "bc11660676d3d68ce2459b9fb5d5e654e3f413be" false;
#fcsv2 = fcs-kernel "db0f2eca0cd57a58bf456d7d2c7d5d8fdb25dfb1" false;
#fcsv1-lockdep = fcs-kernel "bc11660676d3d68ce2459b9fb5d5e654e3f413be" true;
#fcsv2-lockdep = fcs-kernel "db0f2eca0cd57a58bf456d7d2c7d5d8fdb25dfb1" true;
#fcs-kernel = gitCommit: lockdep: pkgs.linuxPackages_custom {
# version = "6.2.8";
# src = builtins.fetchGit {
# url = "git@bscpm03.bsc.es:ompss-kernel/linux.git";
# rev = gitCommit;
# ref = "fcs";
# };
# configfile = if lockdep then ./configs/lockdep else ./configs/defconfig;
#};
kernel = nixos-fcs;
nixos-fcs-kernel = lib.makeOverridable ({gitCommit, lockStat ? false, preempt ? false, branch ? "fcs"}: pkgs.linuxPackagesFor (pkgs.buildLinux rec {
version = "6.2.8";
src = builtins.fetchGit {
url = "git@bscpm03.bsc.es:ompss-kernel/linux.git";
rev = gitCommit;
ref = branch;
};
structuredExtraConfig = with lib.kernel; {
# add general custom kernel options here
} // lib.optionalAttrs lockStat {
LOCK_STAT = yes;
} // lib.optionalAttrs preempt {
PREEMPT = lib.mkForce yes;
PREEMPT_VOLUNTARY = lib.mkForce no;
};
kernelPatches = [];
extraMeta.branch = lib.versions.majorMinor version;
}));
nixos-fcs = nixos-fcs-kernel {gitCommit = "8a09822dfcc8f0626b209d6d2aec8b5da459dfee";};
nixos-fcs-lockstat = nixos-fcs.override {
lockStat = true;
};
nixos-fcs-lockstat-preempt = nixos-fcs.override {
lockStat = true;
preempt = true;
};
latest = pkgs.linuxPackages_latest;
in {
imports = [
../eudy/kernel/lttng.nix
../eudy/kernel/perf.nix
];
boot.kernelPackages = lib.mkForce kernel;
# disable all cpu mitigations
boot.kernelParams = [
"mitigations=off"
];
# enable memory overcommit, needed to build a taglibc system using nix after
# increasing the openblas memory footprint
boot.kernel.sysctl."vm.overcommit_memory" = 1;
}

83
m/lake2/configuration.nix Normal file
View File

@ -0,0 +1,83 @@
{ config, pkgs, lib, modulesPath, ... }:
{
imports = [
../common/xeon.nix
../module/monitoring.nix
];
boot.loader.grub.device = "/dev/disk/by-id/wwn-0x55cd2e414d53563a";
boot.kernel.sysctl = {
"kernel.yama.ptrace_scope" = lib.mkForce "1";
};
environment.systemPackages = with pkgs; [
ceph
];
services.ceph = {
enable = true;
global = {
fsid = "9c8d06e0-485f-4aaf-b16b-06d6daf1232b";
monHost = "10.0.40.40";
monInitialMembers = "bay";
clusterNetwork = "10.0.40.40/24"; # Use Ethernet only
};
osd = {
enable = true;
# One daemon per NVME disk
daemons = [ "4" "5" "6" "7" ];
extraConfig = {
"osd crush chooseleaf type" = "0";
"osd journal size" = "10000";
"osd pool default min size" = "2";
"osd pool default pg num" = "200";
"osd pool default pgp num" = "200";
"osd pool default size" = "3";
};
};
};
networking = {
hostName = "lake2";
interfaces.eno1.ipv4.addresses = [ {
address = "10.0.40.42";
prefixLength = 24;
} ];
interfaces.ibp5s0.ipv4.addresses = [ {
address = "10.0.42.42";
prefixLength = 24;
} ];
firewall = {
extraCommands = ''
# Accept all incoming TCP traffic from bay
iptables -A nixos-fw -p tcp -s bay -j nixos-fw-accept
# Accept monitoring requests from hut
iptables -A nixos-fw -p tcp -s hut --dport 9002 -j nixos-fw-accept
# Accept all Ceph traffic from the local network
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 -m multiport --dport 3300,6789,6800:7568 -j nixos-fw-accept
'';
};
};
# Missing service for volumes, see:
# https://www.reddit.com/r/ceph/comments/14otjyo/comment/jrd69vt/
systemd.services.ceph-volume = {
enable = true;
description = "Ceph Volume activation";
unitConfig = {
Type = "oneshot";
After = "local-fs.target";
Wants = "local-fs.target";
};
path = [ pkgs.ceph pkgs.util-linux pkgs.lvm2 pkgs.cryptsetup ];
serviceConfig = {
KillMode = "none";
Environment = "CEPH_VOLUME_TIMEOUT=10000";
ExecStart = "/bin/sh -c 'timeout $CEPH_VOLUME_TIMEOUT ${pkgs.ceph}/bin/ceph-volume lvm activate --all --no-systemd'";
TimeoutSec = "0";
};
wantedBy = [ "multi-user.target" ];
};
}

36
m/module/ceph.nix Normal file
View File

@ -0,0 +1,36 @@
{ config, pkgs, ... }:
# Mounts the /ceph filesystem at boot
{
environment.systemPackages = with pkgs; [
ceph-client
fio # For benchmarks
];
# We need the ceph module loaded as the mount.ceph binary fails to run the
# modprobe command.
boot.kernelModules = [ "ceph" ];
age.secrets.cephUser.file = ../../secrets/ceph-user.age;
fileSystems."/ceph-slow" = {
fsType = "ceph";
device = "user@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/";
options = [
"mon_addr=10.0.40.40"
"secretfile=${config.age.secrets.cephUser.path}"
];
};
services.cachefilesd.enable = true;
fileSystems."/ceph" = {
fsType = "ceph";
device = "user@9c8d06e0-485f-4aaf-b16b-06d6daf1232b.cephfs=/";
options = [
"fsc"
"mon_addr=10.0.40.40"
"secretfile=${config.age.secrets.cephUser.path}"
];
};
}

3
m/module/debuginfod.nix Normal file
View File

@ -0,0 +1,3 @@
{
services.nixseparatedebuginfod.enable = true;
}

3
m/module/emulation.nix Normal file
View File

@ -0,0 +1,3 @@
{
boot.binfmt.emulatedSystems = [ "armv7l-linux" "aarch64-linux" "powerpc64le-linux" "riscv64-linux" ];
}

24
m/module/jungle-users.nix Normal file
View File

@ -0,0 +1,24 @@
{ config, lib, ... }:
with lib;
{
options = {
users.jungleUsers = mkOption {
type = types.attrsOf (types.anything // { check = (x: x ? "hosts"); });
description = ''
Same as users.users but with the extra `hosts` attribute, which controls
access to the nodes by `networking.hostName`.
'';
};
};
config = let
allowedUser = host: userConf: builtins.elem host userConf.hosts;
filterUsers = host: users: filterAttrs (n: v: allowedUser host v) users;
removeHosts = users: mapAttrs (n: v: builtins.removeAttrs v [ "hosts" ]) users;
currentHost = config.networking.hostName;
in {
users.users = removeHosts (filterUsers currentHost config.users.jungleUsers);
};
}

25
m/module/monitoring.nix Normal file
View File

@ -0,0 +1,25 @@
{ config, lib, ... }:
{
# We need access to the devices to monitor the disk space
systemd.services.prometheus-node-exporter.serviceConfig.PrivateDevices = lib.mkForce false;
systemd.services.prometheus-node-exporter.serviceConfig.ProtectHome = lib.mkForce "read-only";
# Required to allow the smartctl exporter to read the nvme0 character device,
# see the commit message on:
# https://github.com/NixOS/nixpkgs/commit/12c26aca1fd55ab99f831bedc865a626eee39f80
services.udev.extraRules = ''
SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", GROUP="disk"
'';
services.prometheus = {
exporters = {
node = {
enable = true;
enabledCollectors = [ "systemd" ];
port = 9002;
};
smartctl.enable = true;
};
};
}

107
m/module/slurm-client.nix Normal file
View File

@ -0,0 +1,107 @@
{ config, pkgs, lib, ... }:
let
suspendProgram = pkgs.writeScript "suspend.sh" ''
#!/usr/bin/env bash
exec 1>>/var/log/power_save.log 2>>/var/log/power_save.log
set -x
export "PATH=/run/current-system/sw/bin:$PATH"
echo "$(date) Suspend invoked $0 $*" >> /var/log/power_save.log
hosts=$(scontrol show hostnames $1)
for host in $hosts; do
echo Shutting down host: $host
ipmitool -I lanplus -H ''${host}-ipmi -P "" -U "" chassis power off
done
'';
resumeProgram = pkgs.writeScript "resume.sh" ''
#!/usr/bin/env bash
exec 1>>/var/log/power_save.log 2>>/var/log/power_save.log
set -x
export "PATH=/run/current-system/sw/bin:$PATH"
echo "$(date) Suspend invoked $0 $*" >> /var/log/power_save.log
hosts=$(scontrol show hostnames $1)
for host in $hosts; do
echo Starting host: $host
ipmitool -I lanplus -H ''${host}-ipmi -P "" -U "" chassis power on
done
'';
in {
systemd.services.slurmd.serviceConfig = {
# Kill all processes in the control group on stop/restart. This will kill
# all the jobs running, so ensure that we only upgrade when the nodes are
# not in use. See:
# https://github.com/NixOS/nixpkgs/commit/ae93ed0f0d4e7be0a286d1fca86446318c0c6ffb
# https://bugs.schedmd.com/show_bug.cgi?id=2095#c24
KillMode = lib.mkForce "control-group";
};
services.slurm = {
client.enable = true;
controlMachine = "hut";
clusterName = "jungle";
nodeName = [
"owl[1,2] Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 Feature=owl"
"hut Sockets=2 CoresPerSocket=14 ThreadsPerCore=2"
];
partitionName = [
"owl Nodes=owl[1-2] Default=YES DefaultTime=01:00:00 MaxTime=INFINITE State=UP"
"all Nodes=owl[1-2],hut Default=NO DefaultTime=01:00:00 MaxTime=INFINITE State=UP"
];
# See slurm.conf(5) for more details about these options.
extraConfig = ''
# Use PMIx for MPI by default. It works okay with MPICH and OpenMPI, but
# not with Intel MPI. For that use the compatibility shim libpmi.so
# setting I_MPI_PMI_LIBRARY=$pmix/lib/libpmi.so while maintaining the PMIx
# library in SLURM (--mpi=pmix). See more details here:
# https://pm.bsc.es/gitlab/rarias/jungle/-/issues/16
MpiDefault=pmix
# When a node reboots return that node to the slurm queue as soon as it
# becomes operative again.
ReturnToService=2
# Track all processes by using a cgroup
ProctrackType=proctrack/cgroup
# Enable task/affinity to allow the jobs to run in a specified subset of
# the resources. Use the task/cgroup plugin to enable process containment.
TaskPlugin=task/affinity,task/cgroup
# Power off unused nodes until they are requested
SuspendProgram=${suspendProgram}
SuspendTimeout=60
ResumeProgram=${resumeProgram}
ResumeTimeout=300
SuspendExcNodes=hut
# Turn the nodes off after 1 hour of inactivity
SuspendTime=3600
# Reduce port range so we can allow only this range in the firewall
SrunPortRange=60000-61000
# Use cores as consumable resources. In SLURM terms, a core may have
# multiple hardware threads (or CPUs).
SelectType=select/cons_tres
# Ignore memory constraints and only use unused cores to share a node with
# other jobs.
SelectTypeParameters=CR_Core
'';
};
age.secrets.mungeKey = {
file = ../../secrets/munge-key.age;
owner = "munge";
group = "munge";
};
services.munge = {
enable = true;
password = config.age.secrets.mungeKey.path;
};
}

View File

@ -0,0 +1,28 @@
{ config, lib, pkgs, ... }:
# See also: https://github.com/NixOS/nixpkgs/pull/112010
# And: https://github.com/NixOS/nixpkgs/pull/115839
with lib;
{
systemd.services."prometheus-slurm-exporter" = {
wantedBy = [ "multi-user.target" ];
after = [ "network.target" ];
serviceConfig = {
Restart = mkDefault "always";
PrivateTmp = mkDefault true;
WorkingDirectory = mkDefault "/tmp";
DynamicUser = mkDefault true;
ExecStart = ''
${pkgs.prometheus-slurm-exporter}/bin/prometheus-slurm-exporter --listen-address "127.0.0.1:9341"
'';
Environment = [
"PATH=${pkgs.slurm}/bin"
# We need to specify the slurm config to be able to talk to the slurmd
# daemon.
"SLURM_CONF=${config.services.slurm.etcSlurm}/slurm.conf"
];
};
};
}

View File

@ -0,0 +1,8 @@
{ ... }:
{
networking.firewall = {
# Required for PMIx in SLURM, we should find a better way
allowedTCPPortRanges = [ { from=1024; to=65535; } ];
};
}

View File

@ -0,0 +1,19 @@
{ ... }:
{
# Mount the hut nix store via NFS
fileSystems."/mnt/hut-nix-store" = {
device = "hut:/nix/store";
fsType = "nfs";
options = [ "ro" ];
};
systemd.services.slurmd.serviceConfig = {
# When running a job, bind the hut store in /nix/store so the paths are
# available too.
# FIXME: This doesn't keep the programs in /run/current-system/sw/bin
# available in the store. Ideally they should be merged but the overlay FS
# doesn't work when the underlying directories change.
BindReadOnlyPaths = "/mnt/hut-nix-store:/nix/store";
};
}

27
m/owl1/configuration.nix Normal file
View File

@ -0,0 +1,27 @@
{ config, pkgs, ... }:
{
imports = [
../common/xeon.nix
../module/ceph.nix
../module/emulation.nix
../module/slurm-client.nix
../module/slurm-firewall.nix
../module/debuginfod.nix
];
# Select the this using the ID to avoid mismatches
boot.loader.grub.device = "/dev/disk/by-id/wwn-0x55cd2e414d53566c";
networking = {
hostName = "owl1";
interfaces.eno1.ipv4.addresses = [ {
address = "10.0.40.1";
prefixLength = 24;
} ];
interfaces.ibp5s0.ipv4.addresses = [ {
address = "10.0.42.1";
prefixLength = 24;
} ];
};
}

28
m/owl2/configuration.nix Normal file
View File

@ -0,0 +1,28 @@
{ config, pkgs, ... }:
{
imports = [
../common/xeon.nix
../module/ceph.nix
../module/emulation.nix
../module/slurm-client.nix
../module/slurm-firewall.nix
../module/debuginfod.nix
];
# Select the this using the ID to avoid mismatches
boot.loader.grub.device = "/dev/disk/by-id/wwn-0x55cd2e414d535629";
networking = {
hostName = "owl2";
interfaces.eno1.ipv4.addresses = [ {
address = "10.0.40.2";
prefixLength = 24;
} ];
# Watch out! The OmniPath device is not in the same place here:
interfaces.ibp129s0.ipv4.addresses = [ {
address = "10.0.42.2";
prefixLength = 24;
} ];
};
}

View File

@ -0,0 +1,64 @@
{ config, pkgs, lib, modulesPath, ... }:
{
imports = [
../common/base.nix
];
# Don't install Grub on the disk yet
boot.loader.grub.device = "nodev";
# Enable serial console
boot.kernelParams = [
"console=tty1"
"console=ttyS1,115200"
];
networking = {
hostName = "raccoon";
# Only BSC DNSs seem to be reachable from the office VLAN
nameservers = [ "84.88.52.35" "84.88.52.36" ];
defaultGateway = "84.88.51.129";
interfaces.eno0.ipv4.addresses = [ {
address = "84.88.51.152";
prefixLength = 25;
} ];
};
# Configure Nvidia driver to use with CUDA
hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
hardware.graphics.enable = true;
nixpkgs.config.allowUnfree = true;
nixpkgs.config.nvidia.acceptLicense = true;
services.xserver.videoDrivers = [ "nvidia" ];
users.motd = ''
DO YOU BRING FEEDS?
'';
}

1
nixos-config.nix Normal file
View File

@ -0,0 +1 @@
(builtins.getFlake (toString ./.)).nixosConfigurations

View File

@ -1,8 +0,0 @@
self: super:
with super.lib;
let
# Load the system config and get the `nixpkgs.overlays` option
overlays = (import <nixpkgs/nixos> { }).config.nixpkgs.overlays;
in
# Apply all overlays to the input of the current "main" overlay
foldl' (flip extends) (_: super) overlays self

View File

@ -0,0 +1,36 @@
diff --git a/src/util/mpir_hwtopo.c b/src/util/mpir_hwtopo.c
index 33e88bc..ee3641c 100644
--- a/src/util/mpir_hwtopo.c
+++ b/src/util/mpir_hwtopo.c
@@ -200,18 +200,6 @@ int MPII_hwtopo_init(void)
#ifdef HAVE_HWLOC
bindset = hwloc_bitmap_alloc();
hwloc_topology_init(&hwloc_topology);
- char *xmlfile = MPIR_pmi_get_jobattr("PMI_hwloc_xmlfile");
- if (xmlfile != NULL) {
- int rc;
- rc = hwloc_topology_set_xml(hwloc_topology, xmlfile);
- if (rc == 0) {
- /* To have hwloc still actually call OS-specific hooks, the
- * HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM has to be set to assert that the loaded
- * file is really the underlying system. */
- hwloc_topology_set_flags(hwloc_topology, HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM);
- }
- MPL_free(xmlfile);
- }
hwloc_topology_set_io_types_filter(hwloc_topology, HWLOC_TYPE_FILTER_KEEP_ALL);
if (!hwloc_topology_load(hwloc_topology))
--- a/src/mpi/init/local_proc_attrs.c
+++ b/src/mpi/init/local_proc_attrs.c
@@ -79,10 +79,6 @@ int MPII_init_local_proc_attrs(int *p_thread_required)
/* Set the number of tag bits. The device may override this value. */
MPIR_Process.tag_bits = MPIR_TAG_BITS_DEFAULT;
- char *requested_kinds = MPIR_pmi_get_jobattr("PMI_mpi_memory_alloc_kinds");
- MPIR_get_supported_memory_kinds(requested_kinds, &MPIR_Process.memory_alloc_kinds);
- MPL_free(requested_kinds);
-
return mpi_errno;
}

45
pkgs/overlay.nix Normal file
View File

@ -0,0 +1,45 @@
final: prev:
{
# Set MPICH as default
mpi = final.mpich;
# Configure the network for MPICH
mpich = with final; let
# pmix comes with the libraries in .out and headers in .dev
pmixAll = symlinkJoin {
name = "pmix-all";
paths = [ pmix.dev pmix.out ];
};
in prev.mpich.overrideAttrs (old: {
patches = [
# See https://github.com/pmodels/mpich/issues/6946
./mpich-fix-hwtopo.patch
];
buildInput = old.buildInputs ++ [
libfabric
pmixAll
];
configureFlags = [
"--enable-shared"
"--enable-sharedlib"
"--with-pm=no"
"--with-device=ch4:ofi"
"--with-pmi=pmix"
"--with-pmix=${pmixAll}"
"--with-libfabric=${libfabric}"
"--enable-g=log"
] ++ lib.optionals (lib.versionAtLeast gfortran.version "10") [
"FFLAGS=-fallow-argument-mismatch" # https://github.com/pmodels/mpich/issues/4300
"FCFLAGS=-fallow-argument-mismatch"
];
});
slurm = prev.slurm.overrideAttrs (old: {
patches = (old.patches or []) ++ [
# See https://bugs.schedmd.com/show_bug.cgi?id=19324
./slurm-rank-expansion.patch
];
});
prometheus-slurm-exporter = prev.callPackage ./slurm-exporter.nix { };
}

22
pkgs/slurm-exporter.nix Normal file
View File

@ -0,0 +1,22 @@
{ buildGoModule, fetchFromGitHub, lib }:
buildGoModule rec {
pname = "prometheus-slurm-exporter";
version = "0.20";
src = fetchFromGitHub {
rev = version;
owner = "vpenso";
repo = pname;
sha256 = "sha256-KS9LoDuLQFq3KoKpHd8vg1jw20YCNRJNJrnBnu5vxvs=";
};
vendorHash = "sha256-A1dd9T9SIEHDCiVT2UwV6T02BSLh9ej6LC/2l54hgwI=";
doCheck = false;
meta = with lib; {
description = "Prometheus SLURM Exporter";
homepage = "https://github.com/vpenso/prometheus-slurm-exporter";
platforms = platforms.linux;
};
}

View File

@ -0,0 +1,11 @@
--- a/src/plugins/mpi/pmix/pmixp_dmdx.c 2024-03-15 13:05:24.815313882 +0100
+++ b/src/plugins/mpi/pmix/pmixp_dmdx.c 2024-03-15 13:09:53.936900823 +0100
@@ -314,7 +314,7 @@ static void _dmdx_req(buf_t *buf, int no
}
nsptr = pmixp_nspaces_local();
- if (nsptr->ntasks <= rank) {
+ if ((long) nsptr->ntasks <= (long) rank) {
char *nodename = pmixp_info_job_host(nodeid);
PMIXP_ERROR("Bad request from %s: nspace \"%s\" has only %d ranks, asked for %d",
nodename, ns, nsptr->ntasks, rank);

16
rebuild.sh Executable file
View File

@ -0,0 +1,16 @@
#!/bin/sh -ex
if [ "$(id -u)" != 0 ]; then
echo "Needs root permissions"
exit 1
fi
if [ "$(hostname)" != "hut" ]; then
>&2 echo "must run from machine hut, not $(hostname)"
exit 1
fi
# Update all nodes
nixos-rebuild switch --flake .
nixos-rebuild switch --flake .#owl1 --target-host owl1
nixos-rebuild switch --flake .#owl2 --target-host owl2

21
secrets/ceph-user.age Normal file
View File

@ -0,0 +1,21 @@
age-encryption.org/v1
-> ssh-ed25519 AY8zKw J00a6ZOhkupkhLU5WQ0kD05HEF4KKsSs2hwjHKbnnHU
J14VoNOCqLpScVO7OLXbqTcLI4tcVUHt5cqY/XQmbGs
-> ssh-ed25519 sgAamA k8R/bSUdvVmlBI6yHPi5NBQPBGM36lPJwsir8DFGgxE
4ZKC3gYvic6AVrNGgNjwztbUzhxP8ViX5O3wFo9wlrk
-> ssh-ed25519 HY2yRg 966xf2fTnA6Wq0uYXbXZQOManqITJcCbQS9LZCGEOh4
Qg5echQSrzqeDqvaMx+5fqi8XyTjAeCsY/UFJX6YnDs
-> ssh-ed25519 tcumPQ e0U2okrGIoUpLfPYjIRx1V92rE3hZW13nJef+l3kBQg
LejAUKBl+tPhwocCF00ZHTzFISnwX8og8GvemiMIcyo
-> ssh-ed25519 JJ1LWg QkzTsPq9Gdh+FNz/a4bDb9LQOreFyxeTC51UNd1fsj0
ayrlKenETfQzH1Z9drVEWqszQebicGVJve0/pCnxAE8
-> ssh-ed25519 CAWG4Q lJLW9+dxvyoD4hYzeXeE/4rzJ6HIeEQOB1+fbhV3xw0
T2RrVCtTuQvya9HiJB7txk3QGrntpsMX9Tt1cyXoW5E
-> ssh-ed25519 MSF3dg JOZkFb2CfqWKvZIz7lYxXWgv8iEVDkQF8hInDMZvknc
MHDWxjUw4dNiC1h4MrU9uKKcI3rwkxABm0+5FYMZkok
-> ~8m;7f-grease
lDIullfC98RhpTZ4Mk87Td+VtPmwPdgz+iIilpKugUkmV5r4Uqd7yE+5ArA6ekr/
G/X4EA
--- Cz4sv9ZunBcVdZCozdTh1zlg1zIASjk2MjYeYfcN9eA
ÊN Å$[H˜ÝQËéŠ
d£š·'­±ö7…·Í²)ÖØÀÊx9yüÐëE¡þÓM7^Ø[ÐMŽ+É&éâö½$8tM¨Ð²

View File

@ -0,0 +1,9 @@
age-encryption.org/v1
-> ssh-ed25519 HY2yRg DQdgCk16Yu524BsrWVf0krnwWzDM6SeaJCgQipOfwCA
Ab9ocqra/UWJZI+QGMlxUhBu5AzqfjPgXl+ENIiHYGs
-> ssh-ed25519 CAWG4Q KF9rGCenb3nf+wyz2hyVs/EUEbsmUs5R+1fBxlCibC8
7++Kxbr3FHVdVfnFdHYdAuR0Tgfd+sRcO6WRss6LhEw
-> ssh-ed25519 MSF3dg aUe4DhRsu4X8CFOEAnD/XM/o/0qHYSB522woCaAVh0I
GRcs5cm2YqA/lGhUtbpboBaz7mfgiLaCr+agaB7vACU
--- 9Q7Ou+Pxq+3RZilCb2dKC/pCFjZEt4rp5KnTUUU7WJ8
1¬Mw4Í ì:Hµ@Á/ägLtMÇ,߯¥ô*¡žzñNV5ˆmÍNŽoÞáj1 $÷TøG_³E{Œ%“‰1ǯ<>îAÛp™

View File

@ -0,0 +1,9 @@
age-encryption.org/v1
-> ssh-ed25519 HY2yRg WvKK6U1wQtx2pbUDfuaUIXTQiCulDkz7hgUCSwMfMzQ
jLktUMqKuVxukqzz++pHOKvmucUQqeKYy5IwBma7KxY
-> ssh-ed25519 CAWG4Q XKGuNNoYFl9bdZzsqYYTY7GsEt5sypLW4R+1uk78NmU
8dIA2GzRAwTGM5CDHSM2BUBsbXzEAUssWUz2PY2PaTg
-> ssh-ed25519 MSF3dg T630RsKuZIF/bp+KITnIIWWHsg6M/VQGqbWQZxqT+AA
SraZcgZJVtmUzHF/XR9J7aK5t5EDNpkC/av/WJUT/G8
--- /12G8pj9sbs591OM/ryhoLnSWWmzYcoqprk9uN/3g18
ä·ù¼Â‡%å]yi"ô<>»LÓ âùH`ªa$Æþ)¦9ve<76>.0úmÉK<EFBFBD>vƒÀ ïu"|1cÞ-%ÔÕ"åWFï¡ÞA«<41>hº$•ºj<eñ¶xÅLx«ç.?œÈâ:L…¬ƒ,ëu»|³F|Õi²äÔ

Binary file not shown.

View File

@ -0,0 +1,10 @@
age-encryption.org/v1
-> ssh-ed25519 HY2yRg 3L1Y5upc5qN6fgiFAox5rD/W8n0eQUv5mT39QAdO5Ac
XkWsmPmzRgHjsvJgsDKJRgHZ7/sBZFmd1Doppj/y390
-> ssh-ed25519 CAWG4Q v03Qr+fckdIpsxvQG/viKxlF8WNpO4XUe//QcPzH4k0
afUwi3ccDCRfUxPDdF7ZkoL+0UX1XwqVtiyabDWjVQk
-> ssh-ed25519 MSF3dg c2hEUk4LslJpiL7v/4UpT8fK7ZiBJ8+uRhZ/vBoRUDE
YX9EpnJpHo1eDsZtapTVY6jD+81kb588Oik4NoY9jro
--- LhUkopNtCsyHCLzEYzBFs+vekOkAR4B3VBaiMF/ZF8w
<EFBFBD>×à»ÂCßHãáàùýy—LØ”ItMèÕåµI×±sMÆ\Í1-±K”ˆ¤‰G:õ™<02>¦
ÝgáºÙbpF¼Ó¶Í%Y·

BIN
secrets/munge-key.age Normal file

Binary file not shown.

12
secrets/nix-serve.age Normal file
View File

@ -0,0 +1,12 @@
age-encryption.org/v1
-> ssh-ed25519 HY2yRg d144D+VvxhYgKtH//uD2qNuVnYX6bh74YqkyM3ZjBwU
0IeVmFAf4U8Sm0d01O6ZwJ1V2jl/mSMl4wF0MP5LrIg
-> ssh-ed25519 CAWG4Q H4nKxue/Cj/3KUF5A+/ygHMjjArwgx3SIWwXcqFtyUo
4k5NJkLUrueLYiPkr2LAwQLWmuaOIsDmV/86ravpleU
-> ssh-ed25519 MSF3dg HpgUAFHLPs4w0cdJHqTwf8lySkTeV9O9NnBf49ClDHs
foPIUUgAYe1YSDy6+aMfjN7xv9xud9fDmhRlIztHoEo
-> vLkF\<-grease
3GRT+W8gYSpjl/a6Ix9+g9UJnTpl1ZH/oucfR801vfE8y77DV2Jxz/XJwzxYxKG5
YEhiTGMNbXw/V7E5aVSz6Bdc
--- GtiHKCZdHByq9j0BSLd544PhbEwTN138E8TFdxipeiA
¥¿£„ÝG$Sº¼ƒRAæÀ¾Th]nÄ8<C384>,ùHœsÈïÚ=p¼™Ù'»<>ô+ôjõÓõŒ9±)ñ:”)¸œYâþÑ8³IØõ8:ol<6F>ë<1F>åÃZÐæ3PM”F;ÊrYõ“ÞÛ<1F>­y¸LâÙœ¦ÎœàÕUús16Ǿ¡LŒb÷¨²

17
secrets/secrets.nix Normal file
View File

@ -0,0 +1,17 @@
let
keys = import ../keys.nix;
adminsKeys = builtins.attrValues keys.admins;
hut = [ keys.hosts.hut ] ++ adminsKeys;
# Only expose ceph keys to safe nodes and admins
safe = keys.hostGroup.safe ++ adminsKeys;
in
{
"gitea-runner-token.age".publicKeys = hut;
"gitlab-runner-docker-token.age".publicKeys = hut;
"gitlab-runner-shell-token.age".publicKeys = hut;
"nix-serve.age".publicKeys = hut;
"jungle-robot-password.age".publicKeys = hut;
"ceph-user.age".publicKeys = safe;
"munge-key.age".publicKeys = safe;
}

1
web/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
./public

View File

@ -0,0 +1,6 @@
---
title: "{{ replace .Name "-" " " | title }}"
date: {{ .Date }}
draft: true
---

25
web/content/_index.md Normal file
View File

@ -0,0 +1,25 @@
![Rainforest](jungle.jpg)
Welcome to the jungle, a set of machines with no imposed rules that are fully
controlled and maintained by their users.
The configuration of all the machines is written in a centralized [git
repository][config] using the Nix language for NixOS. Changes in the
configuration of the machines are introduced by merge requests and pass a review
step before being deployed.
[config]: https://pm.bsc.es/gitlab/rarias/jungle
The machines have access to the large list of packages available in
[Nixpkgs][nixpkgs] and a custom set of packages named [bscpkgs][bscpkgs],
specifically tailored to our needs for HPC machines. Users can install their own
packages and made them system-wide available by opening a merge request.
[nixpkgs]: https://github.com/NixOS/nixpkgs
[bscpkgs]: https://pm.bsc.es/gitlab/rarias/bscpkgs
We have put a lot of effort to guarantee very good reproducibility properties in
the configuration of the machines and the software they use.
To enter the jungle machines follow the [instructions](access) to submit a
request.

BIN
web/content/access/cave.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 470 KiB

View File

@ -0,0 +1,22 @@
---
title: "Enter the jungle"
description: "Request access to the machines"
---
![Cave](./cave.jpg)
Before requesting access to the jungle machines, you must be able to access the
`ssfhead.bsc.es` node (only available via the intranet or VPN). You can request
access to the login machine using a resource petition in the BSC intranet.
Then, to request access to the machines we will need some information about you:
1. Which machines you want access to (hut, owl1, owl2, eudy, koro...)
1. Your user name and user id (to match the NFS permissions)
1. Your real name and surname (for identification purposes)
1. The salted hash of your login password, generated with `mkpasswd -m sha-512`
1. An SSH public key of type Ed25519 (can be generated with `ssh-keygen -t ed25519`)
Send an email to <jungle@bsc.es> with the details, or directly open a
merge request in the [jungle
repository](https://pm.bsc.es/gitlab/rarias/jungle/).

View File

@ -0,0 +1,10 @@
---
title: "Eudy"
description: "Linux kernel experiments"
---
[![Eudy](eudy.jpg)](https://commons.wikimedia.org/w/index.php?curid=5817408)
The *eudy* machine is destined as a playground for Linux kernel experiments. The
name is a shorthand of the Eudyptula species of little penguins found the New
Zealand and Australia.

BIN
web/content/eudy/eudy.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

View File

@ -0,0 +1,6 @@
---
title: "Git"
description: "Gitea instance"
---
If you are reading this page, the proxy to the Gitea service is not working.

View File

@ -0,0 +1,6 @@
---
title: "Grafana"
description: "Monitor metrics"
---
If you are reading this page, the proxy to the Grafana service is not working.

18
web/content/hut/_index.md Normal file
View File

@ -0,0 +1,18 @@
---
title: "Hut"
description: "Control node"
date: 2023-06-13T19:36:57+02:00
---
![Hut](hut.jpg)
From the hut we monitor and control other nodes. It consist of one node only,
which is available at `hut` or `xeon07`. It runs the following services:
- Prometheus: to store the monitoring data.
- Grafana: to plot the data in the web browser.
- Slurmctld: to manage the SLURM nodes.
- Gitlab runner: to run CI jobs from Gitlab.
This node is prone to interruptions from all the services it runs, so it is not
a good candidate for low noise executions.

BIN
web/content/hut/hut.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 178 KiB

View File

@ -0,0 +1,10 @@
---
title: "Lake"
description: "Data storage"
date: 2023-06-13T19:36:57+02:00
draft: true
---
![Lake](lake.jpg)
Data storage

BIN
web/content/lake/lake.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 144 KiB

View File

@ -0,0 +1,6 @@
---
title: "Lists"
description: "Mailing lists"
---
If you are reading this page, the proxy to the public-inbox service is not working.

18
web/content/owl/_index.md Normal file
View File

@ -0,0 +1,18 @@
---
title: "Owl"
description: "Low system noise"
---
![Owl](owl.jpg)
Much like the silent flight of an owl at night, these nodes are configured to
minimize the system noise and let programs run undisturbed. The list of nodes is
`owl[1-2]` and are available for jobs with SLURM.
The contents of the nix store of the hut node is made available in the owl nodes
when a job is running. This allows jobs to access the same paths that are on hut
to load dependencies.
For now, only the hut node can be used to build new derivations so that they
appear in the compute nodes. This applies to the `nix build`, `nix develop` and
`nix shell` commands.

Some files were not shown because too many files have changed in this diff Show More