Update to nixpkgs 25.11 (Xantusia) #218

Manually merged
rarias merged 15 commits from upgrade/25.11 into master 2026-01-20 13:51:15 +01:00
Collaborator

Update to nixpkgs 25.11 (Xantusia)

Compiler changes:

LLVM has been updated to version 21. GCC remains at version 14. CMake was updated to version 4.

Broken:

  • mercurium (mcxx)
Update to nixpkgs 25.11 (Xantusia) - [NixOS 25.11 release announcement](https://nixos.org/blog/announcements/2025/nixos-2511/) - [NixOS release notes](https://nixos.org/manual/nixos/stable/release-notes.html#sec-release-25.11) - [nixpkgs release notes](https://nixos.org/manual/nixpkgs/stable/release-notes#sec-nixpkgs-release-25.11) Compiler changes: LLVM has been updated to version 21. GCC remains at version 14. CMake was updated to version 4. Broken: - mercurium (mcxx)
abonerib added 9 commits 2025-12-02 14:14:27 +01:00
Fixes:
```
To build with setuptools as before, set `pyproject = true` and `build-system = [ setuptools ]
```
See: https://github.com/NixOS/nixpkgs/pull/437723
The option `systemd.watchdog.runtimeTime' defined in `/nix/store/m7h6slsq394m872xnhxsxqrkhndz1lqs-source/m/common/base/watchdog.nix' has been renamed to `systemd.settings.Manager.RuntimeWatchdogSec'.
Upgrade to nixseparatedebuginfod2
Some checks failed
CI / build:all (pull_request) Failing after 49m58s
CI / build:cross (pull_request) Successful in 49m56s
4545fbf08f
Author
Collaborator

Evaluation warnings when building hut:

evaluation warning: linuxPackages.perf is now perf
evaluation warning: Runner registration tokens have been deprecated and disabled by default in GitLab >= 17.0.
                    Consider migrating to runner authentication tokens by setting `services.gitlab-runner.services.gitlab-bsc-docker.authenticationTokenConfigFile`.
                    https://docs.gitlab.com/17.0/ee/ci/runners/new_creation_workflow.html
Done. The new configuration is /nix/store/nffq9ynzlrlx4m7phqgn621dcy3731xm-nixos-system-hut-25.11.20251130.8bb5646
Evaluation warnings when building hut: ``` evaluation warning: linuxPackages.perf is now perf evaluation warning: Runner registration tokens have been deprecated and disabled by default in GitLab >= 17.0. Consider migrating to runner authentication tokens by setting `services.gitlab-runner.services.gitlab-bsc-docker.authenticationTokenConfigFile`. https://docs.gitlab.com/17.0/ee/ci/runners/new_creation_workflow.html Done. The new configuration is /nix/store/nffq9ynzlrlx4m7phqgn621dcy3731xm-nixos-system-hut-25.11.20251130.8bb5646 ```
abonerib added 1 commit 2025-12-02 14:40:37 +01:00
linuxPackages.perf is now perf
Some checks failed
CI / build:cross (pull_request) Successful in 23m50s
CI / build:all (pull_request) Failing after 23m55s
4fa4005056
abonerib added this to the 25.11 Release milestone 2025-12-02 14:45:40 +01:00
abonerib added 1 commit 2025-12-02 14:48:42 +01:00
Enable papi when cross-compiling
Some checks failed
CI / build:cross (pull_request) Successful in 15m45s
CI / build:all (pull_request) Failing after 15m47s
111fcc61d8
abonerib force-pushed upgrade/25.11 from 111fcc61d8 to 408b974433 2025-12-02 16:27:29 +01:00 Compare
abonerib force-pushed upgrade/25.11 from 408b974433 to 00a7122768 2025-12-02 17:53:23 +01:00 Compare
abonerib changed title from WIP: nixpkgs 25.11 to Update to nixpkgs 25.11 (Xantusia) 2025-12-02 17:54:17 +01:00
abonerib requested review from rarias 2025-12-02 17:59:00 +01:00
abonerib force-pushed upgrade/25.11 from 00a7122768 to 1d3bda33a0 2025-12-03 10:15:20 +01:00 Compare
Owner

Thanks!, looks good. I would need to upgrade all machines to test it (including Fox due to SLURM), I would rather do it after Christmas unless we need some fixes before that. We have a custom AMD driver in Fox, could you also build the configuration for Fox to see if it still compiles?

CC: @varcila you were doing some experiments in Fox and this will upgrade the kernel (but not your development shell).

Thanks!, looks good. I would need to upgrade all machines to test it (including Fox due to SLURM), I would rather do it after Christmas unless we need some fixes before that. We have a custom AMD driver in Fox, could you also build the configuration for Fox to see if it still compiles? CC: @varcila you were doing some experiments in Fox and this will upgrade the kernel (but not your development shell).
Collaborator

Thanks for the copy. FYI, I have finished most of the batch of jobs I needed to execute this year, so I will most probably not use Fox until the second of January when I come back from holidays. Just to say that I have no preference for when the upgrade is done :)

Thanks for the copy. FYI, I have finished most of the batch of jobs I needed to execute this year, so I will most probably not use Fox until the second of January when I come back from holidays. Just to say that I have no preference for when the upgrade is done :)
abonerib added 1 commit 2025-12-10 14:34:58 +01:00
Remove conflicting definitions in amd-uprof-driver
All checks were successful
CI / build:cross (pull_request) Successful in 8s
CI / build:all (pull_request) Successful in 47m43s
ee9af71da0
See: https://lkml.org/lkml/2025/4/9/1709
Author
Collaborator

We have a custom AMD driver in Fox, could you also build the configuration for Fox to see if it still compiles?

It is caused by the definitions of rdmsrq and wrmsrq defined in: inc/PwrProfAsm.h which now collide with the kernel's own: https://lkml.org/lkml/2025/4/9/1709 .

Doing a grep on the driver source it seems that they are not used anywhere, and since the amd-uprof comes from a binary blob I think it should be safe to remove them? @varcila

I have added a patch to comment them out and now fox config builds.

I would rather do it after Christmas unless we need some fixes before that.

No rush from my side, we can merge it once we come back from vacations.

> We have a custom AMD driver in Fox, could you also build the configuration for Fox to see if it still compiles? - `amd-uprof-driver` is broken: https://jungle.bsc.es/p/abonerib/B8gcl28j.log It is caused by the definitions of `rdmsrq` and `wrmsrq` defined in: `inc/PwrProfAsm.h` which now collide with the kernel's own: https://lkml.org/lkml/2025/4/9/1709 . Doing a grep on the driver source it seems that they are not used anywhere, and since the `amd-uprof` comes from a binary blob I think it should be safe to remove them? @varcila I have added a patch to comment them out and now fox config builds. > I would rather do it after Christmas unless we need some fixes before that. No rush from my side, we can merge it once we come back from vacations.
Collaborator

(...) I think it should be safe to remove them? @varcila

We can try, I think it makes sense to remove them.

> (...) I think it should be safe to remove them? @varcila We can try, I think it makes sense to remove them.
rarias force-pushed upgrade/25.11 from ee9af71da0 to 14fe50fc2a 2026-01-07 16:48:19 +01:00 Compare
rarias added 1 commit 2026-01-07 17:50:01 +01:00
Fix infiniband interface name
All checks were successful
CI / build:all (pull_request) Successful in 54m23s
CI / build:cross (pull_request) Successful in 1h6m13s
7686a75fd5
Owner

Fixed infiniband name in hut and switched to 25.11. I have also updated the nixpkgs commit so we pick the backported fixes. Everything else seems to be working fine so far.

I will propagate the upgrade to the rest of machines in the following days.

Fixed infiniband name in hut and switched to 25.11. I have also updated the nixpkgs commit so we pick the backported fixes. Everything else seems to be working fine so far. I will propagate the upgrade to the rest of machines in the following days.
Owner

Upgraded bay and lake2 (ceph storage). After rebooting lake2 three (of four) NVME disks are missing:

lake2% ls /dev/nvme*
/dev/nvme0  /dev/nvme0n1

lake2% sudo dmesg | grep nvme
[    8.123120] nvme nvme0: pci function 0000:81:00.0
[    8.129975] nvme nvme0: 31/0/0 default/read/poll queues
[   16.436669] nvme nvme0: using unchecked data buffer

Let's see if rebooting it fixes it.

Upgraded bay and lake2 (ceph storage). After rebooting lake2 three (of four) NVME disks are missing: ``` lake2% ls /dev/nvme* /dev/nvme0 /dev/nvme0n1 lake2% sudo dmesg | grep nvme [ 8.123120] nvme nvme0: pci function 0000:81:00.0 [ 8.129975] nvme nvme0: 31/0/0 default/read/poll queues [ 16.436669] nvme nvme0: using unchecked data buffer ``` Let's see if rebooting it fixes it.
Owner

They are back:

lake2% ls /dev/nvme*
/dev/nvme0  /dev/nvme0n1  /dev/nvme1  /dev/nvme1n1  /dev/nvme2  /dev/nvme2n1  /dev/nvme3  /dev/nvme3n1

lake2% sudo dmesg | grep nvme
[    8.128416] nvme nvme0: pci function 0000:83:00.0
[    8.128615] nvme nvme1: pci function 0000:84:00.0
[    8.128791] nvme nvme2: pci function 0000:85:00.0
[    8.128968] nvme nvme3: pci function 0000:86:00.0
[    8.136478] nvme nvme3: 31/0/0 default/read/poll queues
[    8.136522] nvme nvme2: 31/0/0 default/read/poll queues
[    8.143575] nvme nvme0: 31/0/0 default/read/poll queues
[    8.147813] nvme nvme1: 31/0/0 default/read/poll queues
[   16.434508] nvme nvme3: using unchecked data buffer

Something must be going on with the BIOS / BMC boot as the PCI address has changed for the nvme0 disk. I don't think is related with the upgrade. Ceph is fine and recovering now:

lake2% sudo ceph -s
  cluster:
    id:     9c8d06e0-485f-4aaf-b16b-06d6daf1232b
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum bay (age 26m)
    mgr: bay(active, since 26m)
    mds: 1/1 daemons up, 1 standby
    osd: 8 osds: 8 up (since 3m), 8 in (since 3m); 37 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 545 pgs
    objects: 1.25M objects, 1.4 TiB
    usage:   4.3 TiB used, 4.5 TiB / 8.7 TiB avail
    pgs:     111627/3750123 objects misplaced (2.977%)
             516 active+clean
             22  active+remapped+backfill_wait
             7   active+remapped+backfilling

  io:
    recovery: 307 MiB/s, 223 objects/s
They are back: ``` lake2% ls /dev/nvme* /dev/nvme0 /dev/nvme0n1 /dev/nvme1 /dev/nvme1n1 /dev/nvme2 /dev/nvme2n1 /dev/nvme3 /dev/nvme3n1 lake2% sudo dmesg | grep nvme [ 8.128416] nvme nvme0: pci function 0000:83:00.0 [ 8.128615] nvme nvme1: pci function 0000:84:00.0 [ 8.128791] nvme nvme2: pci function 0000:85:00.0 [ 8.128968] nvme nvme3: pci function 0000:86:00.0 [ 8.136478] nvme nvme3: 31/0/0 default/read/poll queues [ 8.136522] nvme nvme2: 31/0/0 default/read/poll queues [ 8.143575] nvme nvme0: 31/0/0 default/read/poll queues [ 8.147813] nvme nvme1: 31/0/0 default/read/poll queues [ 16.434508] nvme nvme3: using unchecked data buffer ``` Something must be going on with the BIOS / BMC boot as the PCI address has changed for the nvme0 disk. I don't think is related with the upgrade. Ceph is fine and recovering now: ``` lake2% sudo ceph -s cluster: id: 9c8d06e0-485f-4aaf-b16b-06d6daf1232b health: HEALTH_OK services: mon: 1 daemons, quorum bay (age 26m) mgr: bay(active, since 26m) mds: 1/1 daemons up, 1 standby osd: 8 osds: 8 up (since 3m), 8 in (since 3m); 37 remapped pgs data: volumes: 1/1 healthy pools: 4 pools, 545 pgs objects: 1.25M objects, 1.4 TiB usage: 4.3 TiB used, 4.5 TiB / 8.7 TiB avail pgs: 111627/3750123 objects misplaced (2.977%) 516 active+clean 22 active+remapped+backfill_wait 7 active+remapped+backfilling io: recovery: 307 MiB/s, 223 objects/s ```
rarias force-pushed upgrade/25.11 from 7686a75fd5 to 4a6e36c7e9 2026-01-08 15:17:01 +01:00 Compare
Collaborator

@rarias Can we delay the upgrade of fox until 17 of January? One day after the wamta deadline, turns out getting results never ends

@rarias Can we delay the upgrade of fox until 17 of January? One day after the wamta deadline, turns out getting results never ends
Owner

@rarias Can we delay the upgrade of fox until 17 of January? One day after the wamta deadline, turns out getting results never ends

Sure, I will leave apex, fox, owl1 and owl2 as-is until after the 17th, as they all need SLURM to be upgraded at the same time.

Raccoon and tent (including this Gitea service) have been just upgraded, I haven't seen anything broken yet.

> @rarias Can we delay the upgrade of fox until 17 of January? One day after the wamta deadline, turns out getting results never ends Sure, I will leave apex, fox, owl1 and owl2 as-is until after the 17th, as they all need SLURM to be upgraded at the same time. Raccoon and tent (including this Gitea service) have been just upgraded, I haven't seen anything broken yet.
rarias added 1 commit 2026-01-08 17:44:00 +01:00
Remove unneeded perf package from eudy
All checks were successful
CI / build:cross (pull_request) Successful in 8s
CI / build:all (pull_request) Successful in 16s
d0e944d05c
It is already included in the base list of packages, which is now only
"perf" and doesn't depend on the kernel version.
rarias added 1 commit 2026-01-09 18:04:10 +01:00
Fix gitea user to allow sending email
All checks were successful
CI / build:cross (pull_request) Successful in 8s
CI / build:all (pull_request) Successful in 16s
fcfee6c674
In order to send email, the gitea user needs to be in the mail-robot
group.

Fixes: #220
Owner

Fox, owl1, owl2 and apex upgraded, no problems so far.

Fox, owl1, owl2 and apex upgraded, no problems so far.
rarias force-pushed upgrade/25.11 from fcfee6c674 to 2577f6344b 2026-01-20 11:49:43 +01:00 Compare
rarias approved these changes 2026-01-20 12:31:32 +01:00
rarias force-pushed upgrade/25.11 from 2577f6344b to dda6a66782 2026-01-20 13:48:40 +01:00 Compare
rarias manually merged commit dda6a66782 into master 2026-01-20 13:51:15 +01:00
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#218