Failed to umount home when shutting down owl nodes #40

Open
opened 2023-09-18 18:33:24 +02:00 by rarias · 2 comments
rarias commented 2023-09-18 18:33:24 +02:00 (Migrated from pm.bsc.es)

After adding the shared nix store, the nodes are having problems to shutdown:

<<< Welcome to NixOS 23.11.20230902.e569908 (x86_64) - ttyS0 >>>

Run 'nixos-help' for the NixOS manual.

owl1 login: [  OK  ] Finished Address configuration of eno1.
[  OK  ] Found device Omni-Path HFI Silicon 100 Series [discrete].
         Starting Address configuration of ibp5s0...
[  OK  ] Finished Address configuration of ibp5s0.
         Starting Networking Setup...
[  OK  ] Stopped target Host and Network Name Lookups.
         Stopping Host and Network Name Lookups...
[  OK  ] Stopped target User and Group Name Lookups.
         Stopping User and Group Name Lookups...
         Stopping Name Service Cache Daemon (nsncd)...
[  OK  ] Stopped Name Service Cache Daemon (nsncd).
         Starting Name Service Cache Daemon (nsncd)...
[  OK  ] Finished Networking Setup.
         Starting Extra networking commands....
[  OK  ] Started Name Service Cache Daemon (nsncd).
[  OK  ] Reached target Host and Network Name Lookups.
[  OK  ] Reached target User and Group Name Lookups.
[  OK  ] Finished Extra networking commands..
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
         Mounting /ceph...
         Mounting /home...
         Mounting /mnt/hut-nix-store...
         Starting munged.service...
         Starting Notify NFS peers of a restart...
         Starting SSH Daemon...
[  OK  ] Started Notify NFS peers of a restart.
[  OK  ] Started munged.service.
[  OK  ] Started SSH Daemon.
         Starting NFS status monitor for NFSv2/3 locking....
[  OK  ] Started NFS status monitor for NFSv2/3 locking..
[  OK  ] Mounted /ceph.
[  OK  ] Mounted /home.
[  OK  ] Mounted /mnt/hut-nix-store.
         Mounting /nix/store...
[  OK  ] Mounted /nix/store.
[  OK  ] Reached target Remote File Systems.
         Starting slurmd.service...
         Starting Permit User Sessions...
[  OK  ] Finished Permit User Sessions.
[  OK  ] Started Getty on tty1.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started slurmstepd.scope.
[  OK  ] Started slurmd.service.
[  OK  ] Reached target Multi-User System.
         Stopping Session 3 of User rarias...
         Stopping slurmstepd.scope...
[  OK  ] Removed slice Slice /system/modprobe.
[  OK  ] Stopped target Multi-User System.
[  OK  ] Stopped target Login Prompts.
[  OK  ] Stopped target Containers.
[  OK  ] Stopped target rpc_pipefs.target.
[  OK  ] Stopped target RPC Port Mapper.
[  OK  ] Stopped target Timer Units.
[  OK  ] Stopped Discard unused filesystem blocks once a week.
[  OK  ] Stopped logrotate.timer.
[  OK  ] Stopped nix-gc.timer.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
[  OK  ] Stop         Stopping serial-getty@ttyS0.service...
         Stopping slurmd.service...
         Stopping SSH Daemon...
         Stopping Load/Save OS Random Seed...
[  OK  ] Stopped serial-getty@ttyS0.service.
[  OK  ] Stopped NTP Daemon.
[  OK  ] Stopped SSH Daemon.
[  OK  ] Stopped munged.service.
[  OK  ] Stopped NFS status monitor for NFSv2/3 locking..
[  OK  ] Stopped Getty on tty1.
[  OK  ] Stopped slurmd.service.
[  OK  ] Mounted /run/initramfs.
[  OK  ] Unmounted RPC Pipe File System.
[  OK  ] Stopped Kernel Auditing.
[  OK  ] Stopped Load/Save OS Random Seed.
[  OK  ] Stopped Session 3 of User rarias.
[  OK  ] Stopped slurmstepd.scope.
[  OK  ] Removed slice Slice /system/getty.
[  OK  ] Removed slice Slice /system/serial-getty.
[  OK  ] Stopped target Host and Network Name Lookups.
         Starting Generate shutdown ramfs...
         Stopping User Login Management...
         Stopping Permit User Sessions...
         Stopping User Manager for UID 1880...
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped User Manager for UID 1880.
[  OK  ] Stopped target Remote File Systems.
         Unmounting /ceph...
         Unmounting /home...
         Unmounting /nix/store...
         Stopping Userspace Out-Of-Memory (OOM) Killer...
         Stopping User Runtime Directory /run/user/1880...
[  OK  ] Unmounted /run/user/1880.
[  OK  ] Stopped Userspace Out-Of-Memory (OOM) Killer.
[  OK  ] Finished Save Hardware Clock.
[  OK  ] Finished Generate shutdown ramfs.
[  OK  ] Unmounted /ceph.
[FAILED] Failed unmounting /nix/store.
[  OK  ] Stopped User Runtime Directory /run/user/1880.
[  OK  ] Removed slice Slice /user/1880.
         Unmounting /mnt/hut-nix-store...
         Stopping D-Bus System Message Bus...
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped User Login Management.
[  OK  ] Unmounted /mnt/hut-nix-store.
[  OK  ] Stopped target User and Group Name Lookups.
         Stopping Name Service Cache Daemon (nsncd)...
[  OK  ] Stopped Name Service Cache Daemon (nsncd).
[  OK  ] Unmounted /home.
[  OK  ] Stopped target Network is Online.
[  OK  ] Stopped target Network.
[  OK  ] Stopped target All Network Interfaces (deprecated).
[  OK  ] Stopped target Preparation for Remote File Systems.
[  OK  ] Stopped target NFS client services.
[  OK  ] Stopped Extra networking commands..
[  OK  ] Stopped Networking Setup.
         Stopping Address configuration of eno1...
         Stopping Address configuration of ibp5s0...
[  OK  ] Stopped Address configuration of eno1.
[  OK  ] Stopped Address configuration of ibp5s0.
[  OK  ] Stopped target Preparation for Network.
[  OK  ] Stopped resolvconf update.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Path Units.
[  OK  ] Stopped target Slice Units.
[  OK  ] Removed slice User and Session Slice.
[  OK  ] Stopped target Socket Units.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Closed Nix Daemon Socket.
[  OK  ] Stopped target System Initialization.
[  OK  ] Stopped target Local Encrypted Volumes.
[  OK  ] Stopped Dispatch Password …ts to Console Directory Watch.
[  OK  ] Stopped Forward Password R…uests to Wall Directory Watch.
[  OK  ] Closed Userspace Out-Of-Memory (OOM) Killer Socket.
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Closed Process Core Dump Socket.
[  OK  ] Stopped Load Kernel Modules.
         Stopping Record System Boot/Shutdown in UTMP...
[  OK  ] Unmounted /run/credentials/systemd-sysctl.service.
[  OK  ] Stopped Record System Boot/Shutdown in UTMP.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
         Unmounting /run/agenix.d...
         Unmounting /run/credential…temd-tmpfiles-setup.service...
         Unmounting /run/credential…-tmpfiles-setup-dev.service...
         Unmounting /run/keys...
         Unmounting /run/wrappers...
[FAILED] Failed unmounting /run/agenix.d.
[FAILED] Failed unmounting /run/cre…md-tmpfiles-setup-dev.service.
[  OK  ] Unmounted /run/credentials…ystemd-tmpfiles-setup.service.
[FAILED] Failed unmounting /run/keys.
[FAILED] Failed unmounting /run/wrappers.
[  OK  ] Stopped target Preparation for Local File Systems.
[  OK  ] Stopped target Swaps.
         Deactivating swap /dev/disk/by-diskseq/2-part2...
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Reached target System Shutdown.
[FAILED] Failed deactivating swap /dev/disk/by-label/swap.
[FAILED] Failed deactivating swap /dev/sda2.
[FAILED] Failed deactivating swap /…40G7_PHDV64620013240AGN-part2.
[FAILED] Failed deactivating swap /…0-deee-4935-b95d-243b70a7a46a.
[FAILED] Failed deactivating swap /…/pci-0000:00:1f.2-ata-1-part2.
[FAILED] Failed deactivating swap /dev/disk/by-diskseq/2-part2.
[FAILED] Failed deactivating swap /…ci-0000:00:1f.2-ata-1.0-part2.
[FAILED] Failed deactivating swap /…/wwn-0x55cd2e414d535632-part2.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Reached target Late Shutdown Services.
[  OK  ] Finished System Reboot.
[  OK  ] Reached target System Reboot.
[  514.611663] IPMI Watchdog: Unexpected close, not stopping watchdog!
[  575.477855] systemd-journald[1405]: Failed tosend WATCHDOG=1 notification message: Connection refused
[  663.061720] systemd-journald[1405]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  755.958575] systemd-journald[1405]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected

Systemd is probably trying to remove the /nix/store overlay mount while it is in use.

After adding the shared nix store, the nodes are having problems to shutdown: ``` <<< Welcome to NixOS 23.11.20230902.e569908 (x86_64) - ttyS0 >>> Run 'nixos-help' for the NixOS manual. owl1 login: [ OK ] Finished Address configuration of eno1. [ OK ] Found device Omni-Path HFI Silicon 100 Series [discrete]. Starting Address configuration of ibp5s0... [ OK ] Finished Address configuration of ibp5s0. Starting Networking Setup... [ OK ] Stopped target Host and Network Name Lookups. Stopping Host and Network Name Lookups... [ OK ] Stopped target User and Group Name Lookups. Stopping User and Group Name Lookups... Stopping Name Service Cache Daemon (nsncd)... [ OK ] Stopped Name Service Cache Daemon (nsncd). Starting Name Service Cache Daemon (nsncd)... [ OK ] Finished Networking Setup. Starting Extra networking commands.... [ OK ] Started Name Service Cache Daemon (nsncd). [ OK ] Reached target Host and Network Name Lookups. [ OK ] Reached target User and Group Name Lookups. [ OK ] Finished Extra networking commands.. [ OK ] Reached target Network. [ OK ] Reached target Network is Online. Mounting /ceph... Mounting /home... Mounting /mnt/hut-nix-store... Starting munged.service... Starting Notify NFS peers of a restart... Starting SSH Daemon... [ OK ] Started Notify NFS peers of a restart. [ OK ] Started munged.service. [ OK ] Started SSH Daemon. Starting NFS status monitor for NFSv2/3 locking.... [ OK ] Started NFS status monitor for NFSv2/3 locking.. [ OK ] Mounted /ceph. [ OK ] Mounted /home. [ OK ] Mounted /mnt/hut-nix-store. Mounting /nix/store... [ OK ] Mounted /nix/store. [ OK ] Reached target Remote File Systems. Starting slurmd.service... Starting Permit User Sessions... [ OK ] Finished Permit User Sessions. [ OK ] Started Getty on tty1. [ OK ] Reached target Login Prompts. [ OK ] Started slurmstepd.scope. [ OK ] Started slurmd.service. [ OK ] Reached target Multi-User System. Stopping Session 3 of User rarias... Stopping slurmstepd.scope... [ OK ] Removed slice Slice /system/modprobe. [ OK ] Stopped target Multi-User System. [ OK ] Stopped target Login Prompts. [ OK ] Stopped target Containers. [ OK ] Stopped target rpc_pipefs.target. [ OK ] Stopped target RPC Port Mapper. [ OK ] Stopped target Timer Units. [ OK ] Stopped Discard unused filesystem blocks once a week. [ OK ] Stopped logrotate.timer. [ OK ] Stopped nix-gc.timer. [ OK ] Stopped Daily Cleanup of Temporary Directories. [ OK ] Stop Stopping serial-getty@ttyS0.service... Stopping slurmd.service... Stopping SSH Daemon... Stopping Load/Save OS Random Seed... [ OK ] Stopped serial-getty@ttyS0.service. [ OK ] Stopped NTP Daemon. [ OK ] Stopped SSH Daemon. [ OK ] Stopped munged.service. [ OK ] Stopped NFS status monitor for NFSv2/3 locking.. [ OK ] Stopped Getty on tty1. [ OK ] Stopped slurmd.service. [ OK ] Mounted /run/initramfs. [ OK ] Unmounted RPC Pipe File System. [ OK ] Stopped Kernel Auditing. [ OK ] Stopped Load/Save OS Random Seed. [ OK ] Stopped Session 3 of User rarias. [ OK ] Stopped slurmstepd.scope. [ OK ] Removed slice Slice /system/getty. [ OK ] Removed slice Slice /system/serial-getty. [ OK ] Stopped target Host and Network Name Lookups. Starting Generate shutdown ramfs... Stopping User Login Management... Stopping Permit User Sessions... Stopping User Manager for UID 1880... [ OK ] Stopped Permit User Sessions. [ OK ] Stopped User Manager for UID 1880. [ OK ] Stopped target Remote File Systems. Unmounting /ceph... Unmounting /home... Unmounting /nix/store... Stopping Userspace Out-Of-Memory (OOM) Killer... Stopping User Runtime Directory /run/user/1880... [ OK ] Unmounted /run/user/1880. [ OK ] Stopped Userspace Out-Of-Memory (OOM) Killer. [ OK ] Finished Save Hardware Clock. [ OK ] Finished Generate shutdown ramfs. [ OK ] Unmounted /ceph. [FAILED] Failed unmounting /nix/store. [ OK ] Stopped User Runtime Directory /run/user/1880. [ OK ] Removed slice Slice /user/1880. Unmounting /mnt/hut-nix-store... Stopping D-Bus System Message Bus... [ OK ] Stopped D-Bus System Message Bus. [ OK ] Stopped User Login Management. [ OK ] Unmounted /mnt/hut-nix-store. [ OK ] Stopped target User and Group Name Lookups. Stopping Name Service Cache Daemon (nsncd)... [ OK ] Stopped Name Service Cache Daemon (nsncd). [ OK ] Unmounted /home. [ OK ] Stopped target Network is Online. [ OK ] Stopped target Network. [ OK ] Stopped target All Network Interfaces (deprecated). [ OK ] Stopped target Preparation for Remote File Systems. [ OK ] Stopped target NFS client services. [ OK ] Stopped Extra networking commands.. [ OK ] Stopped Networking Setup. Stopping Address configuration of eno1... Stopping Address configuration of ibp5s0... [ OK ] Stopped Address configuration of eno1. [ OK ] Stopped Address configuration of ibp5s0. [ OK ] Stopped target Preparation for Network. [ OK ] Stopped resolvconf update. [ OK ] Stopped target Basic System. [ OK ] Stopped target Path Units. [ OK ] Stopped target Slice Units. [ OK ] Removed slice User and Session Slice. [ OK ] Stopped target Socket Units. [ OK ] Closed D-Bus System Message Bus Socket. [ OK ] Closed Nix Daemon Socket. [ OK ] Stopped target System Initialization. [ OK ] Stopped target Local Encrypted Volumes. [ OK ] Stopped Dispatch Password …ts to Console Directory Watch. [ OK ] Stopped Forward Password R…uests to Wall Directory Watch. [ OK ] Closed Userspace Out-Of-Memory (OOM) Killer Socket. [ OK ] Stopped Apply Kernel Variables. [ OK ] Closed Process Core Dump Socket. [ OK ] Stopped Load Kernel Modules. Stopping Record System Boot/Shutdown in UTMP... [ OK ] Unmounted /run/credentials/systemd-sysctl.service. [ OK ] Stopped Record System Boot/Shutdown in UTMP. [ OK ] Stopped Create Volatile Files and Directories. [ OK ] Stopped target Local File Systems. Unmounting /run/agenix.d... Unmounting /run/credential…temd-tmpfiles-setup.service... Unmounting /run/credential…-tmpfiles-setup-dev.service... Unmounting /run/keys... Unmounting /run/wrappers... [FAILED] Failed unmounting /run/agenix.d. [FAILED] Failed unmounting /run/cre…md-tmpfiles-setup-dev.service. [ OK ] Unmounted /run/credentials…ystemd-tmpfiles-setup.service. [FAILED] Failed unmounting /run/keys. [FAILED] Failed unmounting /run/wrappers. [ OK ] Stopped target Preparation for Local File Systems. [ OK ] Stopped target Swaps. Deactivating swap /dev/disk/by-diskseq/2-part2... [ OK ] Stopped Remount Root and Kernel File Systems. [ OK ] Stopped Create Static Device Nodes in /dev. [ OK ] Reached target System Shutdown. [FAILED] Failed deactivating swap /dev/disk/by-label/swap. [FAILED] Failed deactivating swap /dev/sda2. [FAILED] Failed deactivating swap /…40G7_PHDV64620013240AGN-part2. [FAILED] Failed deactivating swap /…0-deee-4935-b95d-243b70a7a46a. [FAILED] Failed deactivating swap /…/pci-0000:00:1f.2-ata-1-part2. [FAILED] Failed deactivating swap /dev/disk/by-diskseq/2-part2. [FAILED] Failed deactivating swap /…ci-0000:00:1f.2-ata-1.0-part2. [FAILED] Failed deactivating swap /…/wwn-0x55cd2e414d535632-part2. [ OK ] Reached target Unmount All Filesystems. [ OK ] Reached target Late Shutdown Services. [ OK ] Finished System Reboot. [ OK ] Reached target System Reboot. [ 514.611663] IPMI Watchdog: Unexpected close, not stopping watchdog! [ 575.477855] systemd-journald[1405]: Failed tosend WATCHDOG=1 notification message: Connection refused [ 663.061720] systemd-journald[1405]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected [ 755.958575] systemd-journald[1405]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected ``` Systemd is probably trying to remove the /nix/store overlay mount while it is in use.
rarias commented 2023-09-18 18:39:49 +02:00 (Migrated from pm.bsc.es)

Maybe we can try to use the LazyUnmount option for the overlay nix store mount.

However, the home mount shouldn't fail either.

Maybe we can try to use the LazyUnmount option for the overlay nix store mount. However, the home mount shouldn't fail either.
rarias commented 2023-09-18 20:41:02 +02:00 (Migrated from pm.bsc.es)

The LazyMount solves the "Failed unmounting /nix/store." but the home problem remains.

The LazyMount solves the "Failed unmounting /nix/store." but the home problem remains.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#40
No description provided.