Setup lake2 node for ceph storage #28
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The lake2 machine will hold 7.2 TiB of disk space used for the CEPH filesystem. To install NixOS:
marked the checklist item Enable serial in GRUB as completed
marked the checklist item Reboot and get a serial console as completed
marked the checklist item Boot NixOS with kexec as completed
Hmm, not sure if the nvme disks are okay... They are taking forever to write files.
marked the checklist item Perform the installation as completed
The BMC keeps droping the serial over LAN connection, making it a pain. Apparently, neither the BIOS or the GRUB can boot from the NVME disks.
One option is to install NixOS in its own partition in the sda disk, making some space. However, I would need to kexec again into NixOS, so the sda FS is not in use, then shrink the partition, make another one, install NixOS there and attempt to boot.
However, this risks leaving the node unbotable if the shrink process fails. Before attempting that, I should setup a secondary mechanism to boot (maybe PXE) so I can load another rescue system to fix it. The BMC droping the console every few minutes won't help either.
Configuring PXE with pixiecore doesn't seem to be straight forward, as I also need to setup a DNS and the serial console is a pain to use to even try to debug the PXE agent.
So, I'm overwriting the partition table from the kexec'd NixOS in RAM. Hopefully I manage to make the system boot again. Otherwise I will need to setup PXE or go there and load NixOS with an USB pendrive.
It says all went okay...
The disk is marked as bootable:
And the GRUB entry seems coherent:
And generated fstab looks good too:
Let's try luck, I won't be able to see the output via serial, so if I don't get a shell via SSH something went wrong. Rebooting...
GRUB loaded!
Network is up:
Aaand, done:
marked the checklist item Add NixOS entry to GRUB as completed
Lake2 now has visibilty with the ceph cluster:
marked the checklist item Configure the disks for ceph as completed
Writting several GB causes the write operation to hang:
Relevant issue: https://tracker.ceph.com/issues/54044
Maybe upgrading the mds to ceph 18 fixes the problem.
There is a PR to update ceph to 18.2.0 but it didn't land on nixpkgs yet: https://github.com/NixOS/nixpkgs/pull/247849
I took the derivation and placed it in a overlay, let's see if I can upgrade ceph.
Upgraded to 18.2.0, still hangs after 21 GB:
marked the checklist item Set default boot entry to NixOS as completed
changed the description
mentioned in issue #29
mentioned in merge request !20
Adding the lake1 (oss01) NVME disks in lake2 and bay won't work, as only the last 4 bays support NVME:
Without
root_squash
runs fine:So let's use this until #29 gets fixed.
marked the checklist item Do some benchmarks as completed