Setup a distributed filesystem #11

Closed
opened 2023-04-24 09:33:26 +02:00 by rarias · 13 comments
rarias commented 2023-04-24 09:33:26 +02:00 (Migrated from pm.bsc.es)

The cluster has 3 nodes specifically configured for storage, in particular MDS and OSS-0 and OSS-1, which correspond to the names of the MetaData Service and Object Storage Service needed by Lustre.

Based on the documentation, the OSS nodes have 4 disks of 2TB each, a total of 16 TB of storage, which currently is completely disregarded. The current setup uses a single 1TB disk in the login node served via NFS to the compute nodes, which is almost full. Also the storage is served via the Ethernet port (1Gbit/s), and using the OmniPath network may be a better idea.

Lustre and Ceph seem to be appropriate candidates. However, Lustre seems to be incompatible with the latest kernel version.

  • Contact Ramón Nou to erase the disks in the MDS, OSS1 and OSS2 nodes (currently used by their Lustre installation).
  • Take control over mds01
  • Install nixos in one of the disks
  • Test Ceph
  • Mount the ceph FS in the other nodes
The cluster has 3 nodes specifically configured for storage, in particular MDS and OSS-0 and OSS-1, which correspond to the names of the MetaData Service and Object Storage Service needed by Lustre. Based on the documentation, the OSS nodes have 4 disks of 2TB each, a total of 16 TB of storage, which currently is completely disregarded. The current setup uses a *single* 1TB disk in the login node served via NFS to the compute nodes, which is almost full. Also the storage is served via the Ethernet port (1Gbit/s), and using the OmniPath network may be a better idea. Lustre and Ceph seem to be appropriate candidates. However, Lustre seems to be incompatible with the latest kernel version. - [x] Contact Ramón Nou to erase the disks in the MDS, OSS1 and OSS2 nodes (currently used by their Lustre installation). - [x] Take control over mds01 - [x] Install nixos in one of the disks - [x] Test Ceph - [x] Mount the ceph FS in the other nodes
rarias commented 2023-05-02 15:12:07 +02:00 (Migrated from pm.bsc.es)

changed the description

changed the description
rarias commented 2023-05-02 17:40:37 +02:00 (Migrated from pm.bsc.es)

marked the checklist item Contact Ramón Nou to erase the disks in the MDS, OSS1 and OSS2 nodes (currently used by their Lustre installation). as completed

marked the checklist item **Contact Ramón Nou to erase the disks in the MDS, OSS1 and OSS2 nodes (currently used by their Lustre installation).** as completed
rarias commented 2023-05-08 20:06:22 +02:00 (Migrated from pm.bsc.es)

changed the description

changed the description
rarias commented 2023-05-08 20:06:25 +02:00 (Migrated from pm.bsc.es)

marked the checklist item Take control over mds01 as completed

marked the checklist item **Take control over mds01** as completed
rarias commented 2023-05-08 20:09:16 +02:00 (Migrated from pm.bsc.es)

The mds01 and oss02 nodes have a lot of unused disk space:

mds01$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda       8:0    0 223.6G  0 disk
|-sda1    8:1    0 486.3M  0 part /boot
|-sda2    8:2    0  30.5G  0 part [SWAP]
`-sda3    8:3    0 192.6G  0 part /
sdb       8:16   0 223.6G  0 disk
sdc       8:32   0 223.6G  0 disk
nvme0n1 259:1    0 372.6G  0 disk
nvme1n1 259:3    0 372.6G  0 disk
nvme2n1 259:2    0 372.6G  0 disk
nvme3n1 259:0    0 372.6G  0 disk

oss02$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda       8:0    0 223.6G  0 disk
|-sda1    8:1    0 486.3M  0 part /boot
|-sda2    8:2    0  30.5G  0 part [SWAP]
`-sda3    8:3    0 192.6G  0 part /
nvme0n1 259:0    0   1.8T  0 disk
nvme1n1 259:3    0   1.8T  0 disk
nvme2n1 259:2    0   1.8T  0 disk
nvme3n1 259:1    0   1.8T  0 disk
The mds01 and oss02 nodes have a lot of unused disk space: ``` mds01$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 223.6G 0 disk |-sda1 8:1 0 486.3M 0 part /boot |-sda2 8:2 0 30.5G 0 part [SWAP] `-sda3 8:3 0 192.6G 0 part / sdb 8:16 0 223.6G 0 disk sdc 8:32 0 223.6G 0 disk nvme0n1 259:1 0 372.6G 0 disk nvme1n1 259:3 0 372.6G 0 disk nvme2n1 259:2 0 372.6G 0 disk nvme3n1 259:0 0 372.6G 0 disk oss02$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 223.6G 0 disk |-sda1 8:1 0 486.3M 0 part /boot |-sda2 8:2 0 30.5G 0 part [SWAP] `-sda3 8:3 0 192.6G 0 part / nvme0n1 259:0 0 1.8T 0 disk nvme1n1 259:3 0 1.8T 0 disk nvme2n1 259:2 0 1.8T 0 disk nvme3n1 259:1 0 1.8T 0 disk ```
rarias commented 2023-07-28 21:23:47 +02:00 (Migrated from pm.bsc.es)

marked the checklist item Install nixos in one of the disks as completed

marked the checklist item **Install nixos in one of the disks** as completed
rarias commented 2023-08-17 16:18:38 +02:00 (Migrated from pm.bsc.es)

Ceph setup and running on MDS, renamed as bay:

bay$ sudo ceph -s
  cluster:
    id:     9c8d06e0-485f-4aaf-b16b-06d6daf1232b
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum bay (age 9d)
    mgr: bay(active, since 9d)
    mds: 1/1 daemons up, 1 standby
    osd: 4 osds: 4 up (since 9d), 4 in (since 2w)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 280 objects, 1.0 GiB
    usage:   3.1 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     97 active+clean
Ceph setup and running on MDS, renamed as bay: ``` bay$ sudo ceph -s cluster: id: 9c8d06e0-485f-4aaf-b16b-06d6daf1232b health: HEALTH_OK services: mon: 1 daemons, quorum bay (age 9d) mgr: bay(active, since 9d) mds: 1/1 daemons up, 1 standby osd: 4 osds: 4 up (since 9d), 4 in (since 2w) data: volumes: 1/1 healthy pools: 4 pools, 97 pgs objects: 280 objects, 1.0 GiB usage: 3.1 GiB used, 1.5 TiB / 1.5 TiB avail pgs: 97 active+clean ```
rarias commented 2023-08-22 11:31:13 +02:00 (Migrated from pm.bsc.es)

changed the description

changed the description
rarias commented 2023-08-22 16:03:24 +02:00 (Migrated from pm.bsc.es)

mentioned in merge request !19

mentioned in merge request !19
rarias commented 2023-08-22 19:03:50 +02:00 (Migrated from pm.bsc.es)

marked the checklist item Test Ceph as completed

marked the checklist item **Test Ceph** as completed
rarias commented 2023-08-22 19:03:52 +02:00 (Migrated from pm.bsc.es)

marked the checklist item Mount the ceph FS in the other nodes as completed

marked the checklist item **Mount the ceph FS in the other nodes** as completed
rarias commented 2023-08-22 19:04:42 +02:00 (Migrated from pm.bsc.es)

The installation in bay seems to be working fine. Let's move on to the oss nodes.

The node oss01 is waiting for the voltage regulator (see #22) so I will start by oss02.

The installation in bay seems to be working fine. Let's move on to the oss nodes. The node oss01 is waiting for the voltage regulator (see #22) so I will start by oss02.
rarias commented 2023-08-24 12:25:37 +02:00 (Migrated from pm.bsc.es)

changed the description

changed the description
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#11