Cephfs hangs after a few writes #29

Open
opened 2023-08-28 08:56:32 +02:00 by rarias · 5 comments
rarias commented 2023-08-28 08:56:32 +02:00 (Migrated from pm.bsc.es)

Following #28, using the following test from hut:

hut$ strace -f -tt dd if=/dev/urandom of=/ceph/rarias/kk bs=32M count=$((32*1024)) status=progress
...
08:45:34.985445 write(2, "603979776 bytes (604 MB, 576 MiB"..., 55603979776 bytes (604 MB, 576 MiB) copied, 5 s, 117 MB/s) = 55
08:45:34.985525 read(0, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432
08:45:35.237872 write(1, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432
08:45:35.270395 read(0, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432
08:45:35.523104 write(1, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432
08:45:35.556310 read(0, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432) = 33554432
08:45:35.808834 write(1, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432^Cstrace: Process 1695154 detached
 <detached ...>

21+0 records in
20+0 records out
671088640 bytes (671 MB, 640 MiB) copied, 85,9042 s, 7,8 MB/s

Causes the write to block, with this stack:

hut# cat /proc/1402180/stack
[<0>] wait_woken+0x54/0x70
[<0>] ceph_get_caps+0x4b3/0x6f0 [ceph]
[<0>] ceph_write_iter+0x316/0xdc0 [ceph]
[<0>] vfs_write+0x22e/0x3f0
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x3e/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x77/0xe1

Same with ceph 17.2.5 and 18.2.0, both tests on the linux kernel 6.4.11.

Enabling the dynamic debug for the caps file in the ceph kernel module with:

hut# echo 'file fs/ceph/caps.c +p' > /sys/kernel/debug/dynamic_debug/control

Results in the following messages: dmesg.log

Following #28, using the following test from hut: ``` hut$ strace -f -tt dd if=/dev/urandom of=/ceph/rarias/kk bs=32M count=$((32*1024)) status=progress ... 08:45:34.985445 write(2, "603979776 bytes (604 MB, 576 MiB"..., 55603979776 bytes (604 MB, 576 MiB) copied, 5 s, 117 MB/s) = 55 08:45:34.985525 read(0, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432 08:45:35.237872 write(1, "I2\223ss\201}\233b\204{\262:L\212/W\25t\301vuq}\354d\272\302*u0\253"..., 33554432) = 33554432 08:45:35.270395 read(0, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432 08:45:35.523104 write(1, "\372+\234\205g\332|\201\350A;\254\34P\215x\374\255'90\257\257\v\341\227\251\355A\32\350#"..., 33554432) = 33554432 08:45:35.556310 read(0, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432) = 33554432 08:45:35.808834 write(1, "I\352\203L\17\323\345\355\335L\304\334XB~\327'\177U\24\333\221I\273Sjz\177N\243Hh"..., 33554432^Cstrace: Process 1695154 detached <detached ...> 21+0 records in 20+0 records out 671088640 bytes (671 MB, 640 MiB) copied, 85,9042 s, 7,8 MB/s ``` Causes the write to block, with this stack: ``` hut# cat /proc/1402180/stack [<0>] wait_woken+0x54/0x70 [<0>] ceph_get_caps+0x4b3/0x6f0 [ceph] [<0>] ceph_write_iter+0x316/0xdc0 [ceph] [<0>] vfs_write+0x22e/0x3f0 [<0>] ksys_write+0x6f/0xf0 [<0>] do_syscall_64+0x3e/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x77/0xe1 ```` Same with ceph 17.2.5 and 18.2.0, both tests on the linux kernel 6.4.11. Enabling the dynamic debug for the caps file in the ceph kernel module with: ``` hut# echo 'file fs/ceph/caps.c +p' > /sys/kernel/debug/dynamic_debug/control ``` Results in the following messages: [dmesg.log](/uploads/f0179e6227708af7ae6319f22ed5f2ee/dmesg.log)
rarias commented 2023-08-28 10:04:29 +02:00 (Migrated from pm.bsc.es)
Reported upstream https://tracker.ceph.com/issues/62604
rarias commented 2023-08-28 10:05:11 +02:00 (Migrated from pm.bsc.es)

changed the description

changed the description
rarias commented 2023-08-29 11:40:14 +02:00 (Migrated from pm.bsc.es)

It seems the ceph-fuse client is failing too, with another problem.

It seems the ceph-fuse client is failing too, with another problem.
rarias commented 2023-09-04 12:09:06 +02:00 (Migrated from pm.bsc.es)

This problems seems to be caused by the root_squash option. Without it, both dd and fio tests run fine.

This problems seems to be caused by the root_squash option. Without it, both dd and fio tests run fine.
rarias commented 2023-09-04 12:17:38 +02:00 (Migrated from pm.bsc.es)

mentioned in issue #28

mentioned in issue #28
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#29
No description provided.