BUG: using smp_processor_id() in preemptible [00000000] code: osu_bw/11417 #17

Closed
opened 2023-04-27 19:39:04 +02:00 by rarias · 11 comments
rarias commented 2023-04-27 19:39:04 +02:00 (Migrated from pm.bsc.es)

First attempt to use osu_bw causes a BUG in the hfi1 kernel module, stuck after 32K:

$ srun --mpi=pmix -N2 result/bin/osu_bw
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   xeon01
  Local device: hfi1_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   xeon02
  Local device: hfi1_0
--------------------------------------------------------------------------
# OSU MPI Bandwidth Test v5.7
# Size      Bandwidth (MB/s)
1                       1.86
2                       3.76
4                       7.52
8                      15.32
16                     27.13
32                     54.30
64                    111.09
128                   218.86
256                   426.96
512                   796.52
1024                 1430.20
2048                 2329.13
4096                 3406.67
8192                 4526.88
16384                5050.30
32768                5804.92

Here is the dmesg:

[21773.942016] hugetlbfs: osu_bw (11417): Using mlock ulimits for SHM_HUGETLB is obsolete
[21774.121302] BUG: using smp_processor_id() in preemptible [00000000] code: osu_bw/11417
[21774.130213] caller is sdma_select_user_engine+0x77/0xf0 [hfi1]
[21774.130336] CPU: 0 PID: 11417 Comm: osu_bw Tainted: G        W          6.1.21 #1-NixOS
[21774.130343] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
[21774.130347] Call Trace:
[21774.130353]  <TASK>
[21774.130359]  dump_stack_lvl+0x44/0x5c
[21774.130370]  check_preemption_disabled+0xe1/0xf0
[21774.130382]  sdma_select_user_engine+0x77/0xf0 [hfi1]
[21774.130464]  hfi1_user_sdma_process_request+0xc85/0x11e0 [hfi1]
[21774.130557]  hfi1_write_iter+0xe6/0x1f0 [hfi1]
[21774.130625]  do_iter_readv_writev+0xf3/0x150
[21774.130638]  do_iter_write+0x81/0x1c0
[21774.130645]  vfs_writev+0xed/0x1e0
[21774.130655]  do_writev+0x7e/0x160
[21774.130662]  do_syscall_64+0x3a/0x90
[21774.130670]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[21774.130678] RIP: 0033:0x7f8a0df04407
[21774.130716] Code: 77 51 c3 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 7b 0c f8 ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 44 89 c7 48 89 4
4 24 08 e8 d4 0c f8 ff 48
[21774.130720] RSP: 002b:00007fffcfc76970 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
[21774.130726] RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00007f8a0df04407
[21774.130730] RDX: 0000000000000002 RSI: 00007fffcfc769c0 RDI: 000000000000001b
[21774.130734] RBP: 00007fffcfc769c0 R08: 0000000000000000 R09: 00007f89fe7383f8
[21774.130738] R10: 00000000010d0d40 R11: 0000000000000293 R12: 0000000000000002
[21774.130741] R13: 00007f8a0e995d98 R14: 0000000000000002 R15: 0000000000000001
[21774.130748]  </TASK>
First attempt to use osu_bw causes a BUG in the hfi1 kernel module, stuck after 32K: ``` $ srun --mpi=pmix -N2 result/bin/osu_bw -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: xeon01 Local device: hfi1_0 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: xeon02 Local device: hfi1_0 -------------------------------------------------------------------------- # OSU MPI Bandwidth Test v5.7 # Size Bandwidth (MB/s) 1 1.86 2 3.76 4 7.52 8 15.32 16 27.13 32 54.30 64 111.09 128 218.86 256 426.96 512 796.52 1024 1430.20 2048 2329.13 4096 3406.67 8192 4526.88 16384 5050.30 32768 5804.92 ``` Here is the dmesg: ``` [21773.942016] hugetlbfs: osu_bw (11417): Using mlock ulimits for SHM_HUGETLB is obsolete [21774.121302] BUG: using smp_processor_id() in preemptible [00000000] code: osu_bw/11417 [21774.130213] caller is sdma_select_user_engine+0x77/0xf0 [hfi1] [21774.130336] CPU: 0 PID: 11417 Comm: osu_bw Tainted: G W 6.1.21 #1-NixOS [21774.130343] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 [21774.130347] Call Trace: [21774.130353] <TASK> [21774.130359] dump_stack_lvl+0x44/0x5c [21774.130370] check_preemption_disabled+0xe1/0xf0 [21774.130382] sdma_select_user_engine+0x77/0xf0 [hfi1] [21774.130464] hfi1_user_sdma_process_request+0xc85/0x11e0 [hfi1] [21774.130557] hfi1_write_iter+0xe6/0x1f0 [hfi1] [21774.130625] do_iter_readv_writev+0xf3/0x150 [21774.130638] do_iter_write+0x81/0x1c0 [21774.130645] vfs_writev+0xed/0x1e0 [21774.130655] do_writev+0x7e/0x160 [21774.130662] do_syscall_64+0x3a/0x90 [21774.130670] entry_SYSCALL_64_after_hwframe+0x63/0xcd [21774.130678] RIP: 0033:0x7f8a0df04407 [21774.130716] Code: 77 51 c3 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 7b 0c f8 ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 44 89 c7 48 89 4 4 24 08 e8 d4 0c f8 ff 48 [21774.130720] RSP: 002b:00007fffcfc76970 EFLAGS: 00000293 ORIG_RAX: 0000000000000014 [21774.130726] RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00007f8a0df04407 [21774.130730] RDX: 0000000000000002 RSI: 00007fffcfc769c0 RDI: 000000000000001b [21774.130734] RBP: 00007fffcfc769c0 R08: 0000000000000000 R09: 00007f89fe7383f8 [21774.130738] R10: 00000000010d0d40 R11: 0000000000000293 R12: 0000000000000002 [21774.130741] R13: 00007f8a0e995d98 R14: 0000000000000002 R15: 0000000000000001 [21774.130748] </TASK> ```
rarias commented 2023-04-27 19:41:15 +02:00 (Migrated from pm.bsc.es)
https://www.spinics.net/lists/linux-rdma/msg107330.html
rarias commented 2023-04-28 11:21:26 +02:00 (Migrated from pm.bsc.es)

This problem is only observed in the node xeon01 not xeon02, probably the one making the sends.

I took a look at the kernel source and recent changes by Cornelis Networks, but I didn't saw anything fixing this problem, other than the above patch.

This problem is only observed in the node xeon01 not xeon02, probably the one making the sends. I took a look at the [kernel source][1] and recent [changes][2] by Cornelis Networks, but I didn't saw anything fixing this problem, other than the above patch. [1]: https://github.com/torvalds/linux/commits/master/drivers/infiniband/hw/hfi1/sdma.c [2]: https://github.com/cornelisnetworks/linux/commits/for-upstream_20230404/drivers/infiniband/hw/hfi1/sdma.c
rarias commented 2023-04-28 11:36:56 +02:00 (Migrated from pm.bsc.es)

Upgrading the kernel in xeon01 to 6.1.25 causes the problem to appear in xeon02.

Upgrading the kernel in xeon01 to 6.1.25 causes the problem to appear in xeon02.
rarias commented 2023-04-28 11:53:21 +02:00 (Migrated from pm.bsc.es)

Still failing, let's try the latest kernel 6.3.

Still failing, let's try the latest kernel 6.3.
rarias commented 2023-04-28 12:06:43 +02:00 (Migrated from pm.bsc.es)

In kernel 6.3.0 there is no such BUG error, but the osu_bw test gets stuck in the same point.

In kernel 6.3.0 there is no such BUG error, but the osu_bw test gets stuck in the same point.
rarias commented 2023-04-28 12:25:40 +02:00 (Migrated from pm.bsc.es)

Stuck with mpich too. Here are the strace interesting lines:

...
[pid  2156] ioctl(27, HFI1_IOCTL_TID_UPDATE, 0x7ffc70bbaff0) = -1 ENOMEM (Cannot allocate memory)
[pid  2156] ioctl(27, HFI1_IOCTL_TID_UPDATE, 0x7ffc70bbaff0) = -1 ENOMEM (Cannot allocate memory)
[pid  2156] ioctl(27, HFI1_IOCTL_TID_UPDATE, 0x7ffc70bbaff0) = -1 ENOMEM (Cannot allocate memory)
...

And here the ulimits:

[rarias@xeon07:~/osumb]$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 515174
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 515174
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

This looks like it can be caused by the max locked memory limit. Let's try increasing it.

Stuck with mpich too. Here are the strace interesting lines: ``` ... [pid 2156] ioctl(27, HFI1_IOCTL_TID_UPDATE, 0x7ffc70bbaff0) = -1 ENOMEM (Cannot allocate memory) [pid 2156] ioctl(27, HFI1_IOCTL_TID_UPDATE, 0x7ffc70bbaff0) = -1 ENOMEM (Cannot allocate memory) [pid 2156] ioctl(27, HFI1_IOCTL_TID_UPDATE, 0x7ffc70bbaff0) = -1 ENOMEM (Cannot allocate memory) ... ``` And here the ulimits: ``` [rarias@xeon07:~/osumb]$ ulimit -a real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515174 max locked memory (kbytes, -l) 8192 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 515174 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ``` This looks like it can be caused by the max locked memory limit. Let's try increasing it.
rarias commented 2023-04-28 12:40:51 +02:00 (Migrated from pm.bsc.es)

Yeah, this particular problem is caused by the locked memory limit:

$ srun -N2 --mpi=pmix bash -exc "ulimit -l 1048576; result-openmpi/bin/osu_bw"
+ ulimit -l 1048576
+ result-openmpi/bin/osu_bw
+ ulimit -l 1048576
+ result-openmpi/bin/osu_bw
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   xeon01
  Local device: hfi1_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   xeon02
  Local device: hfi1_0
--------------------------------------------------------------------------
# OSU MPI Bandwidth Test v5.7
# Size      Bandwidth (MB/s)
1                       1.83
2                       3.64
4                       7.40
8                      14.79
16                     25.87
32                     51.85
64                    105.71
128                   208.51
256                   406.30
512                   760.70
1024                 1354.76
2048                 2227.55
4096                 3306.22
8192                 4441.44
16384                4998.52
32768                5870.02
65536                7514.29
131072              10617.00
262144              11653.87
524288              12217.76
1048576             12263.11
2097152             12317.18
4194304             12356.58
Yeah, this particular problem is caused by the locked memory limit: ``` $ srun -N2 --mpi=pmix bash -exc "ulimit -l 1048576; result-openmpi/bin/osu_bw" + ulimit -l 1048576 + result-openmpi/bin/osu_bw + ulimit -l 1048576 + result-openmpi/bin/osu_bw -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: xeon01 Local device: hfi1_0 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: xeon02 Local device: hfi1_0 -------------------------------------------------------------------------- # OSU MPI Bandwidth Test v5.7 # Size Bandwidth (MB/s) 1 1.83 2 3.64 4 7.40 8 14.79 16 25.87 32 51.85 64 105.71 128 208.51 256 406.30 512 760.70 1024 1354.76 2048 2227.55 4096 3306.22 8192 4441.44 16384 4998.52 32768 5870.02 65536 7514.29 131072 10617.00 262144 11653.87 524288 12217.76 1048576 12263.11 2097152 12317.18 4194304 12356.58 ```
rarias commented 2023-04-28 15:37:02 +02:00 (Migrated from pm.bsc.es)

The ulimits are not properly propagated to srun jobs:

$ (ssh xeon01 cat /proc/self/limits; srun cat /proc/self/limits) | grep locked
Max locked memory         1073741824           1073741824           bytes
Max locked memory         8388608              unlimited            bytes

They seem to be taken from the systemd service instead.

The ulimits are not properly propagated to srun jobs: ``` $ (ssh xeon01 cat /proc/self/limits; srun cat /proc/self/limits) | grep locked Max locked memory 1073741824 1073741824 bytes Max locked memory 8388608 unlimited bytes ``` They seem to be taken from the systemd service instead.
rarias commented 2023-04-28 15:55:21 +02:00 (Migrated from pm.bsc.es)

No, the slurmd service has the limit set to infinity, is slurm setting a lower limit:

$ sudo srun --slurmd-debug=6 -v cat /proc/self/limits
...
slurmstepd: debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:8388608
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded

Which is set to 8M in the launcher node (xeon07):

$ ulimit -l
8192
No, the slurmd service has the limit set to infinity, is slurm setting a [lower limit][1]: [1]: https://slurm.schedmd.com/faq.html#rlimit ``` $ sudo srun --slurmd-debug=6 -v cat /proc/self/limits ... slurmstepd: debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:8388608 slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded ``` Which is set to 8M in the launcher node (xeon07): ``` $ ulimit -l 8192 ```
rarias commented 2023-04-28 16:05:39 +02:00 (Migrated from pm.bsc.es)

mentioned in issue #16

mentioned in issue #16
rarias commented 2023-05-02 18:11:22 +02:00 (Migrated from pm.bsc.es)

Fixed for now by updating to 6.3

Fixed for now by updating to 6.3
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#17
No description provided.