BUG: using smp_processor_id() in preemptible [00000000] code: osu_bw/11417 #17
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
First attempt to use osu_bw causes a BUG in the hfi1 kernel module, stuck after 32K:
Here is the dmesg:
https://www.spinics.net/lists/linux-rdma/msg107330.html
This problem is only observed in the node xeon01 not xeon02, probably the one making the sends.
I took a look at the kernel source and recent changes by Cornelis Networks, but I didn't saw anything fixing this problem, other than the above patch.
Upgrading the kernel in xeon01 to 6.1.25 causes the problem to appear in xeon02.
Still failing, let's try the latest kernel 6.3.
In kernel 6.3.0 there is no such BUG error, but the osu_bw test gets stuck in the same point.
Stuck with mpich too. Here are the strace interesting lines:
And here the ulimits:
This looks like it can be caused by the max locked memory limit. Let's try increasing it.
Yeah, this particular problem is caused by the locked memory limit:
The ulimits are not properly propagated to srun jobs:
They seem to be taken from the systemd service instead.
No, the slurmd service has the limit set to infinity, is slurm setting a lower limit:
Which is set to 8M in the launcher node (xeon07):
mentioned in issue #16
Fixed for now by updating to 6.3