Fix memory limit in fox and remove IPMI watchdog #257

Open
rarias wants to merge 3 commits from fox-limit into master
Owner

In order to prevent jobs to use more memory than available, we enable the cgroup limit in SLURM leaving 1% for the system. The limits are enforced in a cgroup, which is only applied when the memory is also a consumable resources. We also need to constraint the swap otherwise the memory usage is moved to swap.

The configuration has been updated in all slurm nodes, so it should be working fine now.

The following program shows how it is killed on large allocations instead of triggering the kernel OOM killer:

apex% salloc -p fox --exclusive
salloc: Granted job allocation 49464
salloc: Nodes fox are ready for job

fox% cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job*/memory.max
798875836416

fox% cd /nfs/$HOME/jungle

fox% cat mem.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
        int size_gb = 4;
        if (argc > 1)
                size_gb = atoi(argv[1]);

        size_t size_mb = (size_t) size_gb * 1024ULL;
        size_t one_mb = 1024ULL * 1024ULL;
        printf("size = %zd (MiB)\n", size_mb);
        void *m = malloc(size_mb * one_mb);
        if (m == NULL) {
                perror("malloc failed");
                return 1;
        }
        for (size_t i = 0; i < size_mb; i++) {
                void *p = m + i * one_mb;
                memset(p, 1, one_mb);
                if ((i % 256) == 0)
                        printf("i=%zd\n", i);
        }
        printf("m=%p\n", m);
        free(m);
        return 0;
}

fox% free -g
               total        used        free      shared  buff/cache   available
Mem:             755           7         749           0           1         747
Swap:              0           0           0

fox% ./mem 755
size = 773120 (MiB)
i=0
i=256
i=512
i=768
i=1024
...
i=759552
i=759808
i=760064
i=760320
zsh: killed     ./mem 755

fox% exit
[2026-04-01T20:29:02.147] error: Detected 1 oom_kill event in StepId=49463.interactive. Some of the step tasks have been OOM Killed.
srun: error: fox: task 0: Out Of Memory
salloc: Relinquishing job allocation 49463
salloc: Job allocation 49463 has been revoked.

We also blacklist the IPMI watchdog so we can run with a buggy BMC, as it is not reliable, and add access to Dylan to owl nodes.

CC @varcila

In order to prevent jobs to use more memory than available, we enable the cgroup limit in SLURM leaving 1% for the system. The limits are enforced in a cgroup, which is only applied when the memory is also a consumable resources. We also need to constraint the swap otherwise the memory usage is moved to swap. The configuration has been updated in all slurm nodes, so it should be working fine now. The following program shows how it is killed on large allocations instead of triggering the kernel OOM killer: ``` apex% salloc -p fox --exclusive salloc: Granted job allocation 49464 salloc: Nodes fox are ready for job fox% cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job*/memory.max 798875836416 fox% cd /nfs/$HOME/jungle fox% cat mem.c #include <stdlib.h> #include <stdio.h> #include <string.h> int main(int argc, char *argv[]) { int size_gb = 4; if (argc > 1) size_gb = atoi(argv[1]); size_t size_mb = (size_t) size_gb * 1024ULL; size_t one_mb = 1024ULL * 1024ULL; printf("size = %zd (MiB)\n", size_mb); void *m = malloc(size_mb * one_mb); if (m == NULL) { perror("malloc failed"); return 1; } for (size_t i = 0; i < size_mb; i++) { void *p = m + i * one_mb; memset(p, 1, one_mb); if ((i % 256) == 0) printf("i=%zd\n", i); } printf("m=%p\n", m); free(m); return 0; } fox% free -g total used free shared buff/cache available Mem: 755 7 749 0 1 747 Swap: 0 0 0 fox% ./mem 755 size = 773120 (MiB) i=0 i=256 i=512 i=768 i=1024 ... i=759552 i=759808 i=760064 i=760320 zsh: killed ./mem 755 fox% exit [2026-04-01T20:29:02.147] error: Detected 1 oom_kill event in StepId=49463.interactive. Some of the step tasks have been OOM Killed. srun: error: fox: task 0: Out Of Memory salloc: Relinquishing job allocation 49463 salloc: Job allocation 49463 has been revoked. ``` We also blacklist the IPMI watchdog so we can run with a buggy BMC, as it is not reliable, and add access to Dylan to owl nodes. CC @varcila
rarias added 3 commits 2026-04-01 20:41:44 +02:00
Make sure that jobs cannot allocate more memory than available so we
don't trigger the OOM killer.

Fixes: #178
Disable IPMI watchdog in fox
All checks were successful
CI / build:all (pull_request) Successful in 1h38m3s
CI / build:cross (pull_request) Successful in 1h42m59s
5d38b0d3a5
Fixes: #231
All checks were successful
CI / build:all (pull_request) Successful in 1h38m3s
CI / build:cross (pull_request) Successful in 1h42m59s
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fox-limit:fox-limit
git checkout fox-limit
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#257