bscpkgs/NOISE

148 lines
5.6 KiB
Plaintext

Known sources of noise
in MareNostrum 4
ABSTRACT
The experiments run at MareNostrum 4 show that there are several
factors that can affect the execution time. Some may even become the
dominant part of the time, rendering the experiment invalid.
This document lists all known sources of variability and tries to give
an overview on how to detect and correct the problems.
1. Notable sources of variability
Usually all sources were found in the MareNostrum 4 cluster, but they
may apply to other machines. Some may have a detection mechanism so
the effect can be neglected, but others don't. Also, some problems
only occur with low probability.
Other sources of variability with a low effect, say lower than 1% of
the mean time, are not listed here.
1.1 The daemon slurmstepd eats sys CPU in a new thread
For a period of about 10 seconds a thread is created from the
slurmstepd process when a job is running, which uses quite a lot of
CPU. This event happens from time to time with unknown frequency. It
was first observed in the nbody program, as it almost doubles the time
per iteration, as the other processes are waiting for the one with
slow CPU to continue to the next iteration. The SLURM version was
17.11.7 and the program was executed with sbatch+srun. See the issue
for more details:
https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19
It can be detected by looking at the cycles per us view with Extrae,
with the PAPI counters enabled. It shows a slowdown in one process
when the problem occurs. Also, perf-sched(1) can be used to trace
context switches to other programs but requires access to the debugfs.
1.2 MPICH uses ethernet rather than infiniband
Some MPI implementations (like MPICH) can silently use non-optimal
fabrics like the ethernet rather than infiniband because the are
misconfigured.
Can be detected by running latency benchmarks like the OSU micro
benchmark, which should report a low latency. It can also be reported
by using strace to ensure which network card is being used.
1.3 CPU binding
A thread may switch between CPUs when running, leading to a drop in
performance. To ensure that it remains in the same process it can be
binded with srun(1) or sbatch(1) using the --cpu-bind option, or using
taskset(1).
It can be detected by running the program with Extrae and using the
General/view/executing_cpu.cfg configuration in Paraver. After
adjusting the scale, all processes must have a different color from
each other (the assigned CPU) and keep it constant. Otherwise changes
of CPUs are happening.
1.4 Libraries that use dlopen(3)
Some libraries or programs try to determine which components are
available in a system by looking for specific libraries in the search
path determined at runtime.
This behavior can cause a program to change the execution time
depending on the environment variables like LD_LIBRARY_PATH.
It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using
strace(1) when running the program.
1.5 Intel MPI library selection
The Intel MPI library has several variants which are loaded at run
time: debug, release, debug_mt and release_mt. Of which the
I_MPI_THREAD_SPLIT controls whether the multithread capabilities are
enabled or not.
1.6 LLVM and OpenMP problem
The LLVM OpenMP implementation is installed in libomp.so, however two
symbolic links are created for libgomp.so and libiomp5.so.
libgomp.so -> libomp.so
libiomp5.so -> libomp.so
libomp.so
So applications compiled with OpenMP by other compilers may end up
using the LLVM implementation. This can be observed by setting
LD_DEBUG=all of using strace(1) and looking for the libomp.so library
being loaded.
In bscpkgs the symbolic links have been removed for the clangOmpss2
compiler.
1.7 Nix-shell does not allow isolation
Nix-shell is not isolated, the compilation process tries then to
use headers and libs from /usr.
This can induce compilation errors not happening inside nix-build.
Do not use to ensure reproducibility.
1.8 Make doesn't rebuild objects
When using local repo as src code, (e.g. developer mode on) a make
clean at the preBuild stage is required.
Nix sets the same modification date (one second after the Epoch
(1970-01-01 at 00:00:01 in UTC timezone) to all the files in the nix
store (also those copied from repos). Makefile checks the files
modification date in order to call or not the compilation
instructions. If any object/binary file exists out of Nix, at the time
we build within Nix, they will be copied with the current data and
consequently not updated during the Nix compilation process.
1.9 Sbatch silently fails on parsing
When submitting a job with a wrong specification in MN4 with SLURM
17.11.9-2, for example this bogus line:
#SBATCH --nodes=1 2
It silently fails to parse the options, falling back to the defaults,
without any error.
We have improved our checking to detect bogus options passed to SLURM,
so we prevent this problem from happening.
1.10 The srun program misses signals after MPI_Finalize
When a program receives a signal such as SIGSEGV after calling
MPI_Finalize, srun at version 17.11.7 doesn't return a error code but
exits with 0.
This can cause bogus programs to go undetected when only checking the
return code of srun. A better approach is to check the exit code with
sacct(1) or write the exit code to a file and check it later.
/* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */