148 lines
5.6 KiB
Plaintext
148 lines
5.6 KiB
Plaintext
|
|
Known sources of noise
|
|
in MareNostrum 4
|
|
|
|
|
|
ABSTRACT
|
|
|
|
The experiments run at MareNostrum 4 show that there are several
|
|
factors that can affect the execution time. Some may even become the
|
|
dominant part of the time, rendering the experiment invalid.
|
|
|
|
This document lists all known sources of variability and tries to give
|
|
an overview on how to detect and correct the problems.
|
|
|
|
1. Notable sources of variability
|
|
|
|
Usually all sources were found in the MareNostrum 4 cluster, but they
|
|
may apply to other machines. Some may have a detection mechanism so
|
|
the effect can be neglected, but others don't. Also, some problems
|
|
only occur with low probability.
|
|
|
|
Other sources of variability with a low effect, say lower than 1% of
|
|
the mean time, are not listed here.
|
|
|
|
1.1 The daemon slurmstepd eats sys CPU in a new thread
|
|
|
|
For a period of about 10 seconds a thread is created from the
|
|
slurmstepd process when a job is running, which uses quite a lot of
|
|
CPU. This event happens from time to time with unknown frequency. It
|
|
was first observed in the nbody program, as it almost doubles the time
|
|
per iteration, as the other processes are waiting for the one with
|
|
slow CPU to continue to the next iteration. The SLURM version was
|
|
17.11.7 and the program was executed with sbatch+srun. See the issue
|
|
for more details:
|
|
|
|
https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19
|
|
|
|
It can be detected by looking at the cycles per us view with Extrae,
|
|
with the PAPI counters enabled. It shows a slowdown in one process
|
|
when the problem occurs. Also, perf-sched(1) can be used to trace
|
|
context switches to other programs but requires access to the debugfs.
|
|
|
|
1.2 MPICH uses ethernet rather than infiniband
|
|
|
|
Some MPI implementations (like MPICH) can silently use non-optimal
|
|
fabrics like the ethernet rather than infiniband because the are
|
|
misconfigured.
|
|
|
|
Can be detected by running latency benchmarks like the OSU micro
|
|
benchmark, which should report a low latency. It can also be reported
|
|
by using strace to ensure which network card is being used.
|
|
|
|
1.3 CPU binding
|
|
|
|
A thread may switch between CPUs when running, leading to a drop in
|
|
performance. To ensure that it remains in the same process it can be
|
|
binded with srun(1) or sbatch(1) using the --cpu-bind option, or using
|
|
taskset(1).
|
|
|
|
It can be detected by running the program with Extrae and using the
|
|
General/view/executing_cpu.cfg configuration in Paraver. After
|
|
adjusting the scale, all processes must have a different color from
|
|
each other (the assigned CPU) and keep it constant. Otherwise changes
|
|
of CPUs are happening.
|
|
|
|
1.4 Libraries that use dlopen(3)
|
|
|
|
Some libraries or programs try to determine which components are
|
|
available in a system by looking for specific libraries in the search
|
|
path determined at runtime.
|
|
|
|
This behavior can cause a program to change the execution time
|
|
depending on the environment variables like LD_LIBRARY_PATH.
|
|
|
|
It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using
|
|
strace(1) when running the program.
|
|
|
|
1.5 Intel MPI library selection
|
|
|
|
The Intel MPI library has several variants which are loaded at run
|
|
time: debug, release, debug_mt and release_mt. Of which the
|
|
I_MPI_THREAD_SPLIT controls whether the multithread capabilities are
|
|
enabled or not.
|
|
|
|
1.6 LLVM and OpenMP problem
|
|
|
|
The LLVM OpenMP implementation is installed in libomp.so, however two
|
|
symbolic links are created for libgomp.so and libiomp5.so.
|
|
|
|
libgomp.so -> libomp.so
|
|
libiomp5.so -> libomp.so
|
|
libomp.so
|
|
|
|
So applications compiled with OpenMP by other compilers may end up
|
|
using the LLVM implementation. This can be observed by setting
|
|
LD_DEBUG=all of using strace(1) and looking for the libomp.so library
|
|
being loaded.
|
|
|
|
In bscpkgs the symbolic links have been removed for the clangOmpss2
|
|
compiler.
|
|
|
|
1.7 Nix-shell does not allow isolation
|
|
|
|
Nix-shell is not isolated, the compilation process tries then to
|
|
use headers and libs from /usr.
|
|
|
|
This can induce compilation errors not happening inside nix-build.
|
|
Do not use to ensure reproducibility.
|
|
|
|
1.8 Make doesn't rebuild objects
|
|
|
|
When using local repo as src code, (e.g. developer mode on) a make
|
|
clean at the preBuild stage is required.
|
|
|
|
Nix sets the same modification date (one second after the Epoch
|
|
(1970-01-01 at 00:00:01 in UTC timezone) to all the files in the nix
|
|
store (also those copied from repos). Makefile checks the files
|
|
modification date in order to call or not the compilation
|
|
instructions. If any object/binary file exists out of Nix, at the time
|
|
we build within Nix, they will be copied with the current data and
|
|
consequently not updated during the Nix compilation process.
|
|
|
|
1.9 Sbatch silently fails on parsing
|
|
|
|
When submitting a job with a wrong specification in MN4 with SLURM
|
|
17.11.9-2, for example this bogus line:
|
|
|
|
#SBATCH --nodes=1 2
|
|
|
|
It silently fails to parse the options, falling back to the defaults,
|
|
without any error.
|
|
|
|
We have improved our checking to detect bogus options passed to SLURM,
|
|
so we prevent this problem from happening.
|
|
|
|
1.10 The srun program misses signals after MPI_Finalize
|
|
|
|
When a program receives a signal such as SIGSEGV after calling
|
|
MPI_Finalize, srun at version 17.11.7 doesn't return a error code but
|
|
exits with 0.
|
|
|
|
This can cause bogus programs to go undetected when only checking the
|
|
return code of srun. A better approach is to check the exit code with
|
|
sacct(1) or write the exit code to a file and check it later.
|
|
|
|
/* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */
|
|
|