Remove NOISE file
This commit is contained in:
parent
4533c94b4f
commit
f9c832654e
147
NOISE
147
NOISE
@ -1,147 +0,0 @@
|
||||
|
||||
Known sources of noise
|
||||
in MareNostrum 4
|
||||
|
||||
|
||||
ABSTRACT
|
||||
|
||||
The experiments run at MareNostrum 4 show that there are several
|
||||
factors that can affect the execution time. Some may even become the
|
||||
dominant part of the time, rendering the experiment invalid.
|
||||
|
||||
This document lists all known sources of variability and tries to give
|
||||
an overview on how to detect and correct the problems.
|
||||
|
||||
1. Notable sources of variability
|
||||
|
||||
Usually all sources were found in the MareNostrum 4 cluster, but they
|
||||
may apply to other machines. Some may have a detection mechanism so
|
||||
the effect can be neglected, but others don't. Also, some problems
|
||||
only occur with low probability.
|
||||
|
||||
Other sources of variability with a low effect, say lower than 1% of
|
||||
the mean time, are not listed here.
|
||||
|
||||
1.1 The daemon slurmstepd eats sys CPU in a new thread
|
||||
|
||||
For a period of about 10 seconds a thread is created from the
|
||||
slurmstepd process when a job is running, which uses quite a lot of
|
||||
CPU. This event happens from time to time with unknown frequency. It
|
||||
was first observed in the nbody program, as it almost doubles the time
|
||||
per iteration, as the other processes are waiting for the one with
|
||||
slow CPU to continue to the next iteration. The SLURM version was
|
||||
17.11.7 and the program was executed with sbatch+srun. See the issue
|
||||
for more details:
|
||||
|
||||
https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19
|
||||
|
||||
It can be detected by looking at the cycles per us view with Extrae,
|
||||
with the PAPI counters enabled. It shows a slowdown in one process
|
||||
when the problem occurs. Also, perf-sched(1) can be used to trace
|
||||
context switches to other programs but requires access to the debugfs.
|
||||
|
||||
1.2 MPICH uses ethernet rather than infiniband
|
||||
|
||||
Some MPI implementations (like MPICH) can silently use non-optimal
|
||||
fabrics like the ethernet rather than infiniband because the are
|
||||
misconfigured.
|
||||
|
||||
Can be detected by running latency benchmarks like the OSU micro
|
||||
benchmark, which should report a low latency. It can also be reported
|
||||
by using strace to ensure which network card is being used.
|
||||
|
||||
1.3 CPU binding
|
||||
|
||||
A thread may switch between CPUs when running, leading to a drop in
|
||||
performance. To ensure that it remains in the same process it can be
|
||||
binded with srun(1) or sbatch(1) using the --cpu-bind option, or using
|
||||
taskset(1).
|
||||
|
||||
It can be detected by running the program with Extrae and using the
|
||||
General/view/executing_cpu.cfg configuration in Paraver. After
|
||||
adjusting the scale, all processes must have a different color from
|
||||
each other (the assigned CPU) and keep it constant. Otherwise changes
|
||||
of CPUs are happening.
|
||||
|
||||
1.4 Libraries that use dlopen(3)
|
||||
|
||||
Some libraries or programs try to determine which components are
|
||||
available in a system by looking for specific libraries in the search
|
||||
path determined at runtime.
|
||||
|
||||
This behavior can cause a program to change the execution time
|
||||
depending on the environment variables like LD_LIBRARY_PATH.
|
||||
|
||||
It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using
|
||||
strace(1) when running the program.
|
||||
|
||||
1.5 Intel MPI library selection
|
||||
|
||||
The Intel MPI library has several variants which are loaded at run
|
||||
time: debug, release, debug_mt and release_mt. Of which the
|
||||
I_MPI_THREAD_SPLIT controls whether the multithread capabilities are
|
||||
enabled or not.
|
||||
|
||||
1.6 LLVM and OpenMP problem
|
||||
|
||||
The LLVM OpenMP implementation is installed in libomp.so, however two
|
||||
symbolic links are created for libgomp.so and libiomp5.so.
|
||||
|
||||
libgomp.so -> libomp.so
|
||||
libiomp5.so -> libomp.so
|
||||
libomp.so
|
||||
|
||||
So applications compiled with OpenMP by other compilers may end up
|
||||
using the LLVM implementation. This can be observed by setting
|
||||
LD_DEBUG=all of using strace(1) and looking for the libomp.so library
|
||||
being loaded.
|
||||
|
||||
In bscpkgs the symbolic links have been removed for the clangOmpss2
|
||||
compiler.
|
||||
|
||||
1.7 Nix-shell does not allow isolation
|
||||
|
||||
Nix-shell is not isolated, the compilation process tries then to
|
||||
use headers and libs from /usr.
|
||||
|
||||
This can induce compilation errors not happening inside nix-build.
|
||||
Do not use to ensure reproducibility.
|
||||
|
||||
1.8 Make doesn't rebuild objects
|
||||
|
||||
When using local repo as src code, (e.g. developer mode on) a make
|
||||
clean at the preBuild stage is required.
|
||||
|
||||
Nix sets the same modification date (one second after the Epoch
|
||||
(1970-01-01 at 00:00:01 in UTC timezone) to all the files in the nix
|
||||
store (also those copied from repos). Makefile checks the files
|
||||
modification date in order to call or not the compilation
|
||||
instructions. If any object/binary file exists out of Nix, at the time
|
||||
we build within Nix, they will be copied with the current data and
|
||||
consequently not updated during the Nix compilation process.
|
||||
|
||||
1.9 Sbatch silently fails on parsing
|
||||
|
||||
When submitting a job with a wrong specification in MN4 with SLURM
|
||||
17.11.9-2, for example this bogus line:
|
||||
|
||||
#SBATCH --nodes=1 2
|
||||
|
||||
It silently fails to parse the options, falling back to the defaults,
|
||||
without any error.
|
||||
|
||||
We have improved our checking to detect bogus options passed to SLURM,
|
||||
so we prevent this problem from happening.
|
||||
|
||||
1.10 The srun program misses signals after MPI_Finalize
|
||||
|
||||
When a program receives a signal such as SIGSEGV after calling
|
||||
MPI_Finalize, srun at version 17.11.7 doesn't return a error code but
|
||||
exits with 0.
|
||||
|
||||
This can cause bogus programs to go undetected when only checking the
|
||||
return code of srun. A better approach is to check the exit code with
|
||||
sacct(1) or write the exit code to a file and check it later.
|
||||
|
||||
/* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */
|
||||
|
Loading…
Reference in New Issue
Block a user