86 lines
3.4 KiB
Plaintext
86 lines
3.4 KiB
Plaintext
|
|
Known sources of noise
|
|
in MareNostrum 4
|
|
|
|
|
|
ABSTRACT
|
|
|
|
The experiments run at MareNostrum 4 show that there are several
|
|
factors that can affect the execution time. Some may even become the
|
|
dominant part of the time, rendering the experiment invalid.
|
|
|
|
This document lists all known sources of variability and tries to give
|
|
an overview on how to detect and correct the problems.
|
|
|
|
1. Notable sources of variability
|
|
|
|
Usually all sources were found in the MareNostrum 4 cluster, but they
|
|
may apply to other machines. Some may have a detection mechanism so
|
|
the effect can be neglected, but others don't. Also, some problems
|
|
only occur with low probability.
|
|
|
|
Other sources of variability with a low effect, say lower than 1% of
|
|
the mean time, are not listed here.
|
|
|
|
1.1 The daemon slurmstepd eats sys CPU in a new thread
|
|
|
|
For a period of about 10 seconds a thread is created from the
|
|
slurmstepd process when a job is running, which uses quite a lot of
|
|
CPU. This event happens from time to time with unknown frequency. It
|
|
was first observed in the nbody program, as it almost doubles the time
|
|
per iteration, as the other processes are waiting for the one with
|
|
slow CPU to continue to the next iteration. The SLURM version was
|
|
17.11.7 and the program was executed with sbatch+srun. See the issue
|
|
for more details:
|
|
|
|
https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19
|
|
|
|
It can be detected by looking at the cycles per us view with Extrae,
|
|
with the PAPI counters enabled. It shows a slowdown in one process
|
|
when the problem occurs. Also, perf-sched(1) can be used to trace
|
|
context switches to other programs but requires access to the debugfs.
|
|
|
|
1.2 MPICH uses ethernet rather than infiniband
|
|
|
|
Some MPI implementations (like MPICH) can silently use non-optimal
|
|
fabrics like the ethernet rather than infiniband because the are
|
|
misconfigured.
|
|
|
|
Can be detected by running latency benchmarks like the OSU micro
|
|
benchmark, which should report a low latency. It can also be reported
|
|
by using strace to ensure which network card is being used.
|
|
|
|
1.3 CPU binding
|
|
|
|
A thread may switch between CPUs when running, leading to a drop in
|
|
performance. To ensure that it remains in the same process it can be
|
|
binded with srun(1) or sbatch(1) using the --cpu-bind option, or using
|
|
taskset(1).
|
|
|
|
It can be detected by running the program with Extrae and using the
|
|
General/view/executing_cpu.cfg configuration in Paraver. After
|
|
adjusting the scale, all processes must have a different color from
|
|
each other (the assigned CPU) and keep it constant. Otherwise changes
|
|
of CPUs are happening.
|
|
|
|
1.4 Libraries that use dlopen(3)
|
|
|
|
Some libraries or programs try to determine which components are
|
|
available in a system by looking for specific libraries in the search
|
|
path determined at runtime.
|
|
|
|
This behavior can cause a program to change the execution time
|
|
depending on the environment variables like LD_LIBRARY_PATH.
|
|
|
|
It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using
|
|
strace(1) when running the program.
|
|
|
|
1.5 Intel MPI library selection
|
|
|
|
The Intel MPI library has several variants which are loaded at run
|
|
time: debug, release, debug_mt and release_mt. Of which the
|
|
I_MPI_THREAD_SPLIT controls whether the multithread capabilities are
|
|
enabled or not.
|
|
|
|
/* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */
|