Known sources of noise
                       in MareNostrum 4


ABSTRACT

  The experiments run at MareNostrum 4 show that there are several 
  factors that can affect the execution time. Some may even become the 
  dominant part of the time, rendering the experiment invalid.

  This document lists all known sources of variability and tries to give 
  an overview on how to detect and correct the problems.

1. Notable sources of variability

  Usually all sources were found in the MareNostrum 4 cluster, but they 
  may apply to other machines. Some may have a detection mechanism so 
  the effect can be neglected, but others don't. Also, some problems 
  only occur with low probability.

  Other sources of variability with a low effect, say lower than 1% of 
  the mean time, are not listed here.

1.1 The daemon slurmstepd eats sys CPU in a new thread

  For a period of about 10 seconds a thread is created from the 
  slurmstepd process when a job is running, which uses quite a lot of 
  CPU. This event happens from time to time with unknown frequency. It 
  was first observed in the nbody program, as it almost doubles the time 
  per iteration, as the other processes are waiting for the one with 
  slow CPU to continue to the next iteration. The SLURM version was 
  17.11.7 and the program was executed with sbatch+srun. See the issue 
  for more details:

    https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19

  It can be detected by looking at the cycles per us view with Extrae, 
  with the PAPI counters enabled. It shows a slowdown in one process 
  when the problem occurs. Also, perf-sched(1) can be used to trace 
  context switches to other programs but requires access to the debugfs.

1.2 MPICH uses ethernet rather than infiniband

  Some MPI implementations (like MPICH) can silently use non-optimal 
  fabrics like the ethernet rather than infiniband because the are 
  misconfigured.

  Can be detected by running latency benchmarks like the OSU micro 
  benchmark, which should report a low latency. It can also be reported 
  by using strace to ensure which network card is being used.

1.3 CPU binding

  A thread may switch between CPUs when running, leading to a drop in 
  performance. To ensure that it remains in the same process it can be 
  binded with srun(1) or sbatch(1) using the --cpu-bind option, or using 
  taskset(1).

  It can be detected by running the program with Extrae and using the 
  General/view/executing_cpu.cfg configuration in Paraver. After 
  adjusting the scale, all processes must have a different color from 
  each other (the assigned CPU) and keep it constant. Otherwise changes 
  of CPUs are happening.

1.4 Libraries that use dlopen(3)

  Some libraries or programs try to determine which components are 
  available in a system by looking for specific libraries in the search 
  path determined at runtime.

  This behavior can cause a program to change the execution time 
  depending on the environment variables like LD_LIBRARY_PATH.

  It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using 
  strace(1) when running the program.

1.5 Intel MPI library selection

  The Intel MPI library has several variants which are loaded at run 
  time: debug, release, debug_mt and release_mt. Of which the 
  I_MPI_THREAD_SPLIT controls whether the multithread capabilities are 
  enabled or not.

1.6 LLVM and OpenMP problem

  The LLVM OpenMP implementation is installed in libomp.so, however two 
  symbolic links are created for libgomp.so and libiomp5.so.

    libgomp.so -> libomp.so
    libiomp5.so -> libomp.so
    libomp.so
  
  So applications compiled with OpenMP by other compilers may end up 
  using the LLVM implementation. This can be observed by setting  
  LD_DEBUG=all of using strace(1) and looking for the libomp.so library 
  being loaded.

  In bscpkgs the symbolic links have been removed for the clangOmpss2 
  compiler.

1.7 Nix-shell does not allow isolation

  Nix-shell is not isolated, the compilation process tries then to
  use headers and libs from /usr.

  This can induce compilation errors not happening inside nix-build.
  Do not use to ensure reproducibility.

1.8 Make doesn't rebuild objects

  When using local repo as src code, (e.g. developer mode on) a make
  clean at the preBuild stage is required.

  Nix sets the same modification date (one second after the Epoch 
  (1970-01-01 at 00:00:01 in UTC timezone) to all the files in the nix 
  store (also those copied from repos). Makefile checks the files 
  modification date in order to call or not the compilation 
  instructions. If any object/binary file exists out of Nix, at the time 
  we build within Nix, they will be copied with the current data and 
  consequently not updated during the Nix compilation process.

/* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */