2021-01-25 20:02:25 +01:00
|
|
|
\"Header point size
|
|
|
|
.ds HP "15 12 12 0 0 0 0 0 0 0 0 0 0 0"
|
2021-01-26 12:57:09 +01:00
|
|
|
.S 11p 1.3m
|
|
|
|
.PGFORM 14c 28c 3.5c
|
|
|
|
.\" .COVER
|
|
|
|
.\" .TL
|
|
|
|
.\" Garlic: User guide
|
|
|
|
.\" .AF "Barcelona Supercomputing Center"
|
|
|
|
.\" .AU "Rodrigo Arias Mallo"
|
|
|
|
.\" .COVEND
|
2021-02-08 19:00:38 +01:00
|
|
|
.H 1 "Overview"
|
2021-01-25 20:02:25 +01:00
|
|
|
.P
|
|
|
|
The garlic framework is designed to fulfill all the requirements of an
|
|
|
|
experimenter in all the steps up to publication. The experience gained
|
|
|
|
while using it suggests that we move along three stages despicted in the
|
|
|
|
following diagram:
|
2021-02-08 19:00:38 +01:00
|
|
|
.DS CB
|
2021-01-26 12:57:09 +01:00
|
|
|
.S 9p 10p
|
|
|
|
.PS 5
|
|
|
|
linewid=1;
|
2021-02-08 19:00:38 +01:00
|
|
|
right
|
|
|
|
box "Source" "code"
|
2021-01-25 20:02:25 +01:00
|
|
|
arrow "Development" above
|
2021-02-08 19:00:38 +01:00
|
|
|
box "Program"
|
2021-01-25 20:02:25 +01:00
|
|
|
arrow "Experiment" above
|
2021-02-08 19:00:38 +01:00
|
|
|
box "Results"
|
2021-01-25 20:02:25 +01:00
|
|
|
arrow "Data" "exploration"
|
2021-02-08 19:00:38 +01:00
|
|
|
box "Figures"
|
|
|
|
.PE
|
2021-01-26 12:57:09 +01:00
|
|
|
.S P P
|
2021-02-08 19:00:38 +01:00
|
|
|
.DE
|
2021-01-25 20:02:25 +01:00
|
|
|
In the development phase the experimenter changes the source code in
|
|
|
|
order to introduce new features or fix bugs. Once the program is
|
|
|
|
considered functional, the next phase is the experimentation, where
|
|
|
|
several experiment configurations are tested to evaluate the program. It
|
|
|
|
is common that some problems are spotted during this phase, which lead
|
|
|
|
the experimenter to go back to the development phase and change the
|
|
|
|
source code.
|
|
|
|
.P
|
|
|
|
Finally, when the experiment is considered completed, the
|
|
|
|
experimenter moves to the next phase, which envolves the exploration of
|
|
|
|
the data generated by the experiment. During this phase, it is common to
|
|
|
|
generate results in the form of plots or tables which provide a clear
|
|
|
|
insight in those quantities of interest. It is also common that after
|
|
|
|
looking at the figures, some changes in the experiment configuration
|
|
|
|
need to be introduced (or even in the source code of the program).
|
|
|
|
.P
|
|
|
|
Therefore, the experimenter may move forward and backwards along three
|
|
|
|
phases several times. The garlic framework provides support for all the
|
|
|
|
three stages (with different degrees of madurity).
|
|
|
|
.H 1 "Development (work in progress)"
|
|
|
|
.P
|
|
|
|
During the development phase, a functional program is produced by
|
|
|
|
modifying its source code. This process is generally cyclic: the
|
|
|
|
developer needs to compile, debug and correct mistakes. We want to
|
|
|
|
minimize the delay times, so the programs can be executed as soon as
|
|
|
|
needed, but under a controlled environment so that the same behavior
|
|
|
|
occurs during the experimentation phase.
|
|
|
|
.P
|
2021-01-26 12:57:09 +01:00
|
|
|
In particular, we want that several experimenters can reproduce the
|
|
|
|
the same development environment so they can debug each other programs
|
|
|
|
when reporting bugs. Therefore, the environment must be carefully
|
|
|
|
controlled to avoid non-reproducible scenarios.
|
|
|
|
.\" ===================================================================
|
|
|
|
.H 2 "Getting the development tools"
|
2021-01-25 20:02:25 +01:00
|
|
|
.P
|
2021-01-26 12:57:09 +01:00
|
|
|
To create a development
|
|
|
|
environment, first copy or download the sources of your program (not the
|
|
|
|
dependencies) in a new directory placed in the target machine
|
|
|
|
(MareNostrum\~4).
|
2021-01-25 20:02:25 +01:00
|
|
|
.P
|
2021-01-26 12:57:09 +01:00
|
|
|
The default environment contains packages commonly used to develop
|
|
|
|
programs, listed in the \fIgarlic/index.nix\fP file:
|
|
|
|
.\" FIXME: Unify garlic.unsafeDevelop in garlic.develop, so we can
|
|
|
|
.\" specify the packages directly
|
2021-01-25 20:02:25 +01:00
|
|
|
.DS I
|
|
|
|
.VERBON
|
2021-01-26 12:57:09 +01:00
|
|
|
develop = let
|
|
|
|
packages = with self; [
|
|
|
|
coreutils htop procps-ng vim which strace
|
|
|
|
tmux gdb kakoune universal-ctags bashInteractive
|
|
|
|
glibcLocales ncurses git screen curl
|
|
|
|
# Add more nixpkgs packages here...
|
|
|
|
] ++ with bsc; [
|
|
|
|
slurm clangOmpss2 icc mcxx perf
|
|
|
|
# Add more bsc packages here...
|
|
|
|
];
|
|
|
|
...
|
2021-01-25 20:02:25 +01:00
|
|
|
.VERBOFF
|
|
|
|
.DE
|
2021-01-26 12:57:09 +01:00
|
|
|
If you need additional packages, add them to the list, so that they
|
|
|
|
become available in the environment. Those may include any dependency
|
|
|
|
required to build your program.
|
2021-01-25 20:02:25 +01:00
|
|
|
.P
|
2021-01-26 12:57:09 +01:00
|
|
|
Then use the build machine (xeon07) to build the
|
|
|
|
.I garlic.develop
|
|
|
|
derivation:
|
2021-01-25 20:02:25 +01:00
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
build% nix-build -A garlic.develop
|
|
|
|
\&...
|
|
|
|
build% grep ln result
|
2021-01-26 12:57:09 +01:00
|
|
|
ln -fs /gpfs/projects/.../bin/stage1 .nix-develop
|
2021-01-25 20:02:25 +01:00
|
|
|
.VERBOFF
|
|
|
|
.DE
|
2021-01-26 12:57:09 +01:00
|
|
|
Copy the \fIln\fP command and run it in the target machine
|
|
|
|
(MareNostrum\~4), in the new directory used for your program
|
|
|
|
development. The link will be created as a hidden file named
|
|
|
|
\fI.nix-develop\fP and will be used to remember your environment.
|
|
|
|
Several environments can be stored using this method, with different
|
|
|
|
packages in different directories. You will need to rebuild the
|
|
|
|
.I garlic.develop
|
|
|
|
derivation and update the
|
|
|
|
.I .nix-develop
|
|
|
|
link after the package list changes to update the environment. Once the
|
|
|
|
environment link is created, there is no need to repeat this steps again.
|
|
|
|
.P
|
|
|
|
Before entering the environment, you will need to access the required
|
|
|
|
resources for your progam, which may include several compute nodes.
|
|
|
|
.\" ===================================================================
|
|
|
|
.H 2 "Allocating resources for development"
|
2021-02-08 19:00:38 +01:00
|
|
|
.P
|
2021-01-26 12:57:09 +01:00
|
|
|
Our target machine (MareNostrum 4) provides an interactive shell, that
|
|
|
|
can be requested with the number of computational resources required for
|
|
|
|
development. To do so, connect to the login node and allocate an
|
|
|
|
interactive session:
|
2021-01-25 20:02:25 +01:00
|
|
|
.DS I
|
|
|
|
.VERBON
|
2021-01-26 12:57:09 +01:00
|
|
|
% ssh mn1
|
|
|
|
login% salloc ...
|
|
|
|
target%
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
This operation may take some minutes to complete depending on the load
|
|
|
|
of the cluster. But once the session is ready, any subsequent execution
|
|
|
|
of programs will be immediate.
|
|
|
|
.\" ===================================================================
|
|
|
|
.H 2 "Accessing the developement environment"
|
|
|
|
.P
|
|
|
|
The utility program \fInix-develop\fP has been designed to access the
|
|
|
|
development environment of the current directory, by looking for the
|
|
|
|
\fI.nix-develop\fP file. It creates a namespace where the required
|
|
|
|
packages are installed and ready to be used. Now you can access the
|
|
|
|
newly created environment by running:
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
target% nix-develop
|
2021-01-25 20:02:25 +01:00
|
|
|
develop%
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
The spawned shell contains all the packages pre-defined in the
|
|
|
|
\fIgarlic.develop\fP derivation, and can now be accessed by typing the
|
|
|
|
name of the commands.
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
2021-01-26 12:57:09 +01:00
|
|
|
develop% which gcc
|
2021-01-25 20:02:25 +01:00
|
|
|
/nix/store/azayfhqyg9...s8aqfmy-gcc-wrapper-9.3.0/bin/gcc
|
2021-01-26 12:57:09 +01:00
|
|
|
develop% which gdb
|
2021-01-25 20:02:25 +01:00
|
|
|
/nix/store/1c833b2y8j...pnjn2nv9d46zv44dk-gdb-9.2/bin/gdb
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
If you need additional packages, you can add them in the
|
2021-01-26 12:57:09 +01:00
|
|
|
\fIgarlic/index.nix\fP file as mentioned previously. To keep the
|
|
|
|
same current resources, so you don't need to wait again for the
|
|
|
|
resources to be allocated, exit only from the development shell:
|
2021-01-25 20:02:25 +01:00
|
|
|
.DS I
|
|
|
|
.VERBON
|
2021-01-26 12:57:09 +01:00
|
|
|
develop% exit
|
|
|
|
target%
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
Then update the
|
|
|
|
.I .nix-develop
|
|
|
|
link and enter into the new develop environment:
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
target% nix-develop
|
|
|
|
develop%
|
2021-01-25 20:02:25 +01:00
|
|
|
.VERBOFF
|
|
|
|
.DE
|
2021-01-26 12:57:09 +01:00
|
|
|
.\" ===================================================================
|
2021-01-25 20:02:25 +01:00
|
|
|
.H 2 "Execution"
|
|
|
|
The allocated shell can only execute tasks in the current node, which
|
|
|
|
may be enough for some tests. To do so, you can directly run your
|
|
|
|
program as:
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
develop$ ./program
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
If you need to run a multi-node program, typically using MPI
|
|
|
|
communications, then you can do so by using srun. Notice that you need
|
|
|
|
to allocate several nodes when calling salloc previously. The srun
|
2021-01-26 12:57:09 +01:00
|
|
|
command will execute the given program \fBoutside\fP the development
|
2021-01-25 20:02:25 +01:00
|
|
|
environment if executed as-is. So we re-enter the develop environment by
|
|
|
|
calling nix-develop as a wrapper of the program:
|
|
|
|
.\" FIXME: wrap srun to reenter the develop environment by its own
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
develop$ srun nix-develop ./program
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
.H 2 "Debugging"
|
|
|
|
The debugger can be used to directly execute the program if is executed
|
|
|
|
in only one node by using:
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
develop$ gdb ./program
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
2021-01-26 12:57:09 +01:00
|
|
|
Or it can be attached to an already running program by using its PID.
|
|
|
|
You will need to first connect to the node running it (say target2), and
|
|
|
|
run gdb inside the nix-develop environment. Use
|
|
|
|
.I squeue
|
|
|
|
to see the compute nodes running your program:
|
2021-01-25 20:02:25 +01:00
|
|
|
.DS I
|
|
|
|
.VERBON
|
2021-01-26 12:57:09 +01:00
|
|
|
login$ ssh target2
|
|
|
|
target2$ cd project-develop
|
|
|
|
target2$ nix-develop
|
2021-01-25 20:02:25 +01:00
|
|
|
develop$ gdb -p $pid
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
2021-01-26 12:57:09 +01:00
|
|
|
You can repeat this step to control the execution of programs running in
|
|
|
|
different nodes simultaneously.
|
2021-02-08 19:00:38 +01:00
|
|
|
.P
|
2021-01-25 20:02:25 +01:00
|
|
|
In those cases where the program crashes before being able to attach the
|
2021-01-26 12:57:09 +01:00
|
|
|
debugger, enable the generation of core dumps:
|
2021-01-25 20:02:25 +01:00
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
develop$ ulimit -c unlimited
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
And rerun the program, which will generate a core file that can be
|
|
|
|
opened by gdb and contains the state of the memory when the crash
|
|
|
|
happened. Beware that the core dump file can be very large, depending on
|
|
|
|
the memory used by your program at the crash.
|
2021-02-08 19:00:38 +01:00
|
|
|
.\" ===================================================================
|
|
|
|
.H 1 "Experimentation"
|
|
|
|
The experimentation phase begins with a functional program which is the
|
|
|
|
object of study. The experimenter then designs an experiment aimed at
|
|
|
|
measuring some properties of the program. The experiment is then
|
|
|
|
executed and the results are stored for further analysis.
|
|
|
|
.H 2 "Writing the experiment configuration"
|
|
|
|
.P
|
|
|
|
The term experiment is quite overloaded in this document. We are going
|
|
|
|
to see how to write the recipe that describes the execution pipeline of
|
|
|
|
an experiment.
|
|
|
|
.P
|
|
|
|
Within the garlic benchmark, experiments are typically sorted by a
|
|
|
|
hierarchy depending on which application they belong. Take a look at the
|
|
|
|
\fCgarlic/exp\fP directory and you will find some folders and .nix
|
|
|
|
files.
|
|
|
|
.P
|
|
|
|
Each of those recipes files describe a function that returns a
|
|
|
|
derivation, which, once built will result in the first stage script of
|
|
|
|
the execution pipeline.
|
|
|
|
.P
|
|
|
|
The first part of states the name of the attributes required as the
|
|
|
|
input of the function. Typically some packages, common tools and options:
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
{
|
|
|
|
stdenv
|
|
|
|
, stdexp
|
|
|
|
, bsc
|
|
|
|
, targetMachine
|
|
|
|
, stages
|
|
|
|
, garlicTools
|
|
|
|
}:
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
.P
|
|
|
|
Notice the \fCtargetMachine\fP argument, which provides information
|
|
|
|
about the machine in which the experiment will run. You should write
|
|
|
|
your experiment in such a way that runs in multiple clusters.
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
varConf = {
|
|
|
|
blocks = [ 1 2 4 ];
|
|
|
|
nodes = [ 1 ];
|
|
|
|
};
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
.P
|
|
|
|
The \fCvarConf\fP is the attribute set that allows you to vary some
|
|
|
|
factors in the experiment.
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
genConf = var: fix (self: targetMachine.config // {
|
|
|
|
expName = "example";
|
|
|
|
unitName = self.expName + "-b" + toString self.blocks;
|
|
|
|
blocks = var.blocks;
|
|
|
|
nodes = var.nodes;
|
|
|
|
cpusPerTask = 1;
|
|
|
|
tasksPerNode = self.hw.socketsPerNode;
|
|
|
|
});
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
.P
|
|
|
|
The \fCgenConf\fP function is the central part of the description of the
|
|
|
|
experiment. Takes as input \fBone\fP configuration from the cartesian
|
|
|
|
product of
|
|
|
|
.I varConfig
|
|
|
|
and returns the complete configuration. In our case, it will be
|
|
|
|
called 3 times, with the following inputs at each time:
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
{ blocks = 1; nodes = 1; }
|
|
|
|
{ blocks = 2; nodes = 1; }
|
|
|
|
{ blocks = 4; nodes = 1; }
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
.P
|
|
|
|
The return value can be inspected by calling the function in the
|
|
|
|
interactive nix repl:
|
|
|
|
.DS I
|
|
|
|
.VERBON
|
|
|
|
nix-repl> genConf { blocks = 2; nodes = 1; }
|
|
|
|
{
|
|
|
|
blocks = 2;
|
|
|
|
cpusPerTask = 1;
|
|
|
|
expName = "example";
|
|
|
|
hw = { ... };
|
|
|
|
march = "skylake-avx512";
|
|
|
|
mtune = "skylake-avx512";
|
|
|
|
name = "mn4";
|
|
|
|
nixPrefix = "/gpfs/projects/bsc15/nix";
|
|
|
|
nodes = 1;
|
|
|
|
sshHost = "mn1";
|
|
|
|
tasksPerNode = 2;
|
|
|
|
unitName = "example-b2";
|
|
|
|
}
|
|
|
|
.VERBOFF
|
|
|
|
.DE
|
|
|
|
.P
|
|
|
|
Some configuration parameters were added by
|
|
|
|
.I targetMachine.config ,
|
|
|
|
such as the
|
|
|
|
.I nixPrefix ,
|
|
|
|
.I sshHost
|
|
|
|
or the
|
|
|
|
.I hw
|
|
|
|
attribute set, which are specific for the cluster they experiment is
|
|
|
|
going to run. Also, the
|
|
|
|
.I unitName
|
|
|
|
got assigned the proper name based on the number of blocks, but the
|
|
|
|
number of tasks per node were assigned based on the hardware description
|
|
|
|
of the target machine.
|
|
|
|
.P
|
|
|
|
By following this rule, the experiments can easily be ported to machines
|
|
|
|
with other hardware characteristics, and we only need to define the
|
|
|
|
hardware details once. Then all the experiments will be updated based on
|
|
|
|
those details.
|
|
|
|
.H 2 "First steps"
|
|
|
|
.P
|
|
|
|
The complete results generally take a long time to be finished, so it is
|
|
|
|
advisable to design the experiments iteratively, in order to quickly
|
|
|
|
obtain some feedback. Some recommendations:
|
|
|
|
.BL
|
|
|
|
.LI
|
|
|
|
Start with one unit only.
|
|
|
|
.LI
|
|
|
|
Set the number of runs low (say 5) but more than one.
|
|
|
|
.LI
|
|
|
|
Use a small problem size, so the execution time is low.
|
|
|
|
.LI
|
|
|
|
Set the time limit low, so deadlocks are caught early.
|
|
|
|
.LE
|
|
|
|
.P
|
|
|
|
As soon as the first runs are complete, examine the results and test
|
|
|
|
that everything looks good. You would likely want to check:
|
|
|
|
.BL
|
|
|
|
.LI
|
|
|
|
The resources where assigned as intended (nodes and CPU affinity).
|
|
|
|
.LI
|
|
|
|
No errors or warnings: look at stderr and stdout logs.
|
|
|
|
.LI
|
|
|
|
If a deadlock happens, it will run out of the time limit.
|
|
|
|
.LE
|
|
|
|
.P
|
|
|
|
As you gain confidence over that the execution went as planned, begin
|
|
|
|
increasing the problem size, the number of runs, the time limit and
|
|
|
|
lastly the number of units. The rationale is that each unit that is
|
|
|
|
shared among experiments gets assigned the same hash. Therefore, you can
|
|
|
|
iteratively add more units to an experiment, and if they are already
|
|
|
|
executed (and the results were generated) is reused.
|
2021-01-25 20:02:25 +01:00
|
|
|
.SK
|
|
|
|
.H 1 "Annex A: Branch name diagram"
|
|
|
|
.DS CB
|
|
|
|
.S -2
|
|
|
|
.PS 4.6/25.4
|
|
|
|
copy "gitbranch.pic"
|
|
|
|
.PE
|
|
|
|
.S P
|
|
|
|
.DE
|
2021-02-08 19:00:38 +01:00
|
|
|
.TC
|