1692 lines
55 KiB
Plaintext
1692 lines
55 KiB
Plaintext
.\" Point size fails when rending html in the code blocks
|
|
.\".nr PS 11p
|
|
.nr GROWPS 3
|
|
.nr PSINCR 2p
|
|
.fam P
|
|
.\" ===================================================================
|
|
.\" Some useful macros
|
|
.\" ===================================================================
|
|
.\"
|
|
.\" Code start (CS) and end (CE) blocks
|
|
.de CS
|
|
.DS L
|
|
\fC
|
|
..
|
|
.de CE
|
|
\fP
|
|
.DE
|
|
..
|
|
.\" Code inline:
|
|
.\" .CI "inline code"
|
|
.de CI
|
|
\fC\\$1\fP\\$2
|
|
..
|
|
.\" ===================================================================
|
|
.\" \&
|
|
.\" .sp 3c
|
|
.\" .LG
|
|
.\" .LG
|
|
.\" .LG
|
|
.\" .LG
|
|
.\" Garlic: User guide
|
|
.\" .br
|
|
.\" .NL
|
|
.\" Rodrigo Arias Mallo
|
|
.\" .br
|
|
.\" .I "Barcelona Supercomputing Center"
|
|
.\" .br
|
|
.\" \*[curdate]
|
|
.\" .sp 17c
|
|
.\" .DE
|
|
.\" .CI \*[gitcommit]
|
|
.TL
|
|
Garlic: User Guide
|
|
.AU
|
|
Rodrigo Arias Mallo
|
|
.AI
|
|
Barcelona Supercomputing Center
|
|
.AB
|
|
.LP
|
|
This document contains all the information to configure and use the
|
|
garlic benchmark. All stages from the development to the publication
|
|
are covered, as well as the introductory steps required to setup the
|
|
machines.
|
|
.DS L
|
|
.SM
|
|
\fC
|
|
Generated on \*[curdate]
|
|
Git commit: \*[gitcommit]
|
|
\fP
|
|
.DE
|
|
.AE
|
|
.\" ===================================================================
|
|
.NH 1
|
|
Introduction
|
|
.LP
|
|
The garlic framework is designed to fulfill all the requirements of an
|
|
experimenter in all the steps up to publication. The experience gained
|
|
while using it suggests that we move along three stages despicted in the
|
|
following diagram:
|
|
.DS L
|
|
.SM
|
|
.PS 5
|
|
linewid=1.4;
|
|
arcrad=1;
|
|
right
|
|
S: box "Source" "code"
|
|
line "Development" invis
|
|
P: box "Program"
|
|
line "Experimentation" invis
|
|
R:box "Results"
|
|
line "Data" "exploration" invis
|
|
F:box "Figures"
|
|
# Creates a "cycle" around two boxes
|
|
define cycle {
|
|
arc cw from 1/2 of the way between $1.n and $1.ne \
|
|
to 1/2 of the way between $2.nw and $2.n ->;
|
|
arc cw from 1/2 of the way between $2.s and $2.sw \
|
|
to 1/2 of the way between $1.se and $1.s ->;
|
|
}
|
|
cycle(S, P)
|
|
cycle(P, R)
|
|
cycle(R, F)
|
|
.PE
|
|
.DE
|
|
In the development phase the experimenter changes the source code in
|
|
order to introduce new features or fix bugs. Once the program is
|
|
considered functional, the next phase is the experimentation, where
|
|
several experiment configurations are tested to evaluate the program. It
|
|
is common that some problems are spotted during this phase, which lead
|
|
the experimenter to go back to the development phase and change the
|
|
source code.
|
|
.PP
|
|
Finally, when the experiment is considered completed, the
|
|
experimenter moves to the next phase, which envolves the exploration of
|
|
the data generated by the experiment. During this phase, it is common to
|
|
generate results in the form of plots or tables which provide a clear
|
|
insight in those quantities of interest. It is also common that after
|
|
looking at the figures, some changes in the experiment configuration
|
|
need to be introduced (or even in the source code of the program).
|
|
.PP
|
|
Therefore, the experimenter may move forward and backwards along three
|
|
phases several times. The garlic framework provides support for all the
|
|
three stages (with different degrees of madurity).
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Machines and clusters
|
|
.LP
|
|
Our current setup employs multiple machines to build and execute the
|
|
experiments. Each cluster and node has it's own name and will be
|
|
different in other clusters. Therefore, instead of using the names of
|
|
the machines we use machine classes to generalize our setup. Those
|
|
machine clases currently correspond to a physical machine each:
|
|
.IP \(bu 12p
|
|
.B Builder
|
|
(xeon07): runs the nix-daemon and performs the builds in /nix. Requires
|
|
root access to setup the
|
|
.I nix-daemon
|
|
with multiple users.
|
|
.IP \(bu
|
|
.B Target
|
|
(MareNostrum 4 compute nodes): the nodes where the experiments
|
|
are executed. It doesn't need to have /nix installed or root access.
|
|
.IP \(bu
|
|
.B Login
|
|
(MareNostrum 4 login nodes): used to allocate resources and run jobs. It
|
|
doesn't need to have /nix installed or root access.
|
|
.IP \(bu
|
|
.B Laptop
|
|
(where the keyboard is attached, can be anything): used to connect to the other machines.
|
|
No root access is required or /nix, but needs to be able to connect to
|
|
the builder.
|
|
.LP
|
|
The machines don't need to be different of each others, as one machine
|
|
can implement several classes. For example the laptop can act as the
|
|
builder too but is not recommended. Or the login machine can also
|
|
perform the builds, but is not possible yet in our setup.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Reproducibility
|
|
.LP
|
|
An effort to facilitate the reproducibility of the experiments has been
|
|
done, with varying degrees of success. The names of the different levels
|
|
of reproducibility have not been yet standarized, so we define our own
|
|
to avoid any confusion. We define three levels of reproducibility based
|
|
on the people and the machine involved:
|
|
.IP \(bu 12p
|
|
R0: The \fIsame\fP people on the \fIsame\fP machine obtain the same result
|
|
.IP \(bu
|
|
R1: \fIDifferent\fP people on the \fIsame\fP machine obtain the same result
|
|
.IP \(bu
|
|
R2: \fIDifferent\fP people on a \fIdifferent\fP machine obtain the same result
|
|
.LP
|
|
The garlic framework distinguishes two types of results: the result of
|
|
\fIbuilding a derivation\fP (usually building a binary or a library from the
|
|
sources) and the results of the \fIexecution of an experiment\fP (typically
|
|
those are the measurements performed during the execution of the program
|
|
of study).
|
|
.PP
|
|
For those two types, the meaning of
|
|
.I "same result"
|
|
is different. In the case of building a binary, we define the same
|
|
result if it is bit-by-bit identical. In the packages provided by nixos
|
|
is usually the case except some rare cases. One example is that during the build process,
|
|
a directory is listed by the order of the inodes, giving a random order
|
|
which is different between builds. These problems are tracked by the
|
|
.URL https://r13y.com/ r13y
|
|
project. About 99% of the derivations of the minimal package set achieve
|
|
the R2 property.
|
|
.PP
|
|
On the other hand, the results of the experiments are always bit-by-bit
|
|
different. So we change the definition to state that they are the same
|
|
if the conclusions that can be obtained are the same. In particular, we
|
|
assume that the results are within the confidence interval. With this
|
|
definition, all experiments are currently R1. The reproducibility level
|
|
R2 is not posible yet as the software is compiled to support only the
|
|
target machine, with an specific interconnection.
|
|
.\" ===================================================================
|
|
.bp
|
|
.NH 1
|
|
Preliminary steps
|
|
.LP
|
|
The peculiarities of our setup require that users perform some actions
|
|
to use the garlic framework. The content of this section is only
|
|
intended for the users of our machines, but can serve as reference in
|
|
other machines.
|
|
.PP
|
|
The names of the machine classes are used in the command line prompt
|
|
instead of the actual name of the machine, to indicate that the command
|
|
needs to be executed in the stated machine class, for example:
|
|
.CS
|
|
builder% echo hi
|
|
hi
|
|
.CE
|
|
When the machine class is not important, it is ignored and only the
|
|
.CI "%"
|
|
prompt appears.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Configure your laptop
|
|
.LP
|
|
To easily connect to the builder (xeon07) in one step, configure the SSH
|
|
client to perform a jump over the Cobi login node. The
|
|
.I ProxyJump
|
|
directive is only available in version 7.3 and upwards. Add the
|
|
following lines in the
|
|
.CI \(ti/.ssh/config
|
|
file of your laptop:
|
|
.CS
|
|
Host cobi
|
|
HostName ssflogin.bsc.es
|
|
User your-username-here
|
|
|
|
Host xeon07
|
|
ProxyJump cobi
|
|
HostName xeon07
|
|
User your-username-here
|
|
.CE
|
|
You should be able to connect to the builder typing:
|
|
.CS
|
|
laptop$ ssh xeon07
|
|
.CE
|
|
To spot any problems try with the
|
|
.CI -v
|
|
option to enable verbose output.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Configure the builder (xeon07)
|
|
.LP
|
|
In order to use nix you would need to be able to download the sources
|
|
from Internet. Usually the download requires the ports 22, 80 and 443
|
|
to be open for outgoing traffic.
|
|
.PP
|
|
Check that you have network access in
|
|
xeon07 provided by the environment variables \fIhttp_proxy\fP and
|
|
\fIhttps_proxy\fP. Try to fetch a webpage with curl, to ensure the proxy
|
|
is working:
|
|
.CS
|
|
xeon07$ curl x.com
|
|
x
|
|
.CE
|
|
.\" ===================================================================
|
|
.NH 3
|
|
Create a new SSH key
|
|
.LP
|
|
There is one DSA key in your current home called "cluster" that is no
|
|
longer supported in recent SSH versions and should not be used. Before
|
|
removing it, create a new one without password protection leaving the
|
|
passphrase empty (in case that you don't have one already created) by
|
|
running:
|
|
.CS
|
|
xeon07$ ssh-keygen
|
|
Generating public/private rsa key pair.
|
|
Enter file in which to save the key (\(ti/.ssh/id_rsa):
|
|
Enter passphrase (empty for no passphrase):
|
|
Enter same passphrase again:
|
|
Your identification has been saved in \(ti/.ssh/id_rsa.
|
|
Your public key has been saved in \(ti/.ssh/id_rsa.pub.
|
|
\&...
|
|
.CE
|
|
By default it will create the public key at \f(CW\(ti/.ssh/id_rsa.pub\fP.
|
|
Then add the newly created key to the authorized keys, so you can
|
|
connect to other nodes of the Cobi cluster:
|
|
.CS
|
|
xeon07$ cat \(ti/.ssh/id_rsa.pub >> \(ti/.ssh/authorized_keys
|
|
.CE
|
|
Finally, delete the old "cluster" key:
|
|
.CS
|
|
xeon07$ rm \(ti/.ssh/cluster \(ti/.ssh/cluster.pub
|
|
.CE
|
|
And remove the section in the configuration \f(CW\(ti/.ssh/config\fP
|
|
where the key was assigned to be used in all hosts along with the
|
|
\f(CWStrictHostKeyChecking=no\fP option. Remove the following lines (if
|
|
they exist):
|
|
.CS
|
|
Host *
|
|
IdentityFile \(ti/.ssh/cluster
|
|
StrictHostKeyChecking=no
|
|
.CE
|
|
By default, the SSH client already searchs for a keypair called
|
|
\f(CW\(ti/.ssh/id_rsa\fP and \f(CW\(ti/.ssh/id_rsa.pub\fP, so there is
|
|
no need to manually specify them.
|
|
.PP
|
|
You should be able to access the login node with your new key by using:
|
|
.CS
|
|
xeon07$ ssh ssfhead
|
|
.CE
|
|
.\" ===================================================================
|
|
.NH 3
|
|
Authorize access to the repository
|
|
.LP
|
|
The sources of BSC packages are usually downloaded directly from the PM
|
|
git server, so you must be able to access all repositories without a
|
|
password prompt.
|
|
.PP
|
|
Most repositories are open to read for logged in users, but there are
|
|
some exceptions (for example the nanos6 repository) where you must have
|
|
explicitly granted read access.
|
|
.PP
|
|
Copy the contents of your public SSH key in \f(CW\(ti/.ssh/id_rsa.pub\fP
|
|
and paste it in GitLab at
|
|
.CS
|
|
https://pm.bsc.es/gitlab/profile/keys
|
|
.CE
|
|
Finally verify the SSH connection to the server works and you get a
|
|
greeting from the GitLab server with your username:
|
|
.CS
|
|
xeon07$ ssh git@bscpm03.bsc.es
|
|
PTY allocation request failed on channel 0
|
|
Welcome to GitLab, @rarias!
|
|
Connection to bscpm03.bsc.es closed.
|
|
.CE
|
|
Verify that you can access the nanos6 repository (otherwise you
|
|
first need to ask to be granted read access), at:
|
|
.CS
|
|
https://pm.bsc.es/gitlab/nanos6/nanos6
|
|
.CE
|
|
Finally, you should be able to download the nanos6 git
|
|
repository without any password interaction by running:
|
|
.CS
|
|
xeon07$ git clone git@bscpm03.bsc.es:nanos6/nanos6.git
|
|
.CE
|
|
Which will create the nanos6 directory.
|
|
.\" ===================================================================
|
|
.NH 3
|
|
Authorize access to MareNostrum 4
|
|
.LP
|
|
You will also need to access MareNostrum 4 from the xeon07 machine, in
|
|
order to run experiments. Add the following lines to the
|
|
\f(CW\(ti/.ssh/config\fP file and set your user name:
|
|
.CS
|
|
Host mn0 mn1 mn2
|
|
User <your user name in MN4>
|
|
.CE
|
|
Then copy your SSH key to MareNostrum 4 (it will ask you for your login
|
|
password):
|
|
.CS
|
|
xeon07$ ssh-copy-id -i \(ti/.ssh/id_rsa.pub mn1
|
|
.CE
|
|
Finally, ensure that you can connect without a password:
|
|
.CS
|
|
xeon07$ ssh mn1
|
|
\&...
|
|
login1$
|
|
.CE
|
|
.\" ===================================================================
|
|
.NH 3
|
|
Clone the bscpkgs repository
|
|
.LP
|
|
Once you have Internet and you have granted access to the PM GitLab
|
|
repositories you can begin building software with nix. First ensure
|
|
that the nix binaries are available from your shell in xeon07:
|
|
.CS
|
|
xeon07$ nix --version
|
|
nix (Nix) 2.3.6
|
|
.CE
|
|
Now you are ready to build and install packages with nix. Clone the
|
|
bscpkgs repository:
|
|
.CS
|
|
xeon07$ git clone git@bscpm03.bsc.es:rarias/bscpkgs.git
|
|
.CE
|
|
Nix looks in the current folder for a file named \f(CWdefault.nix\fP for
|
|
packages, so go to the bscpkgs directory:
|
|
.CS
|
|
xeon07$ cd bscpkgs
|
|
.CE
|
|
Now you should be able to build nanos6 (which is probably already
|
|
compiled):
|
|
.CS
|
|
xeon07$ nix-build -A bsc.nanos6
|
|
\&...
|
|
/nix/store/...2cm1ldx9smb552sf6r1-nanos6-2.4-6f10a32
|
|
.CE
|
|
The installation is placed in the nix store (with the path stated in
|
|
the last line of the build process), with the \f(CWresult\fP symbolic
|
|
link pointing to the same location:
|
|
.CS
|
|
xeon07$ readlink result
|
|
/nix/store/...2cm1ldx9smb552sf6r1-nanos6-2.4-6f10a32
|
|
.CE
|
|
.\" ###################################################################
|
|
.NH 3
|
|
Configure garlic
|
|
.LP
|
|
In order to launch experiments in the
|
|
.I target
|
|
machine, it is required to configure nix to allow a directory to be
|
|
available during the build process, where the results will be stored
|
|
before being copied in the nix store. Create a new
|
|
.CI garlic
|
|
directory in your
|
|
personal cache directory and copy the full path:
|
|
.CS
|
|
xeon07$ mkdir -p \(ti/.cache/garlic
|
|
xeon07$ readlink -f \(ti/.cache/garlic
|
|
/home/Computational/rarias/.cache/garlic
|
|
.CE
|
|
Then create the nix configuration directory (if it has not already been
|
|
created):
|
|
.CS
|
|
xeon07$ mkdir -p \(ti/.config/nix
|
|
.CE
|
|
And add the following line in the
|
|
.CI \(ti/.config/nix/nix.conf
|
|
file, replacing it with the path you copied before:
|
|
.CS
|
|
.SM
|
|
extra-sandbox-paths = /garlic=/home/Computational/rarias/.cache/garlic
|
|
.CE
|
|
This option creates a virtual directory called
|
|
.CI /garlic
|
|
inside the build environment, whose contents are the ones you specify at
|
|
the right hand side of the equal sign (in this case the
|
|
.CI \(ti/.cache/garlic
|
|
directory). It will be used to allow the results of the experiments to
|
|
be passed to nix from the
|
|
.I target
|
|
machine.
|
|
.\" ###################################################################
|
|
.NH 3
|
|
Run the garlic daemon (optional)
|
|
.LP
|
|
The garlic benchmark has a daemon which can be used to
|
|
automatically launch the experiments in the
|
|
.I target
|
|
machine on demand, when they are required to build other derivations, so
|
|
they can be launched without user interaction. The daemon creates some
|
|
FIFO pipes to communicate with the build environment, and must be
|
|
running to be able to run the experiments. To execute it, go to the
|
|
.CI bscpkgs/garlic
|
|
directory and run
|
|
.CS
|
|
xeon07$ nix-shell
|
|
nix-shell$
|
|
.CE
|
|
to enter the nix shell (or specify the path to the
|
|
.CI garlic/shell.nix
|
|
file as argument). Then, run the daemon inside the nix shell:
|
|
.CS
|
|
nix-shell$ garlicd
|
|
garlicd: Waiting for experiments ...
|
|
.CE
|
|
Notice that the daemon stays running in the foreground, waiting for
|
|
experiments. At this moment, it can only process one experiment at a
|
|
time.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Configure the login and target (MareNostrum 4)
|
|
.LP
|
|
In order to execute the programs in MareNostrum 4, you first need load
|
|
some utilities in the PATH. Add to the end of the file
|
|
\f(CW\(ti/.bashrc\fP in MareNostrum 4 the following line:
|
|
.CS
|
|
export PATH=/gpfs/projects/bsc15/nix/bin:$PATH
|
|
.CE
|
|
Then logout and login again (our source the \f(CW\(ti/.bashrc\fP file)
|
|
and check that now you have the \f(CWnix-develop\fP command available:
|
|
.CS
|
|
login1$ which nix-develop
|
|
/gpfs/projects/bsc15/nix/bin/nix-develop
|
|
.CE
|
|
The new utilities are available both in the login nodes and in the
|
|
compute (target) nodes, as they share the file system over the network.
|
|
.\" ===================================================================
|
|
.bp
|
|
.NH 1
|
|
Development
|
|
.LP
|
|
During the development phase, a functional program is produced by
|
|
modifying its source code. This process is generally cyclic: the
|
|
developer needs to compile, debug and correct mistakes. We want to
|
|
minimize the delay times, so the programs can be executed as soon as
|
|
needed, but under a controlled environment so that the same behavior
|
|
occurs during the experimentation phase.
|
|
.PP
|
|
In particular, we want that several developers can reproduce the
|
|
same development environment so they can debug each other programs
|
|
when reporting bugs. Therefore, the environment must be carefully
|
|
controlled to avoid non-reproducible scenarios.
|
|
.PP
|
|
The current development environment provides an isolated shell with a
|
|
clean environment, which runs in a new mount namespace where access to
|
|
the filesystem is restricted. Only the project directory and the nix
|
|
store are available (with some other exceptions), to ensure that you
|
|
cannot accidentally link with the wrong library or modify the build
|
|
process with a forgotten environment variable in the \f(CW\(ti/.bashrc\fP
|
|
file.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Getting the development tools
|
|
.LP
|
|
To create a development
|
|
environment, first copy or download the sources of your program (not the
|
|
dependencies) in a new directory placed in the target machine
|
|
(MareNostrum\~4).
|
|
.PP
|
|
The default environment contains packages commonly used to develop
|
|
programs, listed in the \fIgarlic/index.nix\fP file:
|
|
.\" FIXME: Unify garlic.unsafeDevelop in garlic.develop, so we can
|
|
.\" specify the packages directly
|
|
.CS
|
|
develop = let
|
|
commonPackages = with self; [
|
|
coreutils htop procps-ng vim which strace
|
|
tmux gdb kakoune universal-ctags bashInteractive
|
|
glibcLocales ncurses git screen curl
|
|
# Add more nixpkgs packages here...
|
|
];
|
|
bscPackages = with bsc; [
|
|
slurm clangOmpss2 icc mcxx perf tampi impi
|
|
# Add more bsc packages here...
|
|
];
|
|
...
|
|
.CE
|
|
If you need additional packages, add them to the list, so that they
|
|
become available in the environment. Those may include any dependency
|
|
required to build your program.
|
|
.PP
|
|
Then use the build machine (xeon07) to build the
|
|
.I garlic.develop
|
|
derivation:
|
|
.CS
|
|
build% nix-build -A garlic.develop
|
|
\&...
|
|
build% grep ln result
|
|
ln -fs /gpfs/projects/.../bin/stage1 .nix-develop
|
|
.CE
|
|
Copy the \fIln\fP command and run it in the target machine
|
|
(MareNostrum\~4), inside the new directory used for your program
|
|
development, to create the link \fI.nix-develop\fP (which is used to
|
|
remember your environment). Several environments can be stored in
|
|
different directories using this method, with different packages in each
|
|
environment. You will need
|
|
to rebuild the
|
|
.I garlic.develop
|
|
derivation and update the
|
|
.I .nix-develop
|
|
link after the package list is changed. Once the
|
|
environment link is created, there is no need to repeat these steps again.
|
|
.PP
|
|
Before entering the environment, you will need to access the required
|
|
resources for your program, which may include several compute nodes.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Allocating resources for development
|
|
.LP
|
|
Our target machine (MareNostrum 4) provides an interactive shell, that
|
|
can be requested with the number of computational resources required for
|
|
development. To do so, connect to the login node and allocate an
|
|
interactive session:
|
|
.CS
|
|
% ssh mn1
|
|
login% salloc ...
|
|
target%
|
|
.CE
|
|
This operation may take some minutes to complete depending on the load
|
|
of the cluster. But once the session is ready, any subsequent execution
|
|
of programs will be immediate.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Accessing the developement environment
|
|
.PP
|
|
The utility program \fInix-develop\fP has been designed to access the
|
|
development environment of the current directory, by looking for the
|
|
\fI.nix-develop\fP file. It creates a namespace where the required
|
|
packages are installed and ready to be used. Now you can access the
|
|
newly created environment by running:
|
|
.CS
|
|
target% nix-develop
|
|
develop%
|
|
.CE
|
|
The spawned shell contains all the packages pre-defined in the
|
|
\fIgarlic.develop\fP derivation, and can now be accessed by typing the
|
|
name of the commands.
|
|
.CS
|
|
develop% which gcc
|
|
/nix/store/azayfhqyg9...s8aqfmy-gcc-wrapper-9.3.0/bin/gcc
|
|
develop% which gdb
|
|
/nix/store/1c833b2y8j...pnjn2nv9d46zv44dk-gdb-9.2/bin/gdb
|
|
.CE
|
|
If you need additional packages, you can add them in the
|
|
\fIgarlic/index.nix\fP file as mentioned previously. To keep the
|
|
same current resources, so you don't need to wait again for the
|
|
resources to be allocated, exit only from the development shell:
|
|
.CS
|
|
develop% exit
|
|
target%
|
|
.CE
|
|
Then update the
|
|
.I .nix-develop
|
|
link and enter into the new develop environment:
|
|
.CS
|
|
target% nix-develop
|
|
develop%
|
|
.CE
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Execution
|
|
.LP
|
|
The allocated shell can only execute tasks in the current node, which
|
|
may be enough for some tests. To do so, you can directly run your
|
|
program as:
|
|
.CS
|
|
develop$ ./program
|
|
.CE
|
|
If you need to run a multi-node program, typically using MPI
|
|
communications, then you can do so by using srun. Notice that you need
|
|
to allocate several nodes when calling salloc previously. The srun
|
|
command will execute the given program \fBoutside\fP the development
|
|
environment if executed as-is. So we re-enter the develop environment by
|
|
calling nix-develop as a wrapper of the program:
|
|
.\" FIXME: wrap srun to reenter the develop environment by its own
|
|
.CS
|
|
develop$ srun nix-develop ./program
|
|
.CE
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Debugging
|
|
.LP
|
|
The debugger can be used to directly execute the program if is executed
|
|
in only one node by using:
|
|
.CS
|
|
develop$ gdb ./program
|
|
.CE
|
|
Or it can be attached to an already running program by using its PID.
|
|
You will need to first connect to the node running it (say target2), and
|
|
run gdb inside the nix-develop environment. Use
|
|
.I squeue
|
|
to see the compute nodes running your program:
|
|
.CS
|
|
login$ ssh target2
|
|
target2$ cd project-develop
|
|
target2$ nix-develop
|
|
develop$ gdb -p $pid
|
|
.CE
|
|
You can repeat this step to control the execution of programs running in
|
|
different nodes simultaneously.
|
|
.PP
|
|
In those cases where the program crashes before being able to attach the
|
|
debugger, enable the generation of core dumps:
|
|
.CS
|
|
develop$ ulimit -c unlimited
|
|
.CE
|
|
And rerun the program, which will generate a core file that can be
|
|
opened by gdb and contains the state of the memory when the crash
|
|
happened. Beware that the core dump file can be very large, depending on
|
|
the memory used by your program at the crash.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Git branch name convention
|
|
.LP
|
|
The garlic benchmark imposes a set of requirements to be meet for each
|
|
application in order to coordinate the execution of the benchmark and
|
|
the gathering process of the results.
|
|
.PP
|
|
Each application must be available in a git repository so it can be
|
|
included into the garlic benchmark. The different combinations of
|
|
programming models and communication schemes should be each placed in
|
|
one git branch, which are referred to as \fIbenchmark branches\fP. At
|
|
least one benchmark branch should exist and they all must begin with the
|
|
prefix \f(CWgarlic/\fP (other branches will be ignored).
|
|
.PP
|
|
The branch name is formed by adding keywords separated by the "+"
|
|
character. The keywords must follow the given order and can only
|
|
appear zero or once each. At least one keyword must be included. The
|
|
following keywords are available:
|
|
.IP \f(CWmpi\fP 5m
|
|
A significant fraction of the communications uses only the standard MPI
|
|
(without extensions like TAMPI).
|
|
.IP \f(CWtampi\fP
|
|
A significant fraction of the communications uses TAMPI.
|
|
.IP \f(CWsend\fP
|
|
A significant part of the MPI communication uses the blocking family of
|
|
methods
|
|
.I MPI_Send , (
|
|
.I MPI_Recv ,
|
|
.I MPI_Gather "...)."
|
|
.IP \f(CWisend\fP
|
|
A significant part of the MPI communication uses the non-blocking family
|
|
of methods
|
|
.I MPI_Isend , (
|
|
.I MPI_Irecv ,
|
|
.I MPI_Igather "...)."
|
|
.IP \f(CWrma\fP
|
|
A significant part of the MPI communication uses remote memory access
|
|
(one-sided) methods
|
|
.I MPI_Get , (
|
|
.I MPI_Put "...)."
|
|
.IP \f(CWseq\fP
|
|
The complete execution is sequential in each process (one thread per
|
|
process).
|
|
.IP \f(CWomp\fP
|
|
A significant fraction of the execution uses the OpenMP programming
|
|
model.
|
|
.IP \f(CWoss\fP
|
|
A significant fraction of the execution uses the OmpSs-2 programming
|
|
model.
|
|
.IP \f(CWtask\fP
|
|
A significant part of the execution involves the use of the tasking
|
|
model.
|
|
.IP \f(CWtaskfor\fP
|
|
A significant part of the execution uses the taskfor construct.
|
|
.IP \f(CWfork\fP
|
|
A significant part of the execution uses the fork-join model (including
|
|
hybrid programming techniques with parallel computations and sequential
|
|
communications).
|
|
.IP \f(CWsimd\fP
|
|
A significant part of the computation has been optimized to use SIMD
|
|
instructions.
|
|
.LP
|
|
In the
|
|
.URL #appendixA "Appendix A"
|
|
there is a flowchart to help the decision
|
|
process of the branch name. Additional user defined keywords may be
|
|
added at the end using the separator "+" as well. User keywords must
|
|
consist of capital alphanumeric characters only and be kept short. These
|
|
additional keywords must be different (case insensitive) to the already
|
|
defined above. Some examples:
|
|
.CS
|
|
garlic/mpi+send+seq
|
|
garlic/mpi+send+omp+fork
|
|
garlic/mpi+isend+oss+task
|
|
garlic/tampi+isend+oss+task
|
|
garlic/tampi+isend+oss+task+COLOR
|
|
garlic/tampi+isend+oss+task+COLOR+BTREE
|
|
.CE
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Initialization time
|
|
.LP
|
|
It is common for programs to have an initialization phase prior to the
|
|
execution of the main computation task which is the objective of the study.
|
|
The initialization phase is usually not considered when taking
|
|
measurements, but the time it takes to complete can limit seriously the
|
|
amount of information that can be extracted from the computation phase.
|
|
As an example, if the computation phase is in the order of seconds, but
|
|
the initialization phase takes several minutes, the number of runs would
|
|
need to be set low, as the units could exceed the time limits. Also, the
|
|
experimenter may be reluctant to modify the experiments to test other
|
|
parameters, as the waiting time for the results is unavoidably large.
|
|
.PP
|
|
To prevent this problem the programs must reduce the time of the
|
|
initialization phase to be no larger than the computation time. To do
|
|
so, the initialization phase can be optimized either with
|
|
parallelization, or it can be modified to store the result of the
|
|
initialization to the disk to be later at the computation phase. In the
|
|
garlic framework an experiment can have a dependency over the results of
|
|
another experiment (the results of the initialization). The
|
|
initialization results will be cached if the derivation is kept
|
|
invariant, when modifying the computation phase parameters.
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Measurement of the execution time
|
|
.LP
|
|
The programs must measure the wall time of the computation phase following a
|
|
set of rules. The way in which the wall time is measured is very important to
|
|
get accurate results. The measured time must be implemented by using a
|
|
monotonic clock which is able to correct the drift of the oscillator of
|
|
the internal clock due to changes in temperature. This clock must be
|
|
measured in C and C++ with:
|
|
.CS
|
|
clock_gettime(CLOCK_MONOTONIC, &ts);
|
|
.CE
|
|
A helper function can be used the approximate value of the clock in a
|
|
double precision float, in seconds:
|
|
.CS
|
|
double get_time()
|
|
{
|
|
struct timespec tv;
|
|
if(clock_gettime(CLOCK_MONOTONIC, &tv) != 0)
|
|
{
|
|
perror("clock_gettime failed");
|
|
exit(EXIT_FAILURE);
|
|
}
|
|
return (double)(ts.tv_sec) +
|
|
(double)ts.tv_nsec * 1.0e-9;
|
|
}
|
|
.CE
|
|
The start and end points must be measured after the synchronization of
|
|
all the processes and threads, so the complete computation work can be
|
|
bounded to fit inside the measured interval. An example for a MPI
|
|
program:
|
|
.CS
|
|
double start, end, delta_time;
|
|
MPI_Barrier();
|
|
start = get_time();
|
|
run_simulation();
|
|
MPI_Barrier();
|
|
end = get_time();
|
|
delta_time = end - start;
|
|
.CE
|
|
.\" ===================================================================
|
|
.NH 2
|
|
Format of the execution time
|
|
.LP
|
|
The measured execution time must be printed to the standard output
|
|
(stdout) in scientific notation with at least 7 significative digits.
|
|
The following the printf format (or the strict equivalent in other languages)
|
|
must be used:
|
|
.CS
|
|
printf("time %e\\n", delta_time);
|
|
.CE
|
|
The line must be printed alone and only once: for MPI programs,
|
|
only one process shall print the time:
|
|
.CS
|
|
if(rank == 0) printf("time %e\\n", delta_time);
|
|
.CE
|
|
Other lines can be printed in the stdout, but without the
|
|
.I time
|
|
prefix, so that the following pipe can be used to capture the line:
|
|
.CS
|
|
% ./app | grep "^time"
|
|
1.234567e-01
|
|
.CE
|
|
Ensure that your program follows this convention by testing it with the
|
|
above
|
|
.I grep
|
|
filter; otherwise the results will fail to be parsed when building
|
|
the dataset with the execution time.
|
|
.\" ===================================================================
|
|
.bp
|
|
.NH 1
|
|
Experimentation
|
|
.LP
|
|
During the experimentation, a program is studied by running it and
|
|
measuring some properties. The experimenter is in charge of the
|
|
experiment design, which is typically controlled by a single
|
|
.I nix
|
|
file placed in the
|
|
.CI garlic/exp
|
|
subdirectory.
|
|
Experiments are formed by several
|
|
.I "experimental units"
|
|
or simply
|
|
.I units .
|
|
A unit is the result of each unique configuration of the experiment
|
|
(typically involves the cartesian product of all factors) and
|
|
consists of several shell scripts executed sequentially to setup the
|
|
.I "execution environment" ,
|
|
which finally launch the actual program being analyzed.
|
|
The scripts that prepare the environment and the program itself are
|
|
called the
|
|
.I stages
|
|
of the execution and altogether form the
|
|
.I "execution pipeline"
|
|
or simply the
|
|
.I pipeline .
|
|
The experimenter must know with very good details all the stages
|
|
involved in the pipeline, as they have a large impact on the execution.
|
|
.PP
|
|
Additionally, the execution time is impacted by the target machine in
|
|
which the experiments run. The software used for the benchmark is
|
|
carefully configured and tuned for the hardware used in the execution;
|
|
in particular, the experiments are designed to run in MareNostrum 4
|
|
cluster with the SLURM workload manager and the Omni-Path
|
|
interconnection network. In the future we plan to add
|
|
support for other clusters in order to execute the experiments in other
|
|
machines.
|
|
.\"#####################################################################
|
|
.NH 2
|
|
Isolation
|
|
.LP
|
|
The benchmark is designed so that both the compilation of every software
|
|
package and the execution of the experiment is performed under strict
|
|
conditions. We can ensure that two executions of the same experiment are
|
|
actually running the same program in the same software environment.
|
|
.PP
|
|
All the software used by an experiment is included in the
|
|
.I "nix store"
|
|
which is, by convention, located at the
|
|
.CI /nix
|
|
directory. Unfortunately, it is common for libraries to try to load
|
|
software from other paths like
|
|
.CI /usr
|
|
or
|
|
.CI /lib .
|
|
It is also common that configuration files are loaded from
|
|
.CW /etc
|
|
and from the home directory of the user that runs the experiment.
|
|
Additionally, some environment variables are recognized by the libraries
|
|
used in the experiment, which change their behavior. As we cannot
|
|
control the software and configuration files in those directories, we
|
|
couldn't guarantee that the execution behaves as intended.
|
|
.PP
|
|
In order to avoid this problem, we create a
|
|
.I sandbox
|
|
where only the files in the nix store are available (with some other
|
|
exceptions). Therefore, even if the libraries try to access any path
|
|
outside the nix store, they will find that the files are not there
|
|
anymore. Additionally, the environment variables are cleared before
|
|
entering the environment (with some exceptions as well).
|
|
.\"#####################################################################
|
|
.NH 2
|
|
Execution pipeline
|
|
.LP
|
|
Several predefined stages form the
|
|
.I standard
|
|
execution pipeline and are defined in the
|
|
.I stdPipeline
|
|
array. The standard pipeline prepares the resources and the environment
|
|
to run a program (usually in parallel) in the compute nodes. It is
|
|
divided in two main parts:
|
|
connecting to the target machine to submit a job and executing the job.
|
|
Finally, the complete execution pipeline ends by running the actual
|
|
program, which is not part of the standard pipeline, as should be
|
|
defined differently for each program.
|
|
.\"#####################################################################
|
|
.NH 3
|
|
Job submission
|
|
.LP
|
|
Some stages are involved in the job submission: the
|
|
.I trebuchet
|
|
stage connects via
|
|
.I ssh
|
|
to the target machine and executes the next stage there. Once in the
|
|
target machine, the
|
|
.I runexp
|
|
stage computes the output path to store the experiment results, using
|
|
the user in the target machine and changes the working directory there.
|
|
In MareNostrum 4 the output path is at
|
|
.CI /gpfs/projects/bsc15/garlic/$user/out .
|
|
Then the
|
|
.I isolate
|
|
stage is executed to enter the sandbox and the
|
|
.I experiment
|
|
stage begins, which creates a directory to store the experiment output,
|
|
and launches several
|
|
.I unit
|
|
stages.
|
|
.PP
|
|
Each unit executes a
|
|
.I sbatch
|
|
stage which runs the
|
|
.I sbatch(1)
|
|
program with a job script that simply calls the next stage. The
|
|
sbatch program internally reads the
|
|
.CW /etc/slurm/slurm.conf
|
|
file from outside the sandbox, so we must explicitly allow this file to
|
|
be available, as well as the
|
|
.I munge
|
|
socket used for authentication by the SLURM daemon. Once the jobs are
|
|
submitted to SLURM, the experiment stage ends and the trebuchet finishes
|
|
the execution. The jobs will be queued for execution without any other
|
|
intervention from the user.
|
|
.PP
|
|
The rationale behind running sbatch from the sandbox is because the
|
|
options provided in environment variables override the options from the
|
|
job script. Therefore, we avoid this problem by running sbatch from the
|
|
sandbox, where the interfering environment variables are removed. The
|
|
sbatch program is also provided in the
|
|
.I "nix store" ,
|
|
with a version compatible with the SLURM daemon running in the target
|
|
machine.
|
|
.\"#####################################################################
|
|
.NH 3
|
|
Job execution
|
|
.LP
|
|
Once an unit job has been selected for execution, SLURM
|
|
allocates the resources (usually several nodes) and then selects one of
|
|
the nodes to run the job script: it is not executed in parallel yet.
|
|
The job script runs from a child process forked from on of the SLURM
|
|
daemon processes, which are outside the sandbox. Therefore, we first run the
|
|
.I isolate
|
|
stage
|
|
to enter the sandbox again.
|
|
.PP
|
|
The next stage is called
|
|
.I control
|
|
and determines if enough data has been generated by the experiment unit
|
|
or if it should continue repeating the execution. At the current time,
|
|
it is only implemented as a simple loop that runs the next stage a fixed
|
|
amount of times (by default, it is repeated 30 times).
|
|
.PP
|
|
The following stage is
|
|
.I srun
|
|
which launches several copies of the next stage to run in
|
|
parallel (when using more than one task). Runs one copy per task,
|
|
effectively creating one process per task. The CPUs affinity is
|
|
configured by the parameter
|
|
.I --cpu-bind
|
|
and is important to set it correctly (see more details in the
|
|
.I srun(1)
|
|
manual). Appending the
|
|
.I verbose
|
|
value to the cpu bind option causes srun to print the assigned affinity
|
|
of each task, which is very valuable when examining the execution log.
|
|
.PP
|
|
The mechanism by which srun executes multiple processes is the same used
|
|
by sbatch, it forks from a SLURM daemon running in the computing nodes.
|
|
Therefore, the execution begins outside the sandbox. The next stage is
|
|
.I isolate
|
|
which enters again the sandbox in every task. All remaining stages are
|
|
running now in parallel.
|
|
.\" ###################################################################
|
|
.NH 3
|
|
The program
|
|
.LP
|
|
At this point in the execution, the standard pipeline has been
|
|
completely executed, and we are ready to run the actual program that is
|
|
the matter of the experiment. Usually, programs require some arguments
|
|
to be passed in the command line. The
|
|
.I exec
|
|
stage sets the arguments (and optionally some environment variables) and
|
|
executes the last stage, the
|
|
.I program .
|
|
.PP
|
|
The experimenters are required to define these last stages, as they
|
|
define the specific way in which the program must be executed.
|
|
Additional stages may be included before or after the program run, so
|
|
they can perform additional steps.
|
|
.\" ###################################################################
|
|
.NH 3
|
|
Stage overview
|
|
.LP
|
|
The complete execution pipeline using the standard pipeline is shown in
|
|
the Table 1. Some properties are also reflected about the execution
|
|
stages.
|
|
.DS L
|
|
.TS
|
|
center;
|
|
lB cB cB cB cB cB
|
|
l c c c c c.
|
|
_
|
|
Stage Where Safe Copies User Std
|
|
_
|
|
trebuchet * no no yes yes
|
|
runexp login no no no yes
|
|
isolate login no no no yes
|
|
experiment login yes no no yes
|
|
unit login yes no no yes
|
|
sbatch login yes no no yes
|
|
_
|
|
isolate target no no no yes
|
|
control target yes no no yes
|
|
srun target yes no no yes
|
|
isolate target no yes no yes
|
|
_
|
|
exec target yes yes no no
|
|
program target yes yes no no
|
|
_
|
|
.TE
|
|
.DE
|
|
.QS
|
|
.SM
|
|
.B "Table 1" :
|
|
The stages of a complete execution pipeline. The
|
|
.I where
|
|
column determines where the stage is running,
|
|
.I safe
|
|
states if the stage begins the execution inside the sandbox,
|
|
.I user
|
|
if it can be executed directly by the user,
|
|
.I copies
|
|
if there are several instances running in parallel and
|
|
.I std
|
|
if is part of the standard execution pipeline.
|
|
.QE
|
|
.\" ###################################################################
|
|
.NH 2
|
|
Writing the experiment
|
|
.LP
|
|
The experiments are generally written in the
|
|
.I nix
|
|
language as it provides very easy management for the packages an their
|
|
customization. An experiment file is formed by several parts, which
|
|
produce the execution pipeline when built. The experiment file describes
|
|
a function (which is typical in nix) and takes as argument an
|
|
attribute set with some common packages, tools and options:
|
|
.CS
|
|
{ stdenv, lib, bsc, stdexp, targetMachine, stages, garlicTools }:
|
|
.CE
|
|
The
|
|
.I bsc
|
|
attribute contains all the BSC and nixpkgs packages, as defined in the
|
|
overlay. The
|
|
.I stdexp
|
|
contains some useful tools and functions to build the experiments, like
|
|
the standard execution pipeline, so you don't need to redefine the
|
|
stages in every experiment. The configuration of the target machine is
|
|
specified in the
|
|
.I targetMachine
|
|
attribute which includes information like the number of CPUs per node or
|
|
the cache line length. It is used to define the experiments in such a
|
|
way that they are not tailored to an specific machine hardware
|
|
(sometimes this is not posible). All the execution stages are available
|
|
in the
|
|
.I stages
|
|
attribute which are used when some extra stage is required. And finally,
|
|
the
|
|
.I garlicTools
|
|
attribute provide some functions to aid common tasks when defining the
|
|
experiment configuration
|
|
.\" ###################################################################
|
|
.NH 3
|
|
Experiment configuration
|
|
.LP
|
|
The next step is to define some variables in a
|
|
.CI let
|
|
\&...
|
|
.CI in
|
|
\&...
|
|
.CI ;
|
|
construct, to be used later. The first one, is the variable
|
|
configuration of the experiment called
|
|
.I varConf ,
|
|
which include all
|
|
the factors that will be changed. All the attributes of this set
|
|
.I must
|
|
be arrays, even if they only contain one element:
|
|
.CS
|
|
varConf = {
|
|
blocks = [ 1 2 4 ];
|
|
nodes = [ 1 ];
|
|
};
|
|
.CE
|
|
In this example, the variable
|
|
.I blocks
|
|
will be set to the values 1, 2 and 4; while
|
|
.I nodes
|
|
will remain set to 1 always. These variables are used later to build the
|
|
experiment configuration. The
|
|
.I varConf
|
|
is later converted to a list of attribute sets, where every attribute
|
|
contains only one value, covering all the combinations (the Cartesian
|
|
product is computed):
|
|
.CS
|
|
[ { blocks = 1; nodes = 1; }
|
|
{ blocks = 2; nodes = 1; }
|
|
{ blocks = 4; nodes = 1; } ]
|
|
.CE
|
|
These configurations are then passed to the
|
|
.I genConf
|
|
function one at a time, which is the central part of the description of
|
|
the experiment:
|
|
.CS
|
|
genConf = var: fix (self: targetMachine.config // {
|
|
expName = "example";
|
|
unitName = self.expName + "-b" + toString self.blocks;
|
|
blocks = var.blocks;
|
|
cpusPerTask = 1;
|
|
tasksPerNode = self.hw.socketsPerNode;
|
|
nodes = var.nodes;
|
|
});
|
|
.CE
|
|
It takes as input
|
|
.I one
|
|
configuration from the Cartesian product, for example:
|
|
.CS
|
|
{ blocks = 2; nodes = 1; }
|
|
.CE
|
|
And returns the complete configuration for that input, which usually
|
|
expand the input configuration with some derived variables along with
|
|
other constant parameters. The return value can be inspected by calling
|
|
the function in the interactive
|
|
.I "nix repl"
|
|
session:
|
|
.CS
|
|
nix-repl> genConf { blocks = 2; nodes = 1; }
|
|
{
|
|
blocks = 2;
|
|
cpusPerTask = 1;
|
|
expName = "example";
|
|
hw = { ... };
|
|
march = "skylake-avx512";
|
|
mtune = "skylake-avx512";
|
|
name = "mn4";
|
|
nixPrefix = "/gpfs/projects/bsc15/nix";
|
|
nodes = 1;
|
|
sshHost = "mn1";
|
|
tasksPerNode = 2;
|
|
unitName = "example-b2";
|
|
}
|
|
.CE
|
|
Some configuration parameters were added by
|
|
.I targetMachine.config ,
|
|
such as the
|
|
.I nixPrefix ,
|
|
.I sshHost
|
|
or the
|
|
.I hw
|
|
attribute set, which are specific for the cluster they experiment is
|
|
going to run. Also, the
|
|
.I unitName
|
|
got assigned the proper name based on the number of blocks, but the
|
|
number of tasks per node were assigned based on the hardware description
|
|
of the target machine.
|
|
.PP
|
|
By following this rule, the experiments can easily be ported to machines
|
|
with other hardware characteristics, and we only need to define the
|
|
hardware details once. Then all the experiments will be updated based on
|
|
those details.
|
|
.\" ###################################################################
|
|
.NH 3
|
|
Adding the stages
|
|
.LP
|
|
Once the configuration is ready, it will be passed to each stage of the
|
|
execution pipeline which will take the parameters it needs. The
|
|
connection between the parameters and how they are passed to each stage
|
|
is done either by convention or manually. There is a list of parameters that
|
|
are recognized by the standard pipeline stages. For example the
|
|
attribute
|
|
.I nodes ,
|
|
it is recognized as the number of nodes in the standard
|
|
.I sbatch
|
|
stage when allocating resources:
|
|
.DS L
|
|
.TS
|
|
center;
|
|
lB lB cB cB lB
|
|
l l c c l.
|
|
_
|
|
Stage Attribute Std Req Description
|
|
_
|
|
* nixPrefix yes yes Path to the nix store in the target
|
|
unit expName yes yes Name of the experiment
|
|
unit unitName yes yes Name of the unit
|
|
control loops yes yes Number of runs of each unit
|
|
sbatch cpusPerTask yes yes Number of CPUs per task (process)
|
|
sbatch jobName yes yes Name of the job
|
|
sbatch nodes yes yes Number of nodes allocated
|
|
sbatch ntasksPerNode yes yes Number of tasks (processes) per node
|
|
sbatch qos yes no Name of the QoS queue
|
|
sbatch reservation yes no Name of the reservation
|
|
sbatch time yes no Maximum allocated time (string)
|
|
_
|
|
exec argv no no Array of arguments to execve
|
|
exec env no no Environment variable settings
|
|
exec pre no no Code before the execution
|
|
exec post no no Code after the execution
|
|
_
|
|
.TE
|
|
.DE
|
|
.QS
|
|
.SM
|
|
.B "Table 2" :
|
|
The attributes recognized by the stages in the execution pipeline. The
|
|
column
|
|
.I std
|
|
indicates if they are part of the standard execution pipeline. Some
|
|
attributes are required as indicated by the
|
|
.I req
|
|
column.
|
|
.QE
|
|
.LP
|
|
Other attribute names can be used to specify custom information used in
|
|
additional stages. The two most common stages required to complete the
|
|
pipeline are the
|
|
.I exec
|
|
and the
|
|
.I program .
|
|
Let see an example of
|
|
.I exec :
|
|
.CS
|
|
exec = {nextStage, conf, ...}: stages.exec {
|
|
inherit nextStage;
|
|
argv = [ "--blocks" conf.blocks ];
|
|
};
|
|
.CE
|
|
The
|
|
.I exec
|
|
stage is defined as a function that uses the predefined
|
|
.I stages.exec
|
|
stage, which accepts the
|
|
.I argv
|
|
array, and sets the argv of the program. In our case, we fill the
|
|
.I argv
|
|
array by setting the
|
|
.I --blocks
|
|
parameter to the number of blocks, specified in the configuration in the
|
|
attribute
|
|
.I blocks .
|
|
The name of this attribute can be freely choosen, as long as the
|
|
.I exec
|
|
stage refers to it properly. The
|
|
.I nextStage
|
|
attribute is mandatory in all stages, and is automatically set when
|
|
building the pipeline.
|
|
.PP
|
|
The last step is to configure the actual program to be executed,
|
|
which can be specified as another stage:
|
|
.CS
|
|
program = {nextStage, conf, ...}: bsc.apps.example;
|
|
.CE
|
|
Notice that this function only returns the
|
|
.I bsc.apps.example
|
|
derivation, which will be translated to the path where the example
|
|
program is installed. If the program is located inside a directory
|
|
(typically
|
|
.I bin ),
|
|
it must define the attribute
|
|
.I programPath
|
|
in the
|
|
.I bsc.apps.example
|
|
derivation, which points to the executable program. An example:
|
|
.CS
|
|
stdenv.mkDerivation {
|
|
\& ...
|
|
programPath = "/bin/example";
|
|
\& ...
|
|
};
|
|
.CE
|
|
.\" ###################################################################
|
|
.NH 3
|
|
Building the pipeline
|
|
.LP
|
|
With the
|
|
.I exec
|
|
and
|
|
.I program
|
|
stages defined and the ones provided by the standard pipeline, the
|
|
complete execution pipeline can be formed. To do so, the stages are
|
|
placed in an array, in the order they will be executed:
|
|
.CS
|
|
pipeline = stdexp.stdPipeline ++ [ exec program ];
|
|
.CE
|
|
The attribute
|
|
.I stdexp.stdPipeline
|
|
contains the standard pipeline stages, and we only append our two
|
|
defined stages
|
|
.I exec
|
|
and
|
|
.I program .
|
|
The
|
|
.I pipeline
|
|
is an array of functions, and must be transformed in something that can
|
|
be executed in the target machine. For that purpose, the
|
|
.I stdexp
|
|
provides the
|
|
.I genExperiment
|
|
function, which takes the
|
|
.I pipeline
|
|
array and the list of configurations and builds the execution pipeline:
|
|
.CS
|
|
stdexp.genExperiment { inherit configs pipeline; }
|
|
.CE
|
|
The complete example experiment can be shown here:
|
|
.CS
|
|
{ stdenv, lib, stdexp, bsc, targetMachine, stages }:
|
|
with lib;
|
|
let
|
|
# Initial variable configuration
|
|
varConf = {
|
|
blocks = [ 1 2 4 ];
|
|
nodes = [ 1 ];
|
|
};
|
|
# Generate the complete configuration for each unit
|
|
genConf = c: targetMachine.config // rec {
|
|
expName = "example";
|
|
unitName = "${expName}-b${toString blocks}";
|
|
inherit (targetMachine.config) hw;
|
|
inherit (c) blocks nodes;
|
|
loops = 30;
|
|
ntasksPerNode = hw.socketPerNode;
|
|
cpusPerTask = hw.cpusPerSocket;
|
|
jobName = unitName;
|
|
};
|
|
# Compute the array of configurations
|
|
configs = stdexp.buildConfigs {
|
|
inherit varConf genConf;
|
|
};
|
|
exec = {nextStage, conf, ...}: stages.exec {
|
|
inherit nextStage;
|
|
argv = [ "--blocks" conf.blocks ];
|
|
};
|
|
program = {nextStage, conf, ...}: bsc.garlic.apps.example;
|
|
pipeline = stdexp.stdPipeline ++ [ exec program ];
|
|
in
|
|
stdexp.genExperiment { inherit configs pipeline; }
|
|
.CE
|
|
.\" ###################################################################
|
|
.NH 3
|
|
Adding the experiment to the index
|
|
.LP
|
|
The experiment file must be located in a named directory inside the
|
|
.I garlic/exp
|
|
directory. The name is usually the program name. Once the experiment is
|
|
placed in a nix file, it must be added to the index of experiments, so
|
|
it can be build. The index is hyerarchically organized as attribute
|
|
sets, with
|
|
.I exp
|
|
containing all the experiments;
|
|
.I exp.example
|
|
the experiments of the
|
|
.I example
|
|
program; and
|
|
.I exp.example.test1
|
|
referring to the
|
|
.I test1
|
|
experiment of the
|
|
.I example
|
|
program. Additional attributes can be added, like
|
|
.I exp.example.test1.variantA
|
|
to handle more details.
|
|
.PP
|
|
For this example we are going to use the attribute path
|
|
.I exp.example.test
|
|
and add it to the index, in the
|
|
.I garlic/exp/index.nix
|
|
file. We append to the end of the attribute set, the following
|
|
definition:
|
|
.CS
|
|
\&...
|
|
example = {
|
|
test = callPackage ./example/test.nix { };
|
|
};
|
|
}
|
|
.CE
|
|
The experiment can now be built with:
|
|
.CS
|
|
builder% nix-build -A exp.example.test
|
|
.CE
|
|
.\" ###################################################################
|
|
.NH 2
|
|
Recommendations
|
|
.PP
|
|
The complete results generally take a long time to be finished, so it is
|
|
advisable to design the experiments iteratively, in order to quickly
|
|
obtain some feedback. Some recommendations:
|
|
.BL
|
|
.LI
|
|
Start with one unit only.
|
|
.LI
|
|
Set the number of runs low (say 5) but more than one.
|
|
.LI
|
|
Use a small problem size, so the execution time is low.
|
|
.LI
|
|
Set the time limit low, so deadlocks are caught early.
|
|
.LE
|
|
.PP
|
|
As soon as the first runs are complete, examine the results and test
|
|
that everything looks good. You would likely want to check:
|
|
.BL
|
|
.LI
|
|
The resources where assigned as intended (nodes and CPU affinity).
|
|
.LI
|
|
No errors or warnings: look at stderr and stdout logs.
|
|
.LI
|
|
If a deadlock happens, it will run out of the time limit.
|
|
.LE
|
|
.PP
|
|
As you gain confidence over that the execution went as planned, begin
|
|
increasing the problem size, the number of runs, the time limit and
|
|
lastly the number of units. The rationale is that each unit that is
|
|
shared among experiments gets assigned the same hash. Therefore, you can
|
|
iteratively add more units to an experiment, and if they are already
|
|
executed (and the results were generated) is reused.
|
|
.\" ###################################################################
|
|
.bp
|
|
.NH 1
|
|
Post-processing
|
|
.LP
|
|
After the correct execution of an experiment the results are stored for
|
|
further investigation. Typically the time of the execution or other
|
|
quantities are measured and presented later in a figure (generally a
|
|
plot or a table). The
|
|
.I "postprocess pipeline"
|
|
consists of all the steps required to create a set of figures from the
|
|
results. Similarly to the execution pipeline where several stages run
|
|
sequentially,
|
|
.[
|
|
garlic execution
|
|
.]
|
|
the postprocess pipeline is also formed by multiple stages executed
|
|
in order.
|
|
.PP
|
|
The rationale behind dividing execution and postprocess is
|
|
that usually the experiments are costly to run (they take a long time to
|
|
complete) while generating a figure require less time. Refining the
|
|
figures multiple times reusing the same experimental results doesn't
|
|
require the execution of the complete experiment, so the experimenter
|
|
can try multiple ways to present the data without waiting a large delay.
|
|
.NH 2
|
|
Results
|
|
.LP
|
|
The results are generated in the same
|
|
.I "target"
|
|
machine where the experiment is executed and are stored in the garlic
|
|
\fCout\fP
|
|
directory, organized into a tree structure following the experiment
|
|
name, the unit name and the run number (governed by the
|
|
.I control
|
|
stage):
|
|
.DS L
|
|
\fC
|
|
|-- 6lp88vlj7m8hvvhpfz25p5mvvg7ycflb-experiment
|
|
| |-- 8lpmmfix52a8v7kfzkzih655awchl9f1-unit
|
|
| | |-- 1
|
|
| | | |-- stderr.log
|
|
| | | |-- stdout.log
|
|
| | | |-- ...
|
|
| | |-- 2
|
|
\&...
|
|
\fP
|
|
.DE
|
|
In order to provide an easier access to the results, an index is also
|
|
created by taking the
|
|
.I expName
|
|
and
|
|
.I unitName
|
|
attributes (defined in the experiment configuration) and linking them to
|
|
the appropriate experiment and unit directories. These links are
|
|
overwritten by the last experiment with the same names so they are only
|
|
valid for the last execution. The out and index directories are
|
|
placed into a per-user directory, as we cannot guarantee the complete
|
|
execution of each unit when multiple users share units.
|
|
.PP
|
|
The messages printed to
|
|
.I stdout
|
|
and
|
|
.I stderr
|
|
are stored in the log files with the same name inside each run
|
|
directory. Additional data is sometimes generated by the experiments,
|
|
and is found in each run directory. As the generated data can be very
|
|
large, is ignored by default when fetching the results.
|
|
.NH 2
|
|
Fetching the results
|
|
.LP
|
|
Consider a program of interest for which an experiment has been designed to
|
|
measure some properties that the experimenter wants to present in a
|
|
visual plot. When the experiment is launched, the execution
|
|
pipeline (EP) is completely executed and it will generate some
|
|
results. In this escenario, the execution pipeline depends on the
|
|
program\[em]any changes in the program will cause nix to build the
|
|
pipeline again
|
|
using the updated program. The results will also depend on the
|
|
execution pipeline as well as the postprocess pipeline (PP) and the plot
|
|
on the results. This chain of dependencies can be shown in the
|
|
following dependency graph:
|
|
.PS
|
|
circlerad=0.22;
|
|
linewid=0.3;
|
|
right
|
|
circle "Prog"
|
|
arrow
|
|
circle "EP"
|
|
arrow
|
|
circle "Result"
|
|
arrow
|
|
circle "PP"
|
|
arrow
|
|
circle "Plot"
|
|
.PE
|
|
Ideally, the dependencies should be handled by nix, so it can detect any
|
|
change and rebuild the necessary parts automatically. Unfortunately, nix
|
|
is not able to build the result as a derivation directly, as it requires
|
|
access to the
|
|
.I "target"
|
|
machine with several user accounts. In order to let several users reuse
|
|
the same results from a shared cache, we would like to use the
|
|
.I "nix store" .
|
|
.PP
|
|
To generate the results from the
|
|
experiment, we add some extra steps that must be executed manually:
|
|
.PS
|
|
circle "Prog"
|
|
arrow
|
|
diag=linewid + circlerad;
|
|
far=circlerad*3 + linewid*4
|
|
E: circle "EP"
|
|
R: circle "Result" at E + (far,0)
|
|
RUN: circle "Run" at E + (diag,-diag) dashed
|
|
FETCH: circle "Fetch" at R + (-diag,-diag) dashed
|
|
move to R.e
|
|
arrow
|
|
P: circle "PP"
|
|
arrow
|
|
circle "Plot"
|
|
arrow dashed from E to RUN chop
|
|
arrow dashed from RUN to FETCH chop
|
|
arrow dashed from FETCH to R chop
|
|
arrow from E to R chop
|
|
.PE
|
|
The run and fetch steps are provided by the helper tool
|
|
.I "garlic(1)" ,
|
|
which launches the experiment using the user credentials at the
|
|
.I "target"
|
|
machine and then fetches the results, placing them in a directory known
|
|
by nix. When the result derivation needs to be built, nix will look in
|
|
this directory for the results of the execution. If the directory is not
|
|
found, a message is printed to suggest the user to launch the experiment
|
|
and the build process is stopped. When the result is successfully built
|
|
by any user, is stored in the
|
|
.I "nix store"
|
|
and it won't need to be rebuilt again until the experiment changes, as
|
|
the hash only depends on the experiment and not on the contents of the
|
|
results.
|
|
.PP
|
|
Notice that this mechanism violates the deterministic nature of the nix
|
|
store, as from a given input (the experiment) we can generate different
|
|
outputs (each result from different executions). We knowingly relaxed
|
|
this restriction by providing a guarantee that the results are
|
|
equivalent and there is no need to execute an experiment more than once.
|
|
.PP
|
|
To force the execution of an experiment you can use the
|
|
.I rev
|
|
attribute which is a number assigned to each experiment
|
|
and can be incremented to create copies that only differs on that
|
|
number. The experiment hash will change but the experiment will be the
|
|
same, as long as the revision number is ignored along the execution
|
|
stages.
|
|
.NH 2
|
|
Postprocess stages
|
|
.LP
|
|
Once the results are completely generated in the
|
|
.I "target"
|
|
machine there are several stages required to build a set of figures:
|
|
.PP
|
|
.I fetch \[em]
|
|
waits until all the experiment units are completed and then executes the
|
|
next stage. This stage is performed by the
|
|
.I garlic(1)
|
|
tool using the
|
|
.I -F
|
|
option and also reports the current state of the execution.
|
|
.PP
|
|
.I store \[em]
|
|
copies from the
|
|
.I target
|
|
machine into the nix store all log files generated by the experiment,
|
|
keeping the same directory structure. It tracks the execution state of
|
|
each unit and only copies the results once the experiment is complete.
|
|
Other files are ignored as they are often very large and not required
|
|
for the subsequent stages.
|
|
.PP
|
|
.I timetable \[em]
|
|
converts the results of the experiment into a NDJSON file with one
|
|
line per run for each unit. Each line is a valid JSON object, containing
|
|
the
|
|
.I exp ,
|
|
.I unit
|
|
and
|
|
.I run
|
|
keys and the unit configuration (as a JSON object) in the
|
|
.I config
|
|
key. The execution time is captured from the standard output and is
|
|
added in the
|
|
.I time
|
|
key.
|
|
.PP
|
|
.I merge \[em]
|
|
one or more timetable datasets are joined, by simply concatenating them.
|
|
This step allows building one dataset to compare multiple experiments in
|
|
the same figure.
|
|
.PP
|
|
.I rPlot \[em]
|
|
one ot more figures are generated by a single R script
|
|
.[
|
|
r cookbook
|
|
.]
|
|
which takes as input the previously generated dataset.
|
|
The path of the dataset is recorded in the figure as well, which
|
|
contains enough information to determine all the stages in the execution
|
|
and postprocess pipelines.
|
|
.NH 2
|
|
Current setup
|
|
.LP
|
|
As of this moment, the
|
|
.I build
|
|
machine which contains the nix store is
|
|
.I xeon07
|
|
and the
|
|
.I "target"
|
|
machine used to run the experiments is Mare Nostrum 4 with the
|
|
.I output
|
|
directory placed at
|
|
.CW /gpfs/projects/bsc15/garlic .
|
|
By default, the experiment results are never deleted from the
|
|
.I target
|
|
so you may want to remove the ones already stored in the nix store to
|
|
free space.
|
|
.\" ###################################################################
|
|
.bp
|
|
.SH 1
|
|
Appendix A: Branch name diagram
|
|
.LP
|
|
.TAG appendixA
|
|
.DS B
|
|
.SM
|
|
.PS 4.4/25.4
|
|
copy "gitbranch.pic"
|
|
.PE
|
|
.DE
|