bscpkgs/garlic/doc/slides/test.mm

858 lines
19 KiB
XML

.\"usage: NS title
.de NS \" New Slide
.SK
.ev gp-top
.fam H
.vs 1.5m
.ll \\n[@ll]u
.lt \\n[@ll]u
.rs
.sp 2v
.ps +5
\\$*
.ps -5
.sp 1.5v
.br
.ev
..
.\" Remove headers
.de TP
..
.\" Bigger page number in footer
.de EOP
.fam H
.ps +2
. ie o .tl \\*[pg*odd-footer]
. el .tl \\*[pg*even-footer]
. ds hd*format \\g[P]
. af P 0
. ie (\\n[P]=1)&(\\n[N]=1) .tl \\*[pg*header]
. el .tl \\*[pg*footer]
. af P \\*[hd*format]
. tl ''\\*[Pg_type!\\n[@copy_type]]''
..
.\" Remove top and bottom margin
.VM 0 0
.\"
.\"
.\" Set virtual page dimensions for a physical size of 16x12 cm
.PGFORM 14c 12c 1c 1
.ND "November 24, 2020"
.\" .vs 1.5m
.S C 1.5m
.fam H
.\".PH "'cosas'''"
.COVER ms
.de cov@print-date
.DS C
.fam H
.B
\\*[cov*new-date]
.DE
..
.TL
.ps 20
.fam H
Garlic update
.AF "Barcelona Supercomputing Center"
.AU "Rodrigo Arias Mallo"
.COVEND
.PF "'''%'"
.\" Turn off justification
.SA 0
.\".PF '''%'
.\"==================================================================
.NS "Changelog"
Important changes since the last meeting (2020-09-23)
.BL
.LI
Execution of experiments is now \fBisolated\fP: no $HOME or /usr at run time
.LI
Added a \fBpostprocess\fP pipeline
.LI
New \fBgarlic(1)\fP helper tool (manual included)
.LI
A plot has an experiment result as \fBdependency\fP
.LI
Experiments run on demand based on article \fBfigures\fP
.LI
Fast pkg overrides (MPI)
.LE 1
.\"==================================================================
.NS "Execution pipeline (review)"
.2C
List of stages required to run the program of the experiment:
.BL
.S -1
.LI
The
.B target
column determines where the stage is running.
.LI
.B Safe
states if the stage begins the execution inside the isolated namespace
.LI
.B User
if it can be executed directly by the user
.LI
.B Copies
if there are several instances running in parallel and
.LI
.B Std
if is part of the standard execution pipeline.
.LE
.S P P
.\" Go to the next column
.NCOL
.KF
.defcolor white rgb #FFFFFF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
\m[white]\(rh\m[]\
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBtrebuchet\fP: connects via ssh to the target machine and executes the
next stage there.
.P
The target machine is set to MN4, which by default uses the host
\fBmn1\fP
.P
Literally:
.P
.VERBON
ssh mn1 /path/to/next/stage
.VERBOFF
.P
You need to define the ssh config to be able to connect to mn1.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
\(rh \fBtrebuchet\fP xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBrunexp\fP: sets a few \fCGARLIC_*\fP environment variables used by the
benchmark and changes the current directory to the \fBout\fP directory.
.P
At build time, next stages don't know these values (cyclic dependency),
so they are populated at execution time.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
\(rh \fBrunexp\fP login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBisolate\fP: once on the target machine, we enter an isolated
namespace to load the nix store.
.P
Notice that this and the previous stages require the \fBsh\fP shell to be
available on the target machine
.P
They are not \fBsafe\fP as we run target machine code
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
\(rh \fBisolate\fP login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBexperiment\fP: runs several units sequentially.
.P
Defines the \fCGARLIC_EXPERIMENT\fP environment variable.
.P
Creates a directory for the experiment and changes the current directory
there.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
\(rh \fBexperiment\fP login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBunit\fP: creates an index entry for the unit and the experiment.
.P
Creates a directory for the unit and changes the current directory
there.
.P
Copies the unit configuration in the \fCgarlic_config.json\fP file
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
\(rh \fBunit\fP login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBsbatch\fP: allocates resources and executes the next stage in the
first node.
.P
The execve call is performed by a SLURM daemon, so is \fBout\fP of the
isolated environment.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
\(rh \fBsbatch\fP login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBisolate\fP: enters the isolated namespace again, with the nix store.
.P
Notice that we are now running in the compute node allocated by SLURM.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
\(rh \fBisolate\fP comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBcontrol\fP: runs the next stage several times
.P
Is controlled by the \fCloops\fP attribute, which specifies the number
of runs.
.P
Creates a directory with the number of the run and enters it.
.P
Generated results are placed in this directory.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
\(rh \fBcontrol\fP comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBsrun\fP: launches the tasks in the compute nodes and sets the
affinity.
.P
From here on, all stages are executed in parallel for each task.
.P
The srun program also forks from a SLURM daemon, exiting the
previous isolated namespace.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
\(rh \fBsrun\fP comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBisolate\fP: enter the isolated namespace again.
.P
Now we are ready to execute the program of the experiment.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
\(rh \fBisolate\fP comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBexec\fP: sets the environment variables and argv of the program.
.P
Additional commands can be specified in the \fCpre\fP and \fCpost\fP
attributes.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
\(rh \fBexec\fP comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
\fBprogram\fP: the path to the program itself.
.P
This stage can be used to do some changes:
.BL
.LI
Set the mpi implementation of all dependencies.
.LI
Pass build options
.LI
Custom packages (nanos6 with jemalloc)
.LE
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
\(rh \fBprogram\fP comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Execution stages"
.2C
The \fCstdexp.nix\fP file defines standard pipeline. The last two stages
are usually added to complete the pipeline:
.P
.VERBON
pipeline = stdPipeline ++
[ exec program ];
.VERBOFF
.P
Any stage can be modified to fit a custom experiment.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
isolate login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
isolate comp no no no yes
control comp yes no no yes
srun comp yes no no yes
isolate comp no yes no yes
_ _ _ _ _ _
\m[white]\(rh\m[]\
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Isolated execution"
.2C
The filesystem is \fBnow\fP isolated to prevent irreproducible
scenarios.
.P
The nix store is mounted at /nix and only some other paths are
available like:
.BL
.S -1 1m
.LI
/var/run/munge (required for SLURM)
.LI
/dev, /sys, /proc for MPI comm
.LI
/etc for hosts (FIXME)
.LI
/gpfs/projects/bsc15 to store data
.LE
.S P P
.P
Additional mounts can be requested by using the \fCextraMounts\fP
attribute.
.\" Go to the next column
.NCOL
.KF
.S 8 14p
.\".S C +0.2v
.TS
center expand;
lB lB cB cB cB cB cB
lB lB cB cB cB cB cB
r lw(5.5m) c c c c c.
_ _ _ _ _ _
Stage Target Safe Copies User Std
_ _ _ _ _ _
trebuchet xeon no no yes yes
runexp login no no yes yes
\(rh \fBisolate\fP login no no no yes
experiment login yes no no yes
unit login yes no no yes
sbatch login yes no no yes
_ _ _ _ _ _
\(rh \fBisolate\fP comp no no no yes
control comp yes no no yes
srun comp yes no no yes
\(rh \fBisolate\fP comp no yes no yes
_ _ _ _ _ _
exec comp yes yes no no
program comp yes yes no no
_ _ _ _ _ _
.TE
.S P P
.KE
.1C
.\"==================================================================
.NS "Generating figures"
The postprocess pipeline takes the results of the execution and produces
figures or tables to be used in a publication.
.DS CB
.PS 5.3
circlerad=0.3;
ellipsewid=1.2;
linewid=0.3;
boxwid=1;
right
box "Experiment"
arrow
ellipse "Execution"
arrow
box "Result"
arrow
ellipse "Postprocess"
arrow
box "Figure"
.PE
.DE
.P
Once the results are available, multiple figures can be created without
re-running the experiment.
.P
The postprocess pipeline is \fBexperimental\fP; there is no standard
yet.
.\"==================================================================
.NS "Executing experiments"
.P
We cannot access MN4 from nix, as it doesn't has the SSH keys nor
network access when building derivations.
.P
The garlic(1) tool is used to run experiments and fetch the results. See
the manual for details.
.P
.VERBON
xeon07$ nix-build -A fig.nbody.small
\&...
/tmp/garlic/1qcc44lx2nxwi7rmr6389sksq3gwy9w5-experiment: not found
Run the experiment and fetch the results with:
\f[CB]garlic -RFv /nix/store/5zhmdzi5mf0mfsran74cxngn07ba522m-trebuchet\fP
See garlic(1) for more details.
cannot continue building /nix/store/jql4...2cb0-resultTree, aborting
.VERBOFF
.\"==================================================================
.NS "Executing experiments"
.P
To run an experiment use \fB-R\fP and provide the trebuchet path:
.P
.VERBON
xeon07$ garlic -Rv /nix/store/5zh...22m-trebuchet
Running experiment 1qcc...9w5-experiment
sbatch: error: spank: x11.so: Plugin file not found
Submitted batch job 12719522
\&...
xeon07$
.VERBOFF
.P
Once the experiment is submited, you can leave the session: it will run
in MN4 automatically at some point.
.\"==================================================================
.NS "Executing experiments"
.P
To wait and fetch the results, use \fB-F\fP:
.P
.VERBON
xeon07$ garlic -Fv /nix/store/5zhmd...522m-trebuchet
/mnt/garlic/bsc15557/out/1qc...9w5-experiment: checking units
3qnm6drx5y95kxrr43gnwqz8v4x641c7-unit: running 7 of 10
awd3jzbcw0cwwvjrcrxzjvii3mgj663d-unit: completed
bqnnrwcbcixag0dfflk1zz34zidk97nf-unit: no status
l32097db7hbggvj7l5hz44y1glzz6jcy-unit: no status
n1a26qa13fdz0ih1gg1m0wfcybs71hm9-unit: completed
rywcwvnpz3mk0gyp5dzk94by3q1h3ljp-unit: completed
yl8ygadghd1fyzjwab3csd8hq1q93cw3-unit: completed
\&...
/mn...w5-experiment: \f[CB]execution complete, fetching results\fP
these derivations will be built:
/nix/store/mqdr...q4z-resultTree.drv
\&...
\f[CB]/nix/store/jql41hms1dr49ipbjcw41i4dj4pq2cb0-resultTree\fP
.VERBOFF
.\"==================================================================
.NS "Execution"
The dependency graph shows the role of the garlic tool:
.DS CB
.PS
scale=1;
circlerad=0.25;
linewid=0.3;
diag=linewid + circlerad;
far=circlerad*3 + linewid*4
circle "Prog"
arrow
E: circle "EP"
R: circle "Result" at E + (far,0)
RUN: circle "Run" at E + (diag,-diag) dashed
FETCH: circle "Fetch" at R + (-diag,-diag) dashed
move to R.e
arrow
P: circle "PP"
arrow
circle "Plot"
arrow dashed from E to RUN chop
arrow dashed from RUN to FETCH chop
arrow dashed from FETCH to R chop
arrow from E to R chop
.PE
.DE
With the two pipelines
.BL
.LI
EP: Execution pipeline
.LI
PP: Postprocess pipeline
.LE