diff --git a/garlic/doc/ug.mm b/garlic/doc/ug.mm
new file mode 100644
index 0000000..b0a4c48
--- /dev/null
+++ b/garlic/doc/ug.mm
@@ -0,0 +1,181 @@
+.COVER
+.TL
+Garlic: User guide
+.AF "Barcelona Supercomputing Center"
+.AU "Rodrigo Arias Mallo"
+.COVEND
+.H 1 "Overview"
+Dependency graph of a complete experiment that produces a figure. Each box
+is a derivation and arrows represent \fBbuild dependencies\fP.
+.DS CB
+.PS
+linewid=0.9;
+right
+box "Source" "code"
+arrow <-> "Develop" above
+box "Program"
+arrow <-> "Experiment" above
+box "Results"
+arrow <-> "Data" "exploration"
+box "Figures"
+.PE
+.DE
+.H 1 "Development"
+.P
+The development phase consists in creating a functional program by
+modifying the source code. This process is generally cyclic, where the
+developer needs to compile the program, correct mistakes and debug the
+program.
+.P
+It requires to be running in the target machine.
+.\" ===================================================================
+.H 1 "Experimentation"
+The experimentation phase begins with a functional program which is the
+object of study. The experimenter then designs an experiment aimed at
+measuring some properties of the program. The experiment is then
+executed and the results are stored for further analysis.
+.H 2 "Writing the experiment configuration"
+.P
+The term experiment is quite overloaded in this document. We are going
+to see how to write the recipe that describes the execution pipeline of
+an experiment.
+.P
+Within the garlic benchmark, experiments are typically sorted by a
+hierarchy depending on which application they belong. Take a look at the
+\fCgarlic/exp\fP directory and you will find some folders and .nix
+files.
+.P
+Each of those recipes files describe a function that returns a
+derivation, which, once built will result in the first stage script of
+the execution pipeline.
+.P
+The first part of states the name of the attributes required as the
+input of the function. Typically some packages, common tools and options:
+.DS I
+.VERBON
+{
+  stdenv
+, stdexp
+, bsc
+, targetMachine
+, stages
+, garlicTools
+}:
+.VERBOFF
+.DE
+.P
+Notice the \fCtargetMachine\fP argument, which provides information
+about the machine in which the experiment will run. You should write
+your experiment in such a way that runs in multiple clusters.
+.DS I
+.VERBON
+varConf = {
+  blocks = [ 1 2 4 ];
+  nodes = [ 1 ];
+};
+.VERBOFF
+.DE
+.P
+The \fCvarConf\fP is the attribute set that allows you to vary some
+factors in the experiment.
+.DS I
+.VERBON
+genConf = var: fix (self: targetMachine.config // {
+  expName = "example";
+  unitName = self.expName + "-b" + toString self.blocks;
+  blocks = var.blocks;
+  nodes = var.nodes;
+  cpusPerTask = 1;
+  tasksPerNode = self.hw.socketsPerNode;
+});
+.VERBOFF
+.DE
+.P
+The \fCgenConf\fP function is the central part of the description of the
+experiment. Takes as input \fBone\fP configuration from the cartesian
+product of
+.I varConfig
+and returns the complete configuration. In our case, it will be
+called 3 times, with the following inputs at each time:
+.DS I
+.VERBON
+{ blocks = 1; nodes = 1; }
+{ blocks = 2; nodes = 1; }
+{ blocks = 4; nodes = 1; }
+.VERBOFF
+.DE
+.P
+The return value can be inspected by calling the function in the
+interactive nix repl:
+.DS I
+.VERBON
+nix-repl> genConf { blocks = 2; nodes = 1; }
+{
+  blocks = 2;
+  cpusPerTask = 1;
+  expName = "example";
+  hw = { ... };
+  march = "skylake-avx512";
+  mtune = "skylake-avx512";
+  name = "mn4";
+  nixPrefix = "/gpfs/projects/bsc15/nix";
+  nodes = 1;
+  sshHost = "mn1";
+  tasksPerNode = 2;
+  unitName = "example-b2";
+}
+.VERBOFF
+.DE
+.P
+Some configuration parameters were added by
+.I targetMachine.config ,
+such as the
+.I nixPrefix ,
+.I sshHost
+or the
+.I hw
+attribute set, which are specific for the cluster they experiment is
+going to run. Also, the
+.I unitName
+got assigned the proper name based on the number of blocks, but the
+number of tasks per node were assigned based on the hardware description
+of the target machine.
+.P
+By following this rule, the experiments can easily be ported to machines
+with other hardware characteristics, and we only need to define the
+hardware details once. Then all the experiments will be updated based on
+those details.
+.H 2 "First steps"
+.P
+The complete results generally take a long time to be finished, so it is
+advisable to design the experiments iteratively, in order to quickly
+obtain some feedback. Some recommendations:
+.BL
+.LI
+Start with one unit only.
+.LI
+Set the number of runs low (say 5) but more than one.
+.LI
+Use a small problem size, so the execution time is low.
+.LI
+Set the time limit low, so deadlocks are caught early.
+.LE
+.P
+As soon as the first runs are complete, examine the results and test
+that everything looks good. You would likely want to check:
+.BL
+.LI
+The resources where assigned as intended (nodes and CPU affinity).
+.LI
+No errors or warnings: look at stderr and stdout logs.
+.LI
+If a deadlock happens, it will run out of the time limit.
+.LE
+.P
+As you gain confidence over that the execution went as planned, begin
+increasing the problem size, the number of runs, the time limit and
+lastly the number of units. The rationale is that each unit that is
+shared among experiments gets assigned the same hash. Therefore, you can
+iteratively add more units to an experiment, and if they are already
+executed (and the results were generated) is reused.
+.TC