For now we only show the breakdown of the label and subsystem, without the idle information.
		
			
				
	
	
		
			276 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			276 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # OpenMP model
 | |
| 
 | |
| The [OpenMP programming model](https://www.openmp.org) is a widely used API and
 | |
| set of directives for parallel programming, allowing developers to write
 | |
| multi-threaded and multi-process applications more easily. In this document we
 | |
| refer to the
 | |
| [version 5.2 of the OpenMP specification](https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf).
 | |
| 
 | |
| The [LLVM OpenMP Runtime](https://openmp.llvm.org/design/Runtimes.html) provides
 | |
| an implementation of the OpenMP specification as a component of the LLVM
 | |
| compiler infrastructure. We have modified the LLVM OpenMP runtime (libomp) to run on top
 | |
| of the [nOS-V](https://gitlab.bsc.es/nos-v/nos-v) runtime as part of the
 | |
| [OmpSs-2 LLVM compiler](https://pm.bsc.es/llvm-ompss), named **libompv**.
 | |
| 
 | |
| We have added instrumentation events to libompv designed to be enabled along
 | |
| the [nOS-V instrumentation](nosv.md). This document describes all the
 | |
| instrumentation features included in our modified libompv runtime to identify
 | |
| what is happening. This data is useful for both users and developers of the
 | |
| OpenMP runtime to analyze issues and undesired behaviors.
 | |
| 
 | |
| !!! Note
 | |
| 
 | |
|     Instrumenting libomp is planned but is not yet posible.
 | |
|     For now you must use libompv.
 | |
| 
 | |
| ## Enable the instrumentation
 | |
| 
 | |
| To generate runtime traces, you will have to:
 | |
| 
 | |
| 1. **Build nOS-V with ovni support:** Refer to the
 | |
|   [nOS-V
 | |
|   documentation](https://github.com/bsc-pm/nos-v/blob/master/docs/user/tracing.md).
 | |
|   Typically you should use the `--with-ovni` option at configure time to specify
 | |
|   where ovni is installed.
 | |
| 2. **Build libompv with ovni and nOS-V support:** Use the `PKG_CONFIG_PATH`
 | |
|   environment variable to specify the nOS-V and ovni installation 
 | |
|   when configuring CMake.
 | |
| 3. **Enable the instrumentation in nOS-V at runtime:** Refer to the
 | |
|   [nOS-V documentation](https://github.com/bsc-pm/nos-v/blob/master/docs/user/tracing.md)
 | |
|   to find out how to enable the tracing at runtime. Typically you can just set 
 | |
|   `NOSV_CONFIG_OVERRIDE="instrumentation.version=ovni"`.
 | |
| 4. **Enable the instrumentation of libompv at runtime:** Set the environment
 | |
|   variable `OMP_OVNI=1`.
 | |
| 
 | |
| Next sections describe each of the views included for analysis.
 | |
| 
 | |
| ## Subsystem view
 | |
| 
 | |
| 
 | |
| 
 | |
| The view is complemented with the information of [nOS-V views](nosv.md),
 | |
| as libompv uses nOS-V tasks to run the workers.
 | |
| Subsystem illustrates the activities of each thread with different states:
 | |
| 
 | |
| - **Work-distribution subsystem**: Related to work-distribution constructs,
 | |
|     [in Chapter 11][workdis].
 | |
| 
 | |
|     - **Distribute**: Running a *Distribute* region.
 | |
| 
 | |
|     - **Dynamic for chunk**: Running a chunk of a dynamic *for*, which often
 | |
|       involve running more than one iteration of the loop. See the
 | |
|       [limitations](#dynamic_for) below.
 | |
| 
 | |
|     - **Dynamic for initialization**: Preparing a dynamic *for*.
 | |
| 
 | |
|     - **Static for chunk**: Executing the assigned iterations of an static
 | |
|       *for*.
 | |
| 
 | |
|     - **Single**: Running a *Single* region. All threads of the parallel region
 | |
|       participate.
 | |
| 
 | |
|     - **Section**: Running a *Section* region. All threads of the parallel region
 | |
|       participate.
 | |
| 
 | |
| - **Task subsystem**: Related to tasking constructs, [in Chapter 12][tasking].
 | |
| 
 | |
|     - **Allocation**: Allocating the task descriptor.
 | |
| 
 | |
|     - **Check deps**: Checking if the task has pending dependencies to be
 | |
|       fulfilled. When all dependencies are fulfilled the task will be scheduled.
 | |
| 
 | |
|     - **Duplicating**: Duplicating the task descriptor in a taskloop.
 | |
| 
 | |
|     - **Releasing deps**: Releasing dependencies at the end of a task. This
 | |
|       state is always present even if the task has no dependencies.
 | |
| 
 | |
|     - **Running task**: Executing a task.
 | |
| 
 | |
|     - **Running task if0**: Executing a task if0.
 | |
| 
 | |
|     - **Scheduling**: Adding the task to the scheduler for execution.
 | |
| 
 | |
|     - **Taskgroup**: Waiting in a *taskgroup* construct.
 | |
| 
 | |
|     - **Taskwait**: Waiting in a *taskwait* construct.
 | |
| 
 | |
|     - **Taskwait deps**: Trying to execute tasks until dependencies have been
 | |
|       fulfilled. This appears typically in a task if0 with dependencies or a
 | |
|       taskwait with deps.
 | |
|     
 | |
|     - **Taskyield**: Performing a *taskyield* construct.
 | |
| 
 | |
| - **Critical subsystem**: Related to the *critical* Constuct, in [Section 15.2][critical].
 | |
| 
 | |
|     - **Acquiring**: Waiting to acquire a *Critical* section.
 | |
| 
 | |
|     - **Section**: Running the *Critical* section.
 | |
| 
 | |
|     - **Releasing**: Waiting to release a *Critical* section.
 | |
| 
 | |
| - **Barrier subsystem**: Related to barriers, in [Section 15.3][barrier].
 | |
|     **All barriers can try to execute tasks**.
 | |
| 
 | |
|     - **Barrier: Fork**: Workers wait for a release signal from the master thread to
 | |
|       continue. The master can continue as soon as it signals the workers. It is
 | |
|       done at the beginning of a fork-join region.
 | |
| 
 | |
|     - **Barrier: Join**: The master thread waits until all workers finish their work.
 | |
|       Workers can continue as soon as they signal the master. It is done at the
 | |
|       end of a fork-join region.
 | |
|   
 | |
|     - **Barrier: Plain**: Performing a plain barrier, which waits for a release
 | |
|       signal from the master thread to continue. It is done at the beginning of
 | |
|       a fork-join region, in the `__kmp_join_barrier()` function.
 | |
| 
 | |
|     - **Barrier: Task**: Blocked in an additional tasking barrier *until all previous
 | |
|       tasks have been executed*. Only happens when executed with `KMP_TASKING=1`.
 | |
| 
 | |
| - **Runtime subsystem**: Internal operations of the runtime.
 | |
| 
 | |
|     - **Attached**: Present after the call to `nosv_attach()` and before
 | |
|       `nosv_detach()`. This state is a hack.
 | |
| 
 | |
|     - **Fork call**: Preparing a parallel section using the fork-join model.
 | |
|       Only called from the master thread.
 | |
| 
 | |
|     - **Init**: Initializing the libompv runtime.
 | |
| 
 | |
|     - **Internal microtask**: Running a internal libompv function as a microtask.
 | |
| 
 | |
|     - **User microtask**: Running user code as a microtask in a worker thread.
 | |
| 
 | |
|     - **Worker main Loop**: Running the main loop, where the workers run the
 | |
|       fork barrier, run a microtask and perform a join barrier until there is no
 | |
|       more work.
 | |
| 
 | |
| !!! Note
 | |
| 
 | |
|     The generated HTML version of the OpenMP 5.2 specification has some parts
 | |
|     missing, so we link directly to the PDF file which may not work in some
 | |
|     browsers.
 | |
| 
 | |
| [workdis]:  https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#chapter.11
 | |
| [tasking]:  https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#chapter.12
 | |
| [critical]: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#section.15.2
 | |
| [barrier]:  https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#section.15.3
 | |
| 
 | |
| ## Label view
 | |
| 
 | |
| The label view displays the text in the `label()` clause of OpenMP
 | |
| tasks and work distribution constructs (static and dynamic for, single
 | |
| and section). When the label is not provided, the source file and source
 | |
| line location is used instead.
 | |
| 
 | |
| When nesting multiple tasks or work distribution constructs, only the
 | |
| innermost label is shown.
 | |
| 
 | |
| Note that in this view, the numeric event value is a hash function of
 | |
| the type label, so two distinct tasks (declared in different parts of
 | |
| the code) with the same label will share the event value and have the
 | |
| same color.
 | |
| 
 | |
| ## Task ID view
 | |
| 
 | |
| The task ID view represents the numeric ID of the OpenMP task that is
 | |
| currently running on each thread. The ID is a monotonically increasing
 | |
| identifier assigned on task creation. Lower IDs correspond to tasks
 | |
| created at an earlier point than higher IDs.
 | |
| 
 | |
| # Breakdown (simple)
 | |
| 
 | |
| A simplified view for the breakdown is generated when the emulator is run with
 | |
| the `-b` flag, the trace is stored in `openmp-breakdown.prv`. This breakdown
 | |
| view selects the label when it has a value or the subsystem otherwise. The view
 | |
| is sorted so that rows with same values are grouped together.
 | |
| 
 | |
| Notice that unlike nOS-V or Nanos6, we don't include yet the information about
 | |
| the runtime waiting or making progress, but some information can be inferred
 | |
| from the subsystem states.
 | |
| 
 | |
| ## Limitations
 | |
| 
 | |
| As the compiler generates the code that perform the calls to the libompv
 | |
| runtime, there are some parts of the execution that are complicated to
 | |
| instrument by just placing a pair of events to delimite a function.
 | |
| 
 | |
| For those cases we use an approximation which is documented in the following
 | |
| subsections.
 | |
| 
 | |
| ### Dynamic for
 | |
| 
 | |
| The generated code of a *dynamic for* has the following structure:
 | |
| 
 | |
| ```c
 | |
| __kmpc_dispatch_init_4(...);
 | |
| while (__kmpc_dispatch_next_4(...)) {
 | |
|     for (i = ...; i <= ...; i++) {
 | |
|         // User code ...
 | |
|     }
 | |
| }
 | |
| ```
 | |
| 
 | |
| The function `__kmpc_dispatch_next_4()` returns `true` if there are more
 | |
| chunks (group of iterations) to be executed by the thread, otherwise it returns
 | |
| `false`.
 | |
| 
 | |
| Ideally we want to instrument each chunk with a pair of begin and end events.
 | |
| 
 | |
| The problem with the instrumentation is that there is no easy way of determining
 | |
| if the call to `__kmpc_dispatch_next_4()` is processing the first chunk, just
 | |
| after `__kmpc_dispatch_init_4()`, or is coming from other chunks due to the
 | |
| while loop.
 | |
| 
 | |
| Therefore, from the `__kmpc_dispatch_next_4()` alone, we cannot determine if we
 | |
| need to only emit a single "begin a new chunk" event or we need to emit the pair
 | |
| of events "finish the last chunk" and "begin a new one".
 | |
| 
 | |
| So, as a workaround, we emit an event from the end of `__kmpc_dispatch_init_4()`
 | |
| starting a new chunk (which is fake), and then from `__kmpc_dispatch_next_4()` we
 | |
| always emit the "finish the last chunk" and "begin a new one" events (unless
 | |
| there are no more chunks, in which case we don't emit the "begin a new one"
 | |
| event).
 | |
| 
 | |
| This will cause an spurious *Work-distribution: Dynamic for chunk* state at the
 | |
| beginning of each dynamic for, which should be very short and is not really a
 | |
| chunk.
 | |
| 
 | |
| ### Static for
 | |
| 
 | |
| The generated code of an *static for* has the following structure:
 | |
| 
 | |
| ```c
 | |
| __kmpc_for_static_init_4(...);
 | |
| for (i = ...; i <= ...; i++) {
 | |
|     // User code ...
 | |
| }
 | |
| __kmpc_for_static_fini(...);
 | |
| ```
 | |
| 
 | |
| As this code is generated by the compiler we cannot easily add the begin/end
 | |
| pair of events to mark the *Work-distribution: Static for chunk* state.
 | |
| 
 | |
| We assume that by placing the "begin processing a chunk" event at the end of
 | |
| `__kmpc_for_static_init_4()` and the "end processing the chunk" event at
 | |
| the beginning of `__kmpc_for_static_fini()` is equivalent to adding the
 | |
| events surrounding the for loop.
 | |
| 
 | |
| ### Task if0
 | |
| 
 | |
| The generated code of an *if0 task* has the following structure:
 | |
| 
 | |
| ```c
 | |
| ... = __kmpc_omp_task_alloc(...);
 | |
| __kmpc_omp_taskwait_deps_51(...); // If task has dependencies
 | |
| __kmpc_omp_task_begin_if0(...);
 | |
| // Call to the user code
 | |
| omp_task_entry_(...);
 | |
| __kmpc_omp_task_complete_if0(...);
 | |
| ```
 | |
| 
 | |
| Instead of injecting the begin and end events in the user code, we
 | |
| approximate it by placing the "begin if0 task" event at the end of the
 | |
| `__kmpc_omp_task_begin_if0` function and the "end if0 task" event at the
 | |
| beginning of `__kmpc_omp_task_complete_if0`. This state will be shown as 
 | |
| *Task: Running task if0*.
 |