2024-01-16 12:21:16 +01:00
|
|
|
# OpenMP model
|
|
|
|
|
|
|
|
The [OpenMP programming model](https://www.openmp.org) is a widely used API and
|
|
|
|
set of directives for parallel programming, allowing developers to write
|
|
|
|
multi-threaded and multi-process applications more easily. In this document we
|
|
|
|
refer to the
|
|
|
|
[version 5.2 of the OpenMP specification](https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf).
|
|
|
|
|
|
|
|
The [LLVM OpenMP Runtime](https://openmp.llvm.org/design/Runtimes.html) provides
|
|
|
|
an implementation of the OpenMP specification as a component of the LLVM
|
|
|
|
compiler infrastructure. We have modified the LLVM OpenMP runtime to run on top
|
|
|
|
of the [nOS-V](https://gitlab.bsc.es/nos-v/nos-v) runtime as part of the
|
|
|
|
[OmpSs-2 LLVM compiler](https://pm.bsc.es/llvm-ompss), named **OpenMP-V**.
|
|
|
|
|
|
|
|
We have added instrumentation events to OpenMP-V designed to be enabled along
|
|
|
|
the [nOS-V instrumentation](nosv.md). This document describes all the
|
|
|
|
instrumentation features included in our modified OpenMP-V runtime to identify
|
|
|
|
what is happening. This data is useful for both users and developers of the
|
|
|
|
OpenMP runtime to analyze issues and undesired behaviors.
|
|
|
|
|
|
|
|
!!! Note
|
|
|
|
|
|
|
|
Instrumenting the original OpenMP runtime from the LLVM project is planned
|
|
|
|
but is not yet posible. For now you must use the modified OpenMP-V runtime
|
|
|
|
with nOS-V.
|
|
|
|
|
|
|
|
## Enable the instrumentation
|
|
|
|
|
|
|
|
To generate runtime traces, you will have to:
|
|
|
|
|
|
|
|
1. **Build nOS-V with ovni support:** Refer to the
|
|
|
|
[nOS-V
|
|
|
|
documentation](https://github.com/bsc-pm/nos-v/blob/master/docs/user/tracing.md).
|
|
|
|
Typically you should use the `--with-ovni` option at configure time to specify
|
|
|
|
where ovni is installed.
|
|
|
|
2. **Build OpenMP-V with ovni and nOS-V support:** Use the `PKG_CONFIG_PATH`
|
|
|
|
environment variable to specify the nOS-V and ovni installation
|
|
|
|
when configuring CMake.
|
|
|
|
3. **Enable the instrumentation in nOS-V at runtime:** Refer to the
|
|
|
|
[nOS-V documentation](https://github.com/bsc-pm/nos-v/blob/master/docs/user/tracing.md)
|
|
|
|
to find out how to enable the tracing at runtime. Typically you can just set
|
|
|
|
`NOSV_CONFIG_OVERRIDE="instrumentation.version=ovni"`.
|
|
|
|
4. **Enable the instrumentation of OpenMP-V at runtime:** Set the environment
|
|
|
|
variable `OMP_OVNI=1`.
|
|
|
|
|
|
|
|
Currently there is only support for the subsystem view, which is documented
|
|
|
|
below. The view is complemented with the information of [nOS-V views](nosv.md),
|
|
|
|
as OpenMP-V uses nOS-V tasks to run the workers.
|
|
|
|
|
|
|
|
## Subsystem view
|
|
|
|
|
|
|
|
![Subsystem view example](fig/openmp-subsystem.png)
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
This view illustrates the activities of each thread with different states:
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Work-distribution subsystem**: Related to work-distribution constructs,
|
|
|
|
[in Chapter 11][workdis].
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Distribute**: Running a *Distribute* region.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Dynamic for chunk**: Running a chunk of a dynamic *for*, which often
|
|
|
|
involve running more than one iteration of the loop. See the
|
|
|
|
[limitations](#dynamic_for) below.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Dynamic for initialization**: Preparing a dynamic *for*.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Static for chunk**: Executing the assigned iterations of an static
|
|
|
|
*for*.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Single**: Running a *Single* region. All threads of the parallel region
|
|
|
|
participate.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Section**: Running a *Section* region. All threads of the parallel region
|
|
|
|
participate.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Task subsystem**: Related to tasking constructs, [in Chapter 12][tasking].
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Allocation**: Allocating the task descriptor.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Check deps**: Checking if the task has pending dependencies to be
|
|
|
|
fulfilled. When all dependencies are fulfilled the task will be scheduled.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Duplicating**: Duplicating the task descriptor in a taskloop.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Releasing deps**: Releasing dependencies at the end of a task. This
|
|
|
|
state is always present even if the task has no dependencies.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Running task**: Executing a task.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Running task if0**: Executing a task if0.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Scheduling**: Adding the task to the scheduler for execution.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Taskgroup**: Waiting in a *taskgroup* construct.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Taskwait**: Waiting in a *taskwait* construct.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Taskwait deps**: Trying to execute tasks until dependencies have been
|
|
|
|
fulfilled. This appears typically in a task if0 with dependencies or a
|
|
|
|
taskwait with deps.
|
|
|
|
|
|
|
|
- **Taskyield**: Performing a *taskyield* construct.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Critical subsystem**: Related to the *critical* Constuct, in [Section 15.2][critical].
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Acquiring**: Waiting to acquire a *Critical* section.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Section**: Running the *Critical* section.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Releasing**: Waiting to release a *Critical* section.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Barrier subsystem**: Related to barriers, in [Section 15.3][barrier].
|
|
|
|
**All barriers can try to execute tasks**.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Barrier: Fork**: Workers wait for a release signal from the master thread to
|
|
|
|
continue. The master can continue as soon as it signals the workers. It is
|
|
|
|
done at the beginning of a fork-join region.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Barrier: Join**: The master thread waits until all workers finish their work.
|
|
|
|
Workers can continue as soon as they signal the master. It is done at the
|
|
|
|
end of a fork-join region.
|
|
|
|
|
|
|
|
- **Barrier: Plain**: Performing a plain barrier, which waits for a release
|
|
|
|
signal from the master thread to continue. It is done at the beginning of
|
|
|
|
a fork-join region, in the `__kmp_join_barrier()` function.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Barrier: Task**: Blocked in an additional tasking barrier *until all previous
|
|
|
|
tasks have been executed*. Only happens when executed with `KMP_TASKING=1`.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Runtime subsystem**: Internal operations of the runtime.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Attached**: Present after the call to `nosv_attach()` and before
|
|
|
|
`nosv_detach()`. This state is a hack.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Fork call**: Preparing a parallel section using the fork-join model.
|
|
|
|
Only called from the master thread.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Init**: Initializing the OpenMP-V runtime.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Internal microtask**: Running a internal OpenMP-V function as a microtask.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **User microtask**: Running user code as a microtask in a worker thread.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
- **Worker main Loop**: Running the main loop, where the workers run the
|
|
|
|
fork barrier, run a microtask and perform a join barrier until there is no
|
|
|
|
more work.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
!!! Note
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
The generated HTML version of the OpenMP 5.2 specification has some parts
|
|
|
|
missing, so we link directly to the PDF file which may not work in some
|
|
|
|
browsers.
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
[workdis]: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#chapter.11
|
|
|
|
[tasking]: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#chapter.12
|
|
|
|
[critical]: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#section.15.2
|
|
|
|
[barrier]: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf#section.15.3
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
## Limitations
|
2023-09-27 11:31:14 +02:00
|
|
|
|
2024-01-16 12:21:16 +01:00
|
|
|
As the compiler generates the code that perform the calls to the OpenMP-V
|
|
|
|
runtime, there are some parts of the execution that are complicated to
|
|
|
|
instrument by just placing a pair of events to delimite a function.
|
|
|
|
|
|
|
|
For those cases we use an approximation which is documented in the following
|
|
|
|
subsections.
|
|
|
|
|
|
|
|
### Dynamic for
|
|
|
|
|
|
|
|
The generated code of a *dynamic for* has the following structure:
|
|
|
|
|
|
|
|
```c
|
|
|
|
__kmpc_dispatch_init_4(...);
|
|
|
|
while (__kmpc_dispatch_next_4(...)) {
|
|
|
|
for (i = ...; i <= ...; i++) {
|
|
|
|
// User code ...
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
The function `__kmpc_dispatch_next_4()` returns `true` if there are more
|
|
|
|
chunks (group of iterations) to be executed by the thread, otherwise it returns
|
|
|
|
`false`.
|
|
|
|
|
|
|
|
Ideally we want to instrument each chunk with a pair of begin and end events.
|
|
|
|
|
|
|
|
The problem with the instrumentation is that there is no easy way of determining
|
|
|
|
if the call to `__kmpc_dispatch_next_4()` is processing the first chunk, just
|
|
|
|
after `__kmpc_dispatch_init_4()`, or is coming from other chunks due to the
|
|
|
|
while loop.
|
|
|
|
|
|
|
|
Therefore, from the `__kmpc_dispatch_next_4()` alone, we cannot determine if we
|
|
|
|
need to only emit a single "begin a new chunk" event or we need to emit the pair
|
|
|
|
of events "finish the last chunk" and "begin a new one".
|
|
|
|
|
|
|
|
So, as a workaround, we emit an event from the end of `__kmpc_dispatch_init_4()`
|
|
|
|
starting a new chunk (which is fake), and then from `__kmpc_dispatch_next_4()` we
|
|
|
|
always emit the "finish the last chunk" and "begin a new one" events (unless
|
|
|
|
there are no more chunks, in which case we don't emit the "begin a new one"
|
|
|
|
event).
|
|
|
|
|
|
|
|
This will cause an spurious *Work-distribution: Dynamic for chunk* state at the
|
|
|
|
beginning of each dynamic for, which should be very short and is not really a
|
|
|
|
chunk.
|
|
|
|
|
|
|
|
### Static for
|
|
|
|
|
|
|
|
The generated code of an *static for* has the following structure:
|
|
|
|
|
|
|
|
```c
|
|
|
|
__kmpc_for_static_init_4(...);
|
|
|
|
for (i = ...; i <= ...; i++) {
|
|
|
|
// User code ...
|
|
|
|
}
|
|
|
|
__kmpc_for_static_fini(...);
|
|
|
|
```
|
|
|
|
|
|
|
|
As this code is generated by the compiler we cannot easily add the begin/end
|
|
|
|
pair of events to mark the *Work-distribution: Static for chunk* state.
|
|
|
|
|
|
|
|
We assume that by placing the "begin processing a chunk" event at the end of
|
|
|
|
`__kmpc_for_static_init_4()` and the "end processing the chunk" event at
|
|
|
|
the beginning of `__kmpc_for_static_fini()` is equivalent to adding the
|
|
|
|
events surrounding the for loop.
|
|
|
|
|
|
|
|
### Task if0
|
|
|
|
|
|
|
|
The generated code of an *if0 task* has the following structure:
|
|
|
|
|
|
|
|
```c
|
|
|
|
... = __kmpc_omp_task_alloc(...);
|
|
|
|
__kmpc_omp_taskwait_deps_51(...); // If task has dependencies
|
|
|
|
__kmpc_omp_task_begin_if0(...);
|
|
|
|
// Call to the user code
|
|
|
|
omp_task_entry_(...);
|
|
|
|
__kmpc_omp_task_complete_if0(...);
|
|
|
|
```
|
|
|
|
|
|
|
|
Instead of injecting the begin and end events in the user code, we
|
|
|
|
approximate it by placing the "begin if0 task" event at the end of the
|
|
|
|
`__kmpc_omp_task_begin_if0` function and the "end if0 task" event at the
|
|
|
|
beginning of `__kmpc_omp_task_complete_if0`. This state will be shown as
|
|
|
|
*Task: Running task if0*.
|