ovni/doc/user/emulation/tampi.md

145 lines
7.6 KiB
Markdown
Raw Normal View History

2023-08-18 12:33:01 +02:00
# TAMPI model
The Task-Aware MPI (TAMPI) library extends the functionality of standard MPI
libraries by providing new mechanisms for improving the interoperability between
parallel task-based programming models, such as OpenMP and OmpSs-2, and MPI
communications. This library allows the safe and efficient execution of MPI
operations from concurrent tasks and guarantees the transparent management and
progress of these communications.
[tampi repo]: https://github.com/bsc-pm/tampi
[tampi docs]: https://github.com/bsc-pm/tampi#readme
[tampi blk]: https://github.com/bsc-pm/tampi#blocking-mode-ompss-2
[tampi nonblk]: https://github.com/bsc-pm/tampi#non-blocking-mode-openmp--ompss-2
The TAMPI library has instrumented the execution of its task-aware functions
with ovni. To obtain an instrumented library, TAMPI must be built passing the
`--with-ovni` configure option and specifying the ovni installation prefix. At
run-time, the user can enable the instrumentation by defining the environment
variable `TAMPI_INSTRUMENT=ovni`.
For more information regarding TAMPI or how to enable its instrumentation see
the TAMPI [repository][tampi repo] and [documentation][tampi docs].
TAMPI is instrumented to track the execution path inside the run-time library
to identify what is happening at each moment. This information can be used by
both users and developers to analyze problems or to better understand the
execution behavior of TAMPI communications and its background services. There is
one view generated to achieve this goal.
## Subsystem view
The subsystem view attempts to provide a general overview of what TAMPI is doing
at any point in time. The view shows the state inside the TAMPI library for each
thread (and for each CPU, the state of the running thread in that CPU). This
subsystem state view sticks to the definition of subsystem states from the
[Nanos6](nanos6.md#subsystem_view).
The states shown in this view are:
- **Library code subsystem**: Indicating whether the running thread is executing
effective TAMPI library code. These subsystem states wrap the rest of
subsystems that are described below. No other TAMPI state can appear outside
of a TAMPI library code subsystem state.
- **Interface function**: Running any TAMPI API function or an intercepted
MPI function which requires task-awareness. When the user application
disables a TAMPI mode, whether the [blocking][tampi blk] or
[non-blocking][tampi nonblk] mode, any call to an interface function
corresponding to the disabled mode will not appear in the view. Operations
that are directly forwarded to MPI (because TAMPI is not asked to apply
task-awareness) will not appear.
- **Polling function**: The TAMPI library can launch internal tasks to
execute polling functions in the background. Currently, TAMPI launches a
polling task that periodically checks and processes the pending MPI
requests generated by task-aware operations. This polling state may not
appear if none of the TAMPI modes are enabled by the user application.
- **Communication subsystem**: The running thread is communicating through MPI
or issuing an asynchronous communication operation.
- **Issuing a non-blocking operation**: Issuing a non-blocking MPI operation
that can generate an MPI request.
- **Ticket subsystem**: Creation and managing of tickets. A ticket is an
internal object that describes the relation between a set of pending MPI
requests and the user communication task that is *waiting* (synchronous or
asynchronously) on them. A ticket is used for both [blocking][tampi blk] and
[non-blocking][tampi nonblk] operations.
- **Creating a ticket**: Creating a ticket that is linked to a set of MPI
requests and a user task. The user task is the task that is *waiting* for
these requests to complete. Notice that *waiting* does not mean that the
task will synchronously wait for them. The ticket is initialized with a
counter of how many requests are still pending. The ticket is completed,
and thus, the task is notified, when this counter becomes zero.
- **Waiting for the ticket completion**: The user task, during a blocking
TAMPI operation, is waiting a ticket and its requests to complete. The
task may be blocked and yield the CPU meanwhile. Notice that user tasks
calling non-blocking TAMPI operations will not enter in this state.
- **Staging queue subsystem**: Queueing and dequeueing requests from the staging
queues before being transferred to the global array of requests and tickets.
These queues are used to optimize and control insertion of these objects into
the global array.
- **Adding to a queue**: A user communication task running a task-aware
TAMPI operation is pushing the corresponding MPI requests and the related
ticket into a staging queue.
- **Transfering from queues to the global array**: The polling task is
transferring the staged requests and tickets from the queues to the global
array.
- **Global array subsystem**: Managing the per-process global array of tickets
and MPI requests related to TAMPI operations.
- **Checking pending requests**: Testing all pending MPI requests from the
global array, processing the completed requests, and reorganizing the
array to keep it compacted.
- **Request subsystem**: Management and testing of pending MPI requests, and
processing the completed ones. This state considers only the management of MPI
requests concerning task-aware operations, which are exclusively tested by the
TAMPI library. Any testing function call made by the user application or other
libraries is not considered.
- **Testing a request with MPI_Test**: Testing a single MPI request by
calling MPI_Test inside the TAMPI library.
- **Testing requests with MPI_Testall**: Testing multiple MPI requests by
calling MPI_Testall inside the TAMPI library.
- **Testing requests with MPI_Testsome**: Testing multiple MPI requests by
calling MPI_Testsome inside the TAMPI library.
- **Processing a completed request**: Processing a completed MPI request by
decreasing the number of pending requests of the linked ticket. If the
ticket does not have any other request to wait, the ticket is completed
and the *waiting* task is notified. In such a case, a call to the tasking
runtime system will occur. If the operation was [blocking][tampi blk], the
*waiting* task will be unblocked and will eventually resume the execution.
If the operation was [non-blocking][tampi nonblk], the library will
decrease the external events of the *waiting* task.
The figure below shows an example of the subsystem view. The program executes a
distributed stencil algorithm with MPI and OmpSs-2. There are several MPI
processes and each process has OmpSs-2 tasks running exlusively on multiple CPU
resources.
![Subsystem view example](fig/tampi-subsystem.png)
The view show there are several user tasks running task-aware communication
operations. The light blue areas show when a user task is testing a request that
was generated by a non-blocking MPI communication function. There is also one
polling task per process. The yellow areas show when the polling tasks are
calling MPI_Testsome. Just after the testsome call, the violet areas show the
moment when the polling task is processing the completed requests.
This view shows that most of the time inside the TAMPI library is spent testing
requests. This could give us a clue that the underlying MPI library may have
concurrency issues (e.g., thread contention) when multiple threads try to test
requests in parallel.