ovni/doc/trace_spec.md

5.7 KiB

Trace specification version 1

The ovni instrumentation library produces a trace with the following specification.

The complete trace is stored in a top-level directory named "ovni". Inside this directory you will find the loom directories with the prefix loom.. The name of the loom is built from the loom parameter of ovni_proc_init(), prefixing it with loom..

Each loom directory contains one directory per process of that loom. The name is composed of the proc. prefix and the PID of the process specified in the pid argument to ovni_proc_init().

Each process directory contains:

  • The metadata file metadata.json.
  • The thread traces with prefix thread..

Process metadata

The metadata file contains important information about the trace that is invariant during the complete execution, and generally is required to be available prior to processing the events in the trace.

The metadata is stored in the JSON file metadata.json inside each process directory and contains the following keys:

  • version: a number specifying the version of the metadata format.
  • app_id: the application ID, used to distinguish between applications running on the same loom.
  • rank: the rank of the MPI process (optional).
  • nranks: number of total MPI processes (optional).
  • cpus: the array of `N_c` CPUs available in the loom. Only one process in the loom must contain this mandatory key. Each element is a dictionary with the keys:
    • index: containing the logical CPU index from 0 to `N_c - 1`.
    • phyid: the number of the CPU as given by the operating system (which can exceed `N_c`).

Thread trace

The thread trace is a binary file composed of events joined one after the other. Each event has a header with the following information:

  • Event flags
  • Payload size in a special format
  • Model, category and value codes
  • Time in nanoseconds
  • Payload (optional)

The payload size is specified using 4 bits, with the value 0x0 for no payload, or with value `v` for `v + 1` bytes of payload. This allows us to use 16 bytes of payload with value 0xf at the cost of sacrificing payloads of one byte.

There are two types of events, depending of the size needed for the payload:

  • Normal: with a payload up to 16 bytes
  • Jumbo: with a payload up to 2^32 bytes

Normal events

The normal events are composed of:

  • 4 bits of flags
  • 4 bits of payload size
  • 3 bytes for the MCV
  • 8 bytes for the clock
  • 0 to 16 bytes of payload

Here is an example of a normal event without payload, a total of 12 bytes:

% dd if=thread.552943 skip=5258 bs=1 | hexdump -C 
00000000  00 4f 48 65 01 c5 cf 1d  96 d0 12 00              |.OHe........|

And in the following figure you can see every field annotated:

Normal event without payload

Another example of a normal event with 16 bytes of payload, a total of 28 bytes as reported by hexdump:

% dd if=thread.552943 bs=1 count=28 | hexdump -C
00000000  0f 4f 48 78 58 c1 b0 b5  95 43 11 00 00 00 00 00  |.OHxX....C......|
00000010  ff ff ff ff 00 00 00 00  00 00 00 00              |............|

In the following figure you can see each field annotated:

Normal event with payload content

Jumbo events

The jumbo events are just like normal events but they can hold large data. The size of the jumbo data is stored as a 32 bits integer as a normal payload, and the jumbo data just follows the event.

  • 4 bits of flags
  • 4 bits of payload size (always 4 with value 0x3)
  • 3 bytes for the MCV
  • 8 bytes for the clock
  • 4 bytes of payload with the size of the jumbo data
  • 0 to 2^32 bytes of jumbo data

Example of a jumbo event of 30 bytes in total, with 14 bytes of jumbo data:

00000000  13 56 59 63 eb c1 4b 1a  96 d0 12 00 0e 00 00 00  |.VYc..K.........|
00000010  01 00 00 00 74 65 73 74  74 79 70 65 31 00        |....testtype1.|

In the following figure you can see each field annotated:

Jumbo event

Design considerations

The trace format has been designed to be very simple, so writing a parser library would take no more than 2 days.

The common events don't use any payload, so the size per event is kept at the minimum of 12 bytes.

Important: The events are stored in disk following the endianness of the machine where they are generated. So a trace generated with a little endian machine would be different than on a big endian machine. Using the same endiannes avoids the cost of serialization when writting the trace at runtime.

The events are designed to be easily identified when looking at the raw trace in binary, as the MCV codes can be read as ASCII characters:

00000000  0f 4f 48 78 58 c1 b0 b5  95 43 11 00 00 00 00 00  |.OHxX....C......|
00000010  ff ff ff ff 00 00 00 00  00 00 00 00 00 36 53 72  |.............6Sr|
00000020  ab cb b0 b5 95 43 11 00  00 36 53 73 78 c3 b9 b5  |.....C...6Ssx...|
00000030  95 43 11 00 00 36 53 40  87 a4 c2 b5 95 43 11 00  |.C...6S@.....C..|
00000040  00 36 53 68 9c 4b cb b5  95 43 11 00 00 36 53 66  |.6Sh.K...C...6Sf|
00000050  85 44 d4 b5 95 43 11 00  00 36 53 5b cb e7 dc b5  |.D...C...6S[....|
00000060  95 43 11 00 00 36 53 5d  cf ca e5 b5 95 43 11 00  |.C...6S].....C..|
00000070  00 36 53 75 8c db ee b5  95 43 11 00 00 36 53 55  |.6Su.....C...6SU|
00000080  5a 70 f8 b5 95 43 11 00  00 36 55 5b 1b ae 01 b6  |Zp...C...6U[....|
00000090  95 43 11 00 00 36 55 5d  aa 19 0b b6 95 43 11 00  |.C...6U].....C..|

This allows a human to detect signs of corruption by just visually inspecting the trace.

Limitations

The traces are designed to be read only forward, as they only contain the size of each event in the header.