Tracking experiment stability after upgrade #221

Open
opened 2026-01-13 12:07:54 +01:00 by rarias · 14 comments
Owner

One of the reproducibility properties provided by Nix is that the whole environment is controlled by the user, so we can upgrade without fears of changing the experimental results. However, a kernel upgrade could still cause changes in performance, even if the user space is the same.

As we are currently running experiments in Fox, we can run some performance experiments and record the results before and after the upgrade. The results should be the same if this hypothesis holds.

CC @varcila

One of the reproducibility properties provided by Nix is that the whole environment is controlled by the user, so we can upgrade without fears of changing the experimental results. However, a kernel upgrade could still cause changes in performance, even if the user space is the same. As we are currently running experiments in Fox, we can run some performance experiments and record the results before and after the upgrade. The results should be the same if this hypothesis holds. CC @varcila
rarias added the nix label 2026-01-13 12:07:54 +01:00
rarias self-assigned this 2026-01-13 12:07:54 +01:00
Collaborator

I have executed the cholesky before and after the upgrade. Results show some degrade, very noticeable with large tasksizes:

Tasksize 384 (best tasksize)

x after_384.out
+ before_384.out
+--------------------------------------------------------------------------------+
|                    + +                                                         |
|           + ++   *x+ +  x   x                   +         xx   x x   x         |
|+      +x*+++++  ****+++ x   x            +     ++    + +* xx xxxxx xxx x      x|
|        |__________M___|A_______________|______A____________M__________|        |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      4.064465       4.35126     4.2728615     4.2213375   0.095528835
+  30      4.030699      4.263528      4.108358     4.1269543   0.065797707
Difference at 95.0% confidence
	-0.0943833 +/- 0.0423981
	-2.23586% +/- 1.00438%
	(Student's t, pooled s = 0.0820216)

Tasksize 512 (somewhat large)

x after.out
+ before.out
+--------------------------------------------------------------------------------+
|    + +                                                                         |
|  +++ +                                                                         |
|  +++++                                                       x                 |
| +++++++                +                 x   x      x        x                 |
| ++++++++             +++      x x  xx  xxx  xx    x xxxxx xxxx xxx  x x       x|
||____M_A______|                          |___________A_M_________|              |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      3.852624        4.6072     4.2233735     4.2029263    0.19033377
+  30      3.378232      3.748664      3.437565      3.470745    0.10714594
Difference at 95.0% confidence
	-0.732181 +/- 0.0798354
	-17.4208% +/- 1.89952%
	(Student's t, pooled s = 0.154446)
I have executed the cholesky before and after the upgrade. Results show some degrade, very noticeable with large tasksizes: Tasksize 384 (best tasksize) ``` x after_384.out + before_384.out +--------------------------------------------------------------------------------+ | + + | | + ++ *x+ + x x + xx x x x | |+ +x*+++++ ****+++ x x + ++ + +* xx xxxxx xxx x x| | |__________M___|A_______________|______A____________M__________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 4.064465 4.35126 4.2728615 4.2213375 0.095528835 + 30 4.030699 4.263528 4.108358 4.1269543 0.065797707 Difference at 95.0% confidence -0.0943833 +/- 0.0423981 -2.23586% +/- 1.00438% (Student's t, pooled s = 0.0820216) ``` Tasksize 512 (somewhat large) ``` x after.out + before.out +--------------------------------------------------------------------------------+ | + + | | +++ + | | +++++ x | | +++++++ + x x x x | | ++++++++ +++ x x xx xxx xx x xxxxx xxxx xxx x x x| ||____M_A______| |___________A_M_________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 3.852624 4.6072 4.2233735 4.2029263 0.19033377 + 30 3.378232 3.748664 3.437565 3.470745 0.10714594 Difference at 95.0% confidence -0.732181 +/- 0.0798354 -17.4208% +/- 1.89952% (Student's t, pooled s = 0.154446) ```
Author
Owner

Thanks for the report. Those are very interesting, albeit unexpected results.

I would like to reproduce them on my end, do you have a suggested set of parameters (size, blocksize...) for cholesky? I'm planning to use the one in bench6. It is safer if we exchange don't exchange the code, only the steps to reproduce it so that we can prevent systematic errors.

One potential explanation is that between linux 6.15.6 and 6.18.3 there are new mitigations that cause a performance impact. If that is the case, it should be easy to reproduce if we test the old and new kernel (keeping the user space intact) or if we disable mitigations in the new version.

This suggests that we should add a performance monitor so we catch this before we merge some upgrade.

Thanks for the report. Those are very interesting, albeit unexpected results. I would like to reproduce them on my end, do you have a suggested set of parameters (size, blocksize...) for cholesky? I'm planning to use the [one in bench6](https://gitlab.pm.bsc.es/rarias/bench6/-/blob/master/src/cholesky/cholesky.c?ref_type=heads). It is safer if we exchange don't exchange the code, only the steps to reproduce it so that we can prevent [systematic errors](https://en.wikipedia.org/wiki/Observational_error). One potential explanation is that between linux 6.15.6 and 6.18.3 there are new mitigations that cause a performance impact. If that is the case, it should be easy to reproduce if we test the old and new kernel (keeping the user space intact) or if we disable mitigations in the new version. This suggests that we should add a performance monitor so we catch this before we merge some upgrade.
Author
Owner

Also, I will leave here the revision log, so we know the problem is located between e42058f08b and fcfee6c674.

fox% tail /var/configrev.log
2025-10-02T16:52:16+02:00 booted=unknown current=unknown next=f3bfe89f275384a5def14c9fd229129416218ba2
2025-10-02T17:06:52+02:00 booted=unknown current=unknown next=e42058f08bcf7c0606e92415fad590e7758f271a
2025-10-22T13:33:21+02:00 booted=unknown current=unknown next=e42058f08bcf7c0606e92415fad590e7758f271a
2025-10-24T15:42:44+02:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=e42058f08bcf7c0606e92415fad590e7758f271a next=5b041f233975b58c2f92a71ebd8ada13e6e4fcbe
2025-10-27T11:39:49+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=5b041f233975b58c2f92a71ebd8ada13e6e4fcbe next=84b7e316a56831949cff4dd4582b050c93d1dc3e
2025-10-28T12:36:29+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=84b7e316a56831949cff4dd4582b050c93d1dc3e next=a7018250ca932679935e8e838b3aa7a6c965539f
2025-10-29T16:46:27+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=a7018250ca932679935e8e838b3aa7a6c965539f next=5ff1b1343b734a4aab5fea91823ebd78043e3897
2026-01-20T10:56:24+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=5ff1b1343b734a4aab5fea91823ebd78043e3897 next=859eebda988ab3dbb86a7c1a18197820ed926bb9
2026-01-20T11:00:10+01:00 booted=unknown current=unknown next=859eebda988ab3dbb86a7c1a18197820ed926bb9
2026-01-20T11:07:44+01:00 booted=unknown current=unknown next=fcfee6c6740fedb20e60825a92ee38e15733a85b
Also, I will leave here the revision log, so we know the problem is located between e42058f08bcf7c0606e92415fad590e7758f271a and fcfee6c6740fedb20e60825a92ee38e15733a85b. ``` fox% tail /var/configrev.log 2025-10-02T16:52:16+02:00 booted=unknown current=unknown next=f3bfe89f275384a5def14c9fd229129416218ba2 2025-10-02T17:06:52+02:00 booted=unknown current=unknown next=e42058f08bcf7c0606e92415fad590e7758f271a 2025-10-22T13:33:21+02:00 booted=unknown current=unknown next=e42058f08bcf7c0606e92415fad590e7758f271a 2025-10-24T15:42:44+02:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=e42058f08bcf7c0606e92415fad590e7758f271a next=5b041f233975b58c2f92a71ebd8ada13e6e4fcbe 2025-10-27T11:39:49+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=5b041f233975b58c2f92a71ebd8ada13e6e4fcbe next=84b7e316a56831949cff4dd4582b050c93d1dc3e 2025-10-28T12:36:29+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=84b7e316a56831949cff4dd4582b050c93d1dc3e next=a7018250ca932679935e8e838b3aa7a6c965539f 2025-10-29T16:46:27+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=a7018250ca932679935e8e838b3aa7a6c965539f next=5ff1b1343b734a4aab5fea91823ebd78043e3897 2026-01-20T10:56:24+01:00 booted=e42058f08bcf7c0606e92415fad590e7758f271a current=5ff1b1343b734a4aab5fea91823ebd78043e3897 next=859eebda988ab3dbb86a7c1a18197820ed926bb9 2026-01-20T11:00:10+01:00 booted=unknown current=unknown next=859eebda988ab3dbb86a7c1a18197820ed926bb9 2026-01-20T11:07:44+01:00 booted=unknown current=unknown next=fcfee6c6740fedb20e60825a92ee38e15733a85b ```
Collaborator

About the parameters of cholesky: size 32*1024, tasksize 512 (the one with the bigger performance difference). Immediate successor was activated.

Important to note that in my version, I allocate memory using mmap and set their memory policy to MPOL_LOCAL. This yielded performance differences wrt the version using malloc.

About the parameters of cholesky: size 32\*1024, tasksize 512 (the one with the bigger performance difference). Immediate successor was activated. Important to note that in my version, I allocate memory using `mmap` and set their memory policy to `MPOL_LOCAL`. This yielded performance differences wrt the version using malloc.
Author
Owner

Using the cholesky from bench6 with MKL and a bit smaller size $((16*1024)), while comparing kernel 6.18 and 6.12 I get the opposite effect (newer kernel goes faster). All measurements report time in seconds (lower is better):

fox% which -a b6_cholesky_nodes
/nix/store/g9p4aifnwmwxlawkrmh9rlq43d53hvif-bench6-fe30c2c/bin/b6_cholesky_nodes

fox% NOSV_CONFIG_OVERRIDE='debug.dump_config=true' b6_cholesky_nodes $((16*1024)) 512
Using configuration file /nix/store/ak91zlrfy62jgjg95xal9dbqcb7avc62-nosv-4.0.0/share/nosv.toml
Parsed options:
scheduler.quantum_ns = 20000000
scheduler.queue_batch = 64
scheduler.cpus_per_queue = 1
scheduler.in_queue_size = 256
scheduler.immediate_successor = 1
shared_memory.name = "nosv"
shared_memory.isolation_level = "process"
shared_memory.start = 0x200000000000
shared_memory.size = 2147483648
task_affinity.default = "all"
task_affinity.default_policy = "strict"
thread_affinity.compat_support = 1
topology.binding = "inherit"
topology.numa_nodes = "(null)"
topology.complex_sets = "(null)"
topology.print = 0
debug.dump_config = 1
debug.print_binding = 0
governor.policy = "hybrid"
governor.spins = 10000
hwcounters.verbose = 0
hwcounters.backend = "none"
hwcounters.papi_events = ["PAPI_TOT_INS","PAPI_TOT_CYC"]
turbo.enabled = 1
monitoring.enabled = 0
monitoring.verbose = 0
instrumentation.version = "none"
misc.stack_size = 8388608
ovni.level = 2
ovni.events = []
ovni.kernel_ringsize = 4194304
  1.706950e-01   1.667047e+03          16384            512
  
fox% bigotes -o oldkernel-16K-512-6.12 b6_cholesky_nodes $((16*1024)) 512

       MIN         Q1     MEDIAN       MEAN         Q3        MAX
 8.534e-01  8.932e-01  9.104e-01  9.282e-01  9.461e-01  1.091e+00

         N       WALL        MAD      STDEV       SKEW   KURTOSIS
        30      126.8  4.357e-02  5.815e-02  1.004e+00  3.629e-01

       FAR       %FAR       %MAD     %STDEV        SEM       %SEM
         0       0.00       4.79       6.26  1.062e-02       2.24

    Cmd: b6_cholesky_nodes 16384 512
    Shapiro-Wilk: W=9.02e-01, p-value=9.53e-03 (NOT normal)

                █
             █  █
 █       █   █ ██        █     █
 ███   █ █ █ ████  █ █   █     █ █     █ █      █  █            █
 
fox% bigotes -i < baseline-16K-512-6.18

       MIN         Q1     MEDIAN       MEAN         Q3        MAX
 8.352e-01  8.539e-01  8.781e-01  8.870e-01  8.997e-01  1.007e+00

         N       WALL        MAD      STDEV       SKEW   KURTOSIS
        30        0.0  3.660e-02  4.463e-02  1.040e+00  2.311e-01

       FAR       %FAR       %MAD     %STDEV        SEM       %SEM
         0       0.00       4.17       5.03  8.148e-03       1.80

    Read from stdin
    Shapiro-Wilk: W=8.82e-01, p-value=3.12e-03 (NOT normal)

        █            █
 ▅   ▅ ▅█            █▅                            ▅
 █▂  █▂██  ▂▂▂▂    ▂ ██  ▂▂  ▂▂     ▂             ▂█            ▂
 ██  ████  ████    █ ██  ██  ██     █             ██            █

fox% ministat -w80 oldkernel-16K-512-6.12 baseline-16K-512-6.18
x oldkernel-16K-512-6.12
+ baseline-16K-512-6.18
+--------------------------------------------------------------------------------+
|                 +                                                              |
|                 +                                                              |
|      +          +  x                                                           |
|+    +*    +     +  x xxx         x      +                                      |
|++ +++**+* + x *x+x **xx*+ x *    x     x*+x      x x+      x  x               x|
|  |________|___MA______M_____A|________________|                                |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      0.853399      1.091382      0.910699    0.92820313   0.058148578
+  30      0.835164      1.006968      0.884693    0.88701993   0.044626157
Difference at 95.0% confidence
        -0.0411832 +/- 0.0267918
        -4.43687% +/- 2.88642%
        (Student's t, pooled s = 0.0518303)

Kernel 6.15 is already deprecated from the current Nixpkgs, so we would need to boot into the same old kernel + userspace configuration if we want to compare that one.

I will try with the larger problem size and see if that is enough to replicate it. Otherwise we may need to also do the mmap change.

Using the cholesky from bench6 with MKL and a bit smaller size `$((16*1024))`, while comparing kernel 6.18 and 6.12 I get the opposite effect (newer kernel goes faster). All measurements report time in seconds (lower is better): ``` fox% which -a b6_cholesky_nodes /nix/store/g9p4aifnwmwxlawkrmh9rlq43d53hvif-bench6-fe30c2c/bin/b6_cholesky_nodes fox% NOSV_CONFIG_OVERRIDE='debug.dump_config=true' b6_cholesky_nodes $((16*1024)) 512 Using configuration file /nix/store/ak91zlrfy62jgjg95xal9dbqcb7avc62-nosv-4.0.0/share/nosv.toml Parsed options: scheduler.quantum_ns = 20000000 scheduler.queue_batch = 64 scheduler.cpus_per_queue = 1 scheduler.in_queue_size = 256 scheduler.immediate_successor = 1 shared_memory.name = "nosv" shared_memory.isolation_level = "process" shared_memory.start = 0x200000000000 shared_memory.size = 2147483648 task_affinity.default = "all" task_affinity.default_policy = "strict" thread_affinity.compat_support = 1 topology.binding = "inherit" topology.numa_nodes = "(null)" topology.complex_sets = "(null)" topology.print = 0 debug.dump_config = 1 debug.print_binding = 0 governor.policy = "hybrid" governor.spins = 10000 hwcounters.verbose = 0 hwcounters.backend = "none" hwcounters.papi_events = ["PAPI_TOT_INS","PAPI_TOT_CYC"] turbo.enabled = 1 monitoring.enabled = 0 monitoring.verbose = 0 instrumentation.version = "none" misc.stack_size = 8388608 ovni.level = 2 ovni.events = [] ovni.kernel_ringsize = 4194304 1.706950e-01 1.667047e+03 16384 512 fox% bigotes -o oldkernel-16K-512-6.12 b6_cholesky_nodes $((16*1024)) 512 MIN Q1 MEDIAN MEAN Q3 MAX 8.534e-01 8.932e-01 9.104e-01 9.282e-01 9.461e-01 1.091e+00 N WALL MAD STDEV SKEW KURTOSIS 30 126.8 4.357e-02 5.815e-02 1.004e+00 3.629e-01 FAR %FAR %MAD %STDEV SEM %SEM 0 0.00 4.79 6.26 1.062e-02 2.24 Cmd: b6_cholesky_nodes 16384 512 Shapiro-Wilk: W=9.02e-01, p-value=9.53e-03 (NOT normal) █ █ █ █ █ █ ██ █ █ ███ █ █ █ ████ █ █ █ █ █ █ █ █ █ █ fox% bigotes -i < baseline-16K-512-6.18 MIN Q1 MEDIAN MEAN Q3 MAX 8.352e-01 8.539e-01 8.781e-01 8.870e-01 8.997e-01 1.007e+00 N WALL MAD STDEV SKEW KURTOSIS 30 0.0 3.660e-02 4.463e-02 1.040e+00 2.311e-01 FAR %FAR %MAD %STDEV SEM %SEM 0 0.00 4.17 5.03 8.148e-03 1.80 Read from stdin Shapiro-Wilk: W=8.82e-01, p-value=3.12e-03 (NOT normal) █ █ ▅ ▅ ▅█ █▅ ▅ █▂ █▂██ ▂▂▂▂ ▂ ██ ▂▂ ▂▂ ▂ ▂█ ▂ ██ ████ ████ █ ██ ██ ██ █ ██ █ fox% ministat -w80 oldkernel-16K-512-6.12 baseline-16K-512-6.18 x oldkernel-16K-512-6.12 + baseline-16K-512-6.18 +--------------------------------------------------------------------------------+ | + | | + | | + + x | |+ +* + + x xxx x + | |++ +++**+* + x *x+x **xx*+ x * x x*+x x x+ x x x| | |________|___MA______M_____A|________________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 0.853399 1.091382 0.910699 0.92820313 0.058148578 + 30 0.835164 1.006968 0.884693 0.88701993 0.044626157 Difference at 95.0% confidence -0.0411832 +/- 0.0267918 -4.43687% +/- 2.88642% (Student's t, pooled s = 0.0518303) ``` Kernel 6.15 is already deprecated from the current Nixpkgs, so we would need to boot into the same old kernel + userspace configuration if we want to compare that one. I will try with the larger problem size and see if that is enough to replicate it. Otherwise we may need to also do the mmap change.
Author
Owner

No significant difference with 32K size, but my times are higher than yours:

fox% bigotes -o baseline-32K-512-6.18 b6_cholesky_nodes $((32*1024)) 512

       MIN         Q1     MEDIAN       MEAN         Q3        MAX
 4.819e+00  5.098e+00  5.140e+00  5.156e+00  5.208e+00  5.482e+00

         N       WALL        MAD      STDEV       SKEW   KURTOSIS
        30      584.9  9.813e-02  1.297e-01  2.715e-02  5.738e-01

       FAR       %FAR       %MAD     %STDEV        SEM       %SEM
         0       0.00       1.91       2.51  2.367e-02       0.90

    Cmd: b6_cholesky_nodes 32768 512
    Shapiro-Wilk: W=9.73e-01, p-value=6.11e-01 (may be normal)

                               ██
                             ▅ ██            ▅ ▅
 ▂              ▂▂▂   ▂ ▂▂ ▂▂█▂██  ▂▂▂▂▂     █ █▂   ▂           ▂
 █              ███   █ ██ ██████  █████     █ ██   █           █

fox% bigotes -i < oldkernel-32K-512-6.12

       MIN         Q1     MEDIAN       MEAN         Q3        MAX
 4.925e+00  5.062e+00  5.132e+00  5.165e+00  5.220e+00  5.596e+00

         N       WALL        MAD      STDEV       SKEW   KURTOSIS
        30        0.0  1.182e-01  1.460e-01  7.763e-01  5.475e-01

       FAR       %FAR       %MAD     %STDEV        SEM       %SEM
         0       0.00       2.30       2.83  2.666e-02       1.01

    Read from stdin
    Shapiro-Wilk: W=9.49e-01, p-value=1.58e-01 (may be normal)

                 ██      █           █  █
                 ██      █           █  █
 █ █   █ ██ ███ ██████ ████  █      ███ █    █                  █
 █ █   █ ██ ███ ██████ ████  █      ███ █    █                  █

fox% ministat -w80 baseline-32K-512-6.18 oldkernel-32K-512-6.12
x baseline-32K-512-6.18
+ oldkernel-32K-512-6.12
+--------------------------------------------------------------------------------+
|                         +  ++x+xx + *  x         +  +                          |
|x          +  + x* x++x+ *x *****xx+ ** x+    xx***  +x   +        x           +|
|                    ||__________MMAA___________|__|                             |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      4.819304      5.482095      5.141714       5.15587    0.12966408
+  30      4.924967      5.596419      5.135468     5.1652136    0.14601492
No difference proven at 95.0% confidence
No significant difference with 32K size, but my times are higher than yours: ``` fox% bigotes -o baseline-32K-512-6.18 b6_cholesky_nodes $((32*1024)) 512 MIN Q1 MEDIAN MEAN Q3 MAX 4.819e+00 5.098e+00 5.140e+00 5.156e+00 5.208e+00 5.482e+00 N WALL MAD STDEV SKEW KURTOSIS 30 584.9 9.813e-02 1.297e-01 2.715e-02 5.738e-01 FAR %FAR %MAD %STDEV SEM %SEM 0 0.00 1.91 2.51 2.367e-02 0.90 Cmd: b6_cholesky_nodes 32768 512 Shapiro-Wilk: W=9.73e-01, p-value=6.11e-01 (may be normal) ██ ▅ ██ ▅ ▅ ▂ ▂▂▂ ▂ ▂▂ ▂▂█▂██ ▂▂▂▂▂ █ █▂ ▂ ▂ █ ███ █ ██ ██████ █████ █ ██ █ █ fox% bigotes -i < oldkernel-32K-512-6.12 MIN Q1 MEDIAN MEAN Q3 MAX 4.925e+00 5.062e+00 5.132e+00 5.165e+00 5.220e+00 5.596e+00 N WALL MAD STDEV SKEW KURTOSIS 30 0.0 1.182e-01 1.460e-01 7.763e-01 5.475e-01 FAR %FAR %MAD %STDEV SEM %SEM 0 0.00 2.30 2.83 2.666e-02 1.01 Read from stdin Shapiro-Wilk: W=9.49e-01, p-value=1.58e-01 (may be normal) ██ █ █ █ ██ █ █ █ █ █ █ ██ ███ ██████ ████ █ ███ █ █ █ █ █ █ ██ ███ ██████ ████ █ ███ █ █ █ fox% ministat -w80 baseline-32K-512-6.18 oldkernel-32K-512-6.12 x baseline-32K-512-6.18 + oldkernel-32K-512-6.12 +--------------------------------------------------------------------------------+ | + ++x+xx + * x + + | |x + + x* x++x+ *x *****xx+ ** x+ xx*** +x + x +| | ||__________MMAA___________|__| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 4.819304 5.482095 5.141714 5.15587 0.12966408 + 30 4.924967 5.596419 5.135468 5.1652136 0.14601492 No difference proven at 95.0% confidence ```
Collaborator

As commented by Slack, I would say the difference between my results being faster is probably bc. in my benchmarks I use this instead of malloc:

void *tglib_mmap_wrapper(size_t length) {
        void *addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

        if (addr == MAP_FAILED) {
                fprintf(stderr, "mmap failed for length %zu\n", length);
                abort();
        }

        int ret = mbind(addr, length, MPOL_LOCAL, NULL, 0, MPOL_MF_STRICT);
        if (ret != 0) {
                fprintf(stderr, "mbind failed!\n");
                abort();
        }

        return addr;
}

Since we were close to a deadline at the time I made this change, I did not investigate why this yields different results, knowing that the default policy is supposed to be MPOL_LOCAL. So I cannot say about the real reason why this is faster for some workloads.

As commented by Slack, I would say the difference between my results being faster is probably bc. in my benchmarks I use this instead of `malloc`: ``` void *tglib_mmap_wrapper(size_t length) { void *addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { fprintf(stderr, "mmap failed for length %zu\n", length); abort(); } int ret = mbind(addr, length, MPOL_LOCAL, NULL, 0, MPOL_MF_STRICT); if (ret != 0) { fprintf(stderr, "mbind failed!\n"); abort(); } return addr; } ``` Since we were close to a deadline at the time I made this change, I did not investigate why this yields different results, knowing that the default policy is supposed to be `MPOL_LOCAL`. So I cannot say about the real reason why this is faster for some workloads.
Author
Owner

I've added support for NUMA, but I don't observe any performance difference:

fox$ bigotes -o numa-off.csv build/src/cholesky/b6_cholesky_nodes $((32*1024)) 512 0

       MIN         Q1     MEDIAN       MEAN         Q3        MAX
 4.708e+00  5.060e+00  5.112e+00  5.101e+00  5.145e+00  5.443e+00

         N       WALL        MAD      STDEV       SKEW   KURTOSIS
        30      582.6  8.598e-02  1.651e-01 -4.015e-01  6.626e-01

       FAR       %FAR       %MAD     %STDEV        SEM       %SEM
         5      16.67       1.68       3.24  3.014e-02       1.16

    Cmd: build/src/cholesky/b6_cholesky_nodes 32768 512 0
    Shapiro-Wilk: W=9.13e-01, p-value=1.78e-02 (NOT normal)

                                   █
                               █   ██ █
                               █   ██ █    █                    █
 █  █  █        █          ███ █  ███ ██ █ ██    █  █           █


fox$ bigotes -o numa-on.csv build/src/cholesky/b6_cholesky_nodes $((32*1024)) 512 1

       MIN         Q1     MEDIAN       MEAN         Q3        MAX
 4.891e+00  4.996e+00  5.091e+00  5.091e+00  5.147e+00  5.374e+00

         N       WALL        MAD      STDEV       SKEW   KURTOSIS
        30      580.7  1.059e-01  1.077e-01  2.425e-01 -4.202e-02

       FAR       %FAR       %MAD     %STDEV        SEM       %SEM
         0       0.00       2.08       2.12  1.966e-02       0.76

    Cmd: build/src/cholesky/b6_cholesky_nodes 32768 512 1
    Shapiro-Wilk: W=9.72e-01, p-value=5.98e-01 (may be normal)

                                 █
           ▅  ▅         ▅ ▅ ▅    █▅
 ▂  ▂▂    ▂█  █     ▂▂  █▂█ █    ██ ▂▂▂▂▂ ▂        ▂            ▂
 █  ██    ██  █     ██  ███ █    ██ █████ █        █            █

fox$ ministat -w80 numa*.csv
x numa-off.csv
+ numa-on.csv
+--------------------------------------------------------------------------------+
|                                      x    xx +                                 |
|                            +  +    + x++ *xx **   + x                         x|
|x   x   x          x+  ++  ++  + xx * x+++*xx **++*+x*      x+ x        +      x|
|                         |____|__________AM_M________|______|                   |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      4.707616      5.443485      5.117572     5.1014736    0.16508331
+  30      4.891141      5.373693      5.095974     5.0907868    0.10767855
No difference proven at 95.0% confidence

Given that only one CPU is initializing all the memory, I don't think that the local policy would behave as intended. AFAIK the first CPU touching that memory will be allocated the page on its own NUMA node, and as only one CPU is initializing it, all memory will go to the same NUMA node.

Did you perhaps modified the initialization to handle this? I also suspect that the malloc behavior would be the same as your custom allocator, as that is the default policy. Even if we change the allocator so that it distributes the initialization among CPUs, we still need to run the computing tasks on their proper NUMA region.

Given that the other few benchmarks I tested don't seem to have appreciable differences between the old and new kernel, I think we can reject the hypothesis that there is a new mitigation or other change that affects the performance systemwide, which would be a blocker for the upgrade. In any case, I would like to add a performance monitor to prevent this in the future.

I suggest leaving the investigation here for now and perhaps in the future we can try to replicate your current results on the new kernel, which seems to be about 4.22 seconds of execution time vs 5.11 on my end. For that I would need to take a look at your environment and see what is different.

I've [added support for NUMA](https://gitlab.pm.bsc.es/rarias/bench6/-/commits/cholesky-numa), but I don't observe any performance difference: ``` fox$ bigotes -o numa-off.csv build/src/cholesky/b6_cholesky_nodes $((32*1024)) 512 0 MIN Q1 MEDIAN MEAN Q3 MAX 4.708e+00 5.060e+00 5.112e+00 5.101e+00 5.145e+00 5.443e+00 N WALL MAD STDEV SKEW KURTOSIS 30 582.6 8.598e-02 1.651e-01 -4.015e-01 6.626e-01 FAR %FAR %MAD %STDEV SEM %SEM 5 16.67 1.68 3.24 3.014e-02 1.16 Cmd: build/src/cholesky/b6_cholesky_nodes 32768 512 0 Shapiro-Wilk: W=9.13e-01, p-value=1.78e-02 (NOT normal) █ █ ██ █ █ ██ █ █ █ █ █ █ █ ███ █ ███ ██ █ ██ █ █ █ fox$ bigotes -o numa-on.csv build/src/cholesky/b6_cholesky_nodes $((32*1024)) 512 1 MIN Q1 MEDIAN MEAN Q3 MAX 4.891e+00 4.996e+00 5.091e+00 5.091e+00 5.147e+00 5.374e+00 N WALL MAD STDEV SKEW KURTOSIS 30 580.7 1.059e-01 1.077e-01 2.425e-01 -4.202e-02 FAR %FAR %MAD %STDEV SEM %SEM 0 0.00 2.08 2.12 1.966e-02 0.76 Cmd: build/src/cholesky/b6_cholesky_nodes 32768 512 1 Shapiro-Wilk: W=9.72e-01, p-value=5.98e-01 (may be normal) █ ▅ ▅ ▅ ▅ ▅ █▅ ▂ ▂▂ ▂█ █ ▂▂ █▂█ █ ██ ▂▂▂▂▂ ▂ ▂ ▂ █ ██ ██ █ ██ ███ █ ██ █████ █ █ █ fox$ ministat -w80 numa*.csv x numa-off.csv + numa-on.csv +--------------------------------------------------------------------------------+ | x xx + | | + + + x++ *xx ** + x x| |x x x x+ ++ ++ + xx * x+++*xx **++*+x* x+ x + x| | |____|__________AM_M________|______| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 4.707616 5.443485 5.117572 5.1014736 0.16508331 + 30 4.891141 5.373693 5.095974 5.0907868 0.10767855 No difference proven at 95.0% confidence ``` Given that [only one CPU is initializing all the memory](https://gitlab.pm.bsc.es/rarias/bench6/-/blob/a0f4bc652f2c3a33552d8233ea97c4869b22cb30/src/cholesky/cholesky.c#L52-66), I don't think that the local policy would behave as intended. AFAIK the first CPU touching that memory will be allocated the page on its own NUMA node, and as only one CPU is initializing it, all memory will go to the same NUMA node. Did you perhaps modified the initialization to handle this? I also suspect that the malloc behavior would be the same as your custom allocator, as that is the default policy. Even if we change the allocator so that it distributes the initialization among CPUs, we still need to run the computing tasks on their proper NUMA region. Given that the other few benchmarks I tested don't seem to have appreciable differences between the old and new kernel, I think we can reject the hypothesis that there is a new mitigation or other change that affects the performance systemwide, which would be a blocker for the upgrade. In any case, I would like to add a performance monitor to prevent this in the future. I suggest leaving the investigation here for now and perhaps in the future we can try to replicate your current results on the new kernel, which seems to be about 4.22 seconds of execution time vs 5.11 on my end. For that I would need to take a look at your environment and see what is different.
Collaborator

Given that only one CPU is initializing all the memory, I don't think that...

Yes, I agree with that analysis, with one CPU touching all memory it should be allocated on that CPU's NUMA node. Nevertheless, for other benchmarks I saw results contradicting this hypothesis, and I also saw that using the direct MMAP allocation solved it, my hypothesis was that maybe using explicit first-touch allocation prevented NUMA auto-balancing which moves pages depending on which CPU uses them. NUMA auto-balancing is enabled in MN5 but disabled in Fox (https://jungle.bsc.es/git/rarias/jungle/src/branch/master/m/fox/configuration.nix#L37), so probably not important to use mmap rather than malloc. I don't have a strong reason to keep using mmap other than to keep consistency with what I used to get the results I showed in my first comment.

Did you perhaps modified the initialization to handle this?

Not for cholesky, no.

I upload the results of the old userspace and the new userspace on kernel 6.12.63. Notably, old userspace uses nOS-V 3.2 and AMD-BLIS 5.0, where the new userspace uses nOS-V 4.0 and AMD-BLIS 5.1. Also important to note that the first results showed in this issue were using the old userspace on the new kernel.

x new_userspace_linux6.12.63_32k_512.out
+ old_userspace_linux6.12.63_32k_512.out
+--------------------------------------------------------------------------------+
|                                     +                      x                   |
|                    x      x     xxx +           +    *x    *                   |
|+    +   x  xx+ * * xx x   *++   **xx*    + * *  + x++** * +*  +       +       +|
|                    |__|___________MA_____A___M_____|________|                  |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      3.771313       4.29377       4.03268     4.0457563    0.16528416
+  30      3.678669      4.485322      4.143362     4.1084013     0.1953015
No difference proven at 95.0% confidence
> Given that only one CPU is initializing all the memory, I don't think that... Yes, I agree with that analysis, with one CPU touching all memory it should be allocated on that CPU's NUMA node. Nevertheless, for other benchmarks I saw results contradicting this hypothesis, and I also saw that using the direct MMAP allocation solved it, my hypothesis was that maybe using explicit first-touch allocation prevented NUMA auto-balancing which moves pages depending on which CPU uses them. NUMA auto-balancing is enabled in MN5 but disabled in Fox (https://jungle.bsc.es/git/rarias/jungle/src/branch/master/m/fox/configuration.nix#L37), so probably not important to use mmap rather than malloc. I don't have a strong reason to keep using mmap other than to keep consistency with what I used to get the results I showed in my first comment. > Did you perhaps modified the initialization to handle this? Not for cholesky, no. I upload the results of the old userspace and the new userspace on kernel 6.12.63. Notably, old userspace uses nOS-V 3.2 and AMD-BLIS 5.0, where the new userspace uses nOS-V 4.0 and AMD-BLIS 5.1. Also important to note that the first results showed in this issue were using the old userspace on the new kernel. ``` x new_userspace_linux6.12.63_32k_512.out + old_userspace_linux6.12.63_32k_512.out +--------------------------------------------------------------------------------+ | + x | | x x xxx + + *x * | |+ + x xx+ * * xx x *++ **xx* + * * + x++** * +* + + +| | |__|___________MA_____A___M_____|________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 3.771313 4.29377 4.03268 4.0457563 0.16528416 + 30 3.678669 4.485322 4.143362 4.1084013 0.1953015 No difference proven at 95.0% confidence ```
Author
Owner

I upload the results of the old userspace and the new userspace on kernel 6.12.63. Notably, old userspace uses nOS-V 3.2 and AMD-BLIS 5.0, where the new userspace uses nOS-V 4.0 and AMD-BLIS 5.1. Also important to note that the first results showed in this issue were using the old userspace on the new kernel.

Thanks!

Notice that I don't mean that when I refer to "userspace". The NixOS configuration installed in fox is composed of 2 parts, the kernel and all the other programs and libraries installed, what I call userspace. This is the part that you cannot change as a user.

Aside from that, you control your own stack of software by two means: the nix develop controlled by the flake.nix and flake.lock and the environment that leaks into that (unless your use nix develop -i), which is a mix of the system userspace and what you change via home-manager or other means.

I will refer to the "system" userspace configuration as just userspace, and the environment that you get with a mix of nix develop and home manager as your environment.

A careful observation of the original data shows that you have a much larger standard deviation:

x after.out
+ before.out
+--------------------------------------------------------------------------------+
|    + +                                                                         |
|  +++ +                                                                         |
|  +++++                                                       x                 |
| +++++++                +                 x   x      x        x                 |
| ++++++++             +++      x x  xx  xxx  xx    x xxxxx xxxx xxx  x x       x|
||____M_A______|                          |___________A_M_________|              |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      3.852624        4.6072     4.2233735     4.2029263    0.19033377 <--- almost 2x
+  30      3.378232      3.748664      3.437565      3.470745    0.10714594
Difference at 95.0% confidence
	-0.732181 +/- 0.0798354
	-17.4208% +/- 1.89952%
	(Student's t, pooled s = 0.154446)

And this persists with the new results that we observe using the new userspace, even if we switch the kernel to an old one.

In the userspace changes there may be a daemon that has changed behavior and is now causing more system noise. This would prevail even if you change your shell or home-manager as you don't control that part.

We can run some sensitive noise benchmarks to see if we see any interrupts from other processes. This can also be measured if we ran the benchmarks again with perf stat and take a look at the context switches.

I did a quick test in owl1 (with a smaller size) and I already see some problems (I'm using -r 5 to repeat it 5 times):

apex% srun -p owl --exclusive nix shell \
  'git+https://jungle.bsc.es/git/rarias/jungle?rev=b9f2e936dece429498dd4bdc04f61cd67f2dd009#bench6' \
  -c perf stat -r 5 -- b6_cholesky_nodes 8192 512
  
  6.718480e-01   2.700804e+02           8192            512
  6.884560e-01   2.635651e+02           8192            512
  6.679520e-01   2.716557e+02           8192            512
  6.529450e-01   2.778993e+02           8192            512
  6.545270e-01   2.772276e+02           8192            512

 Performance counter stats for 'b6_cholesky_nodes 8192 512' (5 runs):

    67.356.682.680      task-clock                       #   14,140 CPUs utilized               ( +-  0,30% )
               713      context-switches                 #   10,585 /sec                        ( +-  2,81% )
                56      cpu-migrations                   #    0,831 /sec                        ( +-  2,22% )
           308.065      page-faults                      #    4,574 K/sec                       ( +-  0,35% )
   106.064.384.315      instructions                     #    1,32  insn per cycle              ( +-  0,10% )
    80.571.815.200      cycles                           #    1,196 GHz                         ( +-  0,30% )
     3.615.450.418      branches                         #   53,676 M/sec                       ( +-  0,43% )
         7.807.170      branch-misses                    #    0,22% of all branches             ( +-  0,38% )

            4,7635 +- 0,0157 seconds time elapsed  ( +-  0,33% )

I see quite a lot of context-switches and several CPU migrations. Ideally this should be 0, or at least a smaller number.

> I upload the results of the old userspace and the new userspace on kernel 6.12.63. Notably, old userspace uses nOS-V 3.2 and AMD-BLIS 5.0, where the new userspace uses nOS-V 4.0 and AMD-BLIS 5.1. Also important to note that the first results showed in this issue were using the old userspace on the new kernel. Thanks! Notice that I don't mean that when I refer to "userspace". The NixOS configuration installed in fox is composed of 2 parts, the kernel and all the other programs and libraries installed, what I call userspace. This is the part that you cannot change as a user. Aside from that, you control your own stack of software by two means: the nix develop controlled by the flake.nix and flake.lock and the environment that leaks into that (unless your use `nix develop -i`), which is a mix of the system userspace and what you change via home-manager or other means. I will refer to the "system" userspace configuration as just _userspace_, and the environment that you get with a mix of nix develop and home manager as your _environment_. A careful observation of the original data shows that you have a much larger standard deviation: ``` x after.out + before.out +--------------------------------------------------------------------------------+ | + + | | +++ + | | +++++ x | | +++++++ + x x x x | | ++++++++ +++ x x xx xxx xx x xxxxx xxxx xxx x x x| ||____M_A______| |___________A_M_________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 3.852624 4.6072 4.2233735 4.2029263 0.19033377 <--- almost 2x + 30 3.378232 3.748664 3.437565 3.470745 0.10714594 Difference at 95.0% confidence -0.732181 +/- 0.0798354 -17.4208% +/- 1.89952% (Student's t, pooled s = 0.154446) ``` And this persists with the new results that we observe using the new userspace, even if we switch the kernel to an old one. In the userspace changes there may be a daemon that has changed behavior and is now causing more system noise. This would prevail even if you change your shell or home-manager as you don't control that part. We can run some sensitive noise benchmarks to see if we see any interrupts from other processes. This can also be measured if we ran the benchmarks again with `perf stat` and take a look at the context switches. I did a quick test in owl1 (with a smaller size) and I already see some problems (I'm using -r 5 to repeat it 5 times): ``` apex% srun -p owl --exclusive nix shell \ 'git+https://jungle.bsc.es/git/rarias/jungle?rev=b9f2e936dece429498dd4bdc04f61cd67f2dd009#bench6' \ -c perf stat -r 5 -- b6_cholesky_nodes 8192 512 6.718480e-01 2.700804e+02 8192 512 6.884560e-01 2.635651e+02 8192 512 6.679520e-01 2.716557e+02 8192 512 6.529450e-01 2.778993e+02 8192 512 6.545270e-01 2.772276e+02 8192 512 Performance counter stats for 'b6_cholesky_nodes 8192 512' (5 runs): 67.356.682.680 task-clock # 14,140 CPUs utilized ( +- 0,30% ) 713 context-switches # 10,585 /sec ( +- 2,81% ) 56 cpu-migrations # 0,831 /sec ( +- 2,22% ) 308.065 page-faults # 4,574 K/sec ( +- 0,35% ) 106.064.384.315 instructions # 1,32 insn per cycle ( +- 0,10% ) 80.571.815.200 cycles # 1,196 GHz ( +- 0,30% ) 3.615.450.418 branches # 53,676 M/sec ( +- 0,43% ) 7.807.170 branch-misses # 0,22% of all branches ( +- 0,38% ) 4,7635 +- 0,0157 seconds time elapsed ( +- 0,33% ) ``` I see quite a lot of context-switches and several CPU migrations. Ideally this should be 0, or at least a smaller number.
Collaborator

I will refer to the "system" userspace configuration as just userspace, and the environment that you get with a mix of nix develop and home manager as your environment.

Ok, makes sense, thanks for the clarification.

What do you think about this: unless we can have the true old environment, userspace and kernel, it will be much harder and time-consuming to reach a sounding conclusion. I know that there are technical challenges to have exactly the same system as when the previous results were achieved, but I would suggest we wait until we can have that system? And if we can't have that system then I think we should not spend much more time trying to reproduce those results, and instead use our time developing a reliable performance monitoring system.

Whatever you decide is fine by me. Let me know if I can help with anything :)

> I will refer to the "system" userspace configuration as just userspace, and the environment that you get with a mix of nix develop and home manager as your environment. Ok, makes sense, thanks for the clarification. What do you think about this: unless we can have the true old environment, userspace and kernel, it will be much harder and time-consuming to reach a sounding conclusion. I know that there are technical challenges to have exactly the same system as when the previous results were achieved, but I would suggest we wait until we can have that system? And if we can't have that system then I think we should not spend much more time trying to reproduce those results, and instead use our time developing a reliable performance monitoring system. Whatever you decide is fine by me. Let me know if I can help with anything :)
Author
Owner

What do you think about this: unless we can have the true old environment, userspace and kernel, it will be much harder and time-consuming to reach a sounding conclusion. I know that there are technical challenges to have exactly the same system as when the previous results were achieved, but I would suggest we wait until we can have that system? And if we can't have that system then I think we should not spend much more time trying to reproduce those results, and instead use our time developing a reliable performance monitoring system.

I mostly agree with this position. My idea is to collect as much information now so that I can investigate later on before is gone. I also need to know how to design the benchmarks so that they could detect a similar problem in the future.

After seen those context switches, I took a quick look with perf and saw some ocassions in which fail2ban is stealing the CPU. I did a quick test with and without fail2ban just to check if that would have any effect, but it doesn't seem to be significative:

% ministat -w80 *512.csv
x with-f2b-32768-512.csv
+ without-f2b-32768-512.csv
+--------------------------------------------------------------------------------+
|                            xx   +                                              |
|                    +    +x *x*  +x          x    +                             |
|+    xx      x  +  ++x   +x****++**+ *xx  x++*xx+x++ *                         x|
|                   |_|________M_MAA__________|__|                               |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      4.796694      5.718658      5.106731     5.1526377    0.17765302
+  29      4.734572       5.38994       5.13133     5.1418823    0.14934108
No difference proven at 95.0% confidence

Notice that my results all use MKL and I should change it for BLIS, which should arrive at a similar performance than you (around 4.03 s of time). We can compare the assembly among both programs if its not the case.

Another question I have is if you have changed something regarding huge pages, as I believe for this program may affect the performance. I reviewed our conversation in the last months but I cannot find any mention about it.

> What do you think about this: unless we can have the true old environment, userspace and kernel, it will be much harder and time-consuming to reach a sounding conclusion. I know that there are technical challenges to have exactly the same system as when the previous results were achieved, but I would suggest we wait until we can have that system? And if we can't have that system then I think we should not spend much more time trying to reproduce those results, and instead use our time developing a reliable performance monitoring system. I mostly agree with this position. My idea is to collect as much information now so that I can investigate later on before is gone. I also need to know how to design the benchmarks so that they could detect a similar problem in the future. After seen those context switches, I took a quick look with perf and saw some ocassions in which fail2ban is stealing the CPU. I did a quick test with and without fail2ban just to check if that would have any effect, but it doesn't seem to be significative: ``` % ministat -w80 *512.csv x with-f2b-32768-512.csv + without-f2b-32768-512.csv +--------------------------------------------------------------------------------+ | xx + | | + +x *x* +x x + | |+ xx x + ++x +x****++**+ *xx x++*xx+x++ * x| | |_|________M_MAA__________|__| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 4.796694 5.718658 5.106731 5.1526377 0.17765302 + 29 4.734572 5.38994 5.13133 5.1418823 0.14934108 No difference proven at 95.0% confidence ``` Notice that my results all use MKL and I should change it for BLIS, which should arrive at a similar performance than you (around 4.03 s of time). We can compare the assembly among both programs if its not the case. Another question I have is if you have changed something regarding huge pages, as I believe for this program may affect the performance. I reviewed our conversation in the last months but I cannot find any mention about it.
Collaborator

I mostly agree with this position. My idea is to collect as much information now so that I can investigate later on before is gone. I also need to know how to design the benchmarks so that they could detect a similar problem in the future.

Sounds fair.

Another question I have is if you have changed something regarding huge pages, as I believe for this program may affect the performance. I reviewed our conversation in the last months but I cannot find any mention about it.

No, I am not using huge pages.

Related to the context switchets, what about using perf sched map as in https://www.brendangregg.com/blog/2017-03-16/perf-sched.html. There are other perf sched commands that seem promising.

> I mostly agree with this position. My idea is to collect as much information now so that I can investigate later on before is gone. I also need to know how to design the benchmarks so that they could detect a similar problem in the future. Sounds fair. > Another question I have is if you have changed something regarding huge pages, as I believe for this program may affect the performance. I reviewed our conversation in the last months but I cannot find any mention about it. No, I am not using huge pages. Related to the context switchets, what about using `perf sched map` as in https://www.brendangregg.com/blog/2017-03-16/perf-sched.html. There are other perf sched commands that seem promising.
Author
Owner

Related to the context switchets, what about using perf sched map as in https://www.brendangregg.com/blog/2017-03-16/perf-sched.html. There are other perf sched commands that seem promising.

Yes, that's what I have used. Probably a trace with kernel events in ovni may be also useful.

I tested cholesky from bench6 with blis and I get the time down to around 4.34 seconds:

fox% ./run.sh
+ size=32768
+ bs=512
++ which b6_cholesky_nodes
++ awk -F/ '{print $4}'
+ b6dir=b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53
+ wdir=out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53
+ mkdir -p out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53
+ log=out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv
+ bigotes -o out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv -- b6_cholesky_nodes 32768 512

       MIN         Q1     MEDIAN       MEAN         Q3        MAX
 4.044e+00  4.238e+00  4.345e+00  4.344e+00  4.399e+00  4.640e+00

         N       WALL        MAD      STDEV       SKEW   KURTOSIS
        30      536.3  1.280e-01  1.320e-01  1.841e-01 -2.161e-01

       FAR       %FAR       %MAD     %STDEV        SEM       %SEM
         0       0.00       2.95       3.04  2.410e-02       1.09

    Cmd: b6_cholesky_nodes 32768 512
    Shapiro-Wilk: W=9.81e-01, p-value=8.50e-01 (may be normal)

                                 █
                                 █
                    ██       █   █ ██  █
 █           █  █ █ ██ █  █  ██  █ ██  █ █  █      ██  █ █      █

+ ministat -w80 out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv
x out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv
+--------------------------------------------------------------------------------+
|                                        x                                       |
|                                        x                                       |
|                         x          x   x x    x                                |
|x              x   x x x xxx   x   xx   x xxx  x  x   x       xx    x x        x|
|                      |_________________A________________|                      |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  30      4.043786      4.640053      4.345529     4.3437177     0.1319854

In your environment you also build nOS-V with the "native" flags, which I'm guessing will also have some impact.

> Related to the context switchets, what about using perf sched map as in https://www.brendangregg.com/blog/2017-03-16/perf-sched.html. There are other perf sched commands that seem promising. Yes, that's what I have used. Probably a trace with kernel events in ovni may be also useful. I tested [cholesky from bench6 with blis](https://jungle.bsc.es/git/rarias/devshell/src/commit/0775e1ce7336433562531a6284fc376295905f00/vincent/chol/flake.nix) and I get the time down to around 4.34 seconds: ``` fox% ./run.sh + size=32768 + bs=512 ++ which b6_cholesky_nodes ++ awk -F/ '{print $4}' + b6dir=b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53 + wdir=out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53 + mkdir -p out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53 + log=out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv + bigotes -o out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv -- b6_cholesky_nodes 32768 512 MIN Q1 MEDIAN MEAN Q3 MAX 4.044e+00 4.238e+00 4.345e+00 4.344e+00 4.399e+00 4.640e+00 N WALL MAD STDEV SKEW KURTOSIS 30 536.3 1.280e-01 1.320e-01 1.841e-01 -2.161e-01 FAR %FAR %MAD %STDEV SEM %SEM 0 0.00 2.95 3.04 2.410e-02 1.09 Cmd: b6_cholesky_nodes 32768 512 Shapiro-Wilk: W=9.81e-01, p-value=8.50e-01 (may be normal) █ █ ██ █ █ ██ █ █ █ █ █ ██ █ █ ██ █ ██ █ █ █ ██ █ █ █ + ministat -w80 out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv x out/b5py11silcyhk6dmwvmh8mm8vmnqksmk-bench6-bf29a53/b6_cholesky_nodes-32768-512.csv +--------------------------------------------------------------------------------+ | x | | x | | x x x x x | |x x x x x xxx x xx x xxx x x x xx x x x| | |_________________A________________| | +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 30 4.043786 4.640053 4.345529 4.3437177 0.1319854 ``` In your environment you also build nOS-V with the "native" flags, which I'm guessing will also have some impact.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#221