Tracking experiment stability after upgrade #221
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
One of the reproducibility properties provided by Nix is that the whole environment is controlled by the user, so we can upgrade without fears of changing the experimental results. However, a kernel upgrade could still cause changes in performance, even if the user space is the same.
As we are currently running experiments in Fox, we can run some performance experiments and record the results before and after the upgrade. The results should be the same if this hypothesis holds.
CC @varcila
I have executed the cholesky before and after the upgrade. Results show some degrade, very noticeable with large tasksizes:
Tasksize 384 (best tasksize)
Tasksize 512 (somewhat large)
Thanks for the report. Those are very interesting, albeit unexpected results.
I would like to reproduce them on my end, do you have a suggested set of parameters (size, blocksize...) for cholesky? I'm planning to use the one in bench6. It is safer if we exchange don't exchange the code, only the steps to reproduce it so that we can prevent systematic errors.
One potential explanation is that between linux 6.15.6 and 6.18.3 there are new mitigations that cause a performance impact. If that is the case, it should be easy to reproduce if we test the old and new kernel (keeping the user space intact) or if we disable mitigations in the new version.
This suggests that we should add a performance monitor so we catch this before we merge some upgrade.
Also, I will leave here the revision log, so we know the problem is located between
e42058f08bandfcfee6c674.About the parameters of cholesky: size 32*1024, tasksize 512 (the one with the bigger performance difference). Immediate successor was activated.
Important to note that in my version, I allocate memory using
mmapand set their memory policy toMPOL_LOCAL. This yielded performance differences wrt the version using malloc.Using the cholesky from bench6 with MKL and a bit smaller size
$((16*1024)), while comparing kernel 6.18 and 6.12 I get the opposite effect (newer kernel goes faster). All measurements report time in seconds (lower is better):Kernel 6.15 is already deprecated from the current Nixpkgs, so we would need to boot into the same old kernel + userspace configuration if we want to compare that one.
I will try with the larger problem size and see if that is enough to replicate it. Otherwise we may need to also do the mmap change.
No significant difference with 32K size, but my times are higher than yours:
As commented by Slack, I would say the difference between my results being faster is probably bc. in my benchmarks I use this instead of
malloc:Since we were close to a deadline at the time I made this change, I did not investigate why this yields different results, knowing that the default policy is supposed to be
MPOL_LOCAL. So I cannot say about the real reason why this is faster for some workloads.I've added support for NUMA, but I don't observe any performance difference:
Given that only one CPU is initializing all the memory, I don't think that the local policy would behave as intended. AFAIK the first CPU touching that memory will be allocated the page on its own NUMA node, and as only one CPU is initializing it, all memory will go to the same NUMA node.
Did you perhaps modified the initialization to handle this? I also suspect that the malloc behavior would be the same as your custom allocator, as that is the default policy. Even if we change the allocator so that it distributes the initialization among CPUs, we still need to run the computing tasks on their proper NUMA region.
Given that the other few benchmarks I tested don't seem to have appreciable differences between the old and new kernel, I think we can reject the hypothesis that there is a new mitigation or other change that affects the performance systemwide, which would be a blocker for the upgrade. In any case, I would like to add a performance monitor to prevent this in the future.
I suggest leaving the investigation here for now and perhaps in the future we can try to replicate your current results on the new kernel, which seems to be about 4.22 seconds of execution time vs 5.11 on my end. For that I would need to take a look at your environment and see what is different.
Yes, I agree with that analysis, with one CPU touching all memory it should be allocated on that CPU's NUMA node. Nevertheless, for other benchmarks I saw results contradicting this hypothesis, and I also saw that using the direct MMAP allocation solved it, my hypothesis was that maybe using explicit first-touch allocation prevented NUMA auto-balancing which moves pages depending on which CPU uses them. NUMA auto-balancing is enabled in MN5 but disabled in Fox (https://jungle.bsc.es/git/rarias/jungle/src/branch/master/m/fox/configuration.nix#L37), so probably not important to use mmap rather than malloc. I don't have a strong reason to keep using mmap other than to keep consistency with what I used to get the results I showed in my first comment.
Not for cholesky, no.
I upload the results of the old userspace and the new userspace on kernel 6.12.63. Notably, old userspace uses nOS-V 3.2 and AMD-BLIS 5.0, where the new userspace uses nOS-V 4.0 and AMD-BLIS 5.1. Also important to note that the first results showed in this issue were using the old userspace on the new kernel.
Thanks!
Notice that I don't mean that when I refer to "userspace". The NixOS configuration installed in fox is composed of 2 parts, the kernel and all the other programs and libraries installed, what I call userspace. This is the part that you cannot change as a user.
Aside from that, you control your own stack of software by two means: the nix develop controlled by the flake.nix and flake.lock and the environment that leaks into that (unless your use
nix develop -i), which is a mix of the system userspace and what you change via home-manager or other means.I will refer to the "system" userspace configuration as just userspace, and the environment that you get with a mix of nix develop and home manager as your environment.
A careful observation of the original data shows that you have a much larger standard deviation:
And this persists with the new results that we observe using the new userspace, even if we switch the kernel to an old one.
In the userspace changes there may be a daemon that has changed behavior and is now causing more system noise. This would prevail even if you change your shell or home-manager as you don't control that part.
We can run some sensitive noise benchmarks to see if we see any interrupts from other processes. This can also be measured if we ran the benchmarks again with
perf statand take a look at the context switches.I did a quick test in owl1 (with a smaller size) and I already see some problems (I'm using -r 5 to repeat it 5 times):
I see quite a lot of context-switches and several CPU migrations. Ideally this should be 0, or at least a smaller number.
Ok, makes sense, thanks for the clarification.
What do you think about this: unless we can have the true old environment, userspace and kernel, it will be much harder and time-consuming to reach a sounding conclusion. I know that there are technical challenges to have exactly the same system as when the previous results were achieved, but I would suggest we wait until we can have that system? And if we can't have that system then I think we should not spend much more time trying to reproduce those results, and instead use our time developing a reliable performance monitoring system.
Whatever you decide is fine by me. Let me know if I can help with anything :)
I mostly agree with this position. My idea is to collect as much information now so that I can investigate later on before is gone. I also need to know how to design the benchmarks so that they could detect a similar problem in the future.
After seen those context switches, I took a quick look with perf and saw some ocassions in which fail2ban is stealing the CPU. I did a quick test with and without fail2ban just to check if that would have any effect, but it doesn't seem to be significative:
Notice that my results all use MKL and I should change it for BLIS, which should arrive at a similar performance than you (around 4.03 s of time). We can compare the assembly among both programs if its not the case.
Another question I have is if you have changed something regarding huge pages, as I believe for this program may affect the performance. I reviewed our conversation in the last months but I cannot find any mention about it.
Sounds fair.
No, I am not using huge pages.
Related to the context switchets, what about using
perf sched mapas in https://www.brendangregg.com/blog/2017-03-16/perf-sched.html. There are other perf sched commands that seem promising.Yes, that's what I have used. Probably a trace with kernel events in ovni may be also useful.
I tested cholesky from bench6 with blis and I get the time down to around 4.34 seconds:
In your environment you also build nOS-V with the "native" flags, which I'm guessing will also have some impact.