Add suppport for AMD uProf in fox #125
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "amd-uprof"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds support for AMD μProf, which allows users to extract low level performance metrics from the AMD CPUs in fox. It comes with a custom driver, which needs a bit of patching to adapt to the latest kernel.
Added a systemd service instead of udev, because the module is not emitting uevents so it doesn't seem to trigger udevd.
@varcila let us know if you find any issue, so far it seems to open fine.
e0ecd2be0cto9c32f54bf4WIP: Add suppport for AMD uProf in foxto Add suppport for AMD uProf in fox51a77777f1to93cc24a40bSorry for the delay! The following command fails as follows:
The following command should be enough to test that we can capture L3 uncore events. If this works, I am pretty confident the rest of the perf events can be captured:
Note the warning on not finding libnuma. Other similar warnings happen with other commands, but are not critical.
I would add the uprof user guide to the docs (gitea does not allow me to upload here, file too large).
Online uprof user guide updated to June 2025
Added a couple of fixes, now it should work:
I would prefer to add a link to the document so we don't make the repository too large. Maybe we can add this one to the fox page? https://docs.amd.com/r/en-US/57368-uProf-user-guide
The file is heavy, but since AMD/Intel links sometimes change or break, I’d prefer including the PDF. It’s about 20 MiB, so not huge. That said, totally fine if you’d rather just link it
We already have some pdf files in jungle in doc:
Notice that from the total 30Mib in the flake source 25Mib are pdfs. It's not ideal since this will be copied to the store every time we evaluate a different revision (although when we only need it for hut/tent which host the docs).
This is already a problem with nixpkgs repo, where the flake source takes 400Mb which can add up quickly.
I am not opposed to adding the pdf, but we should be wary of this and consider other alternatives to host large files out of our source tree.
If you worry it disappears we can keep a copy elsewhere. However, it is likely that the user guide will change as they update their version.
I wanted to test Git LFS For this use case, but never used it before. Gitea has support for it.
I’m aware of the already big size of the repo, if git lfs works okay we may be able to rewrite the history to get rid of the big blobs and keep other docs as well.
Let’s address first the blockers of this PR so we can merge it and then evaluate other solutions in another issue.
ba98023645toc8bc19d891Uploaded a copy of the PDF to jungle web server for now:
ba98023645..c8bc19d891It is here: https://jungle.bsc.es/pub/57368-uprof-user-guide.pdf
Kernel module
amd_hsmpseems to be missing. I tried a command I used for getting the power metrics and a bunch of other events, and it complains about not having said module. I can confirm that the version of uProf used for the thesis did report power metrics with this specific command.The command, also showing the output:
Fixed now:
Regarding that "Unable to find libnuma.so", they seem to be dlopening libnuma from a hardcoded path:
So even if we add it to the runpath it doesn't work. Let me see if I can patch the string so that it dlopens only "libnuma.so".
Patched:
LGTM
@ -0,0 +25,4 @@version = "5.1.701";tarball = "AMDuProf_Linux_x64_${version}.tar.bz2";uprofSrc = runCommandLocal tarball {I would add a comment to remember to update the radare patch addresses when changing the source.
I'll add a md5sum check as well.
@ -0,0 +2,4 @@, lib, amd-uprof, curl, cacertcurlandcacertwere left over.@ -0,0 +2,4 @@# so it matches NixOS:## Change OS name to NixOSwz NixOS @ 0x00550a43I was fiddling with radare, and we could try to make this more robust by doing something like:
/ original string ; wz NixOS @ hit0_0and idem forlibnumaBut I don't know how to get the address of the mov ecx instruction so it's a bit pointless. I could not figure out how to get radare to stop the execution of the script when an error occurs either.
Is not safe to adapt the patch automatically, it would need to be manually updated. Ideally we should ask upstream to add the NixOX option (or use it as fallback if none matches).
Yeah, I wanted the derivation to fail if radare found something amiss, but the md5 checksum is cleaner.
4e3c41c9fato967709982a967709982ato017e0d82f7