-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for sanitised builds #182
Conversation
I'd have a couple of general comments on this:
|
@KubaSz thanks for your points, I add some replies:
We usually run CMake through Python tasks, and I add the typo checking there.
Oh so you are saying the output of sanitised runs is non-deterministic? That can be an issue. The problem is that there are now many warnings. What I was trying to do was to check for just newly added warnings. I was deliberatly not using |
True, I just like having stringly-typed values checked at every step if it doesn't impact performance. The scripts might evolve at some point and the Cmake or python version might become out of sync, and without the sanitizer it could just silently pass tests.
I don't remember the exact format of those outputs, but at least leak checking did output general memory statistics when I last used it, so it would be mostly deterministic, but change with any code changes due to added/removed allocations.
Even if it is hard to read, it's useful to have it - and in the actions view you can just collapse that section if you don't want to look at it at this time. |
135b66d
to
495b26a
Compare
495b26a
to
1e98852
Compare
4c35084
to
7620977
Compare
7620977
to
51b7fac
Compare
…ned up on exceptions and any scope exit
This is now ready for review, the only thing that I would potentially look into before merging is the MPI TSan failures (currently disabled in the ignore list file), send/recv do some things with buffers that I don't quite understand and TSan sees it as a data race - it might be a false positive (if the synchronization happens through sockets which is invisible for tsan), or not. |
@csegarragonz It would be good if you review this too, I can't ask you via github as you opened the pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KubaSz @csegarragonz awesome work guys, this looks great. A few nitpicky comments mixed in with some small changes/ questions.
.github/workflows/sanitisers.yml
Outdated
- uses: actions/cache@v2 | ||
with: | ||
path: '~/.conan' | ||
key: ${{ runner.os }}-${{ steps.get-conan-version.outputs.conan-version }}-${{ hashFiles('cmake/ExternalProjects.cmake') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large is the resulting cache file? I know there's a limit on GHA caches, at which point it will evict the oldest keys, just want to make sure this isn't causing us to hit this limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
430 MB, the limit is 10GB per repo (https://github.com/actions/cache), and this should be a cache hit most times as the dependency versions don't change all that often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this will yield a great speedup to the runtime of general tests if we use the caching there (#187), but @KubaSz it looks as though we may using the wrong key: https://github.com/faasm/faabric/runs/4337385997?check_suite_focus=true#step:7:4
(if the logs are to be trusted of course)
could you double-check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, seems that hashFiles doesn't work, let me look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to replace it with manual hashing in bash+sha256sums, it works in the most recent commit. @csegarragonz This change will probably need porting over to #187
- name: "Build dependencies to be shared by all sanitiser runs" | ||
run: inv dev.cmake -b Debug | ||
|
||
address-sanitiser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a job for each different sanitizer? This file is quite bloated as a result, so I'd be interested to know the time difference in running them sequentially in a single job, repeating the call to dev.sanitise
and faabric_tests
each time. The GHA docs say each job in a workflow runs in a "fresh instance" of the VM, but I don't know whether this means a completely separate VM on a different host, or if it's actually just cramming lots of VMs together on the same allocated physical resources (in which case it might be counterproductive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all runs in parallel, and each sanitizer requires a full rebuild of the source (but not dependencies), so this saves about 20 mins of time on tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool, is that in theory or in practice? I'm just surprised that GHA gives out resources that freely. I don't think we should change it, am just curious.
# Catch2 allocates in its signal handler, this prevents showing the wrong crash report | ||
signal:* | ||
|
||
# TODO: Remove: There's something weird going on with MPI code I don't understand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for me to see the races tsan is complaining about I'd just have to re-run commenting out the next line right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would terminate on the first detected race. For me that was in the AllGather implementation
The only thing I would add to Simon's comments is to see if you can bump the faabric dependency in faasm as well; just to double check nothing is broken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome LGTM 👍
It looks like a faasm distributed test fails with this merged (on top of PR faasm/faasm#540):
|
@KubaSz yes, this should be fixed by #181. but wait, you said you patched on top of faasm/faasm#540, that should contain #181 in a way already 🤔 |
@csegarragonz It won't contain it as this is a couple of commits behind on faabric, and the way faabric replacement works is it fully replaces the tree with this one. I will merge master and rerun the tests just to make sure, then it'll be ready to merge (I can't as I don't have perms on this repo) |
This is a moment where a tool like https://github.com/bors-ng/bors-ng is useful, but I don't think faasm is at a scale where it's worth setting up yet, it's just a temporary influx of PRs with all of us working on some changes right now |
@KubaSz yes, this is ready to go; once it's ready ping me on here and i will merge it. thanks again, this looks great |
Huh |
@csegarragonz I think it's ready to merge, it fails on the same things that your PR was failing in faasm, and only in dist-tests, so I don't think it needs to be blocked by that |
In this PR I introduce a task to build the tests with different sanitisers.
I also add a workflow that runs in parallel to the tests that checks for newly added sanitising errors. It checks against a set of files I commit in this PR containing the output of running the sanitising as of today.
After offline discussion, I think the changes to allow for builds with sanitising are to be kept. The actual workflow file is a different discussion. Before using it, the codebase would have to be in a state where we trigger no warnings. The worflow file is usable and ready to be merged in: it only runs
on: workflow_dispatch
i.e., if manually triggered from the UI.The low-hanging fruit to reduce a bunch of warnings is:
-Wuninitialized
and-Wmaybe-uninitialized
.