-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SegFault in BP5Serializer::CollectFinalShapeValues() #3671
Comments
Thanks. Probably we need more detail from the variable declarations, any calls to SetShape, etc. If the variable is classified as a GlobalArray, the Shape value presumably was set at some point. The question is what happened to it? Was it reset somehow? Was the variable destroyed? Maybe you can point me to the source somewhere? |
Hi @eisenhauer, thank you for your reply. You can find
Let me know if this is helpful and if any further information would be useful. |
Hmm. I haven't looked at Sensei source before. C++ code using the ADIOS C interfaces. That complicates things a bit and at least my initial look at Sensei, I don't have a good guess as to what might be going on. The ADIOS change that seems to be implicated here, grabbing the Shape of a variable at EndStep rather than at Put so that it can be changed after the Put, shouldn't have an impact to anything I see, but the logic is complex enough that I can't be completely sure of what is going on just by inspection. Unfortunately that probably means trying to reproduce this somewhere I can either examine a core dump or add additional diagnostics, which may not happen right away. Will do what I can though. |
I tried to find the commit which introduced this issue and went back to the 14th of March for now.
I will go back in time a bit further to older commits tomorrow ... |
OK, so that puts a different spin on things. I don't recall exactly when we made BP5 the default serializer for SST, but I suspect that if this code worked on a prior version of ADIOS then perhaps it was using the older "bp" marshalling method. You can see if that works by setting the engine parameter "MarshalMethod" to a value of "bp" (even with the newest ADIOS). If it does, that may narrow down the problem. |
You are right! With |
So you have a workaround for the moment. By and large, BP4 operates on metadata provided to it (shape, start, count arrays) at the moment of Put(), but BP5 gains efficiency through bulk processing in EndStep. In looking at Sensei code, it appears that the metadata arrays are often stack-allocated at the time of Put() and EndStep doesn't appear in the same subroutine, so it's a pretty good guess that somehow the deallocation of those arrays is tied to what's going on. We still have to sort out whether or not this is something that happens only when going through the C bindings or not, and how best to fix it. When you call things like adios_set_selection and adios_set_shape, you pass in the address of metadata arrays in application space, but I'm not sure we're clear on the requirements for how long that metadata should persist, if ADIOS commits to copying it when provided, etc. I think I can work from the Sensei code in ADIOS2Schema.cpp to replicate the issue in some test code to sort out exactly what's going on and where to go from here. It'll likely be a few days though. |
If were easy to get a core dump file of the original failure in CollectFinalShapeValues and print VB->m_Name, that might help narrow down exactly which usage was problematic... |
I have run the example with debug flags and core dumps enabled. Here it segfaults with
Klick to see the full backtrace from the core fileHere is the output of a `gbt -> bt full`
You can find the whole job-output including the core files here |
Well, I thought I could easily recreate what I thought was happening in a simple test and debug the problem. I tried that and so far I've failed. I expect that I'm going to have to build SENSEI and try your examples in order to reproduce, but it might be a few weeks before I'm able to do that. Just FYI... |
Thank you that you are looking into this. |
Hi @eisenhauer, |
Sorry, got as far as downloading SENSEI last week and then got distracted by a critical demo (and the need to wipe and reinstall my laptop because of an ongoing problem). This is on my list for this week, possibly later today. |
Cool, I keep my fingers crossed :) |
OK, I've spent enough time on this that I've got it running, but I'm not able to reproduce the problem. Some things to note: I built with SENSEI github master and ADIOS2 github master (which is close enough to 2.9.0 that it shouldn't matter). The first thing I found is that SENSEI had compilation failures with ADIOS 2.9.0 because of the changes in the ADIOS API (elimination of DebugMode in adios2_init()). I edited sensei/ADIOS2AnalysisAdaptor.cxx and sensei/ADIOS2Schema.cxx to eliminate the debug mode parameter and things compiled fine. I skipped the slurm script but instead ran the two clients using MPI on my laptop. I get no segfaults, but I do see some weird behaviour, some of which I can trace to the sensei-transport.xml file. For example RendezvousReaderCount=0 means that the oscillator can and will produce data that is dropped on the floor until the sensei process shows up. Then the QueueFullPolicy=discard also means that even after connected if the producer is producing data faster than they can be sent or consumed, that data will be discarded. (None of these mean that the code where you seemed to be seeing the segfault wouldn't be executed, it would. It's just that the data and metadata block it produced would be discarded.) I guess the upshot is that I'm at a dead-end. I've tried to reproduce the issue both with and without Sensei without having any luck. I'm wondering a bit about what version of Sensei you might be using since I had to do source-level tweaks just to get it to compile with post-2.9.0 ADIOS. I've seen some anomalies, but nothing that should result in the symptoms that you are seeing. Not quite sure where to go from here... |
Hi Greg, this is very surprising. I will go through this again based on you information and come back to you in the next days. |
Sorry, it takes longer than expected to get the time to go on. But I am on it ... |
ADIOS2 (2.9.0 and latest) segfaults in BP5Serializer::CollectFinalShapeValues() when using SST in combination with SENSEI (latest).
I assume this happens at THIS code line.
The text was updated successfully, but these errors were encountered: