-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG, BUG: Fix errant writing of too large int (part 2/2) #8324
Conversation
@larsoner - First, just want to say an extra thank you for looking into this issue. It is really appreciated! Hopefully this will avoid future problems of a similar nature. Second - I switched to this branch and tried rerunning the code that originally gave the problem. First I verified the version:
Then I loaded the large epochs file that saved fine (13 GB).
I then extracted my 10 classes from the data and pulled out an equal amount of them such that each of the 10 classes is equally balanced:
I then concatenate these together:
This produces the following output:
I typically plot the time samples to see if they have a sharp drop in them (when overflow happens). It does happen in this case. This means when I attempt to save the concatenated epochs, each now equally balanced, with the following command:
I get the
I've included the only changes I make to the data in the hope that might clarify / give an idea as to what might be the issue.
|
Can you |
Sure thing. Here is the link to those two files.
Fingers crossed this allows you to see what the issue is underlying all this. |
@alex159 it looks like this events list is already corrupted on disk:
a workaround for you for now is to do To store data relevant to your epochs (e.g., if the sample numbers are critical to keep, or represent something other than actual sample numbers), I recommend using metadata: https://mne.tools/dev/auto_examples/preprocessing/plot_metadata_query.html |
Hi Eric, Just an FYI that the suggested fix didn't work - still get the FIFF saving error. Not sure if this changes the proposed fix or anything. If I keep the dtype as int32 then it doesn't save, but if I force the dtype of the concatenated epochs to be int64 first, then when applying your fix, I get this error when saving:
I've not come across this one before. Looks like the issue is elsewhere. |
What is the traceback you get for this?
Good, this is intended behavior. You should not change the
then saving. On this PR we do this now in Where in your code do you get negative events now? Is it immediately after |
Pushed a commit that should "correct" the events when reading a FIF that had events incorrectly saved. |
Thanks for the response 👍 Keeping
As I save, there are no negative events anymore. That's what I meant before in that I don't think that can be the problem that's causing the failure. I hope something in the traceback can point those who are much more familiar with the codebase in the direction of the solution / fix. |
There must be some bug with our estimation of the size we write out. I'll see if I can replicate locally. In the meantime, doing
As a side note -- why not just use |
I added tests with metadata and even splitting and concatenating epochs and can't replicate unfortunately. If you want to upload your huge file somewhere with a script I can give it a try, or if you can live with doing |
I didn't know about this method either. I have had really big problems with overfitting though and am currently trying to have a really random distribution of samples over different recording sessions. It seems there is an attempt to try to keep events as close as possible using that suggested method - which I can certainly see has its place in most studies. However, I do feel better if I am in control of the exact preprocessing pipeline. I will look into this for future analyses though. Thanks for all the help :) |
Good to know -- we clearly have a big problem with some aspect of the writing-size estimation. It would be great if you could share the file with me for testing. But it's huge so let's see if we can make it more reasonable first! So if your files are roughly 10 GB and we want them to be 100 times smaller (~100MB), how about you do:
Also feel free to try with higher decimations ( If this doesn't work, I can also live with the ~10 GB transfer if you have somewhere you can put it up for even a short period of time. |
@alex159 do the ideas above about how to share make sense? |
Ping @alex159 , any chance to get a version of the problematic file? @agramfort in the meantime I think this can be merged, it already has some useful fixes. |
@larsoner Apologies, I've been out of the country for the past 8 days and just got back. I will attempt what you proposed this afternoon. Looks like a good way to figure out the problem. I will update with the link once it's up. |
Here are the steps I followed and the resulting data: I loaded the same data I was working with from before. I then extracted my 10 classes, extracted the same amount from all of them (9640) and then concatenated them together. This resulted in a new Epochs data structure with 96400 events, of size 9.24 GB in memory.
I then attempted to extract every 100th event using the suggested code above:
This resulted in an Epochs data structure with 964 events, approximately 94.8MB, each of the 10 classes having 96/97 samples each.
That's fine so far. However, if I try to then save this and select the split size as "10MB" using the following code: I then get over 25 files, each 591.6 MB in size. I almost ran out of memory performing this so I interrupted the process because obviously, this is now larger than the original data I loaded. I don't know how much longer it would have gone on making more of those files. There is something definitely wrong here. I will try to play around a bit more with it, but this last result has very much confused me as I was not expecting this result. Is there anything else you would like me to try? Edit: I can also confirm:
I think I will have to load some of these 500 MB files and see what is inside them, because it's not the data from what was extracted and it's more than the original 9.2 GB that I loaded originally. |
If you copy just the first two files to a different directory, can you If so, then if you can at least upload those first two files (~1GB) somewhere, I'm happy to download them and I should be able to replicate your issue. If you want to try making things smaller, or if just having the two files will not |
I copied 3 of the files over to a new folder but unfortunately
The files
And the first line produces this:
When calling the
The |
Okay can you upload those first two files anyway? I'll try to hack the MNE code not to look for the third, and see if I can replicate the issue. |
Sure thing. Here is a link. It's I hope it's a bit clearer to see what might be going on with these files downloadable now. |
Okay @alex159 it looks like:
This PR fixes this by:
Can you try your big concatenation / splitting / saving code with a |
Hi @larsoner, Thanks so much for this fix. I re-ran the same portion of the code that earlier gave rise to all of these problems and it all worked fine. I forgot to add in a manual call to
I then added in the call to I can confirm it matches. So it does save all the data. I have to say, this fix at least on my data has dramatically increased the speed of loading and saving. I used to have time to go and make a cup of coffee between calling the load / save function and those operations actually completing. Now it happens in a fraction of the time. Seriously, a huge thank you is in order! This is a very much appreciated fix. |
So ~8GB gets written to disk when the drop log and selection are there, but 4 x 576 MB = ~1.9 GB of this is drop log. So really it sounds like you have something like ~6 GB of data that should be written (?).
That's so that's ~4.8 GB on disk. Keep in mind that the default format for This ~4.8 GB of data is lower than the expected ~6 GB of data above, but I'm probably forgetting to factor something in here. But as long as the I/O round-trips work I think we're good!
Please do let us know if there are other nasty bottlenecks like this in your code. Things shouldn't take this long :) (Unless of course you are doing some crazy dimensional statistical or machine learning task.) |
When I originally said ~ 2GB, it was actually 1.85GB so making that correction it all seems to fall into place. I didn't know about the upconverted to float64, doubling the size of the data. That was the key piece of information that was missing. So, overall, expectation is more around ~ 5GB to be written, and we see 4.8GB so I think that it's close enough and probably true with some slight precision is off somewhere.
I will definitely do that. I remember that MNE used to be really fast at loading / saving then about a year ago it became frustratingly slow, which is round about the time this data first got put all together and I haven't worked with other neural data in MNE since then. I thought it was a fault with the computer. Knowing what I know now, I can imagine all sorts of reasons that might cause bottlenecks and I'll be happier to raise an issue if I think something is wrong. I'm sure this was just a patch waiting to happen at some point anyway and it's good now that those people in the future won't even bump into this error now that there's a solution. Lucky them :) |
Splitting into two PRs |
Closes #7897
@alex159 can you switch to this branch and try your original code that failed and see where you get an error now? Hopefully it's in the instantiation of the Epochs object or during writing. If it's during writing, then we should figure out what other functions you're using that are modifying
events
without checking to see if the int32-max value is being exceeded.