Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG, BUG: Fix errant writing of too large int (part 2/2) #8324

Closed
wants to merge 7 commits into from

Conversation

larsoner
Copy link
Member

Closes #7897

@alex159 can you switch to this branch and try your original code that failed and see where you get an error now? Hopefully it's in the instantiation of the Epochs object or during writing. If it's during writing, then we should figure out what other functions you're using that are modifying events without checking to see if the int32-max value is being exceeded.

@Alxmrphi
Copy link

Alxmrphi commented Oct 1, 2020

@larsoner - First, just want to say an extra thank you for looking into this issue. It is really appreciated! Hopefully this will avoid future problems of a similar nature.

Second - I switched to this branch and tried rerunning the code that originally gave the problem. First I verified the version:

assert mne.__version__ == '0.22.dev0'

Then I loaded the large epochs file that saved fine (13 GB).

epochs_concat = mne.read_epochs('epochs-epo.fif')

I then extracted my 10 classes from the data and pulled out an equal amount of them such that each of the 10 classes is equally balanced:

min_val = 9640
class1 = epochs_concat["class1"][:min_val]
class2 = epochs_concat["class2"][:min_val]
class3 = epochs_concat["class3"][:min_val]
class4  = epochs_concat["class4"][:min_val]
class5 = epochs_concat["class5"][:min_val]
class6 = epochs_concat["class6"][:min_val]
class7= epochs_concat["class7"][:min_val]
class8= epochs_concat["class8"][:min_val]
class9 = epochs_concat["class9"][:min_val]
class10 = epochs_concat["class10"][:min_val]

I then concatenate these together:

epochs_concat = mne.concatenate_epochs([class1, class2, class3, class4, class5, class6, class7, class8, class9, class10])

This produces the following output:

C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\epochs.py:2975: RuntimeWarning: overflow encountered in long_scalars
  int((10 + tmax) * epochs.info['sfreq']))

Adding metadata with 16 columns
Replacing existing metadata with 16 columns
96400 matching events found
Applying baseline correction (mode: mean)
Created an SSP operator (subspace dimension = 1)

<ipython-input-6-7b59483c15a2>:15: RuntimeWarning: The events passed to the Epochs constructor are not chronologically ordered.

0 bad epochs dropped

I typically plot the time samples to see if they have a sharp drop in them (when overflow happens). It does happen in this case.
Here is a picture of the plot:

This means when I attempt to save the concatenated epochs, each now equally balanced, with the following command:

epochs_concat.save('epochs_balanced-epo.fif')

I get the OSError:

OSError                                   Traceback (most recent call last)
<ipython-input-7-09f8c55bb895> in <module>
      1 # save
----> 2 epochs_concat_reduced.save('bigram_10class_equal_epochs-epo.fif')

<decorator-gen-200> in save(self, fname, split_size, fmt, overwrite, verbose)

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\epochs.py in save(self, fname, split_size, fmt, overwrite, verbose)
   1662             # avoid missing event_ids in splits
   1663             this_epochs.event_id = self.event_id
-> 1664             _save_split(this_epochs, fname, part_idx, n_parts, fmt)
   1665 
   1666     def equalize_event_counts(self, event_ids, method='mintime'):

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\epochs.py in _save_split(epochs, fname, part_idx, n_parts, fmt)
     92 
     93     with start_file(fname) as fid:
---> 94         _save_part(fid, epochs, fmt, n_parts, next_fname, next_idx)
     95 
     96 

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\epochs.py in _save_part(fid, epochs, fmt, n_parts, next_fname, next_idx)
    190     end_block(fid, FIFF.FIFFB_PROCESSED_DATA)
    191     end_block(fid, FIFF.FIFFB_MEAS)
--> 192     end_file(fid)
    193 
    194 

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\io\write.py in end_file(fid)
    347     """Write the closing tags to a fif file and closes the file."""
    348     write_nop(fid, last=True)
--> 349     check_fiff_length(fid)
    350     fid.close()
    351 

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\io\write.py in check_fiff_length(fid, close)
    340         if close:
    341             fid.close()
--> 342         raise IOError('FIFF file exceeded 2GB limit, please split file or '
    343                       'save to a different format')
    344 

OSError: FIFF file exceeded 2GB limit, please split file or save to a different format

I've included the only changes I make to the data in the hope that might clarify / give an idea as to what might be the issue.
I hope this helps to clarify the problem a little bit. If not please let me know anything else I can try.

  • Alex

@larsoner
Copy link
Member Author

larsoner commented Oct 1, 2020

epochs_concat = mne.read_epochs('epochs-epo.fif')

Can you np.savez_compressed('out.npz', epochs.events), mne.io.write_info('out-info.fif', epochs.info) (anonymize the instance first if you want), upload these somewhere and let me know the len(epochs) and len(epochs.times)? This will hopefully be enough for me to reproduce the issue with only a couple MB uploaded somewhere rather than several GB.

@Alxmrphi
Copy link

Alxmrphi commented Oct 1, 2020

Sure thing.

Here is the link to those two files.

print(len(epochs))
print(len(epochs.times))

>> 96400
>> 201

Fingers crossed this allows you to see what the issue is underlying all this.
Thanks! - Alex

@larsoner
Copy link
Member Author

larsoner commented Oct 1, 2020

@alex159 it looks like this events list is already corrupted on disk:

[[     830494           0         103]
 [     831020           0         103]
 [     833887           0         103]
 ...
 [-1847587768           0         110]
 [-1847584272           0         110]
 [-1847584156           0         110]]

a workaround for you for now is to do epochs.events[:, 0] = np.arange(1, len(epochs.events) + 1). I assume the problem arose when doing concatenate_epochs, so I've made it so that this now does this substitution (with a warning) when concatenating epochs with event numbers that are too large.

To store data relevant to your epochs (e.g., if the sample numbers are critical to keep, or represent something other than actual sample numbers), I recommend using metadata: https://mne.tools/dev/auto_examples/preprocessing/plot_metadata_query.html

@Alxmrphi
Copy link

Alxmrphi commented Oct 5, 2020

Hi Eric,

Just an FYI that the suggested fix didn't work - still get the FIFF saving error. Not sure if this changes the proposed fix or anything. If I keep the dtype as int32 then it doesn't save, but if I force the dtype of the concatenated epochs to be int64 first, then when applying your fix, I get this error when saving:

ValueError: events array values must not exceed 2147483647, got 2447383140

I've not come across this one before. Looks like the issue is elsewhere.
Obviously this error comes in the mne.concatenate_epochs stage of things and before I am able to overwrite the events (this error is triggered before I have access to the new variable handle which needs to be overwritten).

@larsoner
Copy link
Member Author

larsoner commented Oct 5, 2020

If I keep the dtype as int32 then it doesn't save

What is the traceback you get for this?

if I force the dtype of the concatenated epochs to be int64 first, then when applying your fix, I get this error when saving:

Good, this is intended behavior. You should not change the dtype, it is unsafe. Try instead doing:

epochs.events[:, 0] = np.arange(1, len(epochs.events))

then saving. On this PR we do this now in concatenate_epochs.

Where in your code do you get negative events now? Is it immediately after epochs = read_epochs(fname)? If so, we can add a little bit of code to read_epochs to "reset" the event numbers when it sees negative ones.

@larsoner
Copy link
Member Author

larsoner commented Oct 5, 2020

Pushed a commit that should "correct" the events when reading a FIF that had events incorrectly saved.

@Alxmrphi
Copy link

Alxmrphi commented Oct 5, 2020

If I keep the dtype as int32 then it doesn't save

What is the traceback you get for this?

Thanks for the response 👍

Keeping dtype as int32 and using the proposed solution, I again extract out equal amounts of the 10 classes and then concatenate the results. I then use the proposed solution to overwrite the first column with the np.arange function and when I then call the save method, this is the traceback:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-23-9a7da58ff897> in <module>
      1 # save
----> 2 epochs_concat.save('epochs_concat-epo.fif', overwrite=True)

<decorator-gen-200> in save(self, fname, split_size, fmt, overwrite, verbose)

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\epochs.py in save(self, fname, split_size, fmt, overwrite, verbose)
   1662             # avoid missing event_ids in splits
   1663             this_epochs.event_id = self.event_id
-> 1664             _save_split(this_epochs, fname, part_idx, n_parts, fmt)
   1665 
   1666     def equalize_event_counts(self, event_ids, method='mintime'):

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\epochs.py in _save_split(epochs, fname, part_idx, n_parts, fmt)
     92 
     93     with start_file(fname) as fid:
---> 94         _save_part(fid, epochs, fmt, n_parts, next_fname, next_idx)
     95 
     96 

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\epochs.py in _save_part(fid, epochs, fmt, n_parts, next_fname, next_idx)
    190     end_block(fid, FIFF.FIFFB_PROCESSED_DATA)
    191     end_block(fid, FIFF.FIFFB_MEAS)
--> 192     end_file(fid)
    193 
    194 

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\io\write.py in end_file(fid)
    347     """Write the closing tags to a fif file and closes the file."""
    348     write_nop(fid, last=True)
--> 349     check_fiff_length(fid)
    350     fid.close()
    351 

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\mne\io\write.py in check_fiff_length(fid, close)
    340         if close:
    341             fid.close()
--> 342         raise IOError('FIFF file exceeded 2GB limit, please split file or '
    343                       'save to a different format')
    344 

OSError: FIFF file exceeded 2GB limit, please split file or save to a different format

As I save, there are no negative events anymore. That's what I meant before in that I don't think that can be the problem that's causing the failure. I hope something in the traceback can point those who are much more familiar with the codebase in the direction of the solution / fix.

@larsoner
Copy link
Member Author

larsoner commented Oct 6, 2020

There must be some bug with our estimation of the size we write out. I'll see if I can replicate locally. In the meantime, doing split_size='1.9GB' in epochs.save or so is probably a suitable workaround.

I again extract out equal amounts of the 10 classes and then concatenate the results.

As a side note -- why not just use equalize_event_counts for this?

@larsoner
Copy link
Member Author

larsoner commented Oct 6, 2020

I added tests with metadata and even splitting and concatenating epochs and can't replicate unfortunately. If you want to upload your huge file somewhere with a script I can give it a try, or if you can live with doing split_size='1.9GB' (and it works) then we can just live with the imperfect code for now :(

@Alxmrphi
Copy link

Alxmrphi commented Oct 9, 2020

There must be some bug with our estimation of the size we write out. I'll see if I can replicate locally. In the meantime, doing split_size='1.9GB' in epochs.save or so is probably a suitable workaround.

"1.9GB" did not work, but "1GB" did (even though the files that were saved were all 1.4 GB. As long as I can get around it this way, I am happy. I didn't know there was a split_size parameter, so I'm really glad you mentioned that.

I again extract out equal amounts of the 10 classes and then concatenate the results.

As a side note -- why not just use equalize_event_counts for this?

I didn't know about this method either. I have had really big problems with overfitting though and am currently trying to have a really random distribution of samples over different recording sessions. It seems there is an attempt to try to keep events as close as possible using that suggested method - which I can certainly see has its place in most studies. However, I do feel better if I am in control of the exact preprocessing pipeline. I will look into this for future analyses though.

Thanks for all the help :)

@larsoner
Copy link
Member Author

larsoner commented Oct 9, 2020

but "1GB" did (even though the files that were saved were all 1.4 GB.

Good to know -- we clearly have a big problem with some aspect of the writing-size estimation. It would be great if you could share the file with me for testing. But it's huge so let's see if we can make it more reasonable first!

So if your files are roughly 10 GB and we want them to be 100 times smaller (~100MB), how about you do:

  1. Save 1/100th of them to a new split file:
    out_epochs = epochs[::100]
    out_epochs.save('smalller-epo.fif', split_size='10M')
    
  2. Check that this results in N files, and that they (at least the first N-1) are bigger than 10 MB? Hopefully 14MB-ish? This would mimic the 1GB->1.4GB behavior.
  3. Upload these files somewhere (hopefully they are ~100 MB in total)?

Also feel free to try with higher decimations (:200 :1000) to get fewer ~10/14MB files if you want to reduce upload size, but 100MB-ish is already fine for me.

If this doesn't work, I can also live with the ~10 GB transfer if you have somewhere you can put it up for even a short period of time.

@larsoner
Copy link
Member Author

@alex159 do the ideas above about how to share make sense?

@larsoner
Copy link
Member Author

Ping @alex159 , any chance to get a version of the problematic file?

@agramfort in the meantime I think this can be merged, it already has some useful fixes.

@Alxmrphi
Copy link

@larsoner Apologies, I've been out of the country for the past 8 days and just got back. I will attempt what you proposed this afternoon. Looks like a good way to figure out the problem. I will update with the link once it's up.

@Alxmrphi
Copy link

Alxmrphi commented Oct 23, 2020

Here are the steps I followed and the resulting data:

I loaded the same data I was working with from before. I then extracted my 10 classes, extracted the same amount from all of them (9640) and then concatenated them together. This resulted in a new Epochs data structure with 96400 events, of size 9.24 GB in memory.

<Epochs |  96400 events (all good), -0.1 - 0.7 sec, baseline [-0.1, -0.1], ~9.24 GB, data loaded, with metadata,
 'class1': 9640
 'class2': 9640
 'class3': 9640
 'class4': 9640
 'class5': 9640
 'class6': 9640
 'class7': 9640
 'class8': 9640
 'class9': 9640
 'class10': 9640>

I then attempted to extract every 100th event using the suggested code above:

out_epochs = epochs[::100]

This resulted in an Epochs data structure with 964 events, approximately 94.8MB, each of the 10 classes having 96/97 samples each.

<Epochs |  964 events (all good), -0.1 - 0.7 sec, baseline [-0.1, -0.1], ~94.8 MB, data loaded, with metadata,
 'class1': 97
 'class2': 96
 'class3': 96
 'class4': 97
 'class5': 96
 'class6': 97
 'class7': 96
 'class8': 96
 'class9': 97
 'class10': 96>

That's fine so far. However, if I try to then save this and select the split size as "10MB" using the following code:
out_epochs.save('smalller-epo.fif', split_size='10MB')

I then get over 25 files, each 591.6 MB in size. I almost ran out of memory performing this so I interrupted the process because obviously, this is now larger than the original data I loaded. I don't know how much longer it would have gone on making more of those files.

There is something definitely wrong here. I will try to play around a bit more with it, but this last result has very much confused me as I was not expecting this result. Is there anything else you would like me to try?

Edit:

I can also confirm:

print(out_epochs.get_data().shape)
print(out_epochs.events.shape)
>> (964, 64, 201)
>> (964, 3)

I think I will have to load some of these 500 MB files and see what is inside them, because it's not the data from what was extracted and it's more than the original 9.2 GB that I loaded originally.

@larsoner
Copy link
Member Author

I then get over 25 files, each 591.6 MB in size.

If you copy just the first two files to a different directory, can you epochs_read = mne.read_epochs(..., preload=True) and have it work (probably with a warning about some split files missing)? If so, if you then epochs_read.save('smaller-er-epochs.fif', split_size='10MB') does it still write out > 10 MB files?

If so, then if you can at least upload those first two files (~1GB) somewhere, I'm happy to download them and I should be able to replicate your issue.

If you want to try making things smaller, or if just having the two files will not read_epochs properly, you might be able to do epochs[:2].save(..., split_size='10MB') or :10 or :100 or something...

@Alxmrphi
Copy link

Alxmrphi commented Oct 23, 2020

I copied 3 of the files over to a new folder but unfortunately mne.read_epochs throws a FileNotFoundError when it can't fetch the rest of the files.

FileNotFoundError: [Errno 2] No such file or directory: 'test\\smaller-epo-3.fif'

The files smaller-epo.fif, smaller-epo-1.fif & smaller-epo-2.fif were in the directory and I called mne.read_epochs('smaller-epo.fif', preload=True). Then I followed your second idea, to only use the first two samples, i.e.

out_epochs = epochs[:2]
out_epochs.save('tiny-epo.fif', split_size='10MB')

And the first line produces this:

<Epochs |  2 events (all good), -0.1 - 0.7 sec, baseline [-0.1, -0.1], ~380 kB, data loaded, with metadata,
 'class1': 2>

When calling the save method on the object, it starts making those 590 MB files again. I stopped the process when it looked like it was going to repeat the other behaviour from before. I added in some print statements to visualise some of the variable values computed in the save method in epochs.py and can see that in the extreme case of only two events, when calling save, the following is returned

len(self) = 2
total_size = 606305574

epoch_idxs = [array([0]), array([1]), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32), array([], dtype=int32)]

The total_size is amended continually in the save method, but this value is the one for the final value it ends up as (and is not the same as when trying to save all the data, that one becomes total size = 5603078476)
So it looks like it wants to split the two events into this many different files, each one about 590 MB.
What that data is exactly, I'm not sure.

@larsoner
Copy link
Member Author

Okay can you upload those first two files anyway? I'll try to hack the MNE code not to look for the third, and see if I can replicate the issue.

@Alxmrphi
Copy link

Alxmrphi commented Oct 23, 2020

Okay can you upload those first two files anyway? I'll try to hack the MNE code not to look for the third, and see if I can replicate the issue.

Sure thing. Here is a link.

It's smalller-epo.fif, smalller-epo-1.fif and smalller-epo-2.fif. The three L's confused me a bit, too, but it came from me copying over your instructions earlier :)

I hope it's a bit clearer to see what might be going on with these files downloadable now.

@larsoner
Copy link
Member Author

Okay @alex159 it looks like:

  1. epochs.drop_log gets written to every file
  2. Your drop log had 100k+ entries (maybe 1m+), the vast majority of which were IGNORE for each write (the epochs that are not going into a specific file get set to IGNORE for that file)

This PR fixes this by:

  1. Properly parsing sizes into those which get written to all split files (drop log, info, some FIF tag overhead) -- this should ensure that you actually get split_size-sized files or smaller.
  2. Warning if the drop_log overhead is terrible
  3. Providing a reset_drop_log_selection method that kill the drop_log and selection attributes.

Can you try your big concatenation / splitting / saving code with a reset_drop_log_selection put in before the save call? It should make the file sizes much more reasonable.

@Alxmrphi
Copy link

Hi @larsoner,

Thanks so much for this fix. I re-ran the same portion of the code that earlier gave rise to all of these problems and it all worked fine. I forgot to add in a manual call to reset_drop_log_selection before calling the save method and the file splitting worked, getting four ~ 2GB files with the following output:

Splitting into 4 parts

<ipython-input-5-af8b8b0fda9d>:1: RuntimeWarning: epochs.drop_log contains 46536100 entries which will incur up to a 576.9 MB writing overhead (per split file), consider using epochs.reset_drop_log_selection() prior to writing
  epochs_concat_reduced.save('test_10class_balanced-epo.fif', split_size='2GB')

I then added in the call to reset_drop_log_selection and attempted to save the data in the same way again and it worked. The saving process was so much quicker than I am accustomed to, this time around. The only thing now is that the size reported in the print call to the Epochs ~ 9.24 GB but only 3 files of 1.6 GB are now saved. Those (now 3) files don't seem to be the right size. Perhaps there is some compression going on somewhere. I will try to load the files again, call the __repr__ method on the Epochs and see what it says.

I can confirm it matches. So it does save all the data. I have to say, this fix at least on my data has dramatically increased the speed of loading and saving. I used to have time to go and make a cup of coffee between calling the load / save function and those operations actually completing. Now it happens in a fraction of the time.

Seriously, a huge thank you is in order! This is a very much appreciated fix.

@larsoner
Copy link
Member Author

larsoner commented Oct 28, 2020

getting four ~ 2GB files with the following output:
...which will incur up to a 576.9 MB writing overhead...

So ~8GB gets written to disk when the drop log and selection are there, but 4 x 576 MB = ~1.9 GB of this is drop log. So really it sounds like you have something like ~6 GB of data that should be written (?).

The only thing now is that the size reported in the print call to the Epochs ~ 9.24 GB but only 3 files of 1.6 GB are now saved. Those (now 3) files don't seem to be the right size.

That's so that's ~4.8 GB on disk. Keep in mind that the default format for save is single, i.e., 32-bit float gets written to disk. When you load into memory they are upconverted to 64-bit float, so this will double to ~9.6 GB. So they're not off by so much.

This ~4.8 GB of data is lower than the expected ~6 GB of data above, but I'm probably forgetting to factor something in here. But as long as the I/O round-trips work I think we're good!

I have to say, this fix at least on my data has dramatically increased the speed of loading and saving. I used to have time to go and make a cup of coffee between calling the load / save function and those operations actually completing. Now it happens in a fraction of the time.

Please do let us know if there are other nasty bottlenecks like this in your code. Things shouldn't take this long :) (Unless of course you are doing some crazy dimensional statistical or machine learning task.)

@Alxmrphi
Copy link

When I originally said ~ 2GB, it was actually 1.85GB so making that correction it all seems to fall into place. I didn't know about the upconverted to float64, doubling the size of the data. That was the key piece of information that was missing. So, overall, expectation is more around ~ 5GB to be written, and we see 4.8GB so I think that it's close enough and probably true with some slight precision is off somewhere.

Please do let us know if there are other nasty bottlenecks like this in your code. Things shouldn't take this long :) (Unless of course you are doing some crazy dimensional statistical or machine learning task.)

I will definitely do that. I remember that MNE used to be really fast at loading / saving then about a year ago it became frustratingly slow, which is round about the time this data first got put all together and I haven't worked with other neural data in MNE since then. I thought it was a fault with the computer. Knowing what I know now, I can imagine all sorts of reasons that might cause bottlenecks and I'll be happier to raise an issue if I think something is wrong.

I'm sure this was just a patch waiting to happen at some point anyway and it's good now that those people in the future won't even bump into this error now that there's a solution. Lucky them :)

@larsoner larsoner changed the title MRG, BUG: Fix errant writing of too large int MRG, BUG: Fix errant writing of too large int (part 2/2) Oct 28, 2020
@larsoner larsoner closed this Oct 28, 2020
@larsoner
Copy link
Member Author

Splitting into two PRs

@larsoner larsoner deleted the epochs branch October 29, 2020 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2GB .fif limit
2 participants