Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting channels in a large dataset with many trials takes ages #454

Closed
dfsp-spirit opened this issue Mar 24, 2023 · 3 comments
Closed
Assignees
Labels
Bug An error that is serious but does not break (parts of) the package. However, it clearly impedes the Performance Improve the number crunching

Comments

@dfsp-spirit
Copy link
Collaborator

dfsp-spirit commented Mar 24, 2023

Describe the bug
Selecting channels in a large dataset with many trials takes ages. This was discovered by @kajal5888

To Reproduce
Steps to reproduce the behavior:

import syncopy as spy
import numpy as np
dt = spy.load("~/DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.spy")   # file from DJ, available on ESI cs
dt2 = dt.selectdata(channel=np.arange(90, 200))  # Takes several hours, has not finished yet

DJ had this running for more than 10 hours without it finishing.

info on dt:

dt
Syncopy AnalogData object with fields

                cfg : dictionary with keys ''
            channel : [220] element <class 'numpy.ndarray'>
          container : DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.spy
               data : 182603 trials of length 50.0 defined on [9312753 x 220] float32 Dataset of size 7.63 GB
             dimord : time by channel
           filename : /home/schaefert/DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.spy/DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.analog
               info : dictionary with keys ''
               mode : r
         sampleinfo : [182603 x 2] element <class 'numpy.ndarray'>
         samplerate : 1017.2526245117188
          selection : None
                tag : None
               time : 182603 element list
          trialinfo : [182603 x 0] element <class 'numpy.ndarray'>
     trialintervals : [182603 x 2] element <class 'numpy.ndarray'>
             trials : 182603 element iterable

Use `.log` to see object history

Note the number of trials.

Expected behavior
It should perform the selection in a reasonable amount of time (seconds, not hours).

@dfsp-spirit dfsp-spirit self-assigned this Mar 24, 2023
@dfsp-spirit dfsp-spirit added Bug An error that is serious but does not break (parts of) the package. However, it clearly impedes the Performance Improve the number crunching labels Mar 24, 2023
@dfsp-spirit dfsp-spirit mentioned this issue Mar 24, 2023
4 tasks
@dfsp-spirit
Copy link
Collaborator Author

dfsp-spirit commented Mar 24, 2023

The issue seems to be that in the Basedata.selection setter, data._get_time gets called and calls shared/tools.py/best_match once per trial. The best_match function however takes about 0.5 secs on this dataset, resulting in a runtime of ~ 25 hours for the ~ 180,000 trials.

One option to improve this is to try to improve the performance of best_match, but that function really has some work to do. However, for a channel-only selection (like in this case), we should be able to not mess with the trials at all, shouldn't we. But we parallelize / copy trial-wise, of course.

For now, as a workaround, we highly recommend selecting channels before applying a trial definition resulting in hundred thousands of trials.

@dfsp-spirit dfsp-spirit changed the title Selecting channels in a large dataset takes ages Selecting channels in a large dataset with many trials takes ages Mar 24, 2023
@dfsp-spirit
Copy link
Collaborator Author

This is being investigated in 454_chan_sel.

@tensionhead
Copy link
Contributor

With #455 we got an performance increase of at around 2-3 orders of magnitude 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug An error that is serious but does not break (parts of) the package. However, it clearly impedes the Performance Improve the number crunching
Projects
None yet
Development

No branches or pull requests

2 participants