Selecting channels in a large dataset with many trials takes ages #454

dfsp-spirit · 2023-03-24T09:18:42Z

Describe the bug
Selecting channels in a large dataset with many trials takes ages. This was discovered by @kajal5888

To Reproduce
Steps to reproduce the behavior:

import syncopy as spy
import numpy as np
dt = spy.load("~/DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.spy")   # file from DJ, available on ESI cs
dt2 = dt.selectdata(channel=np.arange(90, 200))  # Takes several hours, has not finished yet

DJ had this running for more than 10 hours without it finishing.

info on dt:

dt
Syncopy AnalogData object with fields

                cfg : dictionary with keys ''
            channel : [220] element <class 'numpy.ndarray'>
          container : DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.spy
               data : 182603 trials of length 50.0 defined on [9312753 x 220] float32 Dataset of size 7.63 GB
             dimord : time by channel
           filename : /home/schaefert/DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.spy/DataCombined_StimulusCPAll_vertical_Epoched_WindowLength50msSlidWinNr1.analog
               info : dictionary with keys ''
               mode : r
         sampleinfo : [182603 x 2] element <class 'numpy.ndarray'>
         samplerate : 1017.2526245117188
          selection : None
                tag : None
               time : 182603 element list
          trialinfo : [182603 x 0] element <class 'numpy.ndarray'>
     trialintervals : [182603 x 2] element <class 'numpy.ndarray'>
             trials : 182603 element iterable

Use `.log` to see object history

Note the number of trials.

Expected behavior
It should perform the selection in a reasonable amount of time (seconds, not hours).

dfsp-spirit · 2023-03-24T11:21:27Z

The issue seems to be that in the Basedata.selection setter, data._get_time gets called and calls shared/tools.py/best_match once per trial. The best_match function however takes about 0.5 secs on this dataset, resulting in a runtime of ~ 25 hours for the ~ 180,000 trials.

One option to improve this is to try to improve the performance of best_match, but that function really has some work to do. However, for a channel-only selection (like in this case), we should be able to not mess with the trials at all, shouldn't we. But we parallelize / copy trial-wise, of course.

For now, as a workaround, we highly recommend selecting channels before applying a trial definition resulting in hundred thousands of trials.

dfsp-spirit · 2023-04-26T10:16:40Z

This is being investigated in 454_chan_sel.

tensionhead · 2023-05-03T17:06:06Z

With #455 we got an performance increase of at around 2-3 orders of magnitude 🚀

dfsp-spirit self-assigned this Mar 24, 2023

dfsp-spirit added Bug An error that is serious but does not break (parts of) the package. However, it clearly impedes the Performance Improve the number crunching labels Mar 24, 2023

dfsp-spirit mentioned this issue Mar 24, 2023

454 chan sel #455

Merged

4 tasks

dfsp-spirit changed the title ~~Selecting channels in a large dataset takes ages~~ Selecting channels in a large dataset with many trials takes ages Mar 24, 2023

tensionhead closed this as completed May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting channels in a large dataset with many trials takes ages #454

Selecting channels in a large dataset with many trials takes ages #454

dfsp-spirit commented Mar 24, 2023 •

edited

Loading

dfsp-spirit commented Mar 24, 2023 •

edited

Loading

dfsp-spirit commented Apr 26, 2023

tensionhead commented May 3, 2023

Selecting channels in a large dataset with many trials takes ages #454

Selecting channels in a large dataset with many trials takes ages #454

Comments

dfsp-spirit commented Mar 24, 2023 • edited Loading

dfsp-spirit commented Mar 24, 2023 • edited Loading

dfsp-spirit commented Apr 26, 2023

tensionhead commented May 3, 2023

dfsp-spirit commented Mar 24, 2023 •

edited

Loading

dfsp-spirit commented Mar 24, 2023 •

edited

Loading