Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Replace very inefficient discrete _get_trial #403

Merged
merged 12 commits into from
Jan 18, 2023
Merged

Conversation

KatharineShapcott
Copy link
Contributor

On my data x20 speedup

Author Guidelines

  • Is the change set < 600 lines?
  • Was the code checked for memory leaks/performance bottlenecks?
  • Is the code running locally and on the ESI cluster?
  • Is the code running on all supported platforms?

Reviewer Checklist

  • Are testing routines present?
  • Do parallel loops have a set length and correct termination conditions?
  • Do objects in the global package namespace perform proper parsing of their input?
  • Do code-blocks provide novel functionality, i.e., no re-factoring using builtin/external packages possible?
  • Code layout
    • Is the code PEP8 compliant?
    • Does the code adhere to agreed-upon naming conventions?
    • Are keywords clearly named and easy to understand?
    • No commented-out code?
  • Are all docstrings complete and accurate?
  • Is the CHANGELOG.md up to date?

@KatharineShapcott
Copy link
Contributor Author

My assumption is that most spike/event data will be sorted like ours is and therefore only part of the file needs to be loaded from disk, hence the dramatic speedup. But it doesn't require sorted data to work.

@KatharineShapcott KatharineShapcott marked this pull request as draft December 30, 2022 09:23
@codecov
Copy link

codecov bot commented Dec 30, 2022

Codecov Report

Base: 68.22% // Head: 67.64% // Decreases project coverage by -0.58% ⚠️

Coverage data is based on head (ff23719) compared to base (5816ae2).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head ff23719 differs from pull request most recent head 6d6dce7. Consider uploading reports for the commit 6d6dce7 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #403      +/-   ##
==========================================
- Coverage   68.22%   67.64%   -0.58%     
==========================================
  Files          80       80              
  Lines        9602     9600       -2     
  Branches     1993     1995       +2     
==========================================
- Hits         6551     6494      -57     
- Misses       2527     2583      +56     
+ Partials      524      523       -1     
Impacted Files Coverage Δ
syncopy/datatype/discrete_data.py 67.28% <100.00%> (-12.42%) ⬇️
syncopy/datatype/methods/definetrial.py 83.50% <100.00%> (-0.41%) ⬇️
syncopy/specest/freqanalysis.py 61.70% <0.00%> (-3.65%) ⬇️
syncopy/plotting/_helpers.py 85.91% <0.00%> (-2.82%) ⬇️
syncopy/nwanalysis/wilson_sf.py 92.75% <0.00%> (-1.45%) ⬇️
syncopy/plotting/mp_plotting.py 44.66% <0.00%> (-1.34%) ⬇️
syncopy/specest/compRoutines.py 89.41% <0.00%> (-0.69%) ⬇️
syncopy/datatype/base_data.py 76.82% <0.00%> (-0.30%) ⬇️
syncopy/datatype/util.py 71.64% <0.00%> (+1.49%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@KatharineShapcott
Copy link
Contributor Author

While we're at it, I noticed that printing is very slow (although a lot faster after this speedup) since many different attributes are actually calling _get_trial under the hood.

sample
817 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time
401 ms ± 5.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
trials
384 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
trialtime
496 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
trialinfo
392 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What if I make a _get_trial specifically for the SpikeData class that I can assume is sorted. Then storing a searchsorted call in a private variable _trialslice would be very fast e.g.

def _get_trial(self, trialno):
        if not hasattr(self, '_trialslice'):
            self._trialslice = np.searchsorted(self._data[:,self.dimord.index("sample")], self.sampleinfo.ravel())
            self._trialslice = self._trialslice.reshape(self.sampleinfo.shape)
        idx = slice(self._trialslice[trialno,0], self._trialslice[trialno,1])
        return self._data[idx,:]

@tensionhead
Copy link
Contributor

Puh, I am not sure I can follow you here.. so when I do this for some synthetic data spd (syncopy.tests.synth_data.poisson_noise) I have:

In [19]: spd.sampleinfo
Out[19]: 
array([[      0.,  180609.],
       [ 200000.,  396256.],
       [ 400000.,  582885.],
       [ 600000.,  783863.],
       [ 800000.,  996461.],
       [1000000., 1184701.],
       [1200000., 1392257.],
       [1400000., 1592449.],
       [1600000., 1783830.],
       [1800000., 1999999.]])

and after your proposed operation:

In [17]: trialslice.reshape(spd.sampleinfo.shape)
Out[17]: 
array([[     0,  17986],
       [ 19928,  39816],
       [ 40191,  58440],
       [ 60106,  78473],
       [ 80111,  99810],
       [100169, 118774],
       [120309, 139543],
       [140340, 159565],
       [160378, 178619],
       [180300, 200000]])

which I am not sure what it is exactly, looks like sub-intervals of the 1st trial?!

@KatharineShapcott
Copy link
Contributor Author

Yes definitely confusing compared to continuous data... trialslice is not samples anymore but is in units of number of spikes. So what this tells me is that your first trial contains 17,986 spikes, while your second trial contains 19,888 spikes and the first spike that falls within that trial is at index 19,928 etc.

The problem with this solution is that if you change the trialdefinition or use selectdata you need to update trialslice at the same time. Are you around for a meeting about this at some point?

@KatharineShapcott
Copy link
Contributor Author

Okay I think I get why this crashed now. With the previous version of .trials it was impossible to return an empty array, it would just be skipped over. So when you tried to select trials[0] here you would really get trials[1]. That's fine but the problem seems to be that if you try to use selectdata to select that empty trial and there is no data at all then everything gets set to None including the trialdefinition.

spikes.trials[0]
Out[36]: array([], shape=(0, 3), dtype=int64)

In [37]: spikes.trialdefinition
Out[37]: 
array([[   0.,   10.,    0.],
       [  10.,  100.,    0.],
       [ 200.,  500.,    0.],
       [ 600.,  700.,    0.],
       [ 800., 1000.,    0.]])

In [38]: selected.trialdefinition
Out[38]: array(None, dtype=object)

Which causes the following error

selected.trials[0]

SyNCoPy encountered an error in 

/gs/home/shapcottk/.conda/envs/syncopyoe/lib/python3.8/site-packages/IPython/core/interactiveshell.py, line 3361 in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)

--------------------------------------------------------------------------------
Abbreviated traceback:

<ipython-input-40-391b5f5f858d>, line 1 in <cell line: 1>
        selected.trials[0]

Use `import traceback; import sys; traceback.print_tb(sys.last_traceback)` for full error traceback.

TypeError: 'NoneType' object is not subscriptable

@tensionhead
Copy link
Contributor

Mmh, so in the tests we select empty trials, that's somewhat surprising! I would consider this an edge case.. There's loads of CI output so I can't really see what was happening though. Shall I look into this or you think you can get it to work?

@KatharineShapcott
Copy link
Contributor Author

Yeah it's kind of an edge case but since the test data contains very few spikes it occurs quite often in the tests I think.

I tried to avoid it by returning the empty arrays instead of None, let's see what happens now.

@KatharineShapcott
Copy link
Contributor Author

Hmm, one error seems to be due to this, I don't why that's suddenly a problem?

FAILED test_discretedata.py::TestEventData::test_ed_dataselection - syncopy.shared.errors.SPYTypeError: Wrong type of value slice(-2, None, None) for key 'eventid': expected serializable data type, e.g. floats, lists, tuples, ... found slice

slice(-2, None) # negative-start slice

eventidSelections = [
            [0, 0, 1],  # preserve repetition, don't convert to slice
            range(0, 2),  # narrow range
            slice(-2, None)  # negative-start slice
        ]

@tensionhead
Copy link
Contributor

Mmh, we just deprecated slice as valid selection parameter with #407 , as they are not serializable and hence the cfg can't be saved to json. But you branched out befor that got merged today no?!

@KatharineShapcott
Copy link
Contributor Author

Yes but something is weird, the line numbers in the tests don't match the ones in my branch :/

@tensionhead
Copy link
Contributor

ok, I will try to merge dev into your branch 🤞

@tensionhead
Copy link
Contributor

tensionhead commented Jan 5, 2023

So dev itself got repaired, can't really tell atm what is happening here tbh, maybe @dfsp-spirit got an idea?!

EDIT: or we take the most straightforward answer: the changes actually broke sth, as now locally I get:

	T1.test_general()
/xxx/syncopy/syncopy/tests/test_selectdata.py, line 450 in test_general
	assert np.array_equal(selected.trials[tk],

particularly line 443 looks like a good candidate: propArr = np.unique(selected.data[:, propIdx]).astype(np.intp)

Changes to be committed:
	modified:   syncopy/datatype/discrete_data.py
@KatharineShapcott
Copy link
Contributor Author

I don't think we need to revert removing the unique from sample, the tests passed from that change before:

image

This was referenced Jan 11, 2023
@KatharineShapcott
Copy link
Contributor Author

KatharineShapcott commented Jan 17, 2023

@tensionhead The one that's failing is a test_ed_dataselection within the test_discretedata. Can we simply remove that test?

assert np.array_equal(obj.trials[trialno][selector.time[tk], :],

@KatharineShapcott
Copy link
Contributor Author

KatharineShapcott commented Jan 17, 2023

I'm pretty sure it's failing because of the relative indexing thing. trialno is 3 and tk is 0 when it fails

@tensionhead
Copy link
Contributor

tensionhead commented Jan 17, 2023

Phew.. right we also have selection tests witin the individual data class tests 😅 Yup, I think it is safe to remove, this whole EventData probably needs an overhaul anyways.. the event id indexing is so weird.

@KatharineShapcott KatharineShapcott added the Performance Improve the number crunching label Jan 18, 2023
@KatharineShapcott KatharineShapcott marked this pull request as ready for review January 18, 2023 07:05
Copy link
Contributor

@tensionhead tensionhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice small surgical changes! I just have 1 minor question, see below.

EDIT: Looks like the original failing test just evaporated with the re-write of the test_selectdata.py 🙂

syncopy/datatype/discrete_data.py Outdated Show resolved Hide resolved
syncopy/datatype/discrete_data.py Outdated Show resolved Hide resolved
syncopy/datatype/discrete_data.py Show resolved Hide resolved
syncopy/datatype/methods/definetrial.py Show resolved Hide resolved
@KatharineShapcott
Copy link
Contributor Author

@tensionhead Should I merge in the other speedups?

@tensionhead
Copy link
Contributor

Yeah was wondering myself, but if the next CI run passes I think we are good

Copy link
Contributor

@tensionhead tensionhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, together with #415 we hopefully made significant perfomance gains for DiscreteData

@tensionhead tensionhead merged commit ce5707c into dev Jan 18, 2023
@tensionhead tensionhead deleted the discrete-speedup branch March 31, 2023 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Improve the number crunching
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants