Faster split_trains for long runs #459

takluyver · 2023-10-26T17:25:01Z

While watching DAMNIT run on p4507, I noticed that run.split_trains() was pathologically slow when splitting a long run into many small pieces. Splitting 10k trains into 5k chunks took 8 seconds for a single source, and the 16 LPD modules in a run of 27k trains (45 minutes) took over 7 minutes to split into 2-train chunks. I took a screenshot to prove to myself I'm not imagining it:

Profiling revealed the culprit was finding the files that belong to each chunk. For each chunk we were checking for an overlap in each file relevant to that source. For a given chunk size, a longer run means both more files and more chunks, so it's $\mathcal{O}(N^2)$ on the number of trains.

Stack dump from py-spy

Thread 67309 (active+gil): "MainThread"
    _unique1d (numpy/lib/arraysetops.py:352)
    unique (numpy/lib/arraysetops.py:274)
    unique (<__array_function__ internals>:200)
    intersect1d (numpy/lib/arraysetops.py:445)
    intersect1d (<__array_function__ internals>:200)
    has_train_ids (extra_data/file_access.py:270)
    <listcomp> (extra_data/sourcedata.py:273)
    _only_tids (extra_data/sourcedata.py:271)
    <dictcomp> (extra_data/reader.py:1094)
    select_trains (extra_data/reader.py:1093)
    split_trains (extra_data/reader.py:1127)
    <listcomp> (analysis_helpers.py:105)
    integrate_run (analysis_helpers.py:105)
    lpd_ai (<string>:126)
    get_default_inputs (damnit_ctx.py:246)
    create (ctxrunner.py:337)
    main (ctxrunner.py:579)
    <module> (ctxrunner.py:603)
    _run_code (runpy.py:87)
    _run_module_as_main (runpy.py:197)

This change creates a temporary index of which file each train belongs to, and uses that to select the relevant files, which is $\mathcal{O}(N)$. This now takes about 16ms for the single source 10k trains example, and about 4 seconds for the 27k trains of LPD data. That's still slower than I might like, but it's much better than it was.

I played around a bit with making the index using NumPy arrays and set operations (e.g. intersect1d), but at least for the cases I was testing, the pure-Python approach was somewhat faster, and I find it more readable.

The index obviously uses some extra memory. sys.getsizeof tells me that for 27k trains, the tid_to_ix dict is ~1.3 MiB per source, and tids_files about 200 KiB. This seems like an acceptable trade-off.

…eyData

JamesWrigley · 2023-10-26T22:02:12Z

That's really cool 🎉 I ran into this last week with MID too.

extra_data/keydata.py

JamesWrigley

LGTM! Looking forward to this one 😀

takluyver · 2023-11-01T13:35:14Z

prnote: Fix split_trains being very slow when splitting a long run into many pieces.

takluyver added 4 commits October 26, 2023 17:34

Faster implementation of split_trains for long runs in SourceData & K…

b6ed09b

…eyData

Use SourceData.split_trains in DataCollection.split_trains

af5989e

Keep SourceData.train_ids as a list

50d656a

Ensure split_trains() can't leave 0 files selected

7a9ddcb

Extend tests for split_trains slightly

885c835

JamesWrigley requested changes Oct 30, 2023

View reviewed changes

extra_data/keydata.py Outdated Show resolved Hide resolved

Refactor trains_files_index as shared function

39d3274

JamesWrigley approved these changes Oct 31, 2023

View reviewed changes

takluyver added this to the 1.15 milestone Nov 1, 2023

takluyver merged commit e1f00da into master Nov 1, 2023
9 checks passed

takluyver deleted the split-faster branch November 1, 2023 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster split_trains for long runs #459

Faster split_trains for long runs #459

takluyver commented Oct 26, 2023 •

edited

Loading

JamesWrigley commented Oct 26, 2023

JamesWrigley left a comment

takluyver commented Nov 1, 2023

Faster split_trains for long runs #459

Faster split_trains for long runs #459

Conversation

takluyver commented Oct 26, 2023 • edited Loading

JamesWrigley commented Oct 26, 2023

JamesWrigley left a comment

Choose a reason for hiding this comment

takluyver commented Nov 1, 2023

takluyver commented Oct 26, 2023 •

edited

Loading