Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster split_trains for long runs #459

Merged
merged 6 commits into from
Nov 1, 2023
Merged

Faster split_trains for long runs #459

merged 6 commits into from
Nov 1, 2023

Conversation

takluyver
Copy link
Member

@takluyver takluyver commented Oct 26, 2023

While watching DAMNIT run on p4507, I noticed that run.split_trains() was pathologically slow when splitting a long run into many small pieces. Splitting 10k trains into 5k chunks took 8 seconds for a single source, and the 16 LPD modules in a run of 27k trains (45 minutes) took over 7 minutes to split into 2-train chunks. I took a screenshot to prove to myself I'm not imagining it:

Screenshot from 2023-10-26 17-18-15

Profiling revealed the culprit was finding the files that belong to each chunk. For each chunk we were checking for an overlap in each file relevant to that source. For a given chunk size, a longer run means both more files and more chunks, so it's $\mathcal{O}(N^2)$ on the number of trains.

Stack dump from py-spy
Thread 67309 (active+gil): "MainThread"
    _unique1d (numpy/lib/arraysetops.py:352)
    unique (numpy/lib/arraysetops.py:274)
    unique (<__array_function__ internals>:200)
    intersect1d (numpy/lib/arraysetops.py:445)
    intersect1d (<__array_function__ internals>:200)
    has_train_ids (extra_data/file_access.py:270)
    <listcomp> (extra_data/sourcedata.py:273)
    _only_tids (extra_data/sourcedata.py:271)
    <dictcomp> (extra_data/reader.py:1094)
    select_trains (extra_data/reader.py:1093)
    split_trains (extra_data/reader.py:1127)
    <listcomp> (analysis_helpers.py:105)
    integrate_run (analysis_helpers.py:105)
    lpd_ai (<string>:126)
    get_default_inputs (damnit_ctx.py:246)
    create (ctxrunner.py:337)
    main (ctxrunner.py:579)
    <module> (ctxrunner.py:603)
    _run_code (runpy.py:87)
    _run_module_as_main (runpy.py:197)

This change creates a temporary index of which file each train belongs to, and uses that to select the relevant files, which is $\mathcal{O}(N)$. This now takes about 16ms for the single source 10k trains example, and about 4 seconds for the 27k trains of LPD data. That's still slower than I might like, but it's much better than it was.

I played around a bit with making the index using NumPy arrays and set operations (e.g. intersect1d), but at least for the cases I was testing, the pure-Python approach was somewhat faster, and I find it more readable.

The index obviously uses some extra memory. sys.getsizeof tells me that for 27k trains, the tid_to_ix dict is ~1.3 MiB per source, and tids_files about 200 KiB. This seems like an acceptable trade-off.

@JamesWrigley
Copy link
Member

That's really cool 🎉 I ran into this last week with MID too.

extra_data/keydata.py Outdated Show resolved Hide resolved
Copy link
Member

@JamesWrigley JamesWrigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Looking forward to this one 😀

@takluyver takluyver added this to the 1.15 milestone Nov 1, 2023
@takluyver
Copy link
Member Author

prnote: Fix split_trains being very slow when splitting a long run into many pieces.

@takluyver takluyver merged commit e1f00da into master Nov 1, 2023
9 checks passed
@takluyver takluyver deleted the split-faster branch November 1, 2023 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants