Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While watching DAMNIT run on p4507, I noticed that
run.split_trains()
was pathologically slow when splitting a long run into many small pieces. Splitting 10k trains into 5k chunks took 8 seconds for a single source, and the 16 LPD modules in a run of 27k trains (45 minutes) took over 7 minutes to split into 2-train chunks. I took a screenshot to prove to myself I'm not imagining it:Profiling revealed the culprit was finding the files that belong to each chunk. For each chunk we were checking for an overlap in each file relevant to that source. For a given chunk size, a longer run means both more files and more chunks, so it's$\mathcal{O}(N^2)$ on the number of trains.
Stack dump from py-spy
This change creates a temporary index of which file each train belongs to, and uses that to select the relevant files, which is$\mathcal{O}(N)$ . This now takes about 16ms for the single source 10k trains example, and about 4 seconds for the 27k trains of LPD data. That's still slower than I might like, but it's much better than it was.
I played around a bit with making the index using NumPy arrays and set operations (e.g.
intersect1d
), but at least for the cases I was testing, the pure-Python approach was somewhat faster, and I find it more readable.The index obviously uses some extra memory.
sys.getsizeof
tells me that for 27k trains, thetid_to_ix
dict is ~1.3 MiB per source, andtids_files
about 200 KiB. This seems like an acceptable trade-off.