Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR should improve the dataframe benchmark performance by removing two sources of redundant computation for SegArray columns. It also fixes a bug in the benchmark that only saved the result of the final run.
Both redundant computations appear in the SegArray initializer, which in my profiling contributed the most to the dataframe benchmark execution time. The first redundant computation occurred when calculating the number of non-empty segments, which was calculated both when setting the
_non_empty
field and when counting them. The second redundant computation was the calculation of segment lengths, which occurs in thegen_ranges
function, then again in theSegArray
initializer. I added an optional argument togen_ranges
to return the lengths to pass straight to the initializer instead of calculating it again during initialization.