Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegArray optimization & bug fix #3021

Merged
merged 1 commit into from
Mar 11, 2024

Conversation

brandon-neth
Copy link
Collaborator

This PR should improve the dataframe benchmark performance by removing two sources of redundant computation for SegArray columns. It also fixes a bug in the benchmark that only saved the result of the final run.

Both redundant computations appear in the SegArray initializer, which in my profiling contributed the most to the dataframe benchmark execution time. The first redundant computation occurred when calculating the number of non-empty segments, which was calculated both when setting the _non_empty field and when counting them. The second redundant computation was the calculation of segment lengths, which occurs in the gen_ranges function, then again in the SegArray initializer. I added an optional argument to gen_ranges to return the lengths to pass straight to the initializer instead of calculating it again during initialization.

Copy link
Contributor

@bmcdonald3 bmcdonald3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, good catch! Just curious, do you have any performance runs you can put in the description so we have a sense of what kind of improvement to expect? An example of that would be something like #2290

Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!! I would also like to see the improved times, but I'll go ahead and merge in the meantime

@stress-tess stress-tess added this pull request to the merge queue Mar 11, 2024
Merged via the queue into Bears-R-Us:master with commit 31e2c82 Mar 11, 2024
13 checks passed
@brandon-neth
Copy link
Collaborator Author

Sure, these results are from my Mac. Average of 5 runs for 10k-element-long columns.

Original version: .9077 seconds
With non-empty fix: .7495 seconds
Reusing lengths: 0.6762 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants