-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert nested join in Vector Queries to Pandas Merge. #1298
Convert nested join in Vector Queries to Pandas Merge. #1298
Conversation
For 20% speedup, how many rows does the table contain? |
100k |
for col_name in column_list: | ||
res_row[col_name] = row[col_name] | ||
res_row_list[idx] = res_row | ||
result_df = pd.merge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing O(n) of merging, will we get better performance if get all batches from the child and do merging only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing O(n) of merging, will we get better performance if get all batches from the child and do merging only once?
Thanks for the suggestion, I have also made changes to not add child frames into the result df before merging to avoid unnecessary processing. The speedup is 2X now.
left_index=True, | ||
right_index=True, | ||
how="left", | ||
# sort=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Profiling on Vector Scan showed that we are spending a lot of time in the post-processing logic doing a Nested Join. This is an initial commit to change that into a Join using Pandas. Change showed ~50% improvement in Similarity Queries.