-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #1300 - Improve Performance of DataFrame Display #1334
Closes #1300 - Improve Performance of DataFrame Display #1334
Conversation
…ion. Benchmarks configured to time _repr_html_(), and compare the head tail methods.
8d07ef8
to
19b6916
Compare
Added @joshmarshall1 as a reviewer even though he helped write this code. There are some elements that I handled that would be good for him to review as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a participant in developing this, I won't approve it, but after a second review I think we should go through and add comments throughout the chpl code, since there are almost none, just to make future maintenance easier. I also noticed a few places that TODO tags were left in and a stray #
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Just a couple requested changes in the benchmark, and a suggestion about when/how we switch to the new implementation.
Updated benchmark results on single node with requested updates from @reuster986 implemented. array size = 10,000
number of trials = 5
>>> arkouda dataframe display
numLocales = 1, N = 10,000
_get_head_tail_server Average time = 0.0261 sec
_get_head_tail_server Average rate = 0.05 GiB/sec
_get_head_tail Average time = 0.0547 sec
_get_head_tail Average rate = 0.02 GiB/sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments, nothing major. The only thing that might need to be updated is "support uint for segarray values" question? But the logic all looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Closes #1300
_get_head_tail()
method to function by creating a single server message instead of 1 per column in the DataFrame. This was done by adding_get_head_tail_server()
which allows us to benchmark the new code against the old. The old function has been retained for benchmarking.DataFrameIndexingMsg.chpl
to parse and process the server message for indexing the columns. The message is configured so the the column type, column name (in the df), column object name (server-side) are sent to the server to indexing to the appropriate head/tail._rep_html()
which now calls_get_head_tail_server()
. The benchmark also checks_get_head_tail_server()
and_get_head_tail()
directly to allow for easy comparison.Per the issue description, @joshmarshall1 and I did this without using aggregation because the datasets should be relatively small.
Benchmarking on a single node shows an performance already, but we will need to benchmark this on a multi-node system. Results from single node:
@reuster986 - I will leave it up to you if you would like to request an out of band benchmark.