Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #1300 - Improve Performance of DataFrame Display #1334

Merged

Conversation

Ethan-DeBandi99
Copy link
Contributor

Closes #1300

  • Updates the _get_head_tail() method to function by creating a single server message instead of 1 per column in the DataFrame. This was done by adding _get_head_tail_server() which allows us to benchmark the new code against the old. The old function has been retained for benchmarking.
  • Adds a DataFrameIndexingMsg.chpl to parse and process the server message for indexing the columns. The message is configured so the the column type, column name (in the df), column object name (server-side) are sent to the server to indexing to the appropriate head/tail.
  • Benchmarking configured to test _rep_html() which now calls _get_head_tail_server(). The benchmark also checks _get_head_tail_server() and _get_head_tail() directly to allow for easy comparison.

Per the issue description, @joshmarshall1 and I did this without using aggregation because the datasets should be relatively small.

Benchmarking on a single node shows an performance already, but we will need to benchmark this on a multi-node system. Results from single node:

array size = 10,000
number of trials =  5
>>> arkouda dataframe display
numLocales = 1, N = 10,000
  _repr_html_ Average time = 0.0401 sec
  _repr_html_ Average rate = 0.03 GiB/sec
  _get_head_tail_server Average time = 0.0406 sec
  _get_head_tail_server Average rate = 0.03 GiB/sec
  _get_head_tail Average time = 0.0825 sec
  _get_head_tail Average rate = 0.01 GiB/sec

@reuster986 - I will leave it up to you if you would like to request an out of band benchmark.

@Ethan-DeBandi99 Ethan-DeBandi99 force-pushed the 1300_dataframe_display_perf branch from 8d07ef8 to 19b6916 Compare April 27, 2022 15:39
@Ethan-DeBandi99
Copy link
Contributor Author

Added @joshmarshall1 as a reviewer even though he helped write this code. There are some elements that I handled that would be good for him to review as well.

Copy link
Contributor

@joshmarshall1 joshmarshall1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a participant in developing this, I won't approve it, but after a second review I think we should go through and add comments throughout the chpl code, since there are almost none, just to make future maintenance easier. I also noticed a few places that TODO tags were left in and a stray #

arkouda/dataframe.py Outdated Show resolved Hide resolved
src/DataFrameIndexingMsg.chpl Outdated Show resolved Hide resolved
src/DataFrameIndexingMsg.chpl Outdated Show resolved Hide resolved
Copy link
Collaborator

@reuster986 reuster986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Just a couple requested changes in the benchmark, and a suggestion about when/how we switch to the new implementation.

benchmarks/dataframe.py Outdated Show resolved Hide resolved
benchmarks/dataframe.py Outdated Show resolved Hide resolved
benchmarks/dataframe.py Outdated Show resolved Hide resolved
arkouda/dataframe.py Show resolved Hide resolved
@Ethan-DeBandi99
Copy link
Contributor Author

Updated benchmark results on single node with requested updates from @reuster986 implemented.

array size = 10,000
number of trials =  5
>>> arkouda dataframe display
numLocales = 1, N = 10,000
  _get_head_tail_server Average time = 0.0261 sec
  _get_head_tail_server Average rate = 0.05 GiB/sec
  _get_head_tail Average time = 0.0547 sec
  _get_head_tail Average rate = 0.02 GiB/sec

Copy link
Collaborator

@reuster986 reuster986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments, nothing major. The only thing that might need to be updated is "support uint for segarray values" question? But the logic all looks good to me

arkouda/dataframe.py Outdated Show resolved Hide resolved
benchmarks/dataframe.py Outdated Show resolved Hide resolved
benchmarks/dataframe.py Outdated Show resolved Hide resolved
benchmarks/dataframe.py Outdated Show resolved Hide resolved
benchmarks/dataframe.py Show resolved Hide resolved
src/DataFrameIndexingMsg.chpl Outdated Show resolved Hide resolved
src/DataFrameIndexingMsg.chpl Outdated Show resolved Hide resolved
src/DataFrameIndexingMsg.chpl Outdated Show resolved Hide resolved
Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@mhmerrill mhmerrill merged commit b11e22b into Bears-R-Us:master Apr 29, 2022
@Ethan-DeBandi99 Ethan-DeBandi99 deleted the 1300_dataframe_display_perf branch May 2, 2022 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of dataframe display
5 participants