Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part of #1867: String GroupBy optimizations are timing out nightly tests #1869

Conversation

stress-tess
Copy link
Member

This PR (part of #1867):

  • Changes the medium str logic (8 <= max_bytes < 16) to not use the 2 array return variation of computeOnSegments
  • Updates bytesToUintArr to run in parallel

If the nightly still has issues, I think the next step is to remove the special logic for medium str until the next iteration

…nightly tests

This PR (part of Bears-R-Us#1867):
- Changes the medium str logic (`8 <= max_bytes < 16`) to not use the array return variation of `computeOnSegments`
- Updates `bytesToUintArr` to run in parallel

If the nightly still has issues, I think the next step is to remove the special logic for medium str until the next iteration
}
return concat;
var localSlice = new lowLevelLocalizingSlice(values, rng);
return | reduce [i in 0..#rng.size] (localSlice.ptr(i):uint)<<(8*(rng.size-1-i));
Copy link
Member Author

@stress-tess stress-tess Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this should be equivalent to a more explicit forall, something like

var concat: uint;
forall i in 0..#rng.size with (| reduce concat) {
  concat |= (localSlice.ptr(i):uint)<<(8*(rng.size-1-i));
}
return concat;

@ronawho @bradcray, I was wondering if there's any performance benefit between the two? I feel like the one line version could save on memory because (i don't think) we would need shadow variables?

In this particular the example the loop should be small, so probably not a big deal, but it be nice to know for future reference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't feel like the localizing slice should be necessary since we're calling using computeOnSegments, so things should be local? But I know a lot of the other computeOnSegments functions call either interpretAsString or interpretAsBytes which do a localizing slice, so it seemed worth a try

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierce314159 : Sorry for the delayed response, I was out most of last week. I think the difference between your new code in 1441-1442, and your more explicit forall in the comment above should be a wash. Specifically, Chapel typically rewrites op reduce operand-expr expressions into the equivalent forall loop with shadow variables, so as long as the expression driving the forall loop and the operand-expr are the same, you should feel free to use whichever style you prefer. The shadow variables in cases like this are very modest in size (a scalar per task), so should not be a big deal.

I'm not terribly familiar with lowLevelLocalizingSlice(), so also found myself wondering whether this could just be written as something like: return | reduce [i in 0..<rng.size] (values[i]:uint)<<(8*(rng.size-1-i)); without much loss in performance by eliminating the need for the slice. Or, if you're correct that all indices i are local, values.localAccess(i) instead of values[i]. This paragraph is very much a shot in the dark, though, and if the performance is good enough and the style doesn't bother you, definitely ignorable.

Comment on lines -134 to -135
proc twoReturnComputeOnSegments(segments: [?D] int, values: [?vD] ?t, param function: SegFunction, type retType) throws {
// I hate duplicating the code from computeOnSegments but I need to return 2 arrays
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't immediately see any issues with this code but it's def the most complicated/part i understand the least, so it seemed like the natural choice to revert considering the timeout issue

Copy link
Contributor

@Ethan-DeBandi99 Ethan-DeBandi99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me as a good starting point to fix the issues with the nightly. I am interested in the input of @ronawho and @bradcray which is why I am leaving this as a comment and not an approval overall.

@bmcdonald3
Copy link
Contributor

This appears to fix the issue and is now completing in multi locale tests:

array size = 100,000,000
number of trials =  1
numLocales = 16, N = 1,600,000,000
small str array Average time = 5.1369 sec
small str array Average rate = 1.1603 GiB/sec
medium str array Average time = 11.2357 sec
medium str array Average rate = 0.9284 GiB/sec
big str array Average time = 10.1455 sec
big str array Average rate = 1.9093 GiB/sec

And also seems to be scaling:

array size = 100,000,000
number of trials =  1
numLocales = 2, N = 200,000,000
small str array Average time = 6.7848 sec
small str array Average rate = 0.1098 GiB/sec
medium str array Average time = 7.3457 sec
medium str array Average rate = 0.1775 GiB/sec
big str array Average time = 6.6441 sec
big str array Average rate = 0.3645 GiB/sec

Copy link
Contributor

@bmcdonald3 bmcdonald3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not done a thorough review, but seems to fix the timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants