-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Out of core sorts #19
Comments
Sorry for the wall of text. I didn't have the energy to come up with some pictures. I have been playing around with a few algorithms on paper to try and understand what is the ideal way to do this. For now I think we want to do an external merge sort but with a few twists because of limitations we have with spilling the data, with the APIs we have available to us through CUDF, and the fact that we don't know how much data we are going to sort ahead of time. So the main idea is that we sort each batch as it comes in and split it into N chunks (I'll describe this more in detail later). Each chunk will be stored in the spill-able cache, but before we put them into the cache we get the min key for each of those chunks. (I'll describe this based off of an ascending sort order for all columns but it should be simple to generalize this to any order for any of the columns) Once we have done this for all of the incoming batches we can order these chunks by the minimum value in them. Take the N chunks with the lowest minimum values in them and do a merge sort on these values. We then look at the minimum value in the next chunk that has not been included in this. Everything less than or equal to this key is now in sorted order and could be output for later processing. Everything greater than this key needs to be split into smaller chunks again and the process repeated until all of the data is sorted. Conceptually I think we want to deal with bytes instead of chunks of memory, when splitting the data as such we should divide the target output batch size Additionally when loading the data we could keep track of the data size for each chunk and load chunks until we would overflow This works great on paper so long as N is large enough. If we have a lot of input data (enough to warrant multiple merge-sort passes through the data), then I see us outputting relatively small amounts of data in many cases. In the worst case all of the input batches are the same data and we can only output one row per input batch. To avoid this I think we want a target output size for sort. So if we ran into a situation where we could not output enough data, we would stored the data to output in the spill-able cache and then go on until we had enough data to output. We need to be careful here too, because I might end up with a batch that would exceed There are also some games that we can play with not putting data into the if we know that we are going to use it in the next round of processing. I have seen this be a big benefit on paper, but I have no idea in real life how it would work. We need to be careful though. We need to make sure that all of the data we are not going to output is in the cache when we had control off down stream. We don't currently do that for several algorithms, but I think we need to work on doing it more. |
Just as a side note there was some discussion about implementing a histogram like operation for the GPU to be used with distributed sort. I have not see that API, but it would in theory let us get a dynamically bucketed histogram of the data. We could then do an algorithm where we do one pass through the data to create this histogram. We pull it back and figure out all of the cut points. We would then read back in all of the data sort it and cut it, putting it back in the cache. After that we then pull in just the parts we care about for each final merge sort and output. This would be closer to what spark does for distributed sort (where we also hope to eventually use this histogram code). We might want to do some analysis about this once we have the histogram code in place. The big issue I see with using the histogram code is that it could make us susceptible to data skew. |
* Commit sample app for spark xgboost * Updated according to the comments
Removing from 0.4 as it has not been prioritized for this release. However we should consider coming back to this soon. |
Signed-off-by: spark-rapids automation <[email protected]>
one log for GPU reuse fixup Signed-off-by: Firestarman <[email protected]>
Is your feature request related to a problem? Please describe.
Currently when sorting the plugin has to hold all of the data for a single partition in memory. It would be great if we supported spilling some of that data to host memory/disk so if someone had a horribly skewed sort or something that the plugin could still work, even if it was not ideal.
The text was updated successfully, but these errors were encountered: