-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives #3489
Comments
I think we could do |
the sortBy() on large dataset is very expensive, and it does not keep the order for the items within the same group. I used a wrapper class to take the group of training data, and partition on that to workaround the problem. Instead of XGBLabeledPoint, I use the following data structures: The method looks like this: I do repartition on the queriesRDD and expand it into XGBLabeledPoint with flatMap(..) after the repartition. This way it is keep the order of the group and the order of items within a group, too. |
That would be very nice. Thanks! |
The repartition on RDD[XGBLabeledPoint] also has another issue that it does gurantee to keep the items within the same group in the same partition after repartitioning. |
Some more issues with XGBoost.Watches class:
I am trying to refactor my custom code to fix all the above issues without breaking the existing code structure. |
…y shuffle all data and lose ordering required for ranking objectives (dmlc#3654)
Training for ranking objective requires to have the spark dataframe be grouped by
group
i.e. the query groups be next to each other. This assumption is at following places in the code:Currently xgboost-spark code repartitions the data here to have partitions = num_workers. This can potentially shuffle all data and lose the above ordering and above assumptions can break.
Workaround:
Make sure the
#partitions = num_workers
and make xgboost-spark not repartition the data for you.@CodingCat , @hcho3 : My current work around is good for me, but this should be improved inside xgboost-spark code. Any suggestions?
The text was updated successfully, but these errors were encountered: