Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives #3489

Closed
ngoyal2707 opened this issue Jul 18, 2018 · 5 comments

Comments

@ngoyal2707
Copy link
Contributor

Training for ranking objective requires to have the spark dataframe be grouped by group i.e. the query groups be next to each other. This assumption is at following places in the code:

  1. Making group data
  2. Calculating ranking objective loss

Currently xgboost-spark code repartitions the data here to have partitions = num_workers. This can potentially shuffle all data and lose the above ordering and above assumptions can break.

Workaround:
Make sure the #partitions = num_workers and make xgboost-spark not repartition the data for you.

@CodingCat , @hcho3 : My current work around is good for me, but this should be improved inside xgboost-spark code. Any suggestions?

@hcho3
Copy link
Collaborator

hcho3 commented Jul 21, 2018

I think we could do sortBy() with numPartitions, so as to sort by the group ID's. @CodingCat @ngoyal2707 I'm not that familiar with Spark, does this sound feasible to you?

@weitian
Copy link
Contributor

weitian commented Aug 20, 2018

the sortBy() on large dataset is very expensive, and it does not keep the order for the items within the same group.

I used a wrapper class to take the group of training data, and partition on that to workaround the problem.
I have plan to make it more generic and submit a PR later.

Instead of XGBLabeledPoint, I use the following data structures:
case class IndexedFeaturePair(key: Short, value: String)
case class IndexedSuggestion(label: Double, features: Array[IndexedFeaturePair])
case class IndexedQuery(suggestion_count: Int, suggestions: Array[IndexedSuggestion])

The method looks like this:
def trainDistributed(
queriesRDD: RDD[IndexedQuery],
params: Map[String, Any],
round: Int,
nWorkers: Int,
trainingCount: LongAccumulator,
testingCount: LongAccumulator,
obj: ObjectiveTrait = null,
eval: EvalTrait = null,
useExternalMemory: Boolean = false,
missing: Float = Float.NaN): (Booster, Map[String, Array[Float]]) = {

I do repartition on the queriesRDD and expand it into XGBLabeledPoint with flatMap(..) after the repartition.

This way it is keep the order of the group and the order of items within a group, too.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 20, 2018

@weitian

I used a wrapper class to take the group of training data, and partition on that to workaround the problem.
I have plan to make it more generic and submit a PR later.

That would be very nice. Thanks!

@weitian
Copy link
Contributor

weitian commented Aug 21, 2018

The repartition on RDD[XGBLabeledPoint] also has another issue that it does gurantee to keep the items within the same group in the same partition after repartitioning.

@weitian
Copy link
Contributor

weitian commented Aug 24, 2018

Some more issues with XGBoost.Watches class:

  1. The random split code for train and test dataset in object Watches also break the group.
  2. This line "val (trainIter1, trainIter2) = trainPoints.duplicate" to generate group data may cause memory issues with large dataset. Scala Iterator does lazy-evaluation, but duplicated iterator may force it to cache data between the gap of the two iterators. see https://www.scala-lang.org/old/node/4943

I am trying to refactor my custom code to fix all the above issues without breaking the existing code structure.

alois-bissuel pushed a commit to criteo-forks/xgboost that referenced this issue Dec 4, 2018
…y shuffle all data and lose ordering required for ranking objectives (dmlc#3654)
@lock lock bot locked as resolved and limited conversation to collaborators Jan 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants