[jvm-packages] Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives #3489

ngoyal2707 · 2018-07-18T17:29:41Z

Training for ranking objective requires to have the spark dataframe be grouped by group i.e. the query groups be next to each other. This assumption is at following places in the code:

Currently xgboost-spark code repartitions the data here to have partitions = num_workers. This can potentially shuffle all data and lose the above ordering and above assumptions can break.

Workaround:
Make sure the #partitions = num_workers and make xgboost-spark not repartition the data for you.

@CodingCat , @hcho3 : My current work around is good for me, but this should be improved inside xgboost-spark code. Any suggestions?

The text was updated successfully, but these errors were encountered:

hcho3 · 2018-07-21T05:42:29Z

I think we could do sortBy() with numPartitions, so as to sort by the group ID's. @CodingCat @ngoyal2707 I'm not that familiar with Spark, does this sound feasible to you?

weitian · 2018-08-20T22:29:57Z

the sortBy() on large dataset is very expensive, and it does not keep the order for the items within the same group.

I used a wrapper class to take the group of training data, and partition on that to workaround the problem.
I have plan to make it more generic and submit a PR later.

Instead of XGBLabeledPoint, I use the following data structures:
case class IndexedFeaturePair(key: Short, value: String)
case class IndexedSuggestion(label: Double, features: Array[IndexedFeaturePair])
case class IndexedQuery(suggestion_count: Int, suggestions: Array[IndexedSuggestion])

The method looks like this:
def trainDistributed(
queriesRDD: RDD[IndexedQuery],
params: Map[String, Any],
round: Int,
nWorkers: Int,
trainingCount: LongAccumulator,
testingCount: LongAccumulator,
obj: ObjectiveTrait = null,
eval: EvalTrait = null,
useExternalMemory: Boolean = false,
missing: Float = Float.NaN): (Booster, Map[String, Array[Float]]) = {

I do repartition on the queriesRDD and expand it into XGBLabeledPoint with flatMap(..) after the repartition.

This way it is keep the order of the group and the order of items within a group, too.

hcho3 · 2018-08-20T22:34:21Z

@weitian

I used a wrapper class to take the group of training data, and partition on that to workaround the problem.
I have plan to make it more generic and submit a PR later.

That would be very nice. Thanks!

weitian · 2018-08-21T17:09:48Z

The repartition on RDD[XGBLabeledPoint] also has another issue that it does gurantee to keep the items within the same group in the same partition after repartitioning.

weitian · 2018-08-24T16:04:54Z

Some more issues with XGBoost.Watches class:

The random split code for train and test dataset in object Watches also break the group.
This line "val (trainIter1, trainIter2) = trainPoints.duplicate" to generate group data may cause memory issues with large dataset. Scala Iterator does lazy-evaluation, but duplicated iterator may force it to cache data between the gap of the two iterators. see https://www.scala-lang.org/old/node/4943

I am trying to refactor my custom code to fix all the above issues without breaking the existing code structure.

…y shuffle all data and lose ordering required for ranking objectives (dmlc#3654)

CodingCat closed this as completed in efc4f85 Oct 3, 2018

alois-bissuel pushed a commit to criteo-forks/xgboost that referenced this issue Dec 4, 2018

[jvm-packages] Fix dmlc#3489: Spark repartitionForData can potentiall…

37a360b

…y shuffle all data and lose ordering required for ranking objectives (dmlc#3654)

lock bot locked as resolved and limited conversation to collaborators Jan 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives #3489

[jvm-packages] Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives #3489

ngoyal2707 commented Jul 18, 2018

hcho3 commented Jul 21, 2018 •

edited

Loading

weitian commented Aug 20, 2018 •

edited

Loading

hcho3 commented Aug 20, 2018

weitian commented Aug 21, 2018

weitian commented Aug 24, 2018 •

edited

Loading

[jvm-packages] Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives #3489

[jvm-packages] Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives #3489

Comments

ngoyal2707 commented Jul 18, 2018

hcho3 commented Jul 21, 2018 • edited Loading

weitian commented Aug 20, 2018 • edited Loading

hcho3 commented Aug 20, 2018

weitian commented Aug 21, 2018

weitian commented Aug 24, 2018 • edited Loading

hcho3 commented Jul 21, 2018 •

edited

Loading

weitian commented Aug 20, 2018 •

edited

Loading

weitian commented Aug 24, 2018 •

edited

Loading