Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training LightGBMRanker several times gives different NDCG on testing set #580

Open
daureg opened this issue Jun 5, 2019 · 12 comments
Open
Assignees

Comments

@daureg
Copy link

daureg commented Jun 5, 2019

I noticed that when training on Databricks with the same parameters on the same data several times, the resulting models don't give the same predictions, as evidenced by different NDCG on a separate testing set.
Here is my training function, my training set has 400K exemples in 5K lists, with 60 features:

def train(): Unit = {
  val lgbm = new LightGBMRanker()
  .setCategoricalSlotIndexes(Array(0, 2, 3, 4, 6, 7, 8, 59))
  .setFeaturesCol("features")
  .setGroupCol("query_id")
  .setLabelCol("label")
  .setMaxPosition(10)
  .setParallelism("voting")
  .setNumIterations(15)
  .setMaxDepth(4)
  .setNumLeaves(12)
  val training = table(s"training")
  val model = lgbm.fit(training)
}

Is that inherent to distributed training (on 5 executors) or should I change some parameters of my LightGBMRanker instance?

@daniloascione
Copy link

If the table is repartitioned to 1 partition (table(s"training").repartition(1)), then the results are consistent, but this means no parallelism.

@imatiach-msft
Copy link
Contributor

@daureg thank you for reporting this issue. This looks similar to the issue here:
#564
I will need to investigate this problem more to figure out the root cause of the randomness, I'm not sure if it is fixable. It's on my todo list now, but not as high priority as:
#569
#483
Does one model always give the same predictions? Or is it only different models trained on the same data?

@daureg
Copy link
Author

daureg commented Jun 5, 2019

indeed it's the same as #564 (unless there is something specific with ranker, but most likely not). I will also try to predict several time with the same model, but for now it's different models trained on the same data

@daniloascione
Copy link

@imatiach-msft maybe there is the need to ensure that each partition get all the elements from the same group and to enforce the group sorting by adding a sortWithinPartition here https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBase.scala#L45 (similarly to https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMRanker.scala#L67)

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Jun 5, 2019

@daniloascione yes, that was something that I was going to add later; not sure if it should be a separate utility or if it should be done in the ranker itself (which may hurt performance significantly since it would incur a shuffle across partitions) - note it wouldn't go into LightGBMBase because that's the base class for classifier and regressor as well, and this is something needed just for ranker. I sort in LightGBM Ranker so that the groups are ordered, but I don't ensure that a group doesn't cross partitions; as you said one group should only be in one partition in the ranker case. I'm not sure if it is related to your specific issue though. Even if the same group is in each partition you may still get different results from run to run, although at least the difference should be smaller from model to model.

@imatiach-msft imatiach-msft self-assigned this Jun 6, 2019
@imatiach-msft
Copy link
Contributor

@daniloascione @daureg just out of curiosity, how are you computing the NDCG? I would like to add an evaluator for LGBMRanker, similar to the Spark ML evaluators and MLLib metrics. Is there one that exists out there already? I couldn't find anything in Spark ML.

@daniloascione
Copy link

@imatiach-msft I tried to add ranking metrics in Spark ML in the past (apache/spark#16618 and https://issues.apache.org/jira/browse/SPARK-14409) but things got stuck for several reasons. Currently, we are using an udf based implementation of ndcg, which is similar to this one http://lobotomys.blogspot.com/2016/08/normalised-discounted-cumulative-gain.html

@kbafna-antuit
Copy link

kbafna-antuit commented Mar 28, 2020

@daniloascione @daureg I am facing a similar issue where in training the model on the same data with same parameters result in different predictions each time.
Did you find a fix for this ?

@daniloascione
Copy link

@KeertiBafna No, I didn't find a fix, unfortunately. I haven't tried the idea to "sort within partitions" yet (see above), maybe it is time to look at this.

@kbafna-antuit
Copy link

@daniloascione Can i use repartitioning by a key as below ?
Say for ex: If i repartition my data into 8 partitions and add a column 'key' with values from 0 to 7, will the below line ensure each partition has the same key group and order everytime ?
df.repartition(8, 'key').sortWithinPartitions('order_col')

@daniloascione
Copy link

Yes, I think so, the partition should be sorted at least until the next operation with a shuffle.
I recommend you to write tests anyway.

@daniloascione
Copy link

@imatiach-msft is this issue solved in later versions? I believe you mentioned in another issue that you added a sortwithinpartitions to preserve the sorting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants