-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training LightGBMRanker several times gives different NDCG on testing set #580
Comments
If the table is repartitioned to 1 partition (table(s"training").repartition(1)), then the results are consistent, but this means no parallelism. |
@daureg thank you for reporting this issue. This looks similar to the issue here: |
indeed it's the same as #564 (unless there is something specific with ranker, but most likely not). I will also try to predict several time with the same model, but for now it's different models trained on the same data |
@imatiach-msft maybe there is the need to ensure that each partition get all the elements from the same group and to enforce the group sorting by adding a sortWithinPartition here https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBase.scala#L45 (similarly to https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMRanker.scala#L67) |
@daniloascione yes, that was something that I was going to add later; not sure if it should be a separate utility or if it should be done in the ranker itself (which may hurt performance significantly since it would incur a shuffle across partitions) - note it wouldn't go into LightGBMBase because that's the base class for classifier and regressor as well, and this is something needed just for ranker. I sort in LightGBM Ranker so that the groups are ordered, but I don't ensure that a group doesn't cross partitions; as you said one group should only be in one partition in the ranker case. I'm not sure if it is related to your specific issue though. Even if the same group is in each partition you may still get different results from run to run, although at least the difference should be smaller from model to model. |
@daniloascione @daureg just out of curiosity, how are you computing the NDCG? I would like to add an evaluator for LGBMRanker, similar to the Spark ML evaluators and MLLib metrics. Is there one that exists out there already? I couldn't find anything in Spark ML. |
@imatiach-msft I tried to add ranking metrics in Spark ML in the past (apache/spark#16618 and https://issues.apache.org/jira/browse/SPARK-14409) but things got stuck for several reasons. Currently, we are using an udf based implementation of ndcg, which is similar to this one http://lobotomys.blogspot.com/2016/08/normalised-discounted-cumulative-gain.html |
@daniloascione @daureg I am facing a similar issue where in training the model on the same data with same parameters result in different predictions each time. |
@KeertiBafna No, I didn't find a fix, unfortunately. I haven't tried the idea to "sort within partitions" yet (see above), maybe it is time to look at this. |
@daniloascione Can i use repartitioning by a key as below ? |
Yes, I think so, the partition should be sorted at least until the next operation with a shuffle. |
@imatiach-msft is this issue solved in later versions? I believe you mentioned in another issue that you added a sortwithinpartitions to preserve the sorting. |
I noticed that when training on Databricks with the same parameters on the same data several times, the resulting models don't give the same predictions, as evidenced by different NDCG on a separate testing set.
Here is my training function, my training set has 400K exemples in 5K lists, with 60 features:
Is that inherent to distributed training (on 5 executors) or should I change some parameters of my
LightGBMRanker
instance?The text was updated successfully, but these errors were encountered: