Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightgbm run twice with the same parameters, but got different result in validation #564

Closed
wxh001qq opened this issue May 9, 2019 · 12 comments
Assignees

Comments

@wxh001qq
Copy link

wxh001qq commented May 9, 2019

I run the lightgbm twice with the same parameters, but got different result in validation. I find the only random seed parameter is baggingSeed. After fixed baggingSeed, the problem also occured. Should I fix any other parameters? Thanks.

AB#1751559

@imatiach-msft
Copy link
Contributor

imatiach-msft commented May 9, 2019

@wxh001qq would you be able to send a sample? I believe in spark distributed case, the order of the rows can be different from run to run, eg see this:
https://issues.apache.org/jira/browse/SPARK-16207
eg one of the comments from spark commiter for that issue:
"Generally, things like RDD and DataFrame don't guarantee any order at all, unless they are product of an ordering operation like sort. I don't think blogs/SO are relevant as much as Spark docs, and they do cover this in places "

@wxh001qq
Copy link
Author

the sample:

1, load data
trainDF = spark.read.format('csv').options(header='true', inferSchema='true').load('xxx/')

2, to VectorAssembler
assembler = VectorAssembler(
inputCols=feature_cols,
outputCol="features")
train_data = assembler.transform(trainDF).select('features', col('Label').alias('label'))

3, train lgb
classifier = LightGBMClassifier(learningRate=0.05, numIterations=100, numLeaves=70, maxDepth=10).fit(train_data)

and, when we train XGboost (ml.dmlc.xgboost4j, on spark) in the same way, the result can be reproduced.

@kbafna-antuit
Copy link

@imatiach-msft Hello. I am facing the same issue. Is there a way to fix the random_state ?? Or any other parameter ?

@imatiach-msft
Copy link
Contributor

@KeertiBafna what version of lightgbm are you using in python? I wonder if it's a python version difference. It may also be that the dataset coming in is different between spark and pandas (eg the precision of the values may be different).

@kbafna-antuit
Copy link

@KeertiBafna what version of lightgbm are you using in python? I wonder if it's a python version difference. It may also be that the dataset coming in is different between spark and pandas (eg the precision of the values may be different).

@imatiach-msft Thanks for the quick reply.
I am using the mmlspark version 0.17 on azure databricks.

@kbafna-antuit
Copy link

@imatiach-msft I am using the mmlspark version 0.17. Workspace is azure databricks.
Even if i set the bagging seed constant and re-run the model i get different accuracies on the test set each time. Is it inherent in databricks due to parallelism ?
The objective is for me to tune the hyper parameters. Could you kindly suggest a way to do this ?

@kbafna-antuit
Copy link

@imatiach-msft Any updates on this issue ? How do i get consistent results using mmlspark. The version is 0.17.
I am still getting different results each time i run the model with the same parameters and data.
Thanks.

@LIkensust
Copy link

i have the same question. same train dataset, input in different order (row order different), and same test dataset, but ndcg is different. does input order important?

@yangbingjiao
Copy link

I meet the same problem in the version 0.18.1。I see there is only a param called baggingSeed which controls the bagging fraction, so the param "feature_fraction_seed" may be lacked. Is this issue resolved in the latest version?

@andrew-arkhipov
Copy link

andrew-arkhipov commented Sep 8, 2021

Also facing this issue right now. I'm providing a specific train set and test set, and the test set evaluation metric is different every time I train the model (LightGBM regressor). Any ideas @imatiach-msft ?

@shenglaiyin
Copy link

in R, we have to set seed() before running machine learning algorithms, including GBM, because these algorithms involve stochastic processes.

@imatiach-msft
Copy link
Contributor

closing as lightgbm is now deterministic with merged PR #1387 as long as seed and deterministic=True parameters are set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants