Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using same dataset and same parameters while trained two different model #1332

Closed
luoguohao opened this issue Jan 6, 2022 · 3 comments
Closed

Comments

@luoguohao
Copy link

luoguohao commented Jan 6, 2022

Describe the bug
when i tried to run unit test on linux, Serialization Fuzzing test may failed sometimes which shows that the prediction result for the same training dataset may be inconsistent sometimes after using deserialized model from disk. so, why it can't be the same ?

To Reproduce
i also wrote some codes to test, it may failed sometimes:

 test("Verify LightGBM Classifier output same model with same parameters") {
    val fileName = "PimaIndian.csv"
    val labelColumnName = "Diabetes mellitus"
    val fileLocation = DatasetUtils.binaryTrainFile(fileName).toString
    val dataset = readCSV(fileName, fileLocation).repartition(numPartitions)
    val featuresColumn = "_features"
    val rawPredCol = "rawPrediction"
    
    val featurizer = LightGBMUtils.featurizeData(dataset, labelColumnName, featuresColumn)
    val trainData = featurizer.transform(dataset).select(labelColumnName, featuresColumn)
    
    val lgbm1 = new LightGBMClassifier()
      .setLabelCol(labelColumnName)
      .setFeaturesCol(featuresColumn)
      .setRawPredictionCol(rawPredCol)
      .setDefaultListenPort(LightGBMConstants.DefaultLocalListenPort + portIndex)
      .setNumLeaves(5)
      .setNumIterations(10)
      .setObjective(binaryObjective)

    val lgbm2 = new LightGBMClassifier()
      .setLabelCol(labelColumnName)
      .setFeaturesCol(featuresColumn)
      .setRawPredictionCol(rawPredCol)
      .setDefaultListenPort(LightGBMConstants.DefaultLocalListenPort + portIndex)
      .setNumLeaves(5)
      .setNumIterations(10)
      .setObjective(binaryObjective)

    val model1 = lgbm1.fit(trainData)

    println("<=================== model ====================>")
    val model2 = lgbm2.fit(trainData)
    val df1 = model1.transform(trainData)
    val df2 = model2.transform(trainData)
    assertDFEq(df1, df2)
  }

Info (please complete the following information):

  • SynapseML Version: [v0.18]
  • Spark Version [2.4.3]
  • Spark Platform [Spark On Yarn]

** Stacktrace**

Please post the stacktrace here if applicable

If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility

Additional context
Add any other context about the problem here.

AB#1751557

@luoguohao luoguohao changed the title using same dataset and same parameters while trained two different model unit test Serialization Fuzzing failed sometimes Jan 6, 2022
@luoguohao luoguohao changed the title unit test Serialization Fuzzing failed sometimes using same dataset and same parameters while trained two different model Jan 7, 2022
@imatiach-msft
Copy link
Contributor

imatiach-msft commented Jan 12, 2022

@luoguohao the random seed is different, please see related issues:
#928
#997
#564
there are actually multiple parameters in lightgbm for this, deterministic and seed:
https://github.com/microsoft/LightGBM/blob/67b4205c8043326553e294fa2c01ad1189784631/docs/Parameters.rst#deterministic
https://github.com/microsoft/LightGBM/blob/67b4205c8043326553e294fa2c01ad1189784631/docs/Parameters.rst#seed
However, I have tried setting both in distributed case and I still saw randomness, although it did seem to significantly reduce variance. So this is still an unsolved issue currently. When using a single partition (almost like local python lightgbm run) these parameters did make the model deterministic. So perhaps it's something about the syncing logic in the native distributed lightgbm code or how we call it from synapseml.

@luoguohao
Copy link
Author

@imatiach-msft thanks for the reply, i also tried to using lightGBM distributed case ,it sounds like i can get the same result. i will keep attention on it, thanks~

@imatiach-msft
Copy link
Contributor

closing as lightgbm is now deterministic with merged PR #1387 as long as seed and deterministic=True parameters are set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants