deterministic and related flags don't guarantee same result on different machines #6683

deltamacht · 2024-10-15T20:37:05Z

I'm using the LGBMRegressor as part of a Scikit-learn API. I'm having issues in that I have some models that give me different results when calling .predict() in my Docker environment on my local Mac machine and in the same Docker environment on an AWS EC2 instance. This is despite the model using deterministic=True, force_row_wise=True and num_threads=1.

First question. Is this expected that even with these flags set that results might be different on different machines? Under the deterministic section of the docs, I see the following bullet point:

when you use the different seeds, different LightGBM versions, the binaries compiled by different compilers, or in different systems, the results are expected to be different

This makes it seem like maybe this is expected behavior, although I might have hoped that running in a Docker environment would allow for reproducible behavior. The problem of course is that, as I'm creating tests for my code base, I can't guarantee that the tests will pass in CI/CD if they pass locally on my computer or elsewhere. If this expected behavior, how are people including LGBM code in their test suites which don't run on the same hardware?

If this is not expected behavior, is there a data or model setup that would maybe not be covered by the flags being set in this way? Prior to the LGBMRegressor, I have a data transformation pipeline that makes various data transformations. Purely by guessing and checking, I figured out that by removing a CyclicalFeatures (https://feature-engine.trainindata.com/en/1.7.x/api_doc/creation/CyclicalFeatures.html#feature_engine.creation.CyclicalFeatures) transformation on the pipeline gave me reproducible results between my local machine and the EC2 box. This transformation isn't doing anything stochastic, but it simply transforming a feature into sine and cosine representations. Is there a reason why mapping a feature to the -1 to 1 range would introduce a behavior that would be non-deterministic?

I have a minimal example which includes data, a saved pipeline, and a driver script. If useful, I could relabel the data to remove any sensitive information and provide it, provide a minimal working Docker environment, etc., but just wanted to ask the above questions first.

Thanks.

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-10-15T22:09:55Z

Thanks for using LightGBM.

This is similar to many other discussions here, you might find some of those useful

what are the guidelines for strict deterministic behavior of the regressor? #6094
Inconsistent Model Results with Identical Dataset and Parameters #6604 (comment)

We know this is an area of confusion with LightGBM. I have some ideas to improve that but haven't put them into writing yet, apologies. Will try to help you here.

This is despite the model using deterministic=True, force_row_wise=True and num_threads=1

Those are not sufficient to make the training output deterministic. At a minimum, you should also set random_state to something other than 0.

Here's a minimal, reproducible example that produces identical results for me (Python 3.10, lightgbm built from source on the latest commit here) on my M2 Mac.

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000, n_features=5, n_informative=5, random_state=123)

params = {
    "deterministic": True,
    "force_row_wise": True,
    "n_jobs": 1,
    "n_estimators": 10,
    "seed": 708
}

mod1 = lgb.LGBMRegressor(**params).fit(X, y)
mod1_str = mod1.booster_.model_to_string()

mod2 = lgb.LGBMRegressor(**params).fit(X, y)
mod2_str = mod2.booster_.model_to_string()

mod3 = lgb.LGBMRegressor(**params).fit(X, y)
mod3_str = mod3.booster_.model_to_string()

assert mod1_str == mod2_str
assert mod2_str == mod3_str

If you can modify that in a way that still shows some non-determinism, we'd be happy to investigate further.

running in a Docker environment would allow for reproducible behavior

are you sure you're getting identical versions of all libraries in both of those environments?
is that "Docker environment" using the same operating system as the EC2 instance?
do they both have the same CPU architecture? (e.g. x86_64 vs. arm64)?

If the answer to any of those is "no", that could explain why you're seeing different results.

Is there a reason why mapping a feature to the -1 to 1 range would introduce a behavior that would be non-deterministic?

It's possible.

There are multiple ways, but in general they could be summarized as "numerical precision".

Feature values that are different before standardization could be identical once forced into the [-1, 1] range, due to limited floating point precision. You could try checking the number of unique values before and after that transformation to see if that's happening with your dataset.

Even if that cardinality isn't changed, using very very small floating point numbers can lead to non-deterministic results in any operations that are multi-threaded and which involve operations like multiplication and division. I know that you said you're using deterministic=True num_threads=1, which should remove most such operations, but you haven't for example told us whether you're using LightGBM on the GPU... if you are, then such parallel computation is going to happen, and the exact result will depend on the order that different threads complete their work, which is non-deterministic.

deltamacht · 2024-10-15T22:20:35Z

James, thank you for your detailed response.

Those are not sufficient to make the training output deterministic. At a minimum, you should also set random_state to something other than 0.

Apologies for this omission in my original post. I also set the random_state in these runs. Simply forgot to post that as it was a given in my mind.

There are multiple ways, but in general they could be summarized as "numerical precision".

This is currently the reason that seems most likely in my mind. I've been experimenting with applying a "rounding" transformer which rounds the features after doing the cyclical transformation. My first test in doing so gave me identical results on both machines I'm testing on. We'll see if I get the same results soon in the CI/CD pipeline run by github.

Will post an update back here once I learn a bit more.

jameslamb · 2024-10-15T22:25:32Z

edit: there was an important typo in my code sample above. I've corrected that and re-run, confirmed it still produces identical results across multiple runs.

Will post an update back here once I learn a bit more.

Ok sounds good.

jameslamb added the question label Oct 15, 2024

jameslamb added the awaiting response label Oct 15, 2024

github-actions bot removed the awaiting response label Oct 15, 2024

jameslamb mentioned this issue Nov 27, 2024

[RFC] make deterministic parameter more thorough? #6731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deterministic and related flags don't guarantee same result on different machines #6683

deterministic and related flags don't guarantee same result on different machines #6683

deltamacht commented Oct 15, 2024 •

edited

Loading

jameslamb commented Oct 15, 2024 •

edited

Loading

deltamacht commented Oct 15, 2024

jameslamb commented Oct 15, 2024

deterministic and related flags don't guarantee same result on different machines #6683

deterministic and related flags don't guarantee same result on different machines #6683

Comments

deltamacht commented Oct 15, 2024 • edited Loading

jameslamb commented Oct 15, 2024 • edited Loading

deltamacht commented Oct 15, 2024

jameslamb commented Oct 15, 2024

deltamacht commented Oct 15, 2024 •

edited

Loading

jameslamb commented Oct 15, 2024 •

edited

Loading