Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed value ignored? Reproducibility not given. #113

Closed
jakob-r opened this issue Nov 26, 2014 · 5 comments
Closed

Seed value ignored? Reproducibility not given. #113

jakob-r opened this issue Nov 26, 2014 · 5 comments

Comments

@jakob-r
Copy link
Contributor

jakob-r commented Nov 26, 2014

We observed different behaviors depending on the OS.
Given this code:

seed1 = 2
seed2 = 3
set.seed(1)
library(xgboost)
d = matrix(runif(1000), nrow = 200)
y = 3 + d %*% runif(ncol(d), min = 0.1, max = 5) + rnorm(nrow(d), sd = 10)
d = xgb.DMatrix(data = d, label = y)
set.seed(seed1)
m = xgb.train(data = d, nrounds = 3, objective = "reg:linear", lambda = 0.4, alpha = 0.2, booster = "gblinear")
p1 = predict(m, newdata = d)
set.seed(seed2)
m = xgb.train(data = d, nrounds = 3, objective = "reg:linear", lambda = 0.4, alpha = 0.2, booster = "gblinear")
p2 = predict(m, newdata = d)
all(p1 == p2)

On OSX this will be always true. Even if seed1 and seed2 are different.
On Linux this will be always false. Even if seed1 and seed2 are the same.

For Windows in test I made some time ago it seemed like it has the same behavior as OSX.

@tqchen
Copy link
Member

tqchen commented Nov 26, 2014

This is not random seed problem. This is because gblinear uses a multi-threading coordinate descent and with each thread eagerly update the parameter without sync with others for efficiency reason. This difference is due to running order of different threads. setting nthread=1 will disable the behavior.

For gbtree, most updates are synced and usually there is no such behavior

@tqchen
Copy link
Member

tqchen commented Dec 10, 2014

@jakob-r As explained, this was not problem of seed, but due to undermined behavior in multi-threading, I think this issue could be closed

@tqchen tqchen closed this as completed Dec 10, 2014
@dataforager
Copy link

Don't think multi-threading is the error. Seeds were set below either by passing 'seed' param to xgb.train or by using R's set.seed() function.

Independently verified (by extracting value of .Random.seed) after calling set.seed() that seed was indeed being changed (it was).

nthread was set to 1 for both training/test runs with the same model parameters. Predicted probabilities appear below and are identical for two different seed values.

Predicted probabilities (seed=1):
0.4745588005
0.9879690409
0.5989014506
0.9906733632
0.5989014506
0.9928959012
0.1146880165
0.9928619266
0.9917168021
0.9958292842

Predicted probabilities (seed = 2):
0.4745588005
0.9879690409
0.5989014506
0.9906733632
0.5989014506
0.9928959012
0.1146880165
0.9928619266
0.9917168021
0.9958292842

@dataforager
Copy link

Can confirm seeing exact same behavior (with exact same parameter settings), though different predicted probabilities in Python version.

Predicted probabilities (seed = 1):
0.141121
0.98446
0.805141
0.949947
0.805141
0.979856
0.511622
0.990588
0.985136
0.987054

Predicted probabilities (seed = 2):
0.141121
0.98446
0.805141
0.949947
0.805141
0.979856
0.511622
0.990588
0.985136
0.987054

hcho3 pushed a commit to hcho3/xgboost that referenced this issue May 9, 2018
[TRACKER] refactor tracker
@vaughnkoch
Copy link

I'm seeing this behavior as well on macOS, using nthread=1 and different values of seed, including seed=0, which according to the docs should disable a specific seed. The outputs from the engine are unchanged.

Versions:
macOS 10.12.6
Python 3.6.4
xgboost==0.71

@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants