-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul all notebooks #9
Conversation
3a320f1
to
ac38224
Compare
Note for reviewer: I unified the constant of make_cv_splits from num_folds / n_splits to |
@crusaderky saw your ping on this. I might be able to take a look later this week |
@ncclementi I understand Paul is rather busy now; could you pick this up instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM, I left some comments and questions to address. But I do not think they are blockers to merge this. That being said, I think notebook 4 is quite advanced for the average user, I like having this handy but I'd not put it in front of users now.
One small detail, I noticed is that the environment file, is big and has probably multiple unnecessary packages. @crusaderky if you have a better version, it would probably be better to include it.
"outputs": [], | ||
"source": [ | ||
"ddf = dd.read_parquet(\n", | ||
" \"s3://coiled-datasets/prefect-dask/nyc-uber-lyft/processed_data.parquet\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the intent of this notebook, but have we checked that this is publically readable? The last time I checked it wasn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not checked. Can we keep the action on that separate from this PR though?
@@ -41,8 +40,18 @@ | |||
"source": [ | |||
"# Coiled account\n", | |||
"ACCOUNT = \"dask-engineering\"\n", | |||
"# Location of the feature table\n", | |||
"FILEPATH = \"s3://coiled-datasets/prefect-dask/nyc-uber-lyft/feature_table.parquet\"" | |||
"# Location of feature table\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like the account mentioned here, and everywhere but that's more of a nit pick/comment
"# Here we subset data for cross-validation\n", | ||
"def make_cv_splits(\n", | ||
" n_folds: int = N_FOLDS,\n", | ||
") -> Iterator[tuple[dd.DataFrame, dd.DataFrame]]:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally being explicit about typing like this, but I wonder if DS people would understand this. Just a comment.
" predictions = xgb.dask.predict(client, model, X_test)\n", | ||
" print(\"Training model...\")\n", | ||
" with warnings.catch_warnings():\n", | ||
" warnings.simplefilter(\"ignore\", category=RuntimeWarning)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of catching this warning? Why does this warning pop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the warning is raised by xgboost which is out of our control. I could of course just let it show up. It would make the code slightly more readable and the output slightly more alarming.
" -1,\n", | ||
" ),\n", | ||
" predictions.to_dask_array(lengths=True),\n", | ||
" y_test.to_dask_array(),\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original notebook we were computing the lengths and reshaping the array, I remember Greg saying this was necessary. What changed that we do not need this anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why Greg said that; the output of the function is the same with and without those extra steps.
Modeling 3 - Parallel HPO of XGBoost with Optuna and Dask (multi cluster).ipynb
Outdated
Show resolved
Hide resolved
"# Create a single study\n", | ||
"start = datetime.now()\n", | ||
"study = optuna.create_study(study_name=\"parallel-nyc-travel-time-model\")\n", | ||
"study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, instead of using the thread pool executor, now each N_JOB creates a cluster and runs the 5 trials in that cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This starts a thread pool executor with N_JOB threads, each creating a cluster, which then runs 5 trials.
This is exactly like before. The only difference is that it is now done internally by optuna.
" score = mean_squared_error(\n", | ||
" y_test.to_dask_array(),\n", | ||
" predictions.to_dask_array(),\n", | ||
" squared=False,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need acompute=False
here or not in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It would be a very big mistake to put compute=False because that would cause a task to return a Future - which sometimes works thanks to race conditions but will randomly cause your computation to remain stuck.
XREF dask/distributed#5671. I've added a comment to clarify.
" sem: distributed.Semaphore,\n", | ||
" study_params: dict[str, float],\n", | ||
"):\n", | ||
" distributed.secede()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are this and the Semaphores a workaround for the block caused by xgboost?
Do we have issues that we could link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semaphore is because the dask scheduler will reach 100% CPU and slow down to a crawl if you feed it 200k+ tasks all at once. This is one of the biggest showstoppers for reaching terabyte scale.
So we need to feed it enough tasks to saturate the cluster, but not enough to make it drown in them.
The secede()
is because both DaskDMatrix.__init__
and train
call compute()
internally, which blocks. If this was a proper non-blocking system that just creates dask collections (like dask/dask does), we'd still need to call
score = .... (build dask graph)
distributed.secede()
score = score.compute()
distributed.rejoin()
return score
The proper way around it is, again, tasks from tasks.
Another way around it is publish_dataset
, which however will give you a cluster-wide leak if for any reason the waiting task dies.
Reduce all cluster sizes.
Reduce end-to-end runtime of feature engineering from 25 to 3 minutes.
Optimize and clean up training notebooks.
Make all notebooks stable.
Add new notebook "Modelling 4", which is a variant of Modelling 3 however without the hack of spawning multiple clusters. However (
xgboost.dask
's fault) it's not very performant for now - need major intervention upstream.