Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul all notebooks #9

Merged
merged 2 commits into from
Feb 6, 2023
Merged

Overhaul all notebooks #9

merged 2 commits into from
Feb 6, 2023

Conversation

crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Jan 23, 2023

Reduce all cluster sizes.
Reduce end-to-end runtime of feature engineering from 25 to 3 minutes.
Optimize and clean up training notebooks.
Make all notebooks stable.

Add new notebook "Modelling 4", which is a variant of Modelling 3 however without the hack of spawning multiple clusters. However (xgboost.dask's fault) it's not very performant for now - need major intervention upstream.

@crusaderky crusaderky changed the title Overhaul Feature Engineering Overhaul all notebooks Jan 31, 2023
@crusaderky crusaderky self-assigned this Jan 31, 2023
@crusaderky crusaderky requested review from hayesgb and phobson January 31, 2023 16:26
@crusaderky
Copy link
Contributor Author

Note for reviewer: I unified the constant of make_cv_splits from num_folds / n_splits to N_FOLDS. Unsure if the label is appropriate or if there's a better one.

@phobson
Copy link

phobson commented Jan 31, 2023

@crusaderky saw your ping on this. I might be able to take a look later this week

@crusaderky
Copy link
Contributor Author

@ncclementi I understand Paul is rather busy now; could you pick this up instead?

Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM, I left some comments and questions to address. But I do not think they are blockers to merge this. That being said, I think notebook 4 is quite advanced for the average user, I like having this handy but I'd not put it in front of users now.

One small detail, I noticed is that the environment file, is big and has probably multiple unnecessary packages. @crusaderky if you have a better version, it would probably be better to include it.

"outputs": [],
"source": [
"ddf = dd.read_parquet(\n",
" \"s3://coiled-datasets/prefect-dask/nyc-uber-lyft/processed_data.parquet\",\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the intent of this notebook, but have we checked that this is publically readable? The last time I checked it wasn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not checked. Can we keep the action on that separate from this PR though?

@@ -41,8 +40,18 @@
"source": [
"# Coiled account\n",
"ACCOUNT = \"dask-engineering\"\n",
"# Location of the feature table\n",
"FILEPATH = \"s3://coiled-datasets/prefect-dask/nyc-uber-lyft/feature_table.parquet\""
"# Location of feature table\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like the account mentioned here, and everywhere but that's more of a nit pick/comment

"# Here we subset data for cross-validation\n",
"def make_cv_splits(\n",
" n_folds: int = N_FOLDS,\n",
") -> Iterator[tuple[dd.DataFrame, dd.DataFrame]]:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally being explicit about typing like this, but I wonder if DS people would understand this. Just a comment.

" predictions = xgb.dask.predict(client, model, X_test)\n",
" print(\"Training model...\")\n",
" with warnings.catch_warnings():\n",
" warnings.simplefilter(\"ignore\", category=RuntimeWarning)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of catching this warning? Why does this warning pop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the warning is raised by xgboost which is out of our control. I could of course just let it show up. It would make the code slightly more readable and the output slightly more alarming.

" -1,\n",
" ),\n",
" predictions.to_dask_array(lengths=True),\n",
" y_test.to_dask_array(),\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original notebook we were computing the lengths and reshaping the array, I remember Greg saying this was necessary. What changed that we do not need this anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why Greg said that; the output of the function is the same with and without those extra steps.

"# Create a single study\n",
"start = datetime.now()\n",
"study = optuna.create_study(study_name=\"parallel-nyc-travel-time-model\")\n",
"study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, instead of using the thread pool executor, now each N_JOB creates a cluster and runs the 5 trials in that cluster.

Copy link
Contributor Author

@crusaderky crusaderky Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starts a thread pool executor with N_JOB threads, each creating a cluster, which then runs 5 trials.
This is exactly like before. The only difference is that it is now done internally by optuna.

" score = mean_squared_error(\n",
" y_test.to_dask_array(),\n",
" predictions.to_dask_array(),\n",
" squared=False,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need acompute=False here or not in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It would be a very big mistake to put compute=False because that would cause a task to return a Future - which sometimes works thanks to race conditions but will randomly cause your computation to remain stuck.
XREF dask/distributed#5671. I've added a comment to clarify.

" sem: distributed.Semaphore,\n",
" study_params: dict[str, float],\n",
"):\n",
" distributed.secede()\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are this and the Semaphores a workaround for the block caused by xgboost?
Do we have issues that we could link?

Copy link
Contributor Author

@crusaderky crusaderky Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semaphore is because the dask scheduler will reach 100% CPU and slow down to a crawl if you feed it 200k+ tasks all at once. This is one of the biggest showstoppers for reaching terabyte scale.
So we need to feed it enough tasks to saturate the cluster, but not enough to make it drown in them.

The secede() is because both DaskDMatrix.__init__ and train call compute() internally, which blocks. If this was a proper non-blocking system that just creates dask collections (like dask/dask does), we'd still need to call

score = .... (build dask graph)
distributed.secede()
score = score.compute()
distributed.rejoin()
return score

The proper way around it is, again, tasks from tasks.
Another way around it is publish_dataset, which however will give you a cluster-wide leak if for any reason the waiting task dies.

@ncclementi ncclementi merged commit 2ad31b4 into coiled:main Feb 6, 2023
@crusaderky crusaderky deleted the feature_eng branch February 7, 2023 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants