New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Overhaul all notebooks #9

Merged

ncclementi merged 2 commits into coiled:main from crusaderky:feature_eng

Feb 6, 2023

Contributor

crusaderky commented Jan 23, 2023 •

edited

Loading

Reduce all cluster sizes.
Reduce end-to-end runtime of feature engineering from 25 to 3 minutes.
Optimize and clean up training notebooks.
Make all notebooks stable.

Add new notebook "Modelling 4", which is a variant of Modelling 3 however without the hack of spawning multiple clusters. However (xgboost.dask's fault) it's not very performant for now - need major intervention upstream.


          Overhaul all notebooks

ac38224

crusaderky force-pushed the feature_eng branch from 3a320f1 to ac38224 Compare

January 31, 2023 16:24

crusaderky changed the title ~~Overhaul Feature Engineering~~ Overhaul all notebooks

crusaderky self-assigned this

crusaderky requested review from hayesgb and phobson

January 31, 2023 16:26

Contributor Author

crusaderky commented Jan 31, 2023

Note for reviewer: I unified the constant of make_cv_splits from num_folds / n_splits to N_FOLDS. Unsure if the label is appropriate or if there's a better one.

phobson commented Jan 31, 2023

@crusaderky saw your ping on this. I might be able to take a look later this week

crusaderky mentioned this pull request

Test integration with xgboost and optuna coiled/benchmarks#675

Merged

crusaderky requested review from ncclementi and removed request for phobson and hayesgb

February 1, 2023 17:15

Contributor Author

crusaderky commented Feb 1, 2023

@ncclementi I understand Paul is rather busy now; could you pick this up instead?

ncclementi approved these changes

View reviewed changes

Contributor

ncclementi left a comment

Overall this LGTM, I left some comments and questions to address. But I do not think they are blockers to merge this. That being said, I think notebook 4 is quite advanced for the average user, I like having this handy but I'd not put it in front of users now.

One small detail, I noticed is that the environment file, is big and has probably multiple unnecessary packages. @crusaderky if you have a better version, it would probably be better to include it.

Feature Engineering.ipynb

+                 "outputs": [],
+                 "source": [
+                  "ddf = dd.read_parquet(\n",
+                  "    \"s3://coiled-datasets/prefect-dask/nyc-uber-lyft/processed_data.parquet\",\n",

Contributor

ncclementi Feb 2, 2023

I'm not sure what's the intent of this notebook, but have we checked that this is publically readable? The last time I checked it wasn't.

Contributor Author

crusaderky Feb 6, 2023

I have not checked. Can we keep the action on that separate from this PR though?

Modeling 1 - Train an XGBoost Model with Dask.ipynb

@@ @@ -41,8 +40,18 @@ @@
                  "source": [
                   "# Coiled account\n",
                   "ACCOUNT = \"dask-engineering\"\n",
-                  "# Location of the feature table\n",
-                  "FILEPATH = \"s3://coiled-datasets/prefect-dask/nyc-uber-lyft/feature_table.parquet\""
+                  "# Location of feature table\n",

Contributor

ncclementi Feb 2, 2023

I'm not sure I like the account mentioned here, and everywhere but that's more of a nit pick/comment

Modeling 1 - Train an XGBoost Model with Dask.ipynb

+                  "# Here we subset data for cross-validation\n",
+                  "def make_cv_splits(\n",
+                  "    n_folds: int = N_FOLDS,\n",
+                  ") -> Iterator[tuple[dd.DataFrame, dd.DataFrame]]:\n",

Contributor

ncclementi Feb 3, 2023

I personally being explicit about typing like this, but I wonder if DS people would understand this. Just a comment.

Modeling 1 - Train an XGBoost Model with Dask.ipynb Outdated

-                  "    predictions = xgb.dask.predict(client, model, X_test)\n",
+                  "    print(\"Training model...\")\n",
+                  "    with warnings.catch_warnings():\n",
+                  "        warnings.simplefilter(\"ignore\", category=RuntimeWarning)\n",

Contributor

ncclementi Feb 3, 2023

What's the purpose of catching this warning? Why does this warning pop?

Contributor Author

crusaderky Feb 6, 2023

the warning is raised by xgboost which is out of our control. I could of course just let it show up. It would make the code slightly more readable and the output slightly more alarming.

Modeling 1 - Train an XGBoost Model with Dask.ipynb

-                  "            -1,\n",
-                  "        ),\n",
-                  "        predictions.to_dask_array(lengths=True),\n",
+                  "        y_test.to_dask_array(),\n",

Contributor

ncclementi Feb 3, 2023

In the original notebook we were computing the lengths and reshaping the array, I remember Greg saying this was necessary. What changed that we do not need this anymore?

Contributor Author

crusaderky Feb 6, 2023

I don't know why Greg said that; the output of the function is the same with and without those extra steps.

Modeling 1 - Train an XGBoost Model with Dask.ipynb Outdated Show resolved Hide resolved

Modeling 3 - Parallel HPO of XGBoost with Optuna and Dask (multi cluster).ipynb Outdated Show resolved Hide resolved

Modeling 3 - Parallel HPO of XGBoost with Optuna and Dask (multi cluster).ipynb

+                  "# Create a single study\n",
+                  "start = datetime.now()\n",
+                  "study = optuna.create_study(study_name=\"parallel-nyc-travel-time-model\")\n",
+                  "study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS)\n",

Contributor

ncclementi Feb 3, 2023

If I understand correctly, instead of using the thread pool executor, now each N_JOB creates a cluster and runs the 5 trials in that cluster.

Contributor Author

crusaderky Feb 6, 2023 •

edited

Loading

This starts a thread pool executor with N_JOB threads, each creating a cluster, which then runs 5 trials.
This is exactly like before. The only difference is that it is now done internally by optuna.

Modeling 4 - Parallel HPO of XGBoost with Optuna and Dask (single cluster).ipynb

+                  "        score = mean_squared_error(\n",
+                  "            y_test.to_dask_array(),\n",
+                  "            predictions.to_dask_array(),\n",
+                  "            squared=False,\n",

Contributor

ncclementi Feb 3, 2023

Do we need acompute=False here or not in this case?

Contributor Author

crusaderky Feb 6, 2023

No. It would be a very big mistake to put compute=False because that would cause a task to return a Future - which sometimes works thanks to race conditions but will randomly cause your computation to remain stuck.
XREF dask/distributed#5671. I've added a comment to clarify.

Modeling 4 - Parallel HPO of XGBoost with Optuna and Dask (single cluster).ipynb

+                  "    sem: distributed.Semaphore,\n",
+                  "    study_params: dict[str, float],\n",
+                  "):\n",
+                  "    distributed.secede()\n",

Contributor

ncclementi Feb 3, 2023

Are this and the Semaphores a workaround for the block caused by xgboost?
Do we have issues that we could link?

Contributor Author

crusaderky Feb 6, 2023 •

edited

Loading

The semaphore is because the dask scheduler will reach 100% CPU and slow down to a crawl if you feed it 200k+ tasks all at once. This is one of the biggest showstoppers for reaching terabyte scale.
So we need to feed it enough tasks to saturate the cluster, but not enough to make it drown in them.

The secede() is because both DaskDMatrix.__init__ and train call compute() internally, which blocks. If this was a proper non-blocking system that just creates dask collections (like dask/dask does), we'd still need to call

score = .... (build dask graph)
distributed.secede()
score = score.compute()
distributed.rejoin()
return score

The proper way around it is, again, tasks from tasks.
Another way around it is publish_dataset, which however will give you a cluster-wide leak if for any reason the waiting task dies.


          Code review

f91bde4

ncclementi merged commit 2ad31b4 into coiled:main

crusaderky deleted the feature_eng branch

February 7, 2023 15:20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet