Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise 10 minutes notebook. #10738

Merged
merged 6 commits into from
May 13, 2022
Merged

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Apr 26, 2022

Follow-up from #10685 to fix deprecation warnings in the 10 minute notebook.

Fixes: #10613
Changes:

  • Fixed deprecation warning for Series.applymap ➡️ Series.apply
  • Removed two cells demonstrating Series.append. This has also been removed from the Pandas 10 minute notebook because the feature is deprecated.
  • Refactored ORC file path logic to be a bit simpler

@bdice bdice added doc Documentation non-breaking Non-breaking change labels Apr 26, 2022
@bdice bdice self-assigned this Apr 26, 2022
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@bdice bdice added the Python Affects Python cuDF API. label Apr 26, 2022
@bdice bdice requested review from shwina and mmccarty April 26, 2022 15:56
@codecov
Copy link

codecov bot commented Apr 26, 2022

Codecov Report

Merging #10738 (49f60ae) into branch-22.06 (e0d94f3) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.06   #10738      +/-   ##
================================================
+ Coverage         86.28%   86.32%   +0.03%     
================================================
  Files               144      144              
  Lines             22654    22656       +2     
================================================
+ Hits              19548    19558      +10     
+ Misses             3106     3098       -8     
Impacted Files Coverage Δ
python/cudf/cudf/core/resample.py 88.97% <ø> (ø)
python/dask_cudf/dask_cudf/groupby.py 97.42% <100.00%> (+0.01%) ⬆️
python/dask_cudf/dask_cudf/tests/test_groupby.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/dask_cudf/dask_cudf/core.py 73.62% <0.00%> (+0.26%) ⬆️
python/cudf/cudf/core/column/numerical.py 96.17% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ad1e51...49f60ae. Read the comment docs.

"+-----------------------------------------------------------------------------+\n"
"Tue Apr 26 10:47:09 2022 \r\n",
"+-----------------------------------------------------------------------------+\r\n",
"| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |\r\n",
Copy link
Contributor Author

@bdice bdice Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina I'm not sure if these nvidia-smi outputs are showing what we intend. It's called twice, before and after the dask dataframe is persisted. There's a comment before that indicates the memory usage should change. However, I don't see a difference in memory usage before and after the persist() call.

Because Dask is lazy, the computation has not yet occurred. We can see that there are twenty tasks in the task graph and we've used about 800 MB of memory. We can force computation by using persist. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline).

Is this a bug or a change in behavior? Should we change that notice and/or remove the nvidia-smi output entirely so that the notebook's results are less system-dependent?

Copy link
Contributor

@shwina shwina Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that persist() returns immediately. It takes a bit for the DataFrame to materialize. If you wait a bit after the call to persist() and before the second nvidia-smi, the increase in memory is obvious. Unfortunately, this doesn't lend very well to automated notebook execution -- maybe we insert a time.sleep() with an explanation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a little surprising. Definitely wouldn't have considered that possibility. I will take a look at this tomorrow, and probably add a sleep command as you suggest. (Note that every sleep command will increase the time to build the docs, so I reduced the final sleep "wait" at the end of this notebook to be less than 60 seconds.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I realize it's not ideal..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got a sleep command added here. (Found #10829 in the process, which was a blocker.) The memory usage grows after the .persist() call.

@bdice
Copy link
Contributor Author

bdice commented May 11, 2022

This is waiting on bugfix #10829 / #10830, which affects this notebook.

@bdice bdice added the 0 - Blocked Cannot progress due to external reasons label May 11, 2022
@bdice bdice removed the 0 - Blocked Cannot progress due to external reasons label May 13, 2022
@bdice bdice requested a review from mmccarty May 13, 2022 03:45
@bdice
Copy link
Contributor Author

bdice commented May 13, 2022

@shwina @mmccarty This is ready for further review. Thanks! cc: @galipremsagar

Copy link
Contributor

@mmccarty mmccarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@bdice
Copy link
Contributor Author

bdice commented May 13, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0802451 into rapidsai:branch-22.06 May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] 10 Minutes to cuDF and dask-cuDF uses deprecated Append method
4 participants