-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise 10 minutes notebook. #10738
Revise 10 minutes notebook. #10738
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10738 +/- ##
================================================
+ Coverage 86.28% 86.32% +0.03%
================================================
Files 144 144
Lines 22654 22656 +2
================================================
+ Hits 19548 19558 +10
+ Misses 3106 3098 -8
Continue to review full report at Codecov.
|
"+-----------------------------------------------------------------------------+\n" | ||
"Tue Apr 26 10:47:09 2022 \r\n", | ||
"+-----------------------------------------------------------------------------+\r\n", | ||
"| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |\r\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwina I'm not sure if these nvidia-smi
outputs are showing what we intend. It's called twice, before and after the dask dataframe is persisted. There's a comment before that indicates the memory usage should change. However, I don't see a difference in memory usage before and after the persist()
call.
Because Dask is lazy, the computation has not yet occurred. We can see that there are twenty tasks in the task graph and we've used about 800 MB of memory. We can force computation by using persist. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline).
Is this a bug or a change in behavior? Should we change that notice and/or remove the nvidia-smi
output entirely so that the notebook's results are less system-dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that persist()
returns immediately. It takes a bit for the DataFrame to materialize. If you wait a bit after the call to persist()
and before the second nvidia-smi
, the increase in memory is obvious. Unfortunately, this doesn't lend very well to automated notebook execution -- maybe we insert a time.sleep()
with an explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's a little surprising. Definitely wouldn't have considered that possibility. I will take a look at this tomorrow, and probably add a sleep command as you suggest. (Note that every sleep command will increase the time to build the docs, so I reduced the final sleep "wait" at the end of this notebook to be less than 60 seconds.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I realize it's not ideal..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got a sleep
command added here. (Found #10829 in the process, which was a blocker.) The memory usage grows after the .persist()
call.
@shwina @mmccarty This is ready for further review. Thanks! cc: @galipremsagar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
@gpucibot merge |
Follow-up from #10685 to fix deprecation warnings in the 10 minute notebook.
Fixes: #10613
Changes:
Series.applymap
➡️Series.apply
Series.append
. This has also been removed from the Pandas 10 minute notebook because the feature is deprecated.