Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable patching #27

Merged
merged 37 commits into from
Mar 19, 2024
Merged

Enable patching #27

merged 37 commits into from
Mar 19, 2024

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Feb 27, 2024

This PR adds the ability to monkey-patch imports of dask and distributed whenever those imports occur by simply installing rapids-dask-dependency.

There's a tiny bit of scope creep here because this PR added real Python code to the repo for the first time, so I also added pre-commit hooks that in turn modified some unrelated files (only minimally, though).

TODO:

  • Update conda CI and packaging
  • Stress test extensively

@vyasr vyasr added the feature request New feature or request label Feb 27, 2024
@vyasr vyasr self-assigned this Feb 27, 2024
@pentschev
Copy link
Member

I was testing this branch out and I think it may have some issues. I added some prints in Dask to help me with debugging:

diff --git a/dask/dataframe/__init__.py b/dask/dataframe/__init__.py
index 87e945ff..3a619ad2 100644
--- a/dask/dataframe/__init__.py
+++ b/dask/dataframe/__init__.py
@@ -161,6 +161,7 @@ else:
         raise ImportError(msg) from e

     if dask.config.get("dataframe.query-planning-warning"):
+        print("dask.dataframe warning enabled", flush=True)
         warnings.warn(
             """The current Dask DataFrame implementation is deprecated.
 In a future release, Dask DataFrame will use a new implementation that
@@ -196,6 +197,10 @@ To disable this warning in the future, set dask config:
             DeprecationWarning,
             stacklevel=2,
         )
+    else:
+        print("dask.dataframe warning disabled", flush=True)


 from dask.dataframe._testing import test_dataframe
+
+print("dask.dataframe imported", flush=True)

with the changes above and current rapids-dask-dependency package from rapidsai-nightly channel I see:

>>> import dask
>>> import dask.dataframe
dask.dataframe warning enabled
<stdin>:1: DeprecationWarning: The current Dask DataFrame implementation is deprecated.
In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues

To disable this warning in the future, set dask config:

    # via Python
    >>> dask.config.set({'dataframe.query-planning-warning': False})

    # via CLI
    dask config set dataframe.query-planning-warning False


dask.dataframe imported
>>>

In the case above, import dask.dataframe clearly worked as expected. However, with this branch the result is as follows:

>>> import dask
>>> import dask.dataframe
>>> dask.test_attr
'hello world'
>>>

It seems that with the changes here import dask.dataframe has no effect. @vyasr can you think of a reason for why this would happen?

@vyasr
Copy link
Contributor Author

vyasr commented Feb 27, 2024

It's likely because I wasn't careful in my handling of submodules and am only dealing with the top-level package right now.

@vyasr
Copy link
Contributor Author

vyasr commented Feb 27, 2024

I've pushed fixes that should resolve the dask.distributed issue

To be clear, I don't know enough about all the ways in which dask/distributed can be imported/interact to know the exact right solutions here. I'll definitely need help from you all in writing the logic so that it only intercepts the desired imports and that it only patches the required modules. For instance, with the current version of the code every submodule of dask and distributed will also have a test_attr attribute. You'll have to tell me what filtering criteria should be applied for which patches.

Currently I have dask and distributed patches separately, but it might instead make sense to just have a single list of patches and require each patch to determine what action to take based on the module name.

@vyasr
Copy link
Contributor Author

vyasr commented Feb 27, 2024

Also just FYI this approach will not work with editable installs. I assume nobody is trying that, just a warning (I don't know how to make that work).

@vyasr
Copy link
Contributor Author

vyasr commented Feb 27, 2024

I added a tests module. Please feel free to check out this code and push additional tests that you think are appropriate. I can help fix failing cases once I know what they are.

@pentschev
Copy link
Member

Thanks Vyas! I can confirm the import now works, but for some reason DeprecationWarnings are not raised back to the user, whereas UserWarnings are. I'll try to investigate this further.

On a separate note I was thinking of one problem that we may (or at least I did) have overlooked: with this change we now change Dask's behavior to everyone who has rapids-dask-dependency installed, not just those who are actively using Dask-cuDF, Dask-CUDA, etc. For instance, one may always create environments where everything is contained for general use, both RAPIDS and non-RAPIDS packages, if we now omit the DeprecationWarning (unrelated to the issue from paragraph above, this is what I'm thinking we will use this for in 24.04, omitting the dask-expr warning) we will effectively omit it for everyone, even those who may be using CPU-only. In the broader case I think this is ok, we will normally use pth to apply fixes that cannot make their way into Dask/Distributed releases on time for us, but the DeprecationWarning here is a bit different, we will essentially be modifying Dask's default behavior even for CPU. I don't have strong thoughts on this either way but maybe @rjzamora will, so we must make a conscious decision whether this is ok or not.

@rjzamora
Copy link
Member

Thanks Vyas! I can confirm the import now works, but for some reason DeprecationWarnings are not raised back to the user, whereas UserWarnings are. I'll try to investigate this further.

I just ran into this as well. We do want to avoid having an affect on "normal" dask/distributed behavior unless it is behavior that we are specifically targeting in a patch.

On a separate note I was thinking of one problem that we may (or at least I did) have overlooked: ...

This is definitely an important consideration to make when deciding to add a new patch. For the case of a "hot fix" that we would have preferred to merge into dask/distributed proper if time allowed, this is a no-brainer to me: There is little to no downside in modifying CPU-only behavior. For the "query-planning" config change, this is a much tougher question to answer. While I am very comfortable silencing the deprecation warning for our CPU and GPU users alike, I am less comfortable changing the default for dask>=2024.3.0.

It may ultimately make sense to embrace this pth-patching idea (kudos to @vyasr), but to still pin to dask<2024.3.0 for the 24.04 release.

@vyasr
Copy link
Contributor Author

vyasr commented Feb 28, 2024

Is the dask warning you're referring to a true Python DeprecationWarning? If so, I would be curious how it's being surfaced by default. The default behavior of DeprecationWarning in Python is to be hidden. If dask is not using a custom warning class that looks like DeprecationWarning, they must be doing some sort of warnings configuration to show up and maybe that is somehow being masked by this importing approach.

@vyasr
Copy link
Contributor Author

vyasr commented Feb 28, 2024

IMO if someone is installing RAPIDS and dask into the same environment it is OK to have patching take effect even if they're not using RAPIDS. There's really no meaningful way to differentiate those two cases. Even our more drastic solutions to the patching problem (like forking) would have the same problem.

@pentschev
Copy link
Member

I am less comfortable changing the default for dask>=2024.3.0.

I think we should generally leave this discussion for later, but can you elaborate why is that @rjzamora ?

If dask is not using a custom warning class that looks like DeprecationWarning, they must be doing some sort of warnings configuration to show up and maybe that is somehow being masked by this importing approach.

@vyasr This is what they're doing: https://github.com/dask/dask/blob/d66c5c88906ed51b5cb47dcfa0030717657675d2/dask/dataframe/__init__.py#L173-L207

AFAIK, you don't need to do anything besides adjusting to the proper stacklevel, which they do. I wasn't able to get a warning raised during Dask import even playing around with different stacklevels, so I'm really not quite sure this happens as a UserWarning gets raised even with the pth patch.

@rjzamora
Copy link
Member

rjzamora commented Feb 28, 2024

I think we should generally leave this discussion for later, but can you elaborate why is that @rjzamora ?

I'm not stating that we shouldn't do it, just that it has larger consequences than the other cases mentioned, and so I am relatively less comfortable. With that said, I tend to agree with Vyas that this is no worse than forking/publishing our own proxy version of dask.

@pentschev
Copy link
Member

With that said, I tend to agree with Vyas that this is no worse than forking/publishing our own proxy version of dask.

Yes, I agree with that statement as well, but this is something I didn't consider for either case (forking or pth) until earlier today. If you both think we're still ok eventually disabling the query-planning warning even though this will disable for CPU-only users who install RAPIDS packages as well, I'm ok with that too but wanted to raise this point to make sure everyone is aware of it.

@vyasr vyasr mentioned this pull request Feb 29, 2024
galipremsagar pushed a commit that referenced this pull request Feb 29, 2024
- This rolls back the changes in
#25 until we have
a plan to deal with the loud `dask.dataframe` deprecation warning in
`dask>=2024.2.0`
- This adds `dask_expr` as a dependency in preparation for
rapidsai/cudf#14805 (happy to add this in a
follow-up PR if others prefer)
- We may be able to remove the pin for 24.04 if/when
#27 is merged
(fingers crossed). However, the *default* plan is still to keep the
2024.1.1 dask pin in place until 24.06 (see:
#26)
@vyasr
Copy link
Contributor Author

vyasr commented Mar 5, 2024

The reason the DeprecationWarning is currently being hidden is that DeprecationWarnings in Python are hidden by default unless they are in __main__. Whether or not a warning thinks it's in __main__ (the top-level module/environment) is based on the stacklevel. For example, if I do python -c "import warnings; warnings.warn('test', DeprecationWarning)" that will display:

(rapids) coder _ ~/rmm $ python -c "import warnings; warnings.warn('test', DeprecationWarning)"
<string>:1: DeprecationWarning: test

but if I put that same code into a file and import it, it won't display because the warning is coming from one level deeper in the stack, not __main__:

(rapids) coder _ ~/rmm $ echo "import warnings; warnings.warn('test', DeprecationWarning)" > test.py
(rapids) coder _ ~/rmm $ python -c "import test"

If we wanted the warning to show in the second case, we could set the stacklevel so that Python would instead check whether one frame up is __main__:

(rapids) coder _ ~/rmm $ echo "import warnings; warnings.warn('test', DeprecationWarning, stacklevel=2)" > test.py
(rapids) coder _ ~/rmm $ python -c "import test"
<string>:1: DeprecationWarning: test

The problem with the patching approach in this PR is that it adds many extra stack frames to the import of dask*. That completely changes the stack level and means that the DeprecationWarning that is raised on import of dask is not going to think it's coming from __main__. The same is true of all other warnings too, but in those cases the warnings won't be hidden (because those warnings aren't hidden by default), they'll simply show up as being from the wrong file.

We have a few options for how to deal with this:

  1. The status quo of hiding DeprecationWarnings. That seems suboptimal, but it's worth noting that this only changes the behavior of imports. Any other DeprecationWarning in dask from e.g. calling a function will still show up.
  2. Surface all DeprecationWarnings during import. This means that any DeprecationWarning thrown anywhere in the call stack of calling import dask.dataframe (or any other dask import) will be shown to the user. In practice this shouldn't be a lot of things, so I think this is a good compromise solution.
  3. Try and parse the call stack to determine the correct stack level and wrap the call to warnings.warn before importing. This is the most complex and sophisticated solution. It may be possible to use this in theory, but I'm not confident that it's worth the effort it would take to get right.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 6, 2024

I think this PR is good to go on my end. Packaging has been fully updated. I'll leave it as a draft for now so that we don't accidentally merge before 24.06.

@jakirkham jakirkham mentioned this pull request Mar 7, 2024
@vyasr vyasr changed the base branch from branch-24.04 to branch-24.06 March 18, 2024 21:21
@vyasr vyasr marked this pull request as ready for review March 18, 2024 21:29
@vyasr vyasr requested a review from a team as a code owner March 18, 2024 21:29
@vyasr vyasr added the non-breaking Introduces a non-breaking change label Mar 18, 2024
@vyasr vyasr changed the title Enable patching Enable patching Mar 18, 2024
@raydouglass raydouglass added feature request New feature or request and removed feature request New feature or request labels Mar 18, 2024
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. Just some possible/minor README suggestions.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Co-authored-by: Richard (Rick) Zamora <[email protected]>
@vyasr vyasr changed the title Enable patching Enable patching Mar 19, 2024
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this @vyasr !

@galipremsagar
Copy link
Contributor

/merge

@galipremsagar galipremsagar merged commit 2304660 into rapidsai:branch-24.06 Mar 19, 2024
5 checks passed
bdice pushed a commit to bdice/rapids-dask-dependency that referenced this pull request Apr 3, 2024
This PR adds the ability to monkey-patch imports of dask and distributed
whenever those imports occur by simply installing
rapids-dask-dependency.

There's a tiny bit of scope creep here because this PR added real Python
code to the repo for the first time, so I also added pre-commit hooks
that in turn modified some unrelated files (only minimally, though).

TODO:
- [x] Update conda CI and packaging
- [ ] Stress test extensively

---------

Signed-off-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Richard (Rick) Zamora <[email protected]>
vyasr pushed a commit that referenced this pull request Apr 4, 2024
Currently, the tests added in #27 (backported to 24.04 in #36) do not
check the exit codes of their subprocesses. This means that failing
tests are not caught. This PR fixes the test utilities to check the exit
codes and print any stdout/stderr outputs.
vyasr added a commit to vyasr/rapids-dask-dependency that referenced this pull request Apr 4, 2024
This PR adds the ability to monkey-patch imports of dask and distributed
whenever those imports occur by simply installing
rapids-dask-dependency.

There's a tiny bit of scope creep here because this PR added real Python
code to the repo for the first time, so I also added pre-commit hooks
that in turn modified some unrelated files (only minimally, though).

TODO:
- [x] Update conda CI and packaging
- [ ] Stress test extensively

---------

Signed-off-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Richard (Rick) Zamora <[email protected]>
vyasr pushed a commit to vyasr/rapids-dask-dependency that referenced this pull request Apr 4, 2024
Currently, the tests added in rapidsai#27 (backported to 24.04 in rapidsai#36) do not
check the exit codes of their subprocesses. This means that failing
tests are not caught. This PR fixes the test utilities to check the exit
codes and print any stdout/stderr outputs.
raydouglass pushed a commit that referenced this pull request Apr 5, 2024
This PR backports #27, #37, and #39 to 24.04

---------

Signed-off-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Richard (Rick) Zamora <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants