Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Ensure global_output_type is thread-safe #3497

Merged

Conversation

wphicks
Copy link
Contributor

@wphicks wphicks commented Feb 12, 2021

Introduce a global_settings object implemented as a thread-local borg for storing global settings in a thread-safe way
Ensure that global_output_type is manipulated only in thread-safe ways and provide associated unit tests
Move "root" context manager to global_settings object to avoid unnecessary existence check for root_cm

Close #3234
Mitigate #3237

@wphicks wphicks added bug Something isn't working 2 - In Progress Currenty a work in progress Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. non-breaking Non-breaking change labels Feb 12, 2021
@wphicks
Copy link
Contributor Author

wphicks commented Feb 12, 2021

A few notes for those reviewing:

  • While I expect most people to access the global settings object via cuml.global_settings rather than separately instantiating a GlobalSettings object, I still thought it was important to make that class a borg to minimize the chance of hard-to-debug errors that would crop up if a second GlobalSettings object were accidentally invoked somehow. See the class docstring for an explanation of why this is a class at all.
  • I introduced the module-level __getattr__ for 2 reasons. One is to maintain cuml.global_output_type as an alias for cuml.global_settings.output_type (although I would like to deprecate that; see below). The subtler but more important reason is the same as why this check was necessary before this PR. In that case, this assignment is only executed in the main thread, not in any others, leaving root_cm unassigned. Similarly, if we were to just say global_settings = GlobalSettings() in the __init__.py file, then global_settings would reference the borg local to the main thread for all threads, even though its __dict__ has been correctly set up in thread-local storage. This is generally not a problem for thread-local borgs because they are generally invoked within functions which are executed independently in each thread rather than being instantiated as an import side-effect. By using __getattr__, we both avoid the overhead of repeated instantiations of GlobalSettings and ensure that each thread accesses its own borg.
  • I have omitted the usual declaration of __dir__ that accompanies __getattr__ because we explicitly want to discourage access to global settings objects outside of the library.
  • I would like to deprecate cuml.global_output_type in favor of cuml.global_settings.output_type as part of this PR, but I wanted to make sure I had buy-in from others (especially @mdemoret-nv) before doing so. Specifically, while our documents already tell external users to use set_api_output_type or using_output_type rather than accessing cuml.global_output_type directly, it would be a relatively easy mistake for either an internal or external developer to type e.g. cuml.global_output_type = 'cupy', which would immediately remove all the thread-safety infrastructure. There is no __setattr__ equivalent for modules, so the only way to prevent this mistake would be to use the hack where an object replaces a module in the module list at import time. This is generally not encouraged, and module-level __getattr__ was introduced in part to alleviate the need for such shenanigans. Therefore, my preferred solution would be to replace all instances of cuml.global_output_type with cuml.global_settings.output_type internally and add a strongly-worded warning in the global_output_type branch of __getattr__ saying that accessing global_output_type directly is discouraged and will be removed in 0.20. I'm going to leave this PR as a draft until there's consensus on whether or not that's an acceptable solution.

@wphicks wphicks changed the title Ensure global_output_type is thread-safe [REVIEW] Ensure global_output_type is thread-safe Feb 19, 2021
@wphicks wphicks added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Feb 19, 2021
@wphicks wphicks marked this pull request as ready for review February 19, 2021 20:01
@wphicks wphicks requested a review from a team as a code owner February 19, 2021 20:01
@wphicks
Copy link
Contributor Author

wphicks commented Feb 19, 2021

Since there hasn't been any comment on this and offline discussion was generally positive for deprecating cuml.global_output_type, I went ahead and did so. This is ready for review.

Copy link
Contributor

@mdemoret-nv mdemoret-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor suggestions and one request. I like the new system for global settings. I think it could be very flexible and make debugging issues easier. For example, what do you think about adding to global_settings.py a logging wrapper:

class LoggingGlobalSettings(GlobalSettings):

    @GlobalSettings.output_type.setter
    def output_type(self, value):
        print("Setting `output_type` = {}".format(value))
        self._output_type = value

And then this could be controlled with an environment variable:

class GlobalSettings(object):
    def __new__(cls):
        if ("CUML_DEBUG_GLOBALSETTINGS" in os.environ):
            return super(GlobalSettings, cls).__new__(LoggingGlobalSettings)
        else:
            return super(GlobalSettings, cls).__new__(cls)

    def __init__(self):
        self.__dict__ = _global_settings_data.shared_state

This is just an example, but the idea would be great to add to this PR. Let me know what you think.

I also have one request: Can you move the aliased decorators in api_decorators.py (everything below line 657) into a separate file so they can be discovered easily? This has been on my todo list and would fit well with this PR.

Other than that, the PR looks great. Love the Dask tests.

python/cuml/internals/api_context_managers.py Outdated Show resolved Hide resolved
python/cuml/internals/api_context_managers.py Show resolved Hide resolved
python/cuml/internals/global_settings.py Show resolved Hide resolved
Co-authored-by: Michael Demoret <[email protected]>
@wphicks
Copy link
Contributor Author

wphicks commented Feb 23, 2021

For example, what do you think about adding to global_settings.py a logging wrapper:

If it's all right by you, I'd like to leave this for another PR if we decide to include it. I think this might be one case where "less is more" in terms of providing debugging tools for the team. Adding something controlled by an environment variable that deep in the guts of the code does not necessarily seem like the right way to approach this. If someone gets to the point where they know they need to understand how the output type is being set and changed, they'll probably be looking at the global settings code anyway, and it's easy enough to set a breakpoint or (if we must! ;) ) add a logging statement. If someone sets the environment variable and starts to follow the logic of how the output types are getting changed, they're probably at a point where they'll need problem-specific debugging code anyway.

All that being said, I'm happy to be overruled if other people think this would be a handy, often-used tool. Even then, though, I think it would be worth deferring to a separate PR, since there could be useful discussion about the details of the implementation. For instance, I don't think I'd necessarily jump to overriding __new__, cuml's logging methods should be used, we need to decide on an appropriate default logging level, etc. With enough things to discuss about this, I think it makes sense to at least look at it in its own PR.

@mdemoret-nv
Copy link
Contributor

Fair enough. Your implementation looks great and sparked some ideas but I agree with your assessment. Better to leave it for another PR, if one is even necessary.

@wphicks
Copy link
Contributor Author

wphicks commented Feb 24, 2021

I also have one request: Can you move the aliased decorators in api_decorators.py (everything below line 657) into a separate file so they can be discovered easily? This has been on my todo list and would fit well with this PR.

I started to do this, but then I wasn't quite sure about what should and shouldn't be moved over. Everything under 657 or just the aliases? Would you mind terribly just popping that into its own PR? It seems like a discrete change, and I want to make sure it's exactly what you're looking for. I'd be happy to give it a quick review!

@mdemoret-nv
Copy link
Contributor

Would you mind terribly just popping that into its own PR?

No problem. I have it on my TODO list already.

Copy link
Contributor

@mdemoret-nv mdemoret-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice improvement.

@wphicks wphicks added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 24, 2021
@JohnZed
Copy link
Contributor

JohnZed commented Feb 25, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fc718e4 into rapidsai:branch-0.19 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] AssertionError in multithreaded KBinsDiscretizer
4 participants