-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interruptible execution #4463
Interruptible execution #4463
Conversation
rerun tests |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, and glad to see the docs were updated as well!
### Cooperative-style interruptible C++ threads. This proposal introduces `raft::interruptible` introducing three functions: ```C++ static void synchronize(rmm::cuda_stream_view stream); static void yield(); static void cancel(std::thread::id thread_id); ``` `synchronize` and `yield` serve as cancellation points for the executing CPU thread. `cancel` allows to throw an async exception in a target CPU thread, which is observed in the nearest cancellation point. Altogether, these allow to cancel a long-running job without killing the OS process. The key to make this work is an obvious observation that the CPU spends most of the time waiting on `cudaStreamSynchronize`. By replacing that with `interruptible::synchronize`, we introduce cancellation points in all critical places in code. If that is not enough in some edge cases (the cancellation points are too far apart), a developer can use `yield` to ensure that a cancellation request is received sooner rather than later. #### Implementation ##### C++ `raft::interruptible` keeps an `std::atomic_flag` in the thread-local storage in each thread, which tells whether the thread can continue executing (being in non-cancelled state). [`cancel`](https://github.com/rapidsai/raft/blob/6948cab96483ddc7047b1ae0a162574e32bcd8f0/cpp/include/raft/interruptible.hpp#L122) clears this flag, and [`yield`](https://github.com/rapidsai/raft/blob/6948cab96483ddc7047b1ae0a162574e32bcd8f0/cpp/include/raft/interruptible.hpp#L194-L204) checks it and resets to the signalled state (throwing a `raft::interrupted_exception` exception if necessary). [`synchronize`](https://github.com/rapidsai/raft/blob/6948cab96483ddc7047b1ae0a162574e32bcd8f0/cpp/include/raft/interruptible.hpp#L206-L217) implements a spinning lock querying the state of the stream and `yield`ing on each iteration. I also add an overload [`sync_stream`](https://github.com/rapidsai/raft/blob/ee99523ff6a8257ec213e5ad15292f2132a2a687/cpp/include/raft/handle.hpp#L133) to the raft handle type, to make it easier to modify the behavior of all synchronization calls in raft and cuml. ##### python This proposal adds a context manager [`cuda_interruptible`](https://github.com/rapidsai/raft/blob/36e8de5f73e9ec7e604b38a4290ac82bc35be4b7/python/raft/common/interruptible.pyx#L28) to handle Ctrl+C requests during C++ calls (using posix signals). `cuda_interruptible` simply calls `raft::interruptible::cancel` on the target C++ thread. #### Motivation See rapidsai/cuml#4463 Resolves rapidsai/cuml#4384 Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) URL: #433
@gpucibot merge |
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #4463 +/- ##
===============================================
Coverage ? 85.71%
===============================================
Files ? 236
Lines ? 19364
Branches ? 0
===============================================
Hits ? 16597
Misses ? 2767
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
### Cooperative-style interruptible C++ threads. This proposal attempts to make cuml experience more responsive by allowing easier way to interrupt/cancel long running cuml tasks. It replaces calls `cudaStreamSynchronize` with `raft::interruptible::synchronize`, which serve as a cancellation points in the algorithms. With a small extra hook on the python side, Ctrl+C requests now can interrupt the execution (almost) immediately. At this moment, I adapted just a few models as a proof-of-concept. Example: ```python import sklearn.datasets import cuml.svm X, y = sklearn.datasets.fetch_olivetti_faces(return_X_y=True) model = cuml.svm.SVC() print("Data loaded; fitting... (try Ctrl+C now)") try: model.fit(X, y) print("Done! Score:", model.score(X, y)) except Exception as e: print("Canceled!") print(e) ``` #### Implementation details rapidsai/raft#433 #### Adoption costs From the changeset in this PR you can see that I introduce two types of changes: 1. Change `cudaStreamSynchronize` to either `handle.sync_thread` or `raft::interruptible::synchronize` 2. Wrap the cython calls with [`cuda_interruptible`](https://github.com/rapidsai/raft/blob/36e8de5f73e9ec7e604b38a4290ac82bc35be4b7/python/raft/common/interruptible.pyx#L28) and `nogil` Change (1) is straightforward and can mostly be automated. Change (2) is a bit more involved. You definitely have to wrap a C++ call with `interruptibleCpp` to make `Ctrl+C` work, but that is also rather simple. The tricky part is adding `nogil`, because you have to make sure there is no python objects within `with nogil` block. However, `nogil` does not seem to be strictly required for the signal handler to successfully interrupt the C++ thread. It worked in my tests without `nogil` as well. Yet, I chose to add `nogil` in the code where possible, because in theory it should reduce the interrupt latency and enable more multithreading. #### Motivation In general, this proposal makes executing threads (and thus algos/models) more controllable. The main use cases I see: 1. Being able to Ctrl+C the running model using signal handlers. 2. Stopping the thread programmatically, e.g. we can create the tests of sort "if running for more than n seconds, stop and fail". Resolves rapidsai#4384 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4463
Cooperative-style interruptible C++ threads.
This proposal attempts to make cuml experience more responsive by allowing easier way to interrupt/cancel long running cuml tasks. It replaces calls
cudaStreamSynchronize
withraft::interruptible::synchronize
, which serve as a cancellation points in the algorithms. With a small extra hook on the python side, Ctrl+C requests now can interrupt the execution (almost) immediately. At this moment, I adapted just a few models as a proof-of-concept.Example:
Implementation details
rapidsai/raft#433
Adoption costs
From the changeset in this PR you can see that I introduce two types of changes:
cudaStreamSynchronize
to eitherhandle.sync_thread
orraft::interruptible::synchronize
cuda_interruptible
andnogil
Change (1) is straightforward and can mostly be automated.
Change (2) is a bit more involved. You definitely have to wrap a C++ call with
interruptibleCpp
to makeCtrl+C
work, but that is also rather simple. The tricky part is addingnogil
, because you have to make sure there is no python objects withinwith nogil
block. However,nogil
does not seem to be strictly required for the signal handler to successfully interrupt the C++ thread. It worked in my tests withoutnogil
as well. Yet, I chose to addnogil
in the code where possible, because in theory it should reduce the interrupt latency and enable more multithreading.Motivation
In general, this proposal makes executing threads (and thus algos/models) more controllable. The main use cases I see:
Resolves #4384