-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for async tasks to finish in ProcessPipeline#close #79
Conversation
@m50d Thanks for the PR.
By the way, there are several cases that processors may be closed.
For 2, there's already a logic to wait pending tasks up to So only case we want to handle in this PR is case 1, thus just waiting pending tasks before destroying processors in shutdown sequence would be fine. Remaining concern is should we have a timeout for case 1 or not. I think we should, and @kawamuray what do you think? |
Thanks for the PR @m50d :)
I agree with this first of all.
I agree with that the case 1 should have configurable timeout, but disagree with reusing I can imagine several possible interfaces to allow users doing this but not yet convinced which one would the most suitable considering realistic use cases.
Expecting use from spring, the property approach would be suitable because spring calls The The close with timeout approach would be simplest to implement and it might be just sufficient for the purpose without needing to introduce another interface or another property that we would care wrt consistency with existing/future interfaces. |
Makes sense. The polling loop is not ideal (though I guess that's a separate issue).
I found the existing logic quite hard to follow, because it's scattered across a lot of classes. IMO each processor/scope should be responsible for itself - shutting down a subscription should delegate to its partitions to shut them down, and shutting down a partition should delegate to its threads to shut them down. If I pushed down the existing polling-loop logic to run at thread level (so we'd run the polling loop once for each processor rather than at top level as now) it would be somewhat inelegant, but maybe it's ok? WDYT?
SGTM. Will do.
Yeah, that's very much the best fit for our use case IMO. |
I'm not sure which piece of code exactly you're talking about. Can you provide some pointers to code that you think it isn't done at the right place now? |
Sorry for the overly flippant previous comment. The I also couldn't see why this isn't sharing implementation with |
Does that mean you think the implementation like following is better in terms of properly abstractions? // at PartitionContext
private void waitRemainingTasks(long timeoutMillis) {
loop until timeout {
updateHighWatermark();
if (pendingTasks() is zero) {
return;
}
}
// at waitForRemainingTasksCompletion in ProcessorSubscription
public waitForRemainingTasksCompletion(timeout) {
for each partitionContexts {
context.waitRemainingTasks(remainingTimeout);
}
} in case, I think it doesn't work because we must ensure we call
The reason is, on partition rebalance we don't need to (actually can't) call consumer.poll() while we should in case of property reload. |
Hmm. How about an interface akin to |
a4b3fa9
to
5f2f275
Compare
Reworked this branch with the approach I was trying to describe - WDYT? |
@@ -29,7 +29,7 @@ | |||
* Represents consumption processing progress of records consumed from a single partition. | |||
* This class manages sequence of offsets and a flag which represents if each of them was completed or not. | |||
*/ | |||
public class OutOfOrderCommitControl { | |||
public class OutOfOrderCommitControl implements AsyncShutdownPollable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this doesn't feel right because OOOCC is just a datastructure that is indented to be controlled by PartitionContext
.
Calling it "shutdownable" makes me feel like it represents a flow from start to something shutdown eventually, which doesn't really exists. Since the lifetime of this object totally dpeneds on its user class (i.e, this class keeps working and never shuts down as long as the user class keep calling reportOffset
regardless to if initiateShutdown
has ever called. So what this interfaces says is "the pollShutdown() will eventually return true if you stop calling reportFetchedOffset, which doesn't sounds like "shutdown"), this sounds semantically incorrect and makes it unnecessary abstracted while we knew what's happening when we were calling updateHighWatermark + pendingOffsetsCount explicitly.
@@ -77,6 +78,8 @@ | |||
*/ | |||
private TracingProvider tracingProvider = NoopTracingProvider.INSTANCE; | |||
|
|||
private Optional<Long> waitForProcessingOnClose = Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it one of ProcessorProperties ?
What we want to configure here is just "timeout that decaton waits already being processed tasks to complete before returning from ProcessorSubscription.close()
returns. So having a property expressing that timeout benefits us by:
- consistent handling with other configurables
- users can configure it through existing PropertySupplier.
Sorry for the delay in review and thanks for sharing your idea. Besides the inline comments, let me summarize my feeling about the proposed approach.
Sorry if my explanation isn't straightforward to understand. It's about high level class design and was bit hard to express my feeling exactly as statements TBH.. |
Thanks.
Agreed; I did wonder about making it something like "pollWorkInProgress", because it's a bit misleading to call it shutdown if it's not actually shutting down.
I agree, but I think something along those lines has to be the right direction overall - figuring out how far processing has got should surely follow the same hierarchy of responsibility as processing itself, so that each object has a clear area of responsibility? I found it hard to follow how tasks were handed out down to processors but then the progress information came back by a separate path through
I think it makes sense for larger systems that might want to do the same thing themselves - e.g. a program might want to call |
Hm, while I could partially understand that you had hard time to track down the path of delivery tasks from consumer to ProcessorUnit and get back offset information from there, my take is that it isn't caused by insufficient abstraction or responsibility separation, but is mainly due to some limitation that we must follow when implementing the processing (threading) model like decaton using a single consumer instance.
Maybe this is because I'm an original author of those class, but I'm taking classes like
Okay, so it will be like the following right?
which is bit confusing isn't it? What'd be the right expected "meaning" of those timeouts? |
I agree with your 3 points; I think it's OOOCC that I found confusing; it owns both I think it'd make more sense to me if we kept the
If we provide a version of |
Ok, now I see bit clearer about your point.
To tell you a bit of history, OOOCC has been the most important core of Decaton and has been evolved its implementation several times in the past to optimize performance because it has tended to be a bottleneck. So the class's responsibility is limited and isolated from other parts of Decaton core with strong intention, to keep it produce optimal performance for the very specific role:
You can see the previous implementations at here as we keep all of them to track performance comparison when we update the logic.
That's one possible way doing it.
That's possible, but would that make it really better?
Hm, it might depends on interface, but to me an interface like
Yeah, that would make more sense. For that usage, what about adding an interface and implement in ProcessorSubscription like below? public CompletableFuture<Void> initShutdown();
|
Yes, that's exactly the intention. All I mean is that
That's not the intention - if two threads call
I'm modelling the behaviour on |
hm, that's a good point.
Yeah, originally I was thinking to add more methods to
Okay, I didn't imagine that. Besides, calling
is that so? I was imagining something like this: class ProcessorSubscription {
private CompletableFuture<Void> shutdownFuture;
public run() {
finally { shutdownFuture.complete(); };
}
public CompletableFuture<Void> initShutdown() {
shutdownFlag = true;
return shutdownFuture;
}
} |
Makes sense. Thanks for explaining.
Ok, I misunderstood this; I wanted to reuse the same "shutdown" code between what (Side point: I also found it a bit confusing that |
5f2f275
to
54b1b46
Compare
With the idea that the |
processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorProperties.java
Outdated
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java
Outdated
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/runtime/internal/ProcessPipeline.java
Outdated
Show resolved
Hide resolved
Not sure. The original intention to made it a Thread, is to enable running it independently without needing to get any "runtime" from users (e.g, ExecutorService). Which is good so far because most users using it by calling
Hm, while I still CF approach would the most convenient and understandable for users, I think However I think we should enhance it bit by adding a method that takes timeout, like So we should:
I think? |
I think that makes sense. Do you want to treat that as a blocker for this PR or push it out as a separate task? |
54b1b46
to
d08c30d
Compare
Might be okay to be done in a separate PR but want it to be done before making a release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
public void awaitShutdown() throws InterruptedException { | ||
join(); | ||
public boolean awaitShutdown(Duration limit) throws InterruptedException { | ||
join(limit.toMillis()); | ||
metrics.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think aliveness check needs to be performed before this because we should not close metrics while its still running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think the right thing is for the thread to call metrics.close()
when it's finished everything else, since initiateShutdown()
should cause the shutdown to happen and awaitShutdown
should observe that process.
metrics.close(); | ||
log.info("Subscription thread terminated: {}", getName()); | ||
if (isAlive()) { | ||
log.warn("Subscription thread didn't terminate within time limit: {}", getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't agree with writing such log because the caller of awaitShutdown(timeout)
may supply short timeout and maybe it is totally normal to end up without completing the shutdown process. It's a matter of caller's, not ours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to lower it to info, but I think we do want to know which case applied and it's not always clear from outside (we return a boolean but that doesn't necessarily tell us the details of which subpartition didn't terminate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which subpartition
which subscription you mean?
Can we check "absence" of the below log instead?
decaton/processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java
Line 256 in e0b209d
log.info("ProcessorSubscription {} terminated in {} ms", scope, |
Anyway, my opinion for such logging is that its allowed only when:
- The interface intends to complete the whole shutdown process within the method call (either with or without timeout). I think
awaitShutdown
doesn't intend that (it observers shutdown for the given amount of time and just returns the result) or, - termination failed with obvious error (e.g, exception)
Otherwise, we're putting assumption on it such that this is the single last invocation from user and they'll leak some resources if we don't complete everything here, such might not be a case, making it inconsistent with interface design I think.
Still fine to have such logging but debug level is appropriate I think? (if the above "absence" check works)
for (ProcessorUnit unit : units) { | ||
unit.awaitShutdown(); | ||
final Duration unitLimit = Duration.between(Instant.now(), absLimit); | ||
clean &= unit.awaitShutdown(unitLimit.isNegative() ? Duration.ZERO : unitLimit); | ||
} | ||
Utils.runInParallel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think here too, if !clean
, then we should not proceed to this process I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think this can only be done in blocking close (or indefinite awaitTermination
because it has to be done after shutdown success).
public void awaitShutdown() throws InterruptedException { | ||
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS); | ||
public boolean awaitShutdown(Duration limit) throws InterruptedException { | ||
final boolean clean = executor.awaitTermination(limit.toMillis(), TimeUnit.MILLISECONDS); | ||
metrics.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this too.
asyncProcessingStarted.await(); | ||
subscription.initiateShutdown(); | ||
assertTrue(consumer.committed(singleton(tp)).isEmpty()); | ||
letTasksComplete.countDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is this working as a case for this #79 (comment) ?
I think in this implementation, 2 tasks are first fed into DecatonProcessor
anyway (because they returns immediately) and decaton's internal queue quickly becomes empty, so not testing that "decaton processes tasks that are fetched but still in queue at the beginning shutdown" I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first task can't complete until letTasksComplete.countDown();
is called, so the second task will still be queued. I'll add an assert that checks they're still pending at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decaton gives the 2nd task to DecatonProcessor#process
even when the 1st task hasn't yet completed though?
When a processor uses DeferredCompletion
, it is processor's responsibility to serialize execution of multiple tasks with the same key for after #process
returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah indeed. I've added blocking in the synchronous part of processing as well so that we test that case.
That's why I thought CF style works well for such interface. I've tried to mockup my idea. public interface AsyncShutdownable extends AutoCloseable {
CompletableFuture<Void> beginShutdown();
/**
* Shut down, blocking until shutdown is complete
*/
@Override
default void close() throws Exception {
beginShutdown().get();
}
}
public class ProcessorSubscription extends Thread implements AsyncShutdownable {
...
public void run() {
...
termFuture.complete(null);
updateState(SubscriptionStateListener.State.TERMINATED);
}
}
private final CompletableFuture<Void> termFuture = new CompletableFuture<>();
private CompletableFuture<Void> shutdownFuture;
@Override
public synchronized CompletableFuture<Void> beginShutdown() {
if (!terminated.get()) {
shutdownFuture = termFuture.whenComplete((unused, throwable) -> {
metrics.close();
});
terminated.set(true);
}
return shutdownFuture;
}
}
public class PartitionProcessor implements AsyncShutdownable {
...
private CompletableFuture<Void> shutdownFuture;
@Override
public synchronized CompletableFuture<Void> beginShutdown() {
if (shutdownFuture == null) {
CompletableFuture[] cfs = new CompletableFuture[units.size()];
for (ProcessorUnit unit : units) {
try {
cfs[0] = unit.beginShutdown();
} catch (RuntimeException e) {
logger.error("Processor unit threw exception on shutdown", e);
}
}
try {
rateLimiter.close();
} catch (Exception e) {
logger.error("Error thrown while closing rate limiter", e);
}
shutdownFuture = CompletableFuture.allOf(cfs).thenCompose(
unused -> Utils.runInParallel(
"DestroyThreadScopedProcessors",
IntStream.range(0, units.size())
.mapToObj(this::destroyThreadProcessorTask)
.collect(toList())));
}
return shutdownFuture;
}
} By using CF, we can manage multi-step shutdown sequence within single CF instance that is returned by However, it turns out one issue while implementing it for public class ProcessorUnit implements AsyncShutdownable {
...
private CompletableFuture<Void> shutdownFuture;
@Override
public synchronized CompletableFuture<Void> beginShutdown() {
if (shutdownFuture == null) {
terminated = true;
pipeline.close();
CompletableFuture<Void> executorShutdown = new CompletableFuture<>();
shutdownFuture = executorShutdown.whenComplete((unused, throwable) -> metrics.close());
executor.execute(() -> executorShutdown.complete(null));
executor.shutdown();
}
return shutdownFuture;
}
} I'm not sure if that way best suites for us because it cerintaly introduces some complexiity, espcially for unpredictability about who's calling shutdown method of an object (e.g, who's calling Maybe the name like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments.
Yeah. I tried to stick to mirroring what
Agreed, and that's how I've implemented it in the current version of this branch - |
Maybe bit late, but let me confirm about my understanding about the initial intention and current patch.
Then, I still don't get why we should have And, I'm also wondering the reason to add So, I think we should simply do like following:
What do you think? |
With this change As-is: If we just change the behaviour of
To implement a "time-bounded await shutdown" for |
I see, thanks for the explanation. As the summary, my thoughts about this PR:
Hm, I'm not sure we need to implement bounded ver of And, for ProcessorUnit, if a task being processed at shutdown takes long time to process (it's possible because users can implement arbitrary logic in |
It's possible, but it's very rare for users to do this; it means you have to spawn a thread during shutdown to implement the timeout, and then interrupt your shutdown thread, which is complex to reason about. I'd say that in a Java library it's normal good practice that whenever you implement a method that could block indefinitely, you offer a variant that takes a time limit as well, like with Like I said to @kawamuray in my earlier comment #79 (comment) I think it's a logically separate change, so I'd be happy to split it into a separate PR. But I do think it makes sense.
Ah yeah, that's true, my mistake. Still, the current implementation works by having a polling loop in
True in the current implementation - ideally I'd like to implement asynchronous processors by returning a |
H-m this discussion's getting bit hard to follow all the different ideas and their intentions (I didn't expected the way to implement async shutdown interface to get such controversial :p) IMO, the interface that provides "initiate shutdown sequence and asynchronously observe the completion" must:
Let me try to summarize my opinion for each proposed ideas so that it might be more clear for you guys about what I'm thinking and preferring. If you guys don't mind, can you please follow the same style and summarize pros/cons for each ideas from your point of view (especially wants to hear cons you think for Idea 1 and 2) so I might be able to understand your intention better (sorry if I missed any idea, please complement in case). Idea 1: Use CompletableFuture The one I showed POC in this comment. Please also see #89 for full POC. Pros:
Cons:
Idea 2: Use AsyncShutdownable as-is, but with adding awaitShutdown(timeout) Pros:
Cons:
Idea 3: Use AsyncShutdownable as-is without adding awaitShutdown(timeout) Pros:
Cons:
Idea 4: Use AsyncShutdownable as-is, but with adding awaitShutdown(timeout) which is just an "observer" (the current state of this branch) Cons:
Besides the above comments, as @okadaruma briefly mentioned, I think it is an option too to introduce a separate interface from internal-only |
@m50d Thanks for the explanation. I understood the intention, and now things look much reasonable. To provide "initiate shutdown" I/F, I vote for idea 1 for almost same reasons as you described. Idea 1: Use CompletableFuture Cons:
Idea 2: Use AsyncShutdownable as-is, but with adding awaitShutdown(timeout) Cons: |
I'm not actively against the I don't really see this as a choice between alternatives so much as a series of steps: A) make My big concern about a CF-style API was that it didn't play nice with shutting down |
Thanks for the follow-up.
True, but it's not likely happen in case of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AsyncShutdownable
is still bit confusing interface due to co-existence of CF style and existing style APIs + necessity of calling initiateShutdown
even though we just want to kick of the shutdown and get the CF back. However it might be a good compromise between compatibility and fulfilling required functionalities at this moment. We might gonna have a chance to reorganize these interfaces and internals with some breaking changes (deprecation) at the time we release the next major version.
LGTM, thanks 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reworking.
Added a comment.
@@ -83,13 +91,15 @@ private void processTask(TaskRequest request) { | |||
@Override | |||
public void initiateShutdown() { | |||
terminated = true; | |||
pipeline.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should close pipeline at initiateShutdown
, to avoid causing unnecessary delay on shutdown.
Current (and expected) behavior:
1. ProcessorUnit#initiateShutdown()
2. ProcessPipeline#close()
- pipeline's termination flag is on, so processing further tasks will be skipped (return immediately without completing DeferredCompletion)
- ExecutionScheduler also closed, so sleep() returns immediately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this earlier on this PR - if we skip further tasks then in general (with out-of-order commit) some tasks will have completed but their offsets will not be able to be committed. With this implementation we treat it similarly to rebalance (so max.poll.records
needs to be small enough to be processed within the time limit - but that's already the case when we rebalance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the discussion is this, right?: #79 (comment)
My point is rather different. Current patch is changing the order of calling ProcessPipeline#close()
like below:
AS-IS: ProcessorUnit#initiateShutdown() => ProcessPipeline#close() => ProcessorUnit#awaitShutdown()
TO-BE: ProcessorUnit#initiateShutdown() => submit ProcessPipeline#close() to the executor (i.e. processed after all current tasks in the queue finished) => ProcessorUnit#awaitShutdown()
As long as pending tasks are finished within shutdown.timeout.ms
, that should be fine.
But thing is different if waitingForRemainingTasks exceeds timeout (e.g. due to long per-task duration or large max.pending.records or both).
By calling ProcessPipeline#close()
first, tasks are skipped (i.e. not passed to DecatonProcessor) without completing it so ProcessorUnit#awaitShutdown()
returns almost immediately. (except a task which is being processed now)
If we call ProcessPipeline#close()
at last, all pending tasks are passed to DecatonProcessor even AFTER shutdown.timeout.ms has passed, which could cause shutdown time unreasonably long.
Expected behavior should be like below:
1. ProcessorSubscription#initiateShutdown()
2. waitRemainingTasks up to shutdown.timeout.ms
3. After shutdown.timeout.ms, if still there are remaining tasks, we proceed shutdown without waiting them (so re-processing likely to be happen)
4. Complete shutdown sequence.
I think that's the reason why @kawamuray calls ProcessPipeline#close()
at the time initiating shutdown in his PoC (https://github.com/line/decaton/pull/89/files#diff-505ddd33c135748a163c4146ae63acb3d7c40a754a7f2bae4015afbb2e956c1aR89)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah indeed, sorry for misunderstanding your point. I guess we need to implement something like a shutdownNow()
path for forcibly terminating if we've reached the timeout. Will have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think the thing is such complicated, just calling ProcessPipeline#close()
at the beginning of ProcessorUnit#initiateShutdown
should be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have sketched out an implementation but I'm not sure if it's the right approach -PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H-m, I still don't get the point to add shutdownNow()
to AsyncShutdownable and implement it.
What we expect to ProcessorSubscription#close()
is "to wait remaining tasks up to shutdown.timeout.ms, after that proceed shutdown sequence then close all processors and other all resources gracefully", rather than providing another path (shutdownNow()
) to proceed shutdown forcibly.
So only problem of the patch (at I commented this #79 (comment)) was just a timing to call ProcessPipeline#close()
in ProcessorUnit#initiateShutdown()
, which should be at right after setting terminate = true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, yeah, I see - I had forgotten that this isn't implemented by delegation, so ProcessorUnit#initiateShutdown()
doesn't get called until after the wait has finished. Will do it that way.
processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java
Outdated
Show resolved
Hide resolved
68f5088
to
cd84d5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! LGTM
Possible approach to #78
Keep track of in-flight async processing in
ProcessPipeline
via aSemaphore
with a very large number of permits, and wait for the same number of permits on close.Advantages:
whenComplete
logic, no need to add extra tracking to async processorsinitiateShutdown
andawaitShutdown
up the whole object graph).Disadvantages:
ExecutionScheduler
doesn't have any visibility over when tasks actually finishdeferCompletion
and then never completes at all.Scope
?WDYT?