Support async commit for ExchangeSink #10699

arhimondr · 2022-01-20T01:03:07Z

This came up during a discussion with @linzebing . It looks like currently the noMorePages and destroy methods could be called from a tiny thread pool designed to handle lightweight task notifications, for example:

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L640
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L568
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L602

Where the notificationExecutor is shared between all tasks and by default only has 5 threads in the pool: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java#L79

The commit operation on ExchangeSink could be quite time consuming (as it may require to flush existing buffers, create files and so on). So it looks like it is better to provide a non blocking ExchangeSink interface.

I will update the commit message.

losipiuk · 2022-01-20T11:26:45Z

Can you provide some rationale? Would be nice to have it in commit message anyway.

core/trino-main/src/main/java/io/trino/execution/buffer/ArbitraryOutputBuffer.java

losipiuk · 2022-01-20T11:34:03Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

+import static io.trino.execution.buffer.BufferState.OPEN;
+import static io.trino.execution.buffer.BufferState.TERMINAL_BUFFER_STATES;
+
+public class OutputBufferStateMachine


It looks like we are not using boolean stateChanged return values most of the time. Would that make sense to return void for methods where we do not care about returned value.

We need a boolean for most of the methods. The return value is not used for noMoreBuffers and fail. But I thought it might be better to be consistent with other methods.

losipiuk · 2022-01-20T11:45:12Z

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

+                    .orElseGet(() -> new TrinoException(GENERIC_INTERNAL_ERROR, "Output buffer is failed but the failure cause is missing"));
+            taskStateMachine.failed(failureCause);
+            return;
+        }


what about ABORTED why is it not expected here? Worth a comment?

core/trino-main/src/main/java/io/trino/execution/buffer/BufferState.java

losipiuk

LGTM

arhimondr · 2022-01-21T17:03:16Z

Can you provide some rationale? Would be nice to have it in commit message anyway.

This came up during a discussion with @linzebing . It looks like currently the noMorePages and destroy methods could be called from a tiny thread pool designed to handle lightweight task notifications, for example:

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L640
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L568
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java#L602

Where the notificationExecutor is shared between all tasks and by default only has 5 threads in the pool: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java#L79

The commit operation on ExchangeSink could be quite time consuming (as it may require to flush existing buffers, create files and so on). So it looks like it is better to provide a non blocking ExchangeSink interface.

I will update the commit message.

losipiuk

@martint you may want to look at changes in BufferState

losipiuk · 2022-01-21T18:10:41Z

@martint you may want to look at changes in BufferState

or @sopel39 / @findepi maybe :)

core/trino-main/src/main/java/io/trino/execution/buffer/ArbitraryOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/BroadcastOutputBuffer.java

sopel39 · 2022-01-25T13:42:16Z

core/trino-main/src/main/java/io/trino/execution/buffer/BroadcastOutputBuffer.java

@@ -412,7 +407,7 @@ private void noMoreBuffers()

    private void checkFlushComplete()
    {
-        if (state.get() != FLUSHING && state.get() != NO_MORE_BUFFERS) {


This probably should be BufferState state = stateMachine.get
and then you should perform check. Otherwise state could move from NO_MORE_BUFFERS to FLUSHING between stateMachine.getState() calls, which seem racy

Yeah, it does seem weird. I also thought about that. I don't know exactly why it is implemented this way. At the end of the day I decided not to touch it and keep the change as close to being mechanic as possible.

Still I think this should be fixed (separate commit). I can image, state transitioning from NO_MORE_BUFFERS to FLUSHING and this method will destroy buffers

It's been like that for a very long time. It doesn't seem to be likely that the implementation is incorrect. But I agree, it's super confusing. Let me add a commit that simplifies it.

core/trino-main/src/main/java/io/trino/execution/buffer/LazyOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/PartitionedOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/ArbitraryOutputBuffer.java

sopel39 · 2022-01-25T14:11:21Z

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

+        }
+
+        // The only terminal state that remains is ABORTED.
+        // Buffer is expected to be aborted only if the task itself is aborted. In this scenario the following statement is expected to be noop.


In this scenario the following statement is expected to be noop.

why? because task is aborted so this line should never execute?

Failing an aborted task is a noop, as the ABORTED state is a terminal state.

sopel39 · 2022-01-25T14:13:24Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

+    {
+        requireNonNull(throwable, "throwable is null");
+
+        failureCause.compareAndSet(null, throwable);


very that failureCause is not overwritten (it is null)?

This method is allowed to be called multiple times (similar to how it is implemented in other state machines). The contract is that the method has to preserve only the first failure that made the transition.

Could you add a comment: the method has to preserve only the first failure that made the transition.?

The code seems to be self explanatory and aligns with what is done in other state machines. How strongly do you feel about having an explicit comment here?

core/trino-main/src/main/java/io/trino/execution/buffer/SpoolingExchangeOutputBuffer.java

sopel39 · 2022-01-25T14:25:20Z

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSink.java


    /**
     * Notifies the exchange that the write operation has been aborted
+     *
+     * @return future that will be resolved when the abort operation either succeeds or fails


What should abort do when finish is already running?
What should abort do when another abort is running?

What should abort do when finish is already running?

I think it can be implementation specific. The implementation may decide to keep the finish running, or may decide to cancel finish and abort. It doesn't really make a difference from the engine perspective.

What should abort do when another abort is running?

Same here. It is implementation specific. As long as the sink is properly invalidated the engine doesn't really care what happens underneath as the task is already aborted / failed anyway. Regardless I guess it is better to make the abort method idempotent. I will change it to first transition the buffer to the ABORTED state and then call the ExchangeSink#abort result of which is technically ignored anyway (abort is only called when the task itself is failed or aborted)

sopel39 · 2022-01-25T14:25:38Z

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSink.java

     */
-    void finish();
+    CompletableFuture<?> finish();


what should finish do when abort is already running? Contract is undefined here
what should finish do when another finish is running?

what should finish do when abort is already running?

finish should never be called after abort. If it is - it's a bug. Let me document it.

Contract is undefined here what should finish do when another finish is running?

finish shouldn't be called when another finish is running. If it is - it's a bug.

Updated java doc

sopel39 · 2022-01-25T14:29:06Z

core/trino-main/src/main/java/io/trino/execution/buffer/SpoolingExchangeOutputBuffer.java

-            finally {
-                updateMemoryUsage(exchangeSink.getMemoryUsage());
-            }
+        if (stateMachine.getState().canAddPages()) {


Is it racy with setNoMorePages? E.g. abort can be called after setNoMorePages started running finish

setNoMorePages starts running finish after transitioning the state. If destroy is called before setNoMorePages it means that the task got cancelled prematurely and the buffer has to be invalidated. If there's a race and setNoMorePages is called at the same time when destroy is called it is legit to finish the sink, as the data written to the sink at that point is complete.

I don't fully understand. If say this race is fine, then why do we need if (stateMachine.getState().canAddPages()) { check here? In case of a race (between setNoMorePages and destroy), if would be like this check does not exist.

In normal flow there shouldn't be a race. When the output is completely written and the setNoMorePages is called the task is only finished after ExchangeSink#finish is done and the buffer is transitioned to the FINISHED state. When the task itself is transitioned to FINISHED the destroy method is called and we don't want the sink to be aborted under normal circumstances. That's why there's a check.

However a race is possible when all the data is written but the task is cancelled before ExchangeSink#finish is completed. This shouldn't happen in practice, as the scheduler is not expected to cancel tasks that are writing to a spooling exchange. However from the interface perspective it is possible. I was thinking about what's the best way to handle this situation. When the output is complete and the task is cancelled the output itself is valid. So letting it finish should be perfectly fine. However sending an "abort" to the sink gives the ExchangeSink implementation to cancel commit if possible.

Discussed offline.

Removing the check to ensure abort is always called if the finish hasn't succeeded.

linzebing · 2022-01-26T19:05:58Z

It feels that abort doesn't have to be blocking, as we can just abort the multi part upload asynchronously.

arhimondr · 2022-01-26T19:31:59Z

It feels that abort doesn't have to be blocking, as we can just abort the multi part upload asynchronously.

Currently it is not blocking. It returns a feature and the OutputBuffer doesn't wait for it, only logs an exception if one occurred.

linzebing · 2022-01-27T19:06:21Z

Need to wait for futures to complete here https://github.com/trinodb/trino/blob/master/testing/trino-testing/src/main/java/io/trino/testing/AbstractTestExchangeManager.java#L167,L170

arhimondr · 2022-01-27T19:57:45Z

Need to wait for futures to complete here https://github.com/trinodb/trino/blob/master/testing/trino-testing/src/main/java/io/trino/testing/AbstractTestExchangeManager.java#L167,L170

Good catch

arhimondr · 2022-01-28T04:51:31Z

Rebased on top of #10507

Applied necessary changes to DeduplicatingDirectExchangeBuffer. @losipiuk @sopel39 @linzebing Please take a look

sopel39

lgtm % comments

sopel39 · 2022-01-28T11:05:36Z

core/trino-main/src/main/java/io/trino/operator/DirectExchangeClient.java

+            log.warn(e, "error closing buffer");
+        }
+        finally {
+            memoryContext.setBytes(0);


nit: can we have a test for this?

That would probably require creating Exchange mocks that can throw an exception on close. I wonder if it's worth it given that we don't have memory counting tests even for happy path scenarios.

sopel39 · 2022-01-28T11:58:31Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

+    {
+        requireNonNull(throwable, "throwable is null");
+
+        failureCause.compareAndSet(null, throwable);


Could you add a comment: the method has to preserve only the first failure that made the transition.?

sopel39 · 2022-01-28T11:59:16Z

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBufferStateMachine.java

+
+        failureCause.compareAndSet(null, throwable);
+        return state.setIf(FAILED, oldState -> !oldState.isTerminal());
+    }


failure cause can be set before state transitions to FAILED. Are we sure that won't cause any troubles?

The failureCause is only expected to be explored when the buffer is in the FAILED state. If the buffer transitioned to ABORTED in a meantime the failure cause is not expected to be queried.

core/trino-main/src/main/java/io/trino/execution/buffer/OutputBuffer.java

sopel39 · 2022-01-28T12:03:45Z

core/trino-main/src/main/java/io/trino/execution/buffer/SpoolingExchangeOutputBuffer.java

-                updateMemoryUsage(exchangeSink.getMemoryUsage());
-            }
-        }
+        // Abort the buffer if it hasn't been finished. This is possible when a task is cancelled early by the coordinator.


that description is confusing:

This is possible when a task is cancelled early by the coordinator.

and

Task cancellation is not supported as the task output is expected to be deterministic.

Both can't be true at same time, right?

The task cancellation is not expected to be requested by coordinator. It can only be requested if there's a bug in the scheduler. However if this situation happens (e.g.: due to a bug) it is safer to invalidate the buffer with abort to avoid publishing incomplete data to the exchange service.

Added one more sentence to elaborate it.