Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document TableFunctionSplitProcessor thread-safety #16955

Conversation

findepi
Copy link
Member

@findepi findepi commented Apr 11, 2023

No description provided.

@findepi findepi requested review from homar, losipiuk, ebyhr and kasiafi April 11, 2023 10:05
@cla-bot cla-bot bot added the cla-signed label Apr 11, 2023
findepi referenced this pull request Apr 11, 2023
Rename the existing TableFunctionProcessor to TableFunctionDataProcessor,
and introduce another interface for processing splits.
@findepi findepi force-pushed the findepi/document-tablefunctionsplitprocessor-thread-safety-38bd32 branch from e2cf177 to f37a79b Compare April 11, 2023 11:06
@findepi findepi force-pushed the findepi/document-tablefunctionsplitprocessor-thread-safety-38bd32 branch from f37a79b to ffe92d9 Compare April 11, 2023 11:09
@findepi
Copy link
Member Author

findepi commented Apr 11, 2023

CI #16882

* for a {@link ConnectorTableFunctionHandle}.
* <p>
* Thread-safety: implementations do not have to be thread-safe. The {@link #process} method may be called from
* multiple threads, but will never be called from two threads at the same time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that implementation of TableFunctionSplitProcessor need to ensure memory visibility of internal data structures which may be changed by one thread and then accessed by the other. Or are we sure this is ensured by the caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formally we would call on JLS's happen-before semantics and say that there is a happens-before relation between previous method call end and the next method call... i didn't want to be very formal here. but, i wanted to indicate the implementor should not take note of things iike thread id, or use ThreadLocal internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, to reason about thread-safety, we must consider both TableFunctionSplitProcessor and TableFunctionProcessorProvider.
The LeafTableFunctionOperator calls TableFunctionProcessorProvider.getSplitProcessor(session, handle) for each split it has. The function author implements the TableFunctionProcessorProvider and they can decide on the lifecycle of the Processor. One extreme would be to keep a single TableFunctionSplitProcessor, and return it from each call to the provider -- and deal with multiple threads. The other extreme is to instantiate a new TableFunctionSplitProcessor for each call to the provider. The latter is easy and clear, and imo should be considered the default approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LeafTableFunctionOperator calls TableFunctionProcessorProvider.getSplitProcessor(session, handle) for each split it has.

Can it provide a split already in this method call?
it would make it clear the processor serves one split.

(i understand we wanted the leaf processor to be similar to intermediate processor, but reality is that a function implementor implements only one of them at the same time, so making things just simpler would be beneficial)

then the TableFunctionSplitProcessor just provides Pages until it's done. So becomes equivalent to ConnectorPageSource. Maybe we could reuse that interface?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just throwing in my 2 cents that I also found this part of the interface a bit unintuitive when reviewing @homar's CDF table function implantation, specifically that process is initially called with a Split and then continues to be called with null arguments. It's something you only need to learn once, but at first glance I was expecting this to work more like a ConnectorPageSource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it provide a split already in this method call?

It totally makes sense to call TableFunctionProcessorProvider.getSplitProcessor(session, handle, split), and remove the split argument from the TableFunctionSplitProcessor.process method. It should simplify the operator a lot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will take a stab.

What about replacing TableFunctionSplitProcessor with ConnectorPageSource, as a consequence of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one seems to be designed for reading input. We'd have to think about how we implement getReadTimeNanos() etc. so that it makes sense for a particular table function. Or maybe subclass to ensure that those methods cannot be used. Even though they are never used anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @param split a {@link ConnectorSplit} representing a subtask.
* @param split a {@link ConnectorSplit} representing a subtask, or {@code null} if a split has already started to be processed,
* and the implementation returned a {@link TableFunctionProcessorState.Processed} with
* {@link TableFunctionProcessorState.Processed#isUsedInput()} being {@code true}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - this is interesting contract :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the states were primarily designed for the TableFunctionDataProcessor.
That processor gets one portion of data at a time, and it declares isUsedInput() when the portion is ingested. When it has ingested all the due data, it gets null until it's finished.
In case of the TableFunctionSplitProcessor, instead of input data, we have a Split. It is presented to the Processor as a single portion of data. If the processor declares isUsedInput(), it gets null in subsequent calls until it's finished.

@findepi findepi merged commit f2c1fcc into trinodb:master Apr 11, 2023
@findepi findepi deleted the findepi/document-tablefunctionsplitprocessor-thread-safety-38bd32 branch April 11, 2023 21:05
@github-actions github-actions bot added this to the 413 milestone Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants