-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document TableFunctionSplitProcessor thread-safety #16955
Document TableFunctionSplitProcessor thread-safety #16955
Conversation
Rename the existing TableFunctionProcessor to TableFunctionDataProcessor, and introduce another interface for processing splits.
core/trino-spi/src/main/java/io/trino/spi/ptf/TableFunctionSplitProcessor.java
Show resolved
Hide resolved
e2cf177
to
f37a79b
Compare
f37a79b
to
ffe92d9
Compare
CI #16882 |
* for a {@link ConnectorTableFunctionHandle}. | ||
* <p> | ||
* Thread-safety: implementations do not have to be thread-safe. The {@link #process} method may be called from | ||
* multiple threads, but will never be called from two threads at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean that implementation of TableFunctionSplitProcessor
need to ensure memory visibility of internal data structures which may be changed by one thread and then accessed by the other. Or are we sure this is ensured by the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formally we would call on JLS's happen-before
semantics and say that there is a happens-before
relation between previous method call end and the next method call... i didn't want to be very formal here. but, i wanted to indicate the implementor should not take note of things iike thread id, or use ThreadLocal internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, to reason about thread-safety, we must consider both TableFunctionSplitProcessor
and TableFunctionProcessorProvider
.
The LeafTableFunctionOperator
calls TableFunctionProcessorProvider.getSplitProcessor(session, handle)
for each split it has. The function author implements the TableFunctionProcessorProvider
and they can decide on the lifecycle of the Processor. One extreme would be to keep a single TableFunctionSplitProcessor
, and return it from each call to the provider -- and deal with multiple threads. The other extreme is to instantiate a new TableFunctionSplitProcessor
for each call to the provider. The latter is easy and clear, and imo should be considered the default approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
LeafTableFunctionOperator
callsTableFunctionProcessorProvider.getSplitProcessor(session, handle)
for each split it has.
Can it provide a split already in this method call?
it would make it clear the processor serves one split.
(i understand we wanted the leaf processor to be similar to intermediate processor, but reality is that a function implementor implements only one of them at the same time, so making things just simpler would be beneficial)
then the TableFunctionSplitProcessor
just provides Pages until it's done. So becomes equivalent to ConnectorPageSource
. Maybe we could reuse that interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just throwing in my 2 cents that I also found this part of the interface a bit unintuitive when reviewing @homar's CDF table function implantation, specifically that process
is initially called with a Split and then continues to be called with null
arguments. It's something you only need to learn once, but at first glance I was expecting this to work more like a ConnectorPageSource
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it provide a split already in this method call?
It totally makes sense to call TableFunctionProcessorProvider.getSplitProcessor(session, handle, split)
, and remove the split
argument from the TableFunctionSplitProcessor.process
method. It should simplify the operator a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will take a stab.
What about replacing TableFunctionSplitProcessor
with ConnectorPageSource
, as a consequence of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one seems to be designed for reading input. We'd have to think about how we implement getReadTimeNanos()
etc. so that it makes sense for a particular table function. Or maybe subclass to ensure that those methods cannot be used. Even though they are never used anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param split a {@link ConnectorSplit} representing a subtask. | ||
* @param split a {@link ConnectorSplit} representing a subtask, or {@code null} if a split has already started to be processed, | ||
* and the implementation returned a {@link TableFunctionProcessorState.Processed} with | ||
* {@link TableFunctionProcessorState.Processed#isUsedInput()} being {@code true}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm - this is interesting contract :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the states were primarily designed for the TableFunctionDataProcessor
.
That processor gets one portion of data at a time, and it declares isUsedInput()
when the portion is ingested. When it has ingested all the due data, it gets null
until it's finished.
In case of the TableFunctionSplitProcessor
, instead of input data, we have a Split. It is presented to the Processor as a single portion of data. If the processor declares isUsedInput()
, it gets null
in subsequent calls until it's finished.
No description provided.