Document TableFunctionSplitProcessor thread-safety #16955

findepi · 2023-04-11T10:05:32Z

No description provided.

Rename the existing TableFunctionProcessor to TableFunctionDataProcessor, and introduce another interface for processing splits.

core/trino-spi/src/main/java/io/trino/spi/ptf/TableFunctionSplitProcessor.java

findepi · 2023-04-11T13:03:50Z

CI #16882

losipiuk · 2023-04-11T19:47:40Z

core/trino-spi/src/main/java/io/trino/spi/ptf/TableFunctionSplitProcessor.java

+ * for a {@link ConnectorTableFunctionHandle}.
+ * <p>
+ * Thread-safety: implementations do not have to be thread-safe. The {@link #process} method may be called from
+ * multiple threads, but will never be called from two threads at the same time.


does it mean that implementation of TableFunctionSplitProcessor need to ensure memory visibility of internal data structures which may be changed by one thread and then accessed by the other. Or are we sure this is ensured by the caller?

formally we would call on JLS's happen-before semantics and say that there is a happens-before relation between previous method call end and the next method call... i didn't want to be very formal here. but, i wanted to indicate the implementor should not take note of things iike thread id, or use ThreadLocal internally.

Generally, to reason about thread-safety, we must consider both TableFunctionSplitProcessor and TableFunctionProcessorProvider.
The LeafTableFunctionOperator calls TableFunctionProcessorProvider.getSplitProcessor(session, handle) for each split it has. The function author implements the TableFunctionProcessorProvider and they can decide on the lifecycle of the Processor. One extreme would be to keep a single TableFunctionSplitProcessor, and return it from each call to the provider -- and deal with multiple threads. The other extreme is to instantiate a new TableFunctionSplitProcessor for each call to the provider. The latter is easy and clear, and imo should be considered the default approach.

The LeafTableFunctionOperator calls TableFunctionProcessorProvider.getSplitProcessor(session, handle) for each split it has.

Can it provide a split already in this method call?
it would make it clear the processor serves one split.

(i understand we wanted the leaf processor to be similar to intermediate processor, but reality is that a function implementor implements only one of them at the same time, so making things just simpler would be beneficial)

then the TableFunctionSplitProcessor just provides Pages until it's done. So becomes equivalent to ConnectorPageSource. Maybe we could reuse that interface?

Just throwing in my 2 cents that I also found this part of the interface a bit unintuitive when reviewing @homar's CDF table function implantation, specifically that process is initially called with a Split and then continues to be called with null arguments. It's something you only need to learn once, but at first glance I was expecting this to work more like a ConnectorPageSource.

Can it provide a split already in this method call?

It totally makes sense to call TableFunctionProcessorProvider.getSplitProcessor(session, handle, split), and remove the split argument from the TableFunctionSplitProcessor.process method. It should simplify the operator a lot.

i will take a stab.

What about replacing TableFunctionSplitProcessor with ConnectorPageSource, as a consequence of that?

That one seems to be designed for reading input. We'd have to think about how we implement getReadTimeNanos() etc. so that it makes sense for a particular table function. Or maybe subclass to ensure that those methods cannot be used. Even though they are never used anyway.

losipiuk · 2023-04-11T19:49:45Z

core/trino-spi/src/main/java/io/trino/spi/ptf/TableFunctionSplitProcessor.java

-     * @param split a {@link ConnectorSplit} representing a subtask.
+     * @param split a {@link ConnectorSplit} representing a subtask, or {@code null} if a split has already started to be processed,
+     * and the implementation returned a {@link TableFunctionProcessorState.Processed} with
+     * {@link TableFunctionProcessorState.Processed#isUsedInput()} being {@code true}.


hmm - this is interesting contract :)

Yup, the states were primarily designed for the TableFunctionDataProcessor.
That processor gets one portion of data at a time, and it declares isUsedInput() when the portion is ingested. When it has ingested all the due data, it gets null until it's finished.
In case of the TableFunctionSplitProcessor, instead of input data, we have a Split. It is presented to the Processor as a single portion of data. If the processor declares isUsedInput(), it gets null in subsequent calls until it's finished.

Document TableFunctionSplitProcessor thread-safety

5732726

findepi requested review from homar, losipiuk, ebyhr and kasiafi April 11, 2023 10:05

cla-bot bot added the cla-signed label Apr 11, 2023

homar approved these changes Apr 11, 2023

View reviewed changes

findepi referenced this pull request Apr 11, 2023

Introduce TableFunctionSplitProcessor

8be312f

Rename the existing TableFunctionProcessor to TableFunctionDataProcessor, and introduce another interface for processing splits.

findepi commented Apr 11, 2023

View reviewed changes

core/trino-spi/src/main/java/io/trino/spi/ptf/TableFunctionSplitProcessor.java Show resolved Hide resolved

findepi force-pushed the findepi/document-tablefunctionsplitprocessor-thread-safety-38bd32 branch from e2cf177 to f37a79b Compare April 11, 2023 11:06

Document TableFunctionSplitProcessor split argument and finishing

ffe92d9

findepi force-pushed the findepi/document-tablefunctionsplitprocessor-thread-safety-38bd32 branch from f37a79b to ffe92d9 Compare April 11, 2023 11:09

losipiuk reviewed Apr 11, 2023

View reviewed changes

losipiuk approved these changes Apr 11, 2023

View reviewed changes

findepi merged commit f2c1fcc into trinodb:master Apr 11, 2023

findepi deleted the findepi/document-tablefunctionsplitprocessor-thread-safety-38bd32 branch April 11, 2023 21:05

github-actions bot added this to the 413 milestone Apr 11, 2023

colebow mentioned this pull request Apr 12, 2023

Add Trino 413 release notes #16997

Merged

findepi mentioned this pull request Apr 14, 2023

Simplify TableFunctionSplitProcessor interface #17032

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document TableFunctionSplitProcessor thread-safety #16955

Document TableFunctionSplitProcessor thread-safety #16955

findepi commented Apr 11, 2023

findepi commented Apr 11, 2023

losipiuk Apr 11, 2023

findepi Apr 11, 2023

kasiafi Apr 12, 2023

findepi Apr 13, 2023

alexjo2144 Apr 13, 2023

kasiafi Apr 14, 2023

findepi Apr 14, 2023

kasiafi Apr 14, 2023

findepi Apr 14, 2023

losipiuk Apr 11, 2023

kasiafi Apr 12, 2023

Document TableFunctionSplitProcessor thread-safety #16955

Document TableFunctionSplitProcessor thread-safety #16955

Conversation

findepi commented Apr 11, 2023

findepi commented Apr 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment