[BEAM-12384] Set output typeDescriptor explictly in Read.Bounded transform #14854

iemejia · 2021-05-21T07:37:13Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/ReadTest.java

boyuanzz · 2021-05-21T18:06:58Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java

@@ -151,7 +150,8 @@ private Bounded(@Nullable String name, BoundedSource<T> source) {
          .apply(ParDo.of(new OutputSingleSource<>(source)))
          .setCoder(SerializableCoder.of(new TypeDescriptor<BoundedSource<T>>() {}))
          .apply(ParDo.of(new BoundedSourceAsSDFWrapperFn<>()))
-          .setCoder(source.getOutputCoder());
+          .setCoder(source.getOutputCoder())
+          .setTypeDescriptor(source.getOutputCoder().getEncodedTypeDescriptor());


Do we maintain the TypeDescriptor information before for Read? I was under impression that for most of cases we only set Coder for a output PCollection.

You are right and I don't know why we don't pay more attention to this. Probably because coders seem to include the TypeDescriptor, any ideas @kennknowles ? is this redundant somehow?

In any case having this information seems important for the downstream transforms.

It seems like the typeDescriptor can be inferred from Coder.getEncodedTypeDescriptor(). If we really want to populate this information in a consistent way, probably we can consider changing PCollection.getTypeDescriptor() to infer the typeDescriptor from Coder if the typeDescriptor is set.

Yes, the expected use of this method is to set the type descriptor but not the coder. This way, the coder registry still can choose the coder.

Setting both is redundant, in theory. Setting just the coder should suffice. Maybe some plumbing needed? It was not really expected to look at either one in this way.

Another angle to consider is that type descriptor is Java-specific, while coder is the portable "type" of the data. I don't know if that matters here.

I'm thinking about changes like: #14870

I like @boyuanzz fix because even in the presence of different Coders the TypeDescriptor is commonly preserved inside of the Coders. WDYT @kennknowles can you spot some particular issues about it?
I can rebase this PR targetting a generic implementation like the one on #14870 but I did not do it like that because I was not really familiar with the reasoning behind not relying on the coder typeDescriptor.

Yea makes lots of sense.

boyuanzz · 2021-05-21T18:09:34Z

Run Java PreCommit

boyuanzz · 2021-05-21T18:09:41Z

Run Java_Examples_Dataflow PreCommit

boyuanzz · 2021-05-21T18:09:58Z

Run Java_Examples_Dataflow_Java11 PreCommit

boyuanzz · 2021-05-21T21:33:16Z

Run Java PreCommit

boyuanzz · 2021-05-21T21:33:23Z

Run Java_Examples_Dataflow_Java11 PreCommit

…sform

… set explicitly.

iemejia · 2021-05-26T13:58:54Z

I cherry-picked @boyuanzz commit from the other PR notice however that I could not get the Type to be preserved after the two DoFns are applied.

In OutputSingleSource I was able to export correctly the type by overriding

    @Override
    public TypeDescriptor<SourceT> getOutputTypeDescriptor() {
      return (TypeDescriptor<SourceT>)
          new TypeDescriptor<Source<T>>(getClass()) {}.where(
              new TypeParameter<T>() {}, source.getOutputCoder().getEncodedTypeDescriptor());
    }

but in the second DoFn BoundedSourceAsSDFWrapperFn

I did not find a way to recover the real type of T when overwriting getOutputTypeDescriptor. Any suggestion for this? Otherwise I suppose the solution on Read is good enough.

boyuanzz · 2021-05-27T17:59:24Z

I cherry-picked @boyuanzz commit from the other PR notice however that I could not get the Type to be preserved after the two DoFns are applied.

In OutputSingleSource I was able to export correctly the type by overriding
    @Override
    public TypeDescriptor<SourceT> getOutputTypeDescriptor() {
      return (TypeDescriptor<SourceT>)
          new TypeDescriptor<Source<T>>(getClass()) {}.where(
              new TypeParameter<T>() {}, source.getOutputCoder().getEncodedTypeDescriptor());
    }
but in the second DoFn BoundedSourceAsSDFWrapperFn

I did not find a way to recover the real type of T when overwriting getOutputTypeDescriptor. Any suggestion for this? Otherwise I suppose the solution on Read is good enough.

I would say, let's unblock you first : ) Can you file a JIRA issue on this? I think we should find a ultimate solution there.

boyuanzz · 2021-05-27T17:59:42Z

Run Java PreCommit

boyuanzz · 2021-05-27T17:59:50Z

Run Java_Examples_Dataflow PreCommit

boyuanzz

Thanks!

iemejia · 2021-05-28T08:28:01Z

Thanks @boyuanzz I filled BEAM-12420 TypeDescriptor information gets lost when applying multiple DoFn on Composite Transform as a follow up.
I need to focus on finishing the Convert work so feel free to take it if you have some free cycles.

iemejia requested review from kennknowles and boyuanzz May 21, 2021 07:37

kennknowles reviewed May 21, 2021

View reviewed changes

sdks/java/core/src/test/java/org/apache/beam/sdk/io/ReadTest.java Show resolved Hide resolved

boyuanzz reviewed May 21, 2021

View reviewed changes

iemejia force-pushed the BEAM-12384-read-typedescriptor branch from 2857fa7 to e2bfde9 Compare May 21, 2021 20:33

iemejia and others added 3 commits May 26, 2021 10:49

[BEAM-12384] Refine generic types on Read.Bounded internals

331c67c

[BEAM-12384] Set output typeDescriptor explictly in Read.Bounded tran…

3b705c1

…sform

[BEAM-12384] Infer typeDescriptor from coder if typeDescriptor is not…

83bccf9

… set explicitly.

iemejia force-pushed the BEAM-12384-read-typedescriptor branch from e2bfde9 to 83bccf9 Compare May 26, 2021 13:54

boyuanzz approved these changes May 27, 2021

View reviewed changes

iemejia merged commit b03e429 into apache:master May 28, 2021

iemejia deleted the BEAM-12384-read-typedescriptor branch May 28, 2021 08:51

damccorm mentioned this pull request Jun 4, 2022

TypeDescriptor information gets lost when applying multiple DoFn on Composite Transform #20901

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-12384] Set output typeDescriptor explictly in Read.Bounded transform #14854

[BEAM-12384] Set output typeDescriptor explictly in Read.Bounded transform #14854

iemejia commented May 21, 2021

boyuanzz May 21, 2021 •

edited

Loading

iemejia May 21, 2021

boyuanzz May 21, 2021

kennknowles May 21, 2021

boyuanzz May 24, 2021

iemejia May 24, 2021

kennknowles May 24, 2021

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

iemejia commented May 26, 2021

boyuanzz commented May 27, 2021

boyuanzz commented May 27, 2021

boyuanzz commented May 27, 2021

boyuanzz left a comment

iemejia commented May 28, 2021

[BEAM-12384] Set output typeDescriptor explictly in Read.Bounded transform #14854

[BEAM-12384] Set output typeDescriptor explictly in Read.Bounded transform #14854

Conversation

iemejia commented May 21, 2021

boyuanzz May 21, 2021 • edited Loading

Choose a reason for hiding this comment

iemejia May 21, 2021

Choose a reason for hiding this comment

boyuanzz May 21, 2021

Choose a reason for hiding this comment

kennknowles May 21, 2021

Choose a reason for hiding this comment

boyuanzz May 24, 2021

Choose a reason for hiding this comment

iemejia May 24, 2021

Choose a reason for hiding this comment

kennknowles May 24, 2021

Choose a reason for hiding this comment

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

boyuanzz commented May 21, 2021

iemejia commented May 26, 2021

boyuanzz commented May 27, 2021

boyuanzz commented May 27, 2021

boyuanzz commented May 27, 2021

boyuanzz left a comment

Choose a reason for hiding this comment

iemejia commented May 28, 2021

boyuanzz May 21, 2021 •

edited

Loading