Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-12384] Set output typeDescriptor explictly in Read.Bounded transform #14854

Merged
merged 3 commits into from
May 28, 2021

Conversation

iemejia
Copy link
Member

@iemejia iemejia commented May 21, 2021

@iemejia iemejia requested review from kennknowles and boyuanzz May 21, 2021 07:37
@@ -151,7 +150,8 @@ private Bounded(@Nullable String name, BoundedSource<T> source) {
.apply(ParDo.of(new OutputSingleSource<>(source)))
.setCoder(SerializableCoder.of(new TypeDescriptor<BoundedSource<T>>() {}))
.apply(ParDo.of(new BoundedSourceAsSDFWrapperFn<>()))
.setCoder(source.getOutputCoder());
.setCoder(source.getOutputCoder())
.setTypeDescriptor(source.getOutputCoder().getEncodedTypeDescriptor());
Copy link
Contributor

@boyuanzz boyuanzz May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maintain the TypeDescriptor information before for Read? I was under impression that for most of cases we only set Coder for a output PCollection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right and I don't know why we don't pay more attention to this. Probably because coders seem to include the TypeDescriptor, any ideas @kennknowles ? is this redundant somehow?

In any case having this information seems important for the downstream transforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the typeDescriptor can be inferred from Coder.getEncodedTypeDescriptor(). If we really want to populate this information in a consistent way, probably we can consider changing PCollection.getTypeDescriptor() to infer the typeDescriptor from Coder if the typeDescriptor is set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the expected use of this method is to set the type descriptor but not the coder. This way, the coder registry still can choose the coder.

Setting both is redundant, in theory. Setting just the coder should suffice. Maybe some plumbing needed? It was not really expected to look at either one in this way.

Another angle to consider is that type descriptor is Java-specific, while coder is the portable "type" of the data. I don't know if that matters here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about changes like: #14870

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @boyuanzz fix because even in the presence of different Coders the TypeDescriptor is commonly preserved inside of the Coders. WDYT @kennknowles can you spot some particular issues about it?
I can rebase this PR targetting a generic implementation like the one on #14870 but I did not do it like that because I was not really familiar with the reasoning behind not relying on the coder typeDescriptor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea makes lots of sense.

@boyuanzz
Copy link
Contributor

Run Java PreCommit

@boyuanzz
Copy link
Contributor

Run Java_Examples_Dataflow PreCommit

@boyuanzz
Copy link
Contributor

Run Java_Examples_Dataflow_Java11 PreCommit

@iemejia iemejia force-pushed the BEAM-12384-read-typedescriptor branch from 2857fa7 to e2bfde9 Compare May 21, 2021 20:33
@boyuanzz
Copy link
Contributor

Run Java PreCommit

@boyuanzz
Copy link
Contributor

Run Java_Examples_Dataflow_Java11 PreCommit

@iemejia iemejia force-pushed the BEAM-12384-read-typedescriptor branch from e2bfde9 to 83bccf9 Compare May 26, 2021 13:54
@iemejia
Copy link
Member Author

iemejia commented May 26, 2021

I cherry-picked @boyuanzz commit from the other PR notice however that I could not get the Type to be preserved after the two DoFns are applied.

In OutputSingleSource I was able to export correctly the type by overriding

    @Override
    public TypeDescriptor<SourceT> getOutputTypeDescriptor() {
      return (TypeDescriptor<SourceT>)
          new TypeDescriptor<Source<T>>(getClass()) {}.where(
              new TypeParameter<T>() {}, source.getOutputCoder().getEncodedTypeDescriptor());
    }

but in the second DoFn BoundedSourceAsSDFWrapperFn

I did not find a way to recover the real type of T when overwriting getOutputTypeDescriptor. Any suggestion for this? Otherwise I suppose the solution on Read is good enough.

@boyuanzz
Copy link
Contributor

I cherry-picked @boyuanzz commit from the other PR notice however that I could not get the Type to be preserved after the two DoFns are applied.

In OutputSingleSource I was able to export correctly the type by overriding

    @Override
    public TypeDescriptor<SourceT> getOutputTypeDescriptor() {
      return (TypeDescriptor<SourceT>)
          new TypeDescriptor<Source<T>>(getClass()) {}.where(
              new TypeParameter<T>() {}, source.getOutputCoder().getEncodedTypeDescriptor());
    }

but in the second DoFn BoundedSourceAsSDFWrapperFn

I did not find a way to recover the real type of T when overwriting getOutputTypeDescriptor. Any suggestion for this? Otherwise I suppose the solution on Read is good enough.

I would say, let's unblock you first : ) Can you file a JIRA issue on this? I think we should find a ultimate solution there.

@boyuanzz
Copy link
Contributor

Run Java PreCommit

@boyuanzz
Copy link
Contributor

Run Java_Examples_Dataflow PreCommit

Copy link
Contributor

@boyuanzz boyuanzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@iemejia
Copy link
Member Author

iemejia commented May 28, 2021

Thanks @boyuanzz I filled BEAM-12420 TypeDescriptor information gets lost when applying multiple DoFn on Composite Transform as a follow up.
I need to focus on finishing the Convert work so feel free to take it if you have some free cycles.

@iemejia iemejia merged commit b03e429 into apache:master May 28, 2021
@iemejia iemejia deleted the BEAM-12384-read-typedescriptor branch May 28, 2021 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants