-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-13416] Introduce Schema provider for AWS model classes extending SdkPojo #16947
Conversation
@pabloem I remember you made related changes on your recent PR to JdbcIO to add support for generic partitioners. I wonder if you have pointers to documentation or simply advice on the topic :) |
@reuvenlax Could you advise on this please? |
Can you explain what SdkPojoSchema is? We already have an optimized Pojo schema provider (JavaFieldSchema) |
@reuvenlax JavaFieldSchema doesn't work for AWS models, the sdk v2 uses immutable objects based on builders / final fields. SdkPojo is the common interface of all of these models. |
SdkPojoSchema uses the metadata exposed by SdkPojos to read fields and write them via an intermediate builder. |
JavaFieldSchema should work for immutable objects with builders too. Can we
elucidate what the gap is there?
JavaFieldSchema is fairly optimized using bytecode generation, so it
doesn't have to convert back and forth to a Row object (rather the Row
object simply forwards it accesses to direct accesses of the POJO using
bytecode). If we are able to leverage that, it will be a
significant performance advantage over the SdkPojo schema provider.
…On Mon, Feb 28, 2022 at 10:49 AM Moritz Mack ***@***.***> wrote:
@reuvenlax <https://github.com/reuvenlax> JavaFieldSchema doesn't work
for AWS models, the sdk v2 uses immutable objects based on builders / final
fields. SdkPojo is the common interface of all of these models.
—
Reply to this email directly, view it on GitHub
<#16947 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVKQRE7ZTS7RTWNVHJ3U5O7UFANCNFSM5PKC75YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks, I'll have a closer look |
@reuvenlax I had a closer look why JavaFieldSchema does not work. There's quite some issues, and this is by far not an exhaustive list:
Also, I've seen all the optimisations done in JavaFieldSchema and JavaBeanSchema. Honestly, at this point I don't bother much about that yet. I'm more interested to understand what the right approach forward is giving all the AWS IOs currently use hand crafted low level coders. |
Also, for context, here's a stripped down example how such a public interface SdkPojo {
List<SdkField<?>> sdkFields();
}
@Generated("software.amazon.awssdk:codegen")
public final class DeleteMessageRequest extends SqsRequest implements SdkPojo {
private static final SdkField<String> QUEUE_URL_FIELD = SdkField.<String> builder(MarshallingType.STRING)
.memberName("QueueUrl")
.getter(DeleteMessageRequest::queueUrl)
.setter(Builder::queueUrl)
.build();
private static final SdkField<String> RECEIPT_HANDLE_FIELD = SdkField.<String> builder(MarshallingType.STRING)
.memberName("ReceiptHandle")
.getter(DeleteMessageRequest::receiptHandle)
.setter(Builder::receiptHandle)
.build();
private static final List<SdkField<?>> SDK_FIELDS = Collections.unmodifiableList(Arrays.asList(QUEUE_URL_FIELD,
RECEIPT_HANDLE_FIELD));
private final String queueUrl;
private final String receiptHandle;
private DeleteMessageRequest(BuilderImpl builder) {
super(builder);
this.queueUrl = builder.queueUrl;
this.receiptHandle = builder.receiptHandle;
}
public final String queueUrl() {
return queueUrl;
}
public final String receiptHandle() {
return receiptHandle;
}
public static Builder builder() {
return new BuilderImpl();
}
public final List<SdkField<?>> sdkFields() {
return SDK_FIELDS;
}
public interface Builder extends SdkPojo {
Builder queueUrl(String queueUrl);
Builder receiptHandle(String receiptHandle);
}
static final class BuilderImpl implements Builder {
private String queueUrl;
private String receiptHandle;
private BuilderImpl() {}
public final Builder queueUrl(String queueUrl) {
this.queueUrl = queueUrl;
return this;
}
public final Builder receiptHandle(String receiptHandle) {
this.receiptHandle = receiptHandle;
return this;
}
public DeleteMessageRequest build() {
return new DeleteMessageRequest(this);
}
public List<SdkField<?>> sdkFields() {
return SDK_FIELDS;
}
}
} |
On Wed, Mar 2, 2022 at 3:02 AM Moritz Mack ***@***.***> wrote:
@reuvenlax <https://github.com/reuvenlax> I had a closer look why
JavaFieldSchema does not work. There's quite some issues, and this is by
far not an exhaustive list:
- Only public, non-static, non-transient fields
<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ReflectUtils.java#L129-L151>
are considered. Generated models in the AWS SDK are private, however. Using
some kind of Field filter in JavaFieldSchema, that can then be overwritten,
this could be addressed. But this obviously also means to generate
different FieldValueGetter using ByteBuddy.
But the getter methods are public, right?
- Next, more involved I suppose, such AWS SDK models contain final
fields only, but there's no known creator. The annotation based approach
isn't feasible for such "external" classes.
That's something I think could be solved. Yes today we require an
annotation to specify the builder, however we could easily add a
programmatic way to specify this.
- An annoyance, but has to be handled as well: These model classes may
contain maps with non primitive types. This is not supported. Also, in some
cases there might be recursive type hierarchies that can't be supported and
have to be dealt with.
I think we do support this, or are you referring to the key type?
Recursive type hierarchies are also definitely support.
…
Also, I've seen all the optimisations done in JavaFieldSchema and
JavaBeanSchema. Honestly, at this point I don't bother much about that yet.
I'm more interested to understand what the right approach forward is giving
all the AWS IOs currently use hand crafted low level coders.
—
Reply to this email directly, view it on GitHub
<#16947 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVNIELMVNBKPM6QFDOTU55DFZANCNFSM5PKC75YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There's public getters of course, but not following the "getX" naming convention. I've also looked briefly at just extending static class SdkFieldValueGetter<ObjectT, ValueT> implements FieldValueGetter<ObjectT, ValueT> {
private final SdkField<ValueT> field;
public SdkFieldValueGetter(SdkField<ValueT> field) {
this.field = field;
}
@Override
@Nullable
public ValueT get(ObjectT object) {
return field.getValueOrDefault(object);
}
@Override
public String name() {
return field.memberName();
}
}
Problem with recursive types is the immutable |
Do you mean mutually recursive? We considered supporting this with schema,
but the complexity was high and there didn't seem to be many good use cases.
…On Wed, Mar 2, 2022 at 10:37 AM Moritz Mack ***@***.***> wrote:
But the getter methods are public, right?
There's public getters of course, but not following the "getX" naming
convention.
I've also looked briefly at just extending GetterBasedSchemaProvider,
that works really well actually. The read part is rather trivial. The field
metadata of each SDK model provides an easy way to read fields (without any
reflection involved).
static class SdkFieldValueGetter<ObjectT, ValueT> implements FieldValueGetter<ObjectT, ValueT> {
private final SdkField<ValueT> field;
public SdkFieldValueGetter(SdkField<ValueT> field) {
this.field = field;
}
@OverRide
@nullable
public ValueT get(ObjectT object) {
return field.getValueOrDefault(object);
}
@OverRide
public String name() {
return field.memberName();
}
}
Recursive type hierarchies are also definitely support.
Problem with recursive types is the immutable Schema. I'm not aware of
any way to generate such a schema for a recursive hierarchy ... same
problem exists for FieldValueTypeInformation.
Just verified it quickly, I get a stackoverflow as expected ... so
recursive types are definitely not supported.
—
Reply to this email directly, view it on GitHub
<#16947 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVIAMSNFPACIJDYOFVDU56YODANCNFSM5PKC75YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Unfortunately if you truly need recursive or mutually-recursive types, Beam
Schemas won't work today. We supported nested types, but not true recursion.
…On Wed, Mar 2, 2022 at 10:40 AM Reuven Lax ***@***.***> wrote:
Do you mean mutually recursive? We considered supporting this with schema,
but the complexity was high and there didn't seem to be many good use cases.
On Wed, Mar 2, 2022 at 10:37 AM Moritz Mack ***@***.***>
wrote:
> But the getter methods are public, right?
>
> There's public getters of course, but not following the "getX" naming
> convention.
>
> I've also looked briefly at just extending GetterBasedSchemaProvider,
> that works really well actually. The read part is rather trivial. The field
> metadata of each SDK model provides an easy way to read fields (without any
> reflection involved).
>
> static class SdkFieldValueGetter<ObjectT, ValueT> implements FieldValueGetter<ObjectT, ValueT> {
> private final SdkField<ValueT> field;
>
> public SdkFieldValueGetter(SdkField<ValueT> field) {
> this.field = field;
> }
>
> @OverRide
> @nullable
> public ValueT get(ObjectT object) {
> return field.getValueOrDefault(object);
> }
>
> @OverRide
> public String name() {
> return field.memberName();
> }
> }
>
> Recursive type hierarchies are also definitely support.
>
> Problem with recursive types is the immutable Schema. I'm not aware of
> any way to generate such a schema for a recursive hierarchy ... same
> problem exists for FieldValueTypeInformation.
> Just verified it quickly, I get a stackoverflow as expected ... so
> recursive types are definitely not supported.
>
> —
> Reply to this email directly, view it on GitHub
> <#16947 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AFAYJVIAMSNFPACIJDYOFVDU56YODANCNFSM5PKC75YA>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
2f09257
to
fb0f27f
Compare
@reuvenlax Trying to implement this on the existing Schema providers, I kept running into various issues. Also I introduce a similar wrapper row delegating to an underlying object very similar to |
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/values/AwsPojoRow.java
Outdated
Show resolved
Hide resolved
fb0f27f
to
c0790f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mosche thanks for this! I can help review, I have a few initial questions.
FYI @apilloud (relevant for schema proliferation), and @chamikaramj (could enable cross-language AWS IOs)
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/values/AwsPojoRow.java
Outdated
Show resolved
Hide resolved
...amazon-web-services2/src/test/java/org/apache/beam/sdk/io/aws2/schemas/AwsPojoUtilsTest.java
Outdated
Show resolved
Hide resolved
/** | ||
* Custom Coder for WrappedSnsResponse. | ||
* | ||
* @deprecated Coder of deprecated {@link SnsResponse}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the deprecation plan? Should users be opting in to using the SchemaCoder based approach, or are we keeping these coders here so that they can opt-out of schemas and use these instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to turn on schema coders by default by removing all "legacy" CoderProviderRegistrar
s and instead adding the generalSchemaProviderRegistrar
for all AWS models. Users can opt-out using schemas using the deprecated withCoder
setting.
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/values/package-info.java
Outdated
Show resolved
Hide resolved
Ping. What is a state of this PR? What are the next steps? |
This is blocked / waiting for #17172 |
Just wrapping up some other work, then I'll pick this up again 👍 There's a bit of work to be done to rebase this onto master. |
c0790f6
to
2c189b1
Compare
@TheNeuralBit I've migrated this PR onto |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few high-level comments and nits, this looks good overall. I haven't been able to look at the implementation closely though and probably won't be able to before I leave for vacation at the end of today. If you'd like to keep making progress on this while I'm gone maybe @reuvenlax or @lukecwik could take a look.
...java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/coders/AwsCoders.java
Outdated
Show resolved
Hide resolved
* | ||
* <p>Note: Beam doesn't support self-recursive schemas. Some AWS models, such as {@link | ||
* software.amazon.awssdk.services.dynamodb.model.AttributeValue} are self-recursive and cannot be | ||
* used with this schema provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we recommend continuing to use
AwsCoders
for this case? - It could also be helpful to link to the issue tracking recursive schemas.
- nit: I'd say either "recursive" or "self-referential" instead of "self-recursive"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to self-referential 👍 DynamoDB AttributeValue
is the only self-referential type I'm aware of and has a dedicated coder independent of AwsCoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lack of support of self-referential types in Beam was apparently by design, so there's no such issue. My initial idea for AWS was to "fake" it by just skipping over self-referential fields at some depth. But that is obviously not correct and potentially dangerous as it might silently ignore some data.
...mazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/schemas/AwsSchemaProvider.java
Outdated
Show resolved
Hide resolved
import software.amazon.awssdk.services.sqs.model.Message; | ||
|
||
/** Custom Coder for handling SendMessageRequest for using in Write. */ | ||
public class MessageCoder extends AtomicCoder<Message> implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is public and not @Internal
so technically we shouldn't just remove it outright. That being said this and SendMessageRequestCoder
, MessageCoder
are pretty clearly internal coders for use in their respective IOs, so it's probably fine.
I'll leave it up to you - if you want you could mark these coders deprecated and schedule them for removal in a later release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right pointing this out 👍 Considering MessageCoder.of()
/ SendMessageRequestCoder.of()
are both package private I think direct removal is justifiable in this case.
} | ||
|
||
private static byte[] toBytes(Object sdkBytes) { | ||
return ((SdkBytes) sdkBytes).asByteArrayUnsafe(); // TODO copy or use unsafe? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these TODOs something we can address now, or should there be an issue tracking it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at what GetterBasedSchemaProvider does for byte[]
/ ByteBuffer
. It's exposing the mutable bytes as well, so I'll keep the unsafe operations here. 👍
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/sns/SnsIO.java
Show resolved
Hide resolved
* sdkHttpMetadata with the HTTP response headers. | ||
* | ||
* @deprecated Writes fail fast in case of errors, no need to check headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is this a change in behavior that we need to communicate to users?
- Will these functions be removed later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a clarification why there's no need for a coder that includes HTTP response metadata, there's no change in behavior.
Plans is to remove these after a few releases.
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/sns/SnsIO.java
Outdated
Show resolved
Hide resolved
return str.getBytes(UTF_8); | ||
} | ||
|
||
private <T> Condition<Row> field(String name, T value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handy, maybe it should be in SchemaTestUtils for discoverability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code depends on assertj-core
and probably doesn't fit into SchemaTestUtils for that reason.
I could rewrite it as Matcher
if you think it's worth it,e.g.:
public static Matcher<Row> containsFieldValues(Map<String, Object> expectedFieldValues)
import software.amazon.awssdk.services.sqs.model.MessageAttributeValue; | ||
import software.amazon.awssdk.services.sqs.model.SendMessageRequest; | ||
|
||
public class AwsSchemaProviderTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this exercise all the types in the mapping from AwsTypes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact not, thanks for checking ... I'll make sure to add more test cases to cover everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done ✅
retest this please |
Running tests to get a run of the newly forked integration tests (Java_Amazon-Web-Services2_IO_Direct) |
Thanks a lot for your review @TheNeuralBit 🙏 |
2a95375
to
38d9641
Compare
@TheNeuralBit Could you take a final look, I just rebased to account for the byte buddy changes on master ... |
@TheNeuralBit Kind ping, are you ok merging this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
You might consider creating an issue for removing the deprecated functionality, and linking it to a future release milestone. That will help us remember to actually remove it.
Sorry for the delay |
Thanks a lot @TheNeuralBit :) |
@echauchot @aromanenko-dev Not really a PR (yet), I was just looking to understand the usage of schemas vs coders better.
The (most) interesting part here is probably
SdkPojoSchema
that works for all the generated AWS models in the SDK v2 (though, in some cases recursive types are used which obviously impose a challenge).I wasn't able to find a lot of good technical documentation (design decisions) for schemas in Beam, though definitely see the value over low level coders! It seems to make sense to bother users as little as possible with coders and not expose these on any public API (M11). Currently coders are all over the place for AWS IOs, with this it seems feasible to drop them (or deprecate).
Though, I was wondering a couple of things:
SdkPojo
), that means I have to "announce" it globally using an auto service.If you have any good resources on the topic, I'd be happy to hear ...
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.