Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc/developer: add design doc for adding docs to Avro sink schemas #21564

Merged
merged 7 commits into from
Sep 21, 2023

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Sep 3, 2023

@moulimukherjee — here's an initial sketch of a design and implementation for Avro field documentation. Could I hand this off to you to address and resolve any resulting discussion?


This is a design for MaterializeInc/database-issues#6480.

Motivation

  • This PR adds a design document.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • n/a

@moulimukherjee
Copy link
Contributor

@benesch Sure, thank you!

@sjwiesman
Copy link
Contributor

sjwiesman commented Sep 5, 2023

It's unclear to me in the design doc, can you set a top level doc comment for the full schema?

@moulimukherjee
Copy link
Contributor

@sjwiesman Avro does support a top level doc for the record. We should be able to allow it as well.

@benesch benesch force-pushed the design-avro-doc branch 2 times, most recently from 987607e to 0012be4 Compare September 8, 2023 07:22
@benesch
Copy link
Contributor Author

benesch commented Sep 8, 2023

Ugh, just thought of a very big wrinkle. Nested records don't retain their nice names! Given e.g.

CREATE TYPE point AS (x integer, y integer);
CREATE MATERIALIZED VIEW v AS SELECT ROW(1, 1)::point AS c1, 'text' AS c2;
CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO ...;

the generated schema won't have a point record! Instead it'll have a record with an autogenerated name (record0). So to use the DOC ON option as designed, you'd have to somehow guess the autogenerated names. Urgh! Mayyybe it's okay if we discourage use of DOC ON and instead tell people to use COMMENT, and then Materialize does the translation from semantic name to generated name automatically, but still gross to have DOC ON hanging around as a user-facing feature with such a footgun.

@benesch
Copy link
Contributor Author

benesch commented Sep 8, 2023

I wonder how hard it'd be to change sink schema generation to use the Materialize names of the nested record types rather than auto-generating names.

Alternatively, perhaps the DOC ON specifier should be in SQL terms, rather than Avro terms? E.g., you say DOC ON foo.bar.baz instead of DOC ON com.materialize.sink.record3::baz. We'd still have a problem down the road with sum types, if we ever implemented those, but we could cross that bridge when we got there.

@benesch
Copy link
Contributor Author

benesch commented Sep 12, 2023

I just pushed up a big change that uses SQL names rather than Avro field specifiers to indicate on which fields to attach the comments. It's quite a bit more verbiage in the design document, but I think it is not really any harder to implement, and insulates the DOC ON syntax from future changes to how we autogenerate Avro record names.

def-
def- previously requested changes Sep 13, 2023
@moulimukherjee
Copy link
Contributor

+1 to specifying on sql names instead of avro field names, because that's already known to the user.

Copy link
Contributor

@umanwizard umanwizard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. For implementation, I think it would make sense to do this in a test-driven way, since the rules for choosing which doc string to use are rather complex. I.e., it might be better to first land some failing tests (and obviously disable them in CI) that encode the desired behavior, before writing the implementation.

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

This all feels Complicated, and it makes me nervous that every future addition of support for new avro things will leave us with an unmanageable sink syntax. I appreciate why we haven't chosen the other options, though, so this still seems worthwhile to unlock the functionality.

If anything, though, I'd be happy to see some unpacking of why we support comment-on-type... since I don't quite understand the motivation and it ~doubles the surface area.


#### Planner

The planner will "freeze" any comments that have been promoted to documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒

* It is not ergonomic. The provided schema must exactly match the schema
Materialize generates, *except* for the `doc` fields. Minor errors in
constructing the schema (e.g., using a `long` where an `int` is required, or
ordering fields wrong) will result in hard to debug failures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea, which came up recently in another context, was to add some avro_schema_for(<relation>) function that would output our generated avro schema as a string.

That would mitigate this concern a bit, since users could just edit the generated schema instead of having to get the types right a priori. I don't think it helps with the other concern however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had a similar request to see the schema without creating the sink https://github.com/MaterializeInc/materialize/issues/21661

@umanwizard
Copy link
Contributor

If anything, though, I'd be happy to see some unpacking of why we support comment-on-type

Basically just because Avro supports it. doc appears in three places in Avro: (1) as an attribute of records, (2) as an attribute of fields of records, and (3) as an attribute of enums.

I don't think we support enums at all, but the first two correspond directly to comment-on-type and comment-on-column here.

@bkirwi
Copy link
Contributor

bkirwi commented Sep 21, 2023

@umanwizard - Yeah! That's true now but it was not originally specced that way -- they were both ways of writing field-level docs: see ece1b20. Agree that with the updated semantics all looks good!

@moulimukherjee moulimukherjee enabled auto-merge (squash) September 21, 2023 17:53
@moulimukherjee
Copy link
Contributor

Thanks for the reviews folks! Enabled auto-merge.

@umanwizard umanwizard dismissed def-’s stale review September 21, 2023 18:58

His comment has been addressed

@moulimukherjee moulimukherjee merged commit e724754 into MaterializeInc:main Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants