doc/developer: add design doc for adding docs to Avro sink schemas #21564

benesch · 2023-09-03T16:55:54Z

@moulimukherjee — here's an initial sketch of a design and implementation for Avro field documentation. Could I hand this off to you to address and resolve any resulting discussion?

This is a design for MaterializeInc/database-issues#6480.

Motivation

This PR adds a design document.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- n/a

doc/developer/design/20230903_avro_doc.md

moulimukherjee · 2023-09-05T07:32:17Z

@benesch Sure, thank you!

sjwiesman · 2023-09-05T15:59:29Z

It's unclear to me in the design doc, can you set a top level doc comment for the full schema?

moulimukherjee · 2023-09-07T15:55:16Z

@sjwiesman Avro does support a top level doc for the record. We should be able to allow it as well.

doc/developer/design/20230903_avro_doc.md

benesch · 2023-09-08T07:26:26Z

Ugh, just thought of a very big wrinkle. Nested records don't retain their nice names! Given e.g.

CREATE TYPE point AS (x integer, y integer);
CREATE MATERIALIZED VIEW v AS SELECT ROW(1, 1)::point AS c1, 'text' AS c2;
CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO ...;

the generated schema won't have a point record! Instead it'll have a record with an autogenerated name (record0). So to use the DOC ON option as designed, you'd have to somehow guess the autogenerated names. Urgh! Mayyybe it's okay if we discourage use of DOC ON and instead tell people to use COMMENT, and then Materialize does the translation from semantic name to generated name automatically, but still gross to have DOC ON hanging around as a user-facing feature with such a footgun.

benesch · 2023-09-08T07:28:12Z

I wonder how hard it'd be to change sink schema generation to use the Materialize names of the nested record types rather than auto-generating names.

Alternatively, perhaps the DOC ON specifier should be in SQL terms, rather than Avro terms? E.g., you say DOC ON foo.bar.baz instead of DOC ON com.materialize.sink.record3::baz. We'd still have a problem down the road with sum types, if we ever implemented those, but we could cross that bridge when we got there.

This is a design for #21557.

benesch · 2023-09-12T05:13:39Z

I just pushed up a big change that uses SQL names rather than Avro field specifiers to indicate on which fields to attach the comments. It's quite a bit more verbiage in the design document, but I think it is not really any harder to implement, and insulates the DOC ON syntax from future changes to how we autogenerate Avro record names.

doc/developer/design/20230903_avro_doc.md

moulimukherjee · 2023-09-13T14:40:32Z

+1 to specifying on sql names instead of avro field names, because that's already known to the user.

doc/developer/design/20230903_avro_doc.md

fmt

umanwizard

Overall looks good to me. For implementation, I think it would make sense to do this in a test-driven way, since the rules for choosing which doc string to use are rather complex. I.e., it might be better to first land some failing tests (and obviously disable them in CI) that encode the desired behavior, before writing the implementation.

doc/developer/design/20230903_avro_doc.md

Co-authored-by: umanwizard <[email protected]>

bkirwi

Cool!

This all feels Complicated, and it makes me nervous that every future addition of support for new avro things will leave us with an unmanageable sink syntax. I appreciate why we haven't chosen the other options, though, so this still seems worthwhile to unlock the functionality.

If anything, though, I'd be happy to see some unpacking of why we support comment-on-type... since I don't quite understand the motivation and it ~doubles the surface area.

doc/developer/design/20230903_avro_doc.md

bkirwi · 2023-09-20T20:44:03Z

doc/developer/design/20230903_avro_doc.md

+
+#### Planner
+
+The planner will "freeze" any comments that have been promoted to documentation


bkirwi · 2023-09-20T20:46:47Z

doc/developer/design/20230903_avro_doc.md

+  * It is not ergonomic. The provided schema must exactly match the schema
+    Materialize generates, *except* for the `doc` fields. Minor errors in
+    constructing the schema (e.g., using a `long` where an `int` is required, or
+    ordering fields wrong) will result in hard to debug failures.


One idea, which came up recently in another context, was to add some avro_schema_for(<relation>) function that would output our generated avro schema as a string.

That would mitigate this concern a bit, since users could just edit the generated schema instead of having to get the types right a priori. I don't think it helps with the other concern however.

Yes, we had a similar request to see the schema without creating the sink https://github.com/MaterializeInc/materialize/issues/21661

Co-authored-by: Ben Kirwin <[email protected]>

umanwizard · 2023-09-21T14:04:49Z

If anything, though, I'd be happy to see some unpacking of why we support comment-on-type

Basically just because Avro supports it. doc appears in three places in Avro: (1) as an attribute of records, (2) as an attribute of fields of records, and (3) as an attribute of enums.

I don't think we support enums at all, but the first two correspond directly to comment-on-type and comment-on-column here.

bkirwi · 2023-09-21T14:47:41Z

@umanwizard - Yeah! That's true now but it was not originally specced that way -- they were both ways of writing field-level docs: see ece1b20. Agree that with the updated semantics all looks good!

moulimukherjee · 2023-09-21T17:53:23Z

Thanks for the reviews folks! Enabled auto-merge.

His comment has been addressed

benesch requested review from bkirwi, umanwizard, moulimukherjee and dseisun-materialize September 3, 2023 16:55

benesch force-pushed the design-avro-doc branch from ba14cb3 to 44d18fd Compare September 3, 2023 16:56

umanwizard reviewed Sep 4, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

benesch force-pushed the design-avro-doc branch from 44d18fd to 57af193 Compare September 5, 2023 19:42

moulimukherjee reviewed Sep 8, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

moulimukherjee reviewed Sep 8, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

moulimukherjee reviewed Sep 8, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

moulimukherjee reviewed Sep 8, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Show resolved Hide resolved

benesch force-pushed the design-avro-doc branch 2 times, most recently from 987607e to 0012be4 Compare September 8, 2023 07:22

doc/developer: add design doc for adding docs to Avro sink schemas

b802781

This is a design for #21557.

benesch force-pushed the design-avro-doc branch from 0012be4 to b802781 Compare September 12, 2023 05:12

def- previously requested changes Sep 13, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

umanwizard reviewed Sep 13, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Show resolved Hide resolved

Removing duplicate

28a72ae

moulimukherjee reviewed Sep 13, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Show resolved Hide resolved

moulimukherjee requested review from def- and umanwizard September 19, 2023 20:11

moulimukherjee reviewed Sep 20, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

Specifying, will use existing csr connection options

8bc1fb5

fmt

moulimukherjee force-pushed the design-avro-doc branch from be5dc67 to 8bc1fb5 Compare September 20, 2023 15:30

umanwizard approved these changes Sep 20, 2023

View reviewed changes

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

doc/developer/design/20230903_avro_doc.md Outdated Show resolved Hide resolved

Apply suggestions from code review

9d292e7

Co-authored-by: umanwizard <[email protected]>

bkirwi approved these changes Sep 20, 2023

View reviewed changes

moulimukherjee and others added 2 commits September 20, 2023 14:15

Apply suggestions from code review

8bc450e

Co-authored-by: Ben Kirwin <[email protected]>

design: fix specification of sink Avro DOC ON option

ece1b20

Updated BTreeMap key from String to DocTarget

7aafd3f

moulimukherjee enabled auto-merge (squash) September 21, 2023 17:53

moulimukherjee merged commit e724754 into MaterializeInc:main Sep 21, 2023

moulimukherjee mentioned this pull request Oct 6, 2023

adapter,interchange,parser: Support avro sink comments #22019

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc/developer: add design doc for adding docs to Avro sink schemas #21564

doc/developer: add design doc for adding docs to Avro sink schemas #21564

benesch commented Sep 3, 2023

moulimukherjee commented Sep 5, 2023

sjwiesman commented Sep 5, 2023 •

edited

Loading

moulimukherjee commented Sep 7, 2023

benesch commented Sep 8, 2023

benesch commented Sep 8, 2023

benesch commented Sep 12, 2023

moulimukherjee commented Sep 13, 2023

umanwizard left a comment

bkirwi left a comment

bkirwi Sep 20, 2023

bkirwi Sep 20, 2023

moulimukherjee Sep 20, 2023

umanwizard commented Sep 21, 2023

bkirwi commented Sep 21, 2023

moulimukherjee commented Sep 21, 2023


		#### Planner

		The planner will "freeze" any comments that have been promoted to documentation

doc/developer: add design doc for adding docs to Avro sink schemas #21564

doc/developer: add design doc for adding docs to Avro sink schemas #21564

Conversation

benesch commented Sep 3, 2023

Motivation

Checklist

moulimukherjee commented Sep 5, 2023

sjwiesman commented Sep 5, 2023 • edited Loading

moulimukherjee commented Sep 7, 2023

benesch commented Sep 8, 2023

benesch commented Sep 8, 2023

benesch commented Sep 12, 2023

moulimukherjee commented Sep 13, 2023

umanwizard left a comment

Choose a reason for hiding this comment

bkirwi left a comment

Choose a reason for hiding this comment

bkirwi Sep 20, 2023

Choose a reason for hiding this comment

bkirwi Sep 20, 2023

Choose a reason for hiding this comment

moulimukherjee Sep 20, 2023

Choose a reason for hiding this comment

umanwizard commented Sep 21, 2023

bkirwi commented Sep 21, 2023

moulimukherjee commented Sep 21, 2023

sjwiesman commented Sep 5, 2023 •

edited

Loading