doc/developer: add design doc for adding docs to Avro sink schemas

This is a design for #21557.
MaterializeInc · Sep 8, 2023 · 0012be4 · 0012be4
1 parent d9a011e
commit 0012be4
Showing 1 changed file with 329 additions and 0 deletions.
diff --git a/doc/developer/design/20230903_avro_doc.md b/doc/developer/design/20230903_avro_doc.md
@@ -0,0 +1,329 @@
+# Schema documentation for Avro-formatted sinks
+
+Associated issue: [#21557]
+
+## Context
+
+Avro schemas allow users to annotate records, record fields, and enums with
+documentation. Note the `doc` fields in the following example Avro schema:
+
+```json
+{
+  "type": "record",
+  "name": "row",
+  "doc": "represents a row in some materialized view",
+  "fields" : [
+    {"name": "c1", "type": "string", "doc": "this is the meaning of column a"},
+    {"name": "c2", "type": "int"}
+  ]
+}
+```
+
+When using an Avro-formatted sink, Materialize generates the Avro schema based
+on the shape of the underlying relation. For example, given the following
+sink:
+
+```sql
+CREATE MATERIALIZED VIEW v AS SELECT 'text' AS c1, 'int'::integer as c2;
+CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY ...;
+```
+
+Materialize will automatically generate the following Avro schema and publish it
+to the schema registry:
+
+```json
+{
+  "type": "record",
+  "name": "envelope",
+  "fields" : [
+    {"name": "c1", "type": "string"},
+    {"name": "c2", "type": "int"}
+  ]
+}
+```
+
+Today, there is no way to instruct Materialize to include `doc` fields in the
+generated Avro schema. This poses a challenge for organizations with
+requirements on the use of documentation—e.g., a policy that requires that every
+field in every schema published to the organization's schema registry has
+documentation.
+
+## Goals
+
+Must haves:
+
+  * Enable users with organizational policies on the use of `doc` fields in Avro
+    schemas to use Avro-formatted sinks.
+
+Nice to haves:
+
+  * Consistency with Materialize's existing approach to generating Avro schemas
+    from SQL relations.
+
+  * Good ergonomics.
+
+## Non-goals
+
+None yet.
+
+## Detailed description
+
+### Design
+
+#### SQL
+
+We'll extend the `FORMAT AVRO` clause with a new `DOC ON` option:
+
+```sql
+CREATE SINK ...
+FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY ... (
+    DOC ON <field-specifier> = '<documentation string>',
+    DOC ON <field-specifier> = '<documentation string>',
+    ...
+)
+WITH (...)
+```
+
+The option will be located in a new parenthesized option list immediately
+following the existing `FORMAT AVRO` clause. After this change, `DOC ON` will
+be the only supported option, but we expect to add further options to control
+the Avro schema in the future.
+
+Aside: the current design of the `USING CONFLUENT SCHEMA REGISTRY` option is
+historical and predates the introduction of the standardized option list. In the
+future, we may also move the `USING CONFLUENT SCHEMA REGISTRY` into the option
+list, where it can be freely intermixed with other options, e.g.:
+
+```sql
+FORMAT AVRO (
+    DOC ON ...,
+    CONFLUENT SCHEMA REGISTRY = '...',
+    DOC ON ...,
+)
+```
+
+Each `DOC ON` option describes a documentation comment that should be attached
+to a single field in the Avro schema. See [Avro schema node
+specifiers](#avro-schema-node-specifiers) below for details on how users name
+the Avro schema node on which to attach documentation.
+
+Materialize will reject `CREATE SINK` commands that specify multiple `DOC ON`
+options naming the same field.
+
+Materialize will automatically use [comments] on the columns of materialized
+views, or its constituent types, as the default documentation for a field.
+Specifically, for a given node of the Avro schema, Materialize will consider the
+following sources of documentation in order:
+
+  * A `DOC ON` option specifically naming that schema node.
+  * A comment on the column that produced the schema node. If the schema
+    element is at the top level, this will be a comment on a column of the
+    sink's materialized view. If the schema node is nested, this will be a
+    comment on a column of a record type.
+  * A comment on the type of the column that produced the schema node.
+
+For example, consider the following sink:
+
+```sql
+CREATE TYPE point AS (x integer, y integer);
+CREATE MATERIALIZED VIEW v AS SELECT ROW(1, 1)::point AS c1, 'text' AS c2;
+CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO ...;
+```
+
+For `c1`, Materialize will consider the following sources of documentation
+in order:
+
+  * A `DOC ON` option naming `envelope::c1`.
+  * A comment on column `v.c1`.
+  * A comment on type `point`.
+
+`c2` is analogous.
+
+For `c1.x`, Materialize will consider the following sources of documentation
+in order:
+
+  * A `DOC ON` comment naming `com.materialize.sink.point::x`.
+  * A comment on column `point.x`.
+  * A comment on type `integer`.
+
+`c1.y` is analogous.
+
+Materialize will also use comments on the materialized view itself as the
+default documentation for the `envelope` node. Specifically, Materialize will
+consider the following sources of documentation in order:
+
+  * A `DOC ON` comment naming `envelope`.
+  * A comment on materialized view `v`.
+
+Changes made to comments *after* the the sink is created will not be reflected
+in the Avro schema.
+
+#### Avro schema node specifiers
+
+[Brennan](https://github.com/umanwizard) invented a way of unambiguously
+naming Avro schema nodes in the `kgen` load generator. Its syntax and semantics
+are described here.
+
+An Avro schema node can appear in the following contexts:
+
+  1. The top-level node of the schema,
+  2. One of the fields of a record,
+  3. One of the branches of a union,
+  4. The value type of an array, or
+  5. The value type of a map.
+
+We define the "unambiguous name" of a schema node recursively as follows:
+
+ * If the node is of a [named
+   type](https://avro.apache.org/docs/1.10.2/spec.html#names) (record, enum, or
+   fixed) then its unambiguous name is the same as its "fullname" as defined by
+   the Avro spec; e.g. just its name if it is not part of a namespace, or its
+   namespace, followed by `.`, followed by its name if it is.
+ * Otherwise, if the node is the top-level node of the schema, its unambiguous
+   name is the empty string (we will never have to deal with this, as sinks in
+   Materialize can't have a schema that is just a timestamp field and nothing
+   else. Things are always at least nested inside a record.)
+ * Otherwise, if the node is a record field, its unambiguous name is the
+   record's unambiguous name (i.e., its fullname), followed by `::`, followed by
+   the field name. Note that you don't need to account for multiple levels of
+   nesting here-- if record `my.namespace.a` contains record `my.namespace.b`
+   which contains record `my.namespace.c` which contains field `f1`, you can
+   just call that field `my.namespace.c::f1`, unambiguously.
+ * Otherwise, if the node is one of the variants of a union, its unambiguous
+   name is the union's unambiguous name, followed by a `.`, followed by the
+   node's index within a union. I think we should actually change this and use
+   the type name instead (e.g. `union_unambiguous_name.long` instead of
+   `union_unambiguous_name.4`). That would work because unions are not allowed
+   to contain multiple variants of the same kind of type unless they are named
+   types.
+ * Otherwise, if the node is the value type of an array or map, its unambiguous
+   name is the array or map's unambiguous name, followed by `[]`.
+
+### User-facing documentation
+
+The details in the [Design](#design) section above will need to be worked into
+the Avro sink documentation. In particular, we should be sure to precisely
+explain Avro schema node specifiers and the order in which Materialize searches
+for sources of documentation.
+
+### Implementation
+
+#### Parser
+
+The extensions to the parser are straightforward.
+
+#### Planner
+
+The planner will "freeze" any comments that have been promoted to documentation
+by rewriting the `CREATE SINK` statement to contain those comments as explicit
+`DOC ON` options. This ensures that the Avro schema generated for a sink is
+deterministic—i.e., that future changes to comments on the columns and types
+used by the sink will *not* result in Materialize generating a different schema
+on restart.
+
+#### Catalog
+
+No changes to the catalog are required. The documentation will be automatically
+stored in the catalog via the addition of the `DOC ON` option, as the catalog
+stores the raw DDL statement for each sink.
+
+#### Avro schema generation
+
+The `interchange::json::build_row_schema_json` will need to be adjusted to
+take the requested documentation as a parameter. Loosely, it could grow a
+`docs: BTreeMap<String, String>` parameter, where the key represents the
+field specifier and the value represents the requested documentation for that
+field.
+
+The sink code in the coordinator would need to convert the `DOC ON` comments
+to this `BTreeMap`.
+
+## Alternatives
+
+There is one known alternative: Materialize could instead allow users to
+*provide* an Avro schema with all required `doc` fields. For example, users
+could pre-publish the desired schema to the schema registry, and then
+Materialize could load that schema at `CREATE SINK` time.
+
+There are several issues with this approach:
+
+  * It is not ergonomic. The provided schema must exactly match the schema
+    Materialize generates, *except* for the `doc` fields. Minor errors in
+    constructing the schema (e.g., using a `long` where an `int` is required, or
+    ordering fields wrong) will result in hard to debug failures.
+
+  * It prevents Materialize from making backwards compatible changes to sink
+    schemas.
+
+    Today, we can make changes the generated Avro sink schemas as long as they
+    are backwards compatible, according to Avro's schema resolution rules. For
+    example, Materialize could choose to add a new field to the top-level record
+    containing additional metadata. This is a backwards compatible change as
+    long as Materialize can publish a new schema to the registry containing the
+    new field. Downstream consumers using the old schema (as a reader schema)
+    will simply discard the new field.
+
+    We'd not be able to evolve schemas like this if users provide Materialized
+    with the schema to use. The user-provided schema would not contain the new
+    fields, and Materialize would not have the authority to edit that schema to
+    contain the new fields.
+
+At a high level, allowing users to specify their desired Avro schema would make
+Materialize responsible for reconciling that Avro schema with the shape of the
+materialized view provided to the sink. This reconciliation would be quite
+delicate and difficult, and liable to cause sink breakage during version
+upgrades.
+
+So, as a philosophy, we propose that Materialize is always solely responsible
+for generating the Avro schema used by an Avro-formatted sink. Customizations to
+the generated Avro schema must be made via purpose-built options in the `FORMAT
+AVRO` clause.
+
+## Future work
+
+Possibly future work includes:
+
+* Allowing changes to documentation comments for an existing sink.
+
+  Hypothetical syntax:
+
+  ```sql
+  ALTER SINK ... FORMAT AVRO (
+    ADD DOC ON <field-specifier> = '<documentation string>',
+    ADD DOC ON <field-specifier> = '<documentation string>',
+    ...
+  );
+  ```
+
+* Refreshing the Avro schema with any changes made to comments since the sink
+  was created.
+
+  Hypothetical syntax:
+
+  ```sql
+  ALTER SINK ... FORMAT AVRO (REFRESH DOCS);
+  ```
+
+* Disabling the automatic use of Materialize comments as Avro documentation.
+
+  Hypothetical syntax:
+
+  ```sql
+  CREATE SINK ... FORMAT AVRO ... (IGNORE COMMENTS);
+  ```
+
+* Supporting configurable sink policies that e.g. prevent creating sinks without
+  documentation for every field.
+
+  Hypothetical syntax:
+
+  ```sql
+  CREATE SINK POLICY ... FORMAT AVRO (REQUIRE DOCUMENTATION FOR ALL FIELDS);
+  ```
+
+## Open questions
+
+None.
+
+[#21557]: https://github.com/MaterializeInc/materialize/issues/21557
+[comments]: https://github.com/MaterializeInc/materialize/issues/1212