-
Notifications
You must be signed in to change notification settings - Fork 467
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc/developer: add design doc for adding docs to Avro sink schemas
This is a design for #21557.
- Loading branch information
Showing
1 changed file
with
329 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,329 @@ | ||
# Schema documentation for Avro-formatted sinks | ||
|
||
Associated issue: [#21557] | ||
|
||
## Context | ||
|
||
Avro schemas allow users to annotate records, record fields, and enums with | ||
documentation. Note the `doc` fields in the following example Avro schema: | ||
|
||
```json | ||
{ | ||
"type": "record", | ||
"name": "row", | ||
"doc": "represents a row in some materialized view", | ||
"fields" : [ | ||
{"name": "c1", "type": "string", "doc": "this is the meaning of column a"}, | ||
{"name": "c2", "type": "int"} | ||
] | ||
} | ||
``` | ||
|
||
When using an Avro-formatted sink, Materialize generates the Avro schema based | ||
on the shape of the underlying relation. For example, given the following | ||
sink: | ||
|
||
```sql | ||
CREATE MATERIALIZED VIEW v AS SELECT 'text' AS c1, 'int'::integer as c2; | ||
CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY ...; | ||
``` | ||
|
||
Materialize will automatically generate the following Avro schema and publish it | ||
to the schema registry: | ||
|
||
```json | ||
{ | ||
"type": "record", | ||
"name": "envelope", | ||
"fields" : [ | ||
{"name": "c1", "type": "string"}, | ||
{"name": "c2", "type": "int"} | ||
] | ||
} | ||
``` | ||
|
||
Today, there is no way to instruct Materialize to include `doc` fields in the | ||
generated Avro schema. This poses a challenge for organizations with | ||
requirements on the use of documentation—e.g., a policy that requires that every | ||
field in every schema published to the organization's schema registry has | ||
documentation. | ||
|
||
## Goals | ||
|
||
Must haves: | ||
|
||
* Enable users with organizational policies on the use of `doc` fields in Avro | ||
schemas to use Avro-formatted sinks. | ||
|
||
Nice to haves: | ||
|
||
* Consistency with Materialize's existing approach to generating Avro schemas | ||
from SQL relations. | ||
|
||
* Good ergonomics. | ||
|
||
## Non-goals | ||
|
||
None yet. | ||
|
||
## Detailed description | ||
|
||
### Design | ||
|
||
#### SQL | ||
|
||
We'll extend the `FORMAT AVRO` clause with a new `DOC ON` option: | ||
|
||
```sql | ||
CREATE SINK ... | ||
FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY ... ( | ||
DOC ON <field-specifier> = '<documentation string>', | ||
DOC ON <field-specifier> = '<documentation string>', | ||
... | ||
) | ||
WITH (...) | ||
``` | ||
|
||
The option will be located in a new parenthesized option list immediately | ||
following the existing `FORMAT AVRO` clause. After this change, `DOC ON` will | ||
be the only supported option, but we expect to add further options to control | ||
the Avro schema in the future. | ||
|
||
Aside: the current design of the `USING CONFLUENT SCHEMA REGISTRY` option is | ||
historical and predates the introduction of the standardized option list. In the | ||
future, we may also move the `USING CONFLUENT SCHEMA REGISTRY` into the option | ||
list, where it can be freely intermixed with other options, e.g.: | ||
|
||
```sql | ||
FORMAT AVRO ( | ||
DOC ON ..., | ||
CONFLUENT SCHEMA REGISTRY = '...', | ||
DOC ON ..., | ||
) | ||
``` | ||
|
||
Each `DOC ON` option describes a documentation comment that should be attached | ||
to a single field in the Avro schema. See [Avro schema node | ||
specifiers](#avro-schema-node-specifiers) below for details on how users name | ||
the Avro schema node on which to attach documentation. | ||
|
||
Materialize will reject `CREATE SINK` commands that specify multiple `DOC ON` | ||
options naming the same field. | ||
|
||
Materialize will automatically use [comments] on the columns of materialized | ||
views, or its constituent types, as the default documentation for a field. | ||
Specifically, for a given node of the Avro schema, Materialize will consider the | ||
following sources of documentation in order: | ||
|
||
* A `DOC ON` option specifically naming that schema node. | ||
* A comment on the column that produced the schema node. If the schema | ||
element is at the top level, this will be a comment on a column of the | ||
sink's materialized view. If the schema node is nested, this will be a | ||
comment on a column of a record type. | ||
* A comment on the type of the column that produced the schema node. | ||
|
||
For example, consider the following sink: | ||
|
||
```sql | ||
CREATE TYPE point AS (x integer, y integer); | ||
CREATE MATERIALIZED VIEW v AS SELECT ROW(1, 1)::point AS c1, 'text' AS c2; | ||
CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO ...; | ||
``` | ||
|
||
For `c1`, Materialize will consider the following sources of documentation | ||
in order: | ||
|
||
* A `DOC ON` option naming `envelope::c1`. | ||
* A comment on column `v.c1`. | ||
* A comment on type `point`. | ||
|
||
`c2` is analogous. | ||
|
||
For `c1.x`, Materialize will consider the following sources of documentation | ||
in order: | ||
|
||
* A `DOC ON` comment naming `com.materialize.sink.point::x`. | ||
* A comment on column `point.x`. | ||
* A comment on type `integer`. | ||
|
||
`c1.y` is analogous. | ||
|
||
Materialize will also use comments on the materialized view itself as the | ||
default documentation for the `envelope` node. Specifically, Materialize will | ||
consider the following sources of documentation in order: | ||
|
||
* A `DOC ON` comment naming `envelope`. | ||
* A comment on materialized view `v`. | ||
|
||
Changes made to comments *after* the the sink is created will not be reflected | ||
in the Avro schema. | ||
|
||
#### Avro schema node specifiers | ||
|
||
[Brennan](https://github.com/umanwizard) invented a way of unambiguously | ||
naming Avro schema nodes in the `kgen` load generator. Its syntax and semantics | ||
are described here. | ||
|
||
An Avro schema node can appear in the following contexts: | ||
|
||
1. The top-level node of the schema, | ||
2. One of the fields of a record, | ||
3. One of the branches of a union, | ||
4. The value type of an array, or | ||
5. The value type of a map. | ||
|
||
We define the "unambiguous name" of a schema node recursively as follows: | ||
|
||
* If the node is of a [named | ||
type](https://avro.apache.org/docs/1.10.2/spec.html#names) (record, enum, or | ||
fixed) then its unambiguous name is the same as its "fullname" as defined by | ||
the Avro spec; e.g. just its name if it is not part of a namespace, or its | ||
namespace, followed by `.`, followed by its name if it is. | ||
* Otherwise, if the node is the top-level node of the schema, its unambiguous | ||
name is the empty string (we will never have to deal with this, as sinks in | ||
Materialize can't have a schema that is just a timestamp field and nothing | ||
else. Things are always at least nested inside a record.) | ||
* Otherwise, if the node is a record field, its unambiguous name is the | ||
record's unambiguous name (i.e., its fullname), followed by `::`, followed by | ||
the field name. Note that you don't need to account for multiple levels of | ||
nesting here-- if record `my.namespace.a` contains record `my.namespace.b` | ||
which contains record `my.namespace.c` which contains field `f1`, you can | ||
just call that field `my.namespace.c::f1`, unambiguously. | ||
* Otherwise, if the node is one of the variants of a union, its unambiguous | ||
name is the union's unambiguous name, followed by a `.`, followed by the | ||
node's index within a union. I think we should actually change this and use | ||
the type name instead (e.g. `union_unambiguous_name.long` instead of | ||
`union_unambiguous_name.4`). That would work because unions are not allowed | ||
to contain multiple variants of the same kind of type unless they are named | ||
types. | ||
* Otherwise, if the node is the value type of an array or map, its unambiguous | ||
name is the array or map's unambiguous name, followed by `[]`. | ||
|
||
### User-facing documentation | ||
|
||
The details in the [Design](#design) section above will need to be worked into | ||
the Avro sink documentation. In particular, we should be sure to precisely | ||
explain Avro schema node specifiers and the order in which Materialize searches | ||
for sources of documentation. | ||
|
||
### Implementation | ||
|
||
#### Parser | ||
|
||
The extensions to the parser are straightforward. | ||
|
||
#### Planner | ||
|
||
The planner will "freeze" any comments that have been promoted to documentation | ||
by rewriting the `CREATE SINK` statement to contain those comments as explicit | ||
`DOC ON` options. This ensures that the Avro schema generated for a sink is | ||
deterministic—i.e., that future changes to comments on the columns and types | ||
used by the sink will *not* result in Materialize generating a different schema | ||
on restart. | ||
|
||
#### Catalog | ||
|
||
No changes to the catalog are required. The documentation will be automatically | ||
stored in the catalog via the addition of the `DOC ON` option, as the catalog | ||
stores the raw DDL statement for each sink. | ||
|
||
#### Avro schema generation | ||
|
||
The `interchange::json::build_row_schema_json` will need to be adjusted to | ||
take the requested documentation as a parameter. Loosely, it could grow a | ||
`docs: BTreeMap<String, String>` parameter, where the key represents the | ||
field specifier and the value represents the requested documentation for that | ||
field. | ||
|
||
The sink code in the coordinator would need to convert the `DOC ON` comments | ||
to this `BTreeMap`. | ||
|
||
## Alternatives | ||
|
||
There is one known alternative: Materialize could instead allow users to | ||
*provide* an Avro schema with all required `doc` fields. For example, users | ||
could pre-publish the desired schema to the schema registry, and then | ||
Materialize could load that schema at `CREATE SINK` time. | ||
|
||
There are several issues with this approach: | ||
|
||
* It is not ergonomic. The provided schema must exactly match the schema | ||
Materialize generates, *except* for the `doc` fields. Minor errors in | ||
constructing the schema (e.g., using a `long` where an `int` is required, or | ||
ordering fields wrong) will result in hard to debug failures. | ||
|
||
* It prevents Materialize from making backwards compatible changes to sink | ||
schemas. | ||
|
||
Today, we can make changes the generated Avro sink schemas as long as they | ||
are backwards compatible, according to Avro's schema resolution rules. For | ||
example, Materialize could choose to add a new field to the top-level record | ||
containing additional metadata. This is a backwards compatible change as | ||
long as Materialize can publish a new schema to the registry containing the | ||
new field. Downstream consumers using the old schema (as a reader schema) | ||
will simply discard the new field. | ||
|
||
We'd not be able to evolve schemas like this if users provide Materialized | ||
with the schema to use. The user-provided schema would not contain the new | ||
fields, and Materialize would not have the authority to edit that schema to | ||
contain the new fields. | ||
|
||
At a high level, allowing users to specify their desired Avro schema would make | ||
Materialize responsible for reconciling that Avro schema with the shape of the | ||
materialized view provided to the sink. This reconciliation would be quite | ||
delicate and difficult, and liable to cause sink breakage during version | ||
upgrades. | ||
|
||
So, as a philosophy, we propose that Materialize is always solely responsible | ||
for generating the Avro schema used by an Avro-formatted sink. Customizations to | ||
the generated Avro schema must be made via purpose-built options in the `FORMAT | ||
AVRO` clause. | ||
|
||
## Future work | ||
|
||
Possibly future work includes: | ||
|
||
* Allowing changes to documentation comments for an existing sink. | ||
|
||
Hypothetical syntax: | ||
|
||
```sql | ||
ALTER SINK ... FORMAT AVRO ( | ||
ADD DOC ON <field-specifier> = '<documentation string>', | ||
ADD DOC ON <field-specifier> = '<documentation string>', | ||
... | ||
); | ||
``` | ||
|
||
* Refreshing the Avro schema with any changes made to comments since the sink | ||
was created. | ||
|
||
Hypothetical syntax: | ||
|
||
```sql | ||
ALTER SINK ... FORMAT AVRO (REFRESH DOCS); | ||
``` | ||
|
||
* Disabling the automatic use of Materialize comments as Avro documentation. | ||
|
||
Hypothetical syntax: | ||
|
||
```sql | ||
CREATE SINK ... FORMAT AVRO ... (IGNORE COMMENTS); | ||
``` | ||
|
||
* Supporting configurable sink policies that e.g. prevent creating sinks without | ||
documentation for every field. | ||
|
||
Hypothetical syntax: | ||
|
||
```sql | ||
CREATE SINK POLICY ... FORMAT AVRO (REQUIRE DOCUMENTATION FOR ALL FIELDS); | ||
``` | ||
|
||
## Open questions | ||
|
||
None. | ||
|
||
[#21557]: https://github.com/MaterializeInc/materialize/issues/21557 | ||
[comments]: https://github.com/MaterializeInc/materialize/issues/1212 |