Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapter,interchange,parser: Support avro sink comments #22019

Merged

Conversation

moulimukherjee
Copy link
Contributor

@moulimukherjee moulimukherjee commented Sep 27, 2023

Added implementation to

  • automatically add comments to generated avro schemas from materialized comments (via COMMENT ON)
  • allow user to specify documentation in the sql

Currently, COMMENT ON (in materialize) does not support adding comments for a column in a custom type and this PR also skips that. I will do a follow up to add support for that https://github.com/MaterializeInc/database-issues/issues/6696.

Motivation

Partially fixes https://github.com/MaterializeInc/database-issues/issues/6480, does not allow comments on columns in a custom type yet.

Tips for reviewer

Added inline comments.

Checklist

cc: @benesch @dseisun-materialize

@def- def- requested review from def- and removed request for def- September 27, 2023 22:14
@moulimukherjee moulimukherjee force-pushed the support-avro-comments branch 4 times, most recently from cd4e56f to a626ba7 Compare October 5, 2023 15:58
column name is taken to be a name of a column in the top level of the
materialized view. Object names are looked up according to usual SQL name
resolution rules for the search path and active database.
format `[[db.]schema.]object.column`. Object names are looked up according to usual
Copy link
Contributor Author

@moulimukherjee moulimukherjee Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the implementatoin to keep it consistent with the naming resolution for the type below [[db.]schema.]object and with the COMMENT ON sql which expects the object to be specified. Also, sinks can't be queried and they don't have a relation desc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

# Test Avro UPSERT sinks doc comments

$ postgres-execute connection=postgres://mz_system:materialize@${testdrive.materialize-internal-sql-addr}
ALTER SYSTEM SET enable_comment = true;
Copy link
Contributor Author

@moulimukherjee moulimukherjee Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the avro comments from materialize comments behaviour is also gated by this flag. If this flag is not set, users can't create comments in materialize and there will be no automatic documentation added in the generated avro schema.

This does not prevent the user from explicitly adding a DOC ON options in their create sink sql though.

Copy link
Contributor

@benesch benesch Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not prevent the user from explicitly adding a DOC ON options in their create sink sql though.

We should add a separate feature flag for the use of DOC ON comments, so that we can properly follow the feature lifecycle for this feature. It should start out in private preview while we're validating its design and implementation with our preview customers. That way we can make breaking changes if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let me add that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Put it behind enable_sink_doc_on_option

@moulimukherjee moulimukherjee changed the title WIP: Support avro sink comments Support avro sink comments Oct 5, 2023
@moulimukherjee moulimukherjee requested a review from def- October 5, 2023 18:58
@moulimukherjee moulimukherjee marked this pull request as ready for review October 5, 2023 19:02
@moulimukherjee moulimukherjee requested a review from a team October 5, 2023 19:02
@moulimukherjee moulimukherjee requested a review from a team as a code owner October 5, 2023 19:02
@moulimukherjee moulimukherjee requested a review from a team October 5, 2023 19:02
@shepherdlybot
Copy link

shepherdlybot bot commented Oct 5, 2023

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 80 / 100 59% 3
Buggy File Hotspots:
File Percentile
../session/vars.rs 98
../src/catalog.rs 100
../src/parser.rs 90

@umanwizard
Copy link
Contributor

Currently, COMMENT ON does not support adding comments for a column in a custom type and this PR also skips that

FWIW, this works fine in Postgres:

brennan=> create type t as (f1 text);
CREATE TYPE
brennan=> comment on column t.f1 is 'hello, world!';
COMMENT

@moulimukherjee
Copy link
Contributor Author

moulimukherjee commented Oct 5, 2023

@umanwizard Yeah (updated the text). It's just not implemented yet in materialize. I will do a follow up to add both comment on support and DOC ON COLUMN support for them. This PR was becoming a bit too big and they'll have overlapping logic.

(Tracking https://github.com/MaterializeInc/database-issues/issues/6696)

src/sql-parser/src/ast/defs/ddl.rs Outdated Show resolved Hide resolved

fn parse_column_name(&mut self) -> Result<(RawItemName, Ident), ParserError> {
let start = self.peek_pos();
let mut item_name = self.parse_raw_name()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs from

let mut identifiers = self.parse_identifiers()?;
which calls parse_identifiers. I thiiiink the one in this PR is correct because that's what AstInfo expects. @ParkMyCar fyi: comment on column should probably be taught to call this new function once it supports AstInfo.

src/sql/src/plan/statement/ddl.rs Outdated Show resolved Hide resolved
(AvroValueFullname, String),
(NullDefaults, bool, Default(false))
);
/// Creating this by hand instead of using generate_extracted_config! macro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue about this so we don't forget? I think keeping everything in the macro is good, so we should eventually add enum support to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure 👍. I did not find any other similar use cases, so this could be an outlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed https://github.com/MaterializeInc/materialize/issues/22213, will add it in the comment as well.

Copy link
Contributor

@umanwizard umanwizard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to Avro schema generation look fine to me. Someone from Adapter should review the rest.

I think we could use an end-to-end test that verifies that other Avro-speaking tools (e.g. the official Java or Python Avro libraries) can properly interpret these schemas and see the docs on them. But I wouldn't think that should block this PR as we have nothing like that currently. @philip-stoev or @def- , what do you think?

src/interchange/src/json.rs Outdated Show resolved Hide resolved
src/interchange/src/json.rs Outdated Show resolved Hide resolved
src/sql-parser/src/parser.rs Outdated Show resolved Hide resolved
src/interchange/src/json.rs Outdated Show resolved Hide resolved
src/interchange/src/json.rs Outdated Show resolved Hide resolved
src/interchange/src/json.rs Outdated Show resolved Hide resolved
@def-
Copy link
Contributor

def- commented Oct 5, 2023

I think we could use an end-to-end test that verifies that other Avro-speaking tools (e.g. the official Java or Python Avro libraries) can properly interpret these schemas and see the docs on them. But I wouldn't think that should block this PR as we have nothing like that currently. @philip-stoev or @def- , what do you think?

Makes sense to have that. I'll try implementing that test on Friday, probably using Python Avro lib, since that's easiest from our side. I'd prefer if you can wait for the test.

Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapter parts lgtm.

src/sql/src/plan/statement/ddl.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to take a look at this tomorrow as well! Might slip into the weekend but will get you a review by Monday morning at the latest.

@philip-stoev
Copy link
Contributor

Makes sense to have that. I'll try implementing that test on Friday, probably using Python Avro lib, since that's easiest from our side. I'd prefer if you can wait for the test.

Instead of bringing another tool into the mix, can we simply issue a GET request against the schema service and fish for the required parts in the response?

@moulimukherjee
Copy link
Contributor Author

Moving to draft to look into the tests

@def-
Copy link
Contributor

def- commented Oct 6, 2023

The failure is:

Running validate() from <materialize.checks.sink.SinkComments object at 0x7f1d4a52bad0>
$ docker compose exec -T testdrive testdrive --kafka-addr=kafka:9092 --schema-registry-url=http://schema-registry:8081 --materialize-url=postgres://materialize@materialized:6875 --materialize-internal-url=postgres://materialize@materialized:6877 --variable-length-row-encoding --aws-endpoint=http://localstack:4566 --no-reset --default-timeout=300s --seed=1 --persist-blob-url=file:///mzdata/persist/blob --persist-consensus-url=postgres://root@materialized:26257?options=--search_path=consensus --var=replicas=1 --var=default-replica-size=4-4 --var=default-storage-size=4-1 --source=/home/deen/git/materialize/misc/python/materialize/checks/sink.py:691
Verifying contents of latest schema for subject "sink-sink-comments1-key" in the schema registry...
Verifying contents of latest schema for subject "sink-sink-comments2-key" in the schema registry...
Verifying contents of latest schema for subject "sink-sink-comments3-key" in the schema registry...
Verifying contents of latest schema for subject "sink-sink-comments1-value" in the schema registry...
^^^ +++
11:1: error: schema did not match
expected:
{"type":"record","name":"envelope","doc":"comment on view sink_source_comments_view","fields":[{"name":"before","type":["null",{"type":"record","name":"row","fields":[{"name":"l_k","type":"string"},{"name":"l_v1","type":["null","string"],"default":null,"doc":"doc on l_v1"},{"name":"l_v2","type":["null","long"],"default":null,"doc":"value doc on l_v1"},{"name":"c","type":"long"}]}],"default":null},{"name":"after","type":["null","row"],"default":null}]}

actual:
{"type":"record","name":"envelope","doc":"comment on view sink_source_comments_view","fields":[{"name":"before","type":["null",{"type":"record","name":"row","fields":[{"name":"l_k","type":"string"},{"name":"l_v1","type":["null","string"],"default":null},{"name":"l_v2","type":["null","long"],"default":null},{"name":"c","type":"long"}]}],"default":null},{"name":"after","type":["null","row"],"default":null}]}
     |
  10 |
  11 | $ schema-registry-verify schema-type=avro subject=sink-sink-comments1-value
     | ^
+++ !!! Error Report

Maybe I'm holding it wrong?

@def- def- force-pushed the support-avro-comments branch from e761c56 to 2464117 Compare October 6, 2023 22:45
@moulimukherjee
Copy link
Contributor Author

moulimukherjee commented Oct 6, 2023

@def- I had run into a similar issue locally because for column/fields the "doc" field wasn't getting serialized properly, I fixed that in this PR. Running mzcompose to check this out (taking a while).

Btw, is this running for a previous version? What scenario should I run this with?

@moulimukherjee
Copy link
Contributor Author

@def- Oh I think it's the envelope DEBEZIUM. I could reproduce it in testdrive. Thanks for catching this! Will put up a fix.

@moulimukherjee
Copy link
Contributor Author

@def- It should be fixed now.

@moulimukherjee moulimukherjee requested a review from def- October 7, 2023 03:17
@moulimukherjee moulimukherjee changed the title WIP: adapter,interchange,parser: Support avro sink comments adapter,interchange,parser: Support avro sink comments Oct 7, 2023
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for waiting for my tests. I postponed the Python-avro-based check until we reach a conclusion on how we want to test it.


$ schema-registry-verify schema-type=avro subject=sink-sink-comments2-value
{"type":"record","name":"envelope","doc":"comment on view sink_source_comments_view","fields":[{"name":"before","type":["null",{"type":"record","name":"row","fields":[{"name":"l_k","type":"string"},{"name":"l_v1","type":["null","string"],"default":null,"doc":"doc on l_v1"},{"name":"l_v2","type":["null","long"],"default":null,"doc":"value doc on l_v1"},{"name":"c","type":"long"}]}],"default":null},{"name":"after","type":["null","row"],"default":null}]}
{"type":"record","name":"envelope","fields":[{"name":"before","type":["null",{"type":"record","name":"row","doc":"comment on view sink_source_comments_view","fields":[{"name":"l_k","type":"string"},{"name":"l_v1","type":["null","string"],"default":null,"doc":"doc on l_v1"},{"name":"l_v2","type":["null","long"],"default":null,"doc":"value doc on l_v2"},{"name":"c","type":"long"}]}],"default":null},{"name":"after","type":["null","row"],"default":null}]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected to have the comment on the row instead of on the envelope?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard
Copy link
Contributor

umanwizard commented Oct 9, 2023 via email

Mouli Mukherjee and others added 4 commits October 9, 2023 21:20
Addressing review comments

updated error message

Moar test
platform-checks: Explicit sink comment check (currently fails)
platform-checks: Added comments to Identifiers check
parallel-workload: Enabled COMMENT ON
testdrive: Added failure case on NULL and escaping
sqlsmith: will come separately in MaterializeInc/sqlsmith#3
Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had time to quickly skim, but looks really solid. Basically no substantive comments. 👌🏽

# Test Avro UPSERT sinks doc comments

$ postgres-execute connection=postgres://mz_system:materialize@${testdrive.materialize-internal-sql-addr}
ALTER SYSTEM SET enable_comment = true;
Copy link
Contributor

@benesch benesch Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not prevent the user from explicitly adding a DOC ON options in their create sink sql though.

We should add a separate feature flag for the use of DOC ON comments, so that we can properly follow the feature lifecycle for this feature. It should start out in private preview while we're validating its design and implementation with our preview customers. That way we can make breaking changes if necessary.

column name is taken to be a name of a column in the top level of the
materialized view. Object names are looked up according to usual SQL name
resolution rules for the search path and active database.
format `[[db.]schema.]object.column`. Object names are looked up according to usual
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

def-
def- previously requested changes Oct 10, 2023
@@ -69,6 +69,7 @@
"persist_streaming_snapshot_and_fetch_enabled": "true",
"enable_unified_clusters": "true",
"enable_jemalloc_profiling": "true",
"enable_comment": "true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also add "enable_sink_doc_on_option": "true" here so we can use it easily in testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@moulimukherjee
Copy link
Contributor Author

All the comments are addressed, enabling auto-merge.

@moulimukherjee moulimukherjee enabled auto-merge (squash) October 10, 2023 17:13
@moulimukherjee moulimukherjee dismissed def-’s stale review October 10, 2023 18:14

Addressed the feedback, enabled flag for tests

@moulimukherjee moulimukherjee merged commit b414021 into MaterializeInc:main Oct 10, 2023
@moulimukherjee moulimukherjee deleted the support-avro-comments branch October 10, 2023 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants