changefeedccl: support a CSV format for changefeeds #79853

sherman-grewal · 2022-04-12T18:38:06Z

In this PR, we introduce a new CSV format for changefeeds.
Note that this format is only supported with the
initial_scan='only' option. For instance, one can now
execute:

CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only';

Release note (enterprise change): Support a CSV format for
changefeeds, with the intention of supporting use cases
similar to EXPORT. Only works with initial_scan='only',
and does not work with diff/resolved options.

cockroach-teamcity · 2022-04-12T18:38:15Z

This change is

HonoreDB

Copy/pasting from slack for posterity: My thinking here is that resolved and diff don't make any sense as options for initial_scan=only anyway--we won't emit resolved timestamps and the before in initial_scan mode is meaningless. So disallow them for initial_scan=only, which in turn means we don't have to figure out how to encode them in csv right now. Then later if we need to support those options for csv, we can figure out an encoding scheme.

Reviewed 2 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sherman-grewal and @stevendanna)

pkg/ccl/changefeedccl/encoder.go, line 373 at r1 (raw file):

}

type csvEncoder struct {

Format number 3 (ish) strikes me as the right moment to break out different encoders into their own files.

pkg/ccl/changefeedccl/encoder.go, line 409 at r1 (raw file):

	include := make(map[descpb.ColumnID]struct{}, len(family.ColumnIDs))
	var yes struct{}
	for _, colID := range family.ColumnIDs {

Could take advantage of the fact that this is restricted to initial_scan=only to cache this map--we shouldn't be seeing multiple versions of a table. (There might be 2? Not 3.)

pkg/ccl/changefeedccl/encoder.go, line 413 at r1 (raw file):

	}

	var csvRow []string

Is there a way to do this without building this slice? Does the csv writer support something like writer.WriteValue(tree.AsString(datum.Datum)) and then writer.EndOfRow() after we're done?

pkg/ccl/changefeedccl/encoder.go, line 439 at r1 (raw file):

	_ context.Context, _ string, resolved hlc.Timestamp,
) ([]byte, error) {
	return nil, nil

Probably better to error here.

pkg/ccl/changefeedccl/changefeed_test.go, line 5868 at r1 (raw file):

				sqlDB.Exec(t, "INSERT INTO foo VALUES (4, 'Doug'), (5, 'Elaine'), (6, 'Fred')")

				expectedMessages := []string{

So we're not expected to emit a header message, I take it? It'd be a little tricky but we could.

miretskiy

This is very nice. Few nits/questions...

pkg/ccl/changefeedccl/changefeed_stmt.go

pkg/ccl/changefeedccl/sink_webhook.go

pkg/ccl/changefeedccl/encoder.go

sherman-grewal

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @sherman-grewal, and @stevendanna)

pkg/ccl/changefeedccl/encoder.go, line 373 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Format number 3 (ish) strikes me as the right moment to break out different encoders into their own files.

yeah I totally agree - I added a new files for the three encoders. Thanks for the suggestion 😄

pkg/ccl/changefeedccl/encoder.go, line 409 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Could take advantage of the fact that this is restricted to initial_scan=only to cache this map--we shouldn't be seeing multiple versions of a table. (There might be 2? Not 3.)

Yeah I added a mechanism to cache the map (which is then rebuilt if we see a new version number). If we see another version, does it mean that we won't see rows with the previous version number? If we can still see both version numbers, I was considering storing the cache for each version we see (which is at most 2). Let me know what you think.

pkg/ccl/changefeedccl/encoder.go, line 413 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Is there a way to do this without building this slice? Does the csv writer support something like writer.WriteValue(tree.AsString(datum.Datum)) and then writer.EndOfRow() after we're done?

As mentioned on slack, there is no way of building this without a slice if we want to use the CSV writer package. I added a TODO to keep a lookout of the performance of CSV encoding for large rows - if we incur a significant loss in performance we can look into building our own CSV writer that writes values directly into the buffer rather than storing them in an array and then writing the entire array to the buffer.

pkg/ccl/changefeedccl/encoder.go, line 439 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Probably better to error here.

Done - Thanks for the suggestion!

pkg/ccl/changefeedccl/changefeed_test.go, line 5868 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

So we're not expected to emit a header message, I take it? It'd be a little tricky but we could.

As mentioned on slack, we decided to not emit a header for a few reasons

The EXPORT statement does not do this
We need to establish the product requirements for the behaviour of header messages in file-based sinks (e.g. do we re-emit the header message for each new file we create? or is it done once?)
We need to establish the behaviour of re-emitting the header message if the changefeed was paused and then unpaused

I think that once we iron these issues out we can provide an option to include a header message.

HonoreDB

Reviewed 3 of 11 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @sherman-grewal, and @stevendanna)

pkg/ccl/changefeedccl/encoder.go, line 409 at r1 (raw file):

Previously, sherman-grewal (Sherman) wrote…

Yeah I added a mechanism to cache the map (which is then rebuilt if we see a new version number). If we see another version, does it mean that we won't see rows with the previous version number? If we can still see both version numbers, I was considering storing the cache for each version we see (which is at most 2). Let me know what you think.

I'd just store the cache for each version. If it's possible to see multiple versions it's possible to see them out of order.

pkg/ccl/changefeedccl/encoder_avro.go, line 131 at r2 (raw file):

// Get the raw SQL-formatted string for a table name
// and apply full_table_name and avro_schema_prefix options
func (e *confluentAvroEncoder) rawTableName(

Oh man did I really not delete this function when I was doing the topic refactor? Whoops. This should just be a call to TopicNamer.Name but I can do that in a separate PR.

sherman-grewal

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @sherman-grewal, and @stevendanna)

pkg/ccl/changefeedccl/encoder.go, line 409 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

I'd just store the cache for each version. If it's possible to see multiple versions it's possible to see them out of order.

Done

pkg/ccl/changefeedccl/encoder_avro.go, line 131 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Oh man did I really not delete this function when I was doing the topic refactor? Whoops. This should just be a call to TopicNamer.Name but I can do that in a separate PR.

ahh ok sounds good, thanks for catching that!!

pkg/ccl/changefeedccl/encoder_csv.go

pkg/ccl/changefeedccl/sink_webhook.go

sherman-grewal

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @sherman-grewal, and @stevendanna)

pkg/ccl/changefeedccl/encoder_csv.go, line 91 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

At the very least, we should memoize this array; store csvRow array in the encoder struct, initialize it with csvRow: make([]string, 0, len(row.tableDesc.PublicColumns())) and then zero it out here:
csvRow = csvRow[:0]

I completely agree, I implemented memoization in this array in my latest commit. Thanks again for the suggestions 👍

pkg/ccl/changefeedccl/sink_webhook.go, line 137 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

looks like we are not updating some of the fields in the result -- such as mvcc timestamp.

ahh yes - thanks for catching this! 🙌

miretskiy · 2022-04-26T14:32:39Z

@sherman-grewal are you waiting for additional comments?

miretskiy · 2022-04-26T14:45:42Z

I'm giving it 👍 ; let's see if @HonoreDB has more comments though before merge.

HonoreDB

LGTM after one comment is addressed, I'll pre-emptively approve it too.

Reviewed 2 of 9 files at r1, 6 of 11 files at r2, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @sherman-grewal, and @stevendanna)

pkg/ccl/changefeedccl/changefeed_test.go line 5870 at r4 (raw file):

		`initial scan only with csv`:     `CREATE CHANGEFEED FOR foo WITH initial_scan_only, format = csv`,
		`initial backfill only with csv`: `CREATE CHANGEFEED FOR foo WITH initial_scan = 'only', format = csv`,
	}

Mind also adding a test with one changefeed on two different tables? It'll be one encoder for both tables so there can be bugs there.

In this PR, we introduce a new CSV format for changefeeds. Note that this format is only supported with the initial_scan='only' option. For instance, one can now execute: CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only' Release note (enterprise change): Support a CSV format for changefeeds. Only works with initial_scan='only', and does not work with diff/resolved options.

sherman-grewal

Done, thanks for the reviews!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @sherman-grewal, and @stevendanna)

sherman-grewal · 2022-04-26T19:45:08Z

bors r+

craig · 2022-04-26T21:12:45Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-04-26T22:31:54Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-04-27T00:20:19Z

Build succeeded:

GitHub CI (Cockroach)

samiskin · 2022-06-02T14:48:11Z

blathers backport 22.1

blathers-crl · 2022-06-02T14:48:16Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from f7ca7b1 to blathers/backport-release-22.1-79853: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

sherman-grewal requested a review from a team as a code owner April 12, 2022 18:38

sherman-grewal requested review from stevendanna and removed request for a team April 12, 2022 18:38

HonoreDB suggested changes Apr 12, 2022

View reviewed changes

miretskiy reviewed Apr 12, 2022

View reviewed changes

pkg/ccl/changefeedccl/changefeed_stmt.go Show resolved Hide resolved

pkg/ccl/changefeedccl/changefeed_stmt.go Outdated Show resolved Hide resolved

pkg/ccl/changefeedccl/sink_webhook.go Show resolved Hide resolved

pkg/ccl/changefeedccl/encoder.go Outdated Show resolved Hide resolved

sherman-grewal force-pushed the cdc/csv-format branch 3 times, most recently from 1d800a3 to ee566a1 Compare April 14, 2022 15:40

sherman-grewal commented Apr 14, 2022

View reviewed changes

sherman-grewal requested review from HonoreDB and miretskiy April 14, 2022 15:54

HonoreDB reviewed Apr 14, 2022

View reviewed changes

sherman-grewal force-pushed the cdc/csv-format branch from ee566a1 to 74881ab Compare April 14, 2022 16:44

sherman-grewal commented Apr 14, 2022

View reviewed changes

sherman-grewal requested a review from HonoreDB April 14, 2022 16:50

miretskiy reviewed Apr 14, 2022

View reviewed changes

pkg/ccl/changefeedccl/encoder_csv.go Outdated Show resolved Hide resolved

pkg/ccl/changefeedccl/sink_webhook.go Show resolved Hide resolved

sherman-grewal force-pushed the cdc/csv-format branch 2 times, most recently from cf8232d to 45e5c77 Compare April 14, 2022 21:11

sherman-grewal commented Apr 14, 2022

View reviewed changes

sherman-grewal requested a review from miretskiy April 14, 2022 21:14

HonoreDB approved these changes Apr 26, 2022

View reviewed changes

sherman-grewal force-pushed the cdc/csv-format branch from 45e5c77 to f7ca7b1 Compare April 26, 2022 15:39

sherman-grewal commented Apr 26, 2022

View reviewed changes

craig bot merged commit d6240ac into cockroachdb:master Apr 27, 2022

cockroach-teamcity mentioned this pull request Apr 27, 2022

changefeedccl: support a CSV format for changefeeds cockroachdb/docs#13730

Closed

samiskin mentioned this pull request Jun 2, 2022

release-22.1: changefeedccl: support a CSV format for changefeeds #82355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: support a CSV format for changefeeds #79853

changefeedccl: support a CSV format for changefeeds #79853

sherman-grewal commented Apr 12, 2022 •

edited by shermanCRL

Loading

cockroach-teamcity commented Apr 12, 2022

HonoreDB left a comment

miretskiy left a comment

sherman-grewal left a comment

HonoreDB left a comment

sherman-grewal left a comment

sherman-grewal left a comment

miretskiy commented Apr 26, 2022

miretskiy commented Apr 26, 2022

HonoreDB left a comment

sherman-grewal left a comment

sherman-grewal commented Apr 26, 2022

craig bot commented Apr 26, 2022

craig bot commented Apr 26, 2022

craig bot commented Apr 27, 2022

samiskin commented Jun 2, 2022

blathers-crl bot commented Jun 2, 2022

changefeedccl: support a CSV format for changefeeds #79853

changefeedccl: support a CSV format for changefeeds #79853

Conversation

sherman-grewal commented Apr 12, 2022 • edited by shermanCRL Loading

cockroach-teamcity commented Apr 12, 2022

HonoreDB left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

sherman-grewal left a comment

Choose a reason for hiding this comment

HonoreDB left a comment

Choose a reason for hiding this comment

sherman-grewal left a comment

Choose a reason for hiding this comment

sherman-grewal left a comment

Choose a reason for hiding this comment

miretskiy commented Apr 26, 2022

miretskiy commented Apr 26, 2022

HonoreDB left a comment

Choose a reason for hiding this comment

sherman-grewal left a comment

Choose a reason for hiding this comment

sherman-grewal commented Apr 26, 2022

craig bot commented Apr 26, 2022

craig bot commented Apr 26, 2022

craig bot commented Apr 27, 2022

samiskin commented Jun 2, 2022

blathers-crl bot commented Jun 2, 2022

sherman-grewal commented Apr 12, 2022 •

edited by shermanCRL

Loading