Allow to configure prefix for internal Kafka fields #14224

ssheikin · 2022-09-20T22:39:00Z

Description

When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example:
case PARTITION_OFFSET_FIELD -> longValueProvider(message.offset());
By default in current implementation this columns have some hardcoded names like:
_partition_id _message_corrupt e.t.c
So it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example _key)
So reason of this PR is to provide ability to tune internal column names with custom prefix like XX_my_prefix_XX_key, so this conflict could be more rare then with simple current default prefix _key
This PR will not resolve problem, but just postpone it until times when users will have more difficult column name, which will conflict with our internal column.
This change is backward compatible so for people who don't have colliding field names they don't need to change their existing queries to work with newer version.
The drawback is that this can't anticipate a collision ahead of time. It does however allow adminstrators to use a unique prefix unlikely to ever exist - e.g. __kafka_message_metadata_.

Thank you, @vlad-lyutenko and @hashhar for the description :)

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(*) Release notes are required, with the following suggested text:

# Kafka
* Allow configuring the prefix for internal column names using `kafka.internal-column-prefix` catalog configuration property. The default value is `_` to maintain the same behaviour that exists today. ({issue}`14224`)

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaConfig.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java

plugin/trino-kafka/src/test/java/io/trino/plugin/kafka/TestKafkaConnectorTest.java

hashhar · 2022-09-21T07:00:04Z

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java

-                        TIMESTAMP_MILLIS))
-                .buildOrThrow();
+        String prefix = kafkaConfig.getInternalFieldPrefix();
+        internalFields = Stream.of(


we have this to preserve order of columns, correct?

Earlier code the order being preserved was mostly an impl detail of ImmutableMap that insertion order = iteration order?

Don't know about ordering. I guess it is not relevant here.

This code is tricky.
Once it looked like enum already. And was refactored this way to use dynamic value from TypeManager.

I'd say that order here does not matter. It's just a holder.
IIRC order matters for column handlers, and they are used within cursor.

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java

vlad-lyutenko · 2022-09-21T12:28:06Z

I tried to understand what is going in this PR, let me try to summarise it and please correct me if I am wrong or miss something.

When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example:

case PARTITION_OFFSET_FIELD -> longValueProvider(message.offset());

By default in current implementation this columns have some hardcoded names like:

_partition_id _message_corrupt e.t.c

So it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example _key)

So reason of this PR is to provide ability to tune internal column names with custom prefix like XX_my_prefix_XX_key,
so this conflict could be more rare then with simple current default prefix _key

If my description is correct and I understand correctly, this PR will not resolve problem, but just postpone it until times when users will have more difficult column name, which will conflict with our internal column.

But maybe I missed smth, please correct me if I ma wrong

hashhar · 2022-09-21T12:37:40Z

@vlad-lyutenko Exactly correct understanding.

this PR will not resolve problem, but just postpone it until times when users will have more difficult column name, which will conflict with our internal column.

True. This change has some benefits - it's backward compatible so for people who don't have colliding field names they don't need to change their existing queries to work with newer version.

The drawback is that this can't anticipate a collision ahead of time. It does however allow adminstrators to use a unique prefix unlikely to ever exist - e.g. __kafka_message_metadata_.

Is there some other solution you can think of? One idea which I was thinking was to expose all internal columns as a ROW type with multiple fields instead of separate column per field - that way you only have to worry about one collision.

ssheikin · 2022-09-21T13:13:39Z

One idea which I was thinking was to expose all internal columns as a ROW type with multiple fields instead of separate column per field - that way you only have to worry about one collision.

The probability in O() notation is the same, however it's more complex implementation.
With 10 columns it's O(10 * n) = O(n)

vlad-lyutenko · 2022-09-21T13:34:58Z

Is there some other solution you can think of? One idea which I was thinking was to expose all internal columns as a ROW type with multiple fields instead of separate column per field - that way you only have to worry about one collision.

I think in this case, we will lose backward compatibility for people who already have some queries,
I more think about solution of providing users with more detailed description what happens and how to resolve problem, using custom prefix.
Looks like current exception is coming somewhere from core/engine (not from connector).
So maybe we can add such check on column collision inside connector, with more detailed description, not to get any open issues/tickets from users in future.

hashhar · 2022-09-21T18:11:09Z

I more think about solution of providing users with more detailed description what happens and how to resolve problem, using custom prefix.

This I agree with - see #14224 (comment)

Praveen2112 · 2022-09-22T06:02:57Z

What if we could generate the name of these meta-columns based on the table structure. Like if we have column _timestamp_n then we could generate the metacolumn name as _timestamp_(n+1) it would be backward compatible and will also avoid the conflict if it happens anytime in the future ?

ssheikin · 2022-09-22T08:16:28Z

What if we could generate the name of these meta-columns based on the table structure. Like if we have column _timestamp_n then we could generate the metacolumn name as _timestamp_(n+1) it would be backward compatible and will also avoid the conflict if it happens anytime in the future ?

If these internal columns are used by endusers (and according to docs it's possible to query them) I think it's no-go as most probably clients rely on some concrete names.

hashhar · 2022-09-22T09:10:30Z

I think the current solution is the best we can do for now while also not breaking user queries.

An alternative might be to expose these columns using prefixes which are invalid for the target system - in Hive that's $ but AFAIK in Kafka connector there are no limitations on field names other than for Avro which requires prefix to be [A-Za-z_].

hashhar · 2022-09-22T09:19:13Z

LGTM % remaining comments. Please let me know when addressed @ssheikin

ssheikin · 2022-09-22T13:21:11Z

Looks like current exception is coming somewhere from core/engine (not from connector).
So maybe we can add such check on column collision inside connector, with more detailed description, not to get any open issues/tickets from users in future.

Current exception comes from connector.
I've Improved exception message.

ssheikin · 2022-09-22T13:22:29Z

@hashhar @vlad-lyutenko @Praveen2112 All comments are addressed. Please take a look one more time.

vlad-lyutenko · 2022-09-22T14:06:33Z

Looks like current exception is coming somewhere from core/engine (not from connector).
So maybe we can add such check on column collision inside connector, with more detailed description, not to get any open issues/tickets from users in future.

Current exception comes from connector. I've Improved exception message.

Ok, thx, now I see it was from buildOrThrow of immutable map.

vlad-lyutenko · 2022-09-22T14:08:06Z

I am ok with changes

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaRecordSet.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaMetadata.java

docs/src/main/sphinx/connector/kafka.rst

plugin/trino-kafka/src/test/java/io/trino/plugin/kafka/TestInternalFieldConflict.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaMetadata.java

hashhar

LGTM % comments.

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaConfig.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java

- switch expressions - unused parameters - simplify loops - inline effectively final variables - use early return

When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example: `case PARTITION_OFFSET_FIELD -> longValueProvider(message.offset());` By default in current implementation these columns have some hardcoded names like: `_partition_id` `_message_corrupt` e.t.c So it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example `_key`)

ssheikin · 2022-10-11T15:41:55Z

@wendigo please approive

docs/src/main/sphinx/connector/kafka.rst

hashhar

Thanks, will merge once CI is done.

When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example: `case PARTITION_OFFSET_FIELD -> longValueProvider(message.offset());` By default in current implementation these columns have some hardcoded names like: `_partition_id` `_message_corrupt` e.t.c So it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example `_key`) Reason of this PR is to provide ability to tune internal column names with custom prefix like `XX_my_prefix_XX_`, so this conflict could be more rare then with simple current default prefix `_` This PR will not resolve problem, but just postpone it until times when users will have more difficult column name, which will conflict with our internal column. This change is backward compatible so for people who don't have colliding field names they don't need to change their existing queries to work with newer version. The drawback is that this can't anticipate a collision ahead of time. It does however allow adminstrators to use a unique prefix unlikely to ever exist - e.g. `__kafka_message_metadata_`.

ssheikin · 2022-10-13T09:26:29Z

Restarting checks

cla-bot bot added the cla-signed label Sep 20, 2022

ssheikin force-pushed the ssheikin/56/oss/kafka_internal_column_prefix branch from ef8ea29 to 0f4785b Compare September 20, 2022 22:50

ssheikin requested review from wendigo, hashhar and vlad-lyutenko September 20, 2022 22:54

hashhar reviewed Sep 21, 2022

View reviewed changes

hashhar requested a review from Praveen2112 September 21, 2022 07:00

ssheikin commented Sep 21, 2022

View reviewed changes

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java Outdated Show resolved Hide resolved

ssheikin force-pushed the ssheikin/56/oss/kafka_internal_column_prefix branch from 0f4785b to 7b64f31 Compare September 21, 2022 16:01

hashhar approved these changes Sep 22, 2022

View reviewed changes

ssheikin force-pushed the ssheikin/56/oss/kafka_internal_column_prefix branch 4 times, most recently from 59ce82c to 47d03b5 Compare September 22, 2022 13:19

ssheikin requested a review from hashhar September 22, 2022 13:21

ssheikin force-pushed the ssheikin/56/oss/kafka_internal_column_prefix branch from 47d03b5 to e6113b4 Compare September 22, 2022 13:46