-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to configure prefix for internal Kafka fields #14224
Allow to configure prefix for internal Kafka fields #14224
Conversation
ef8ea29
to
0f4785b
Compare
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java
Show resolved
Hide resolved
plugin/trino-kafka/src/test/java/io/trino/plugin/kafka/TestKafkaConnectorTest.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/test/java/io/trino/plugin/kafka/TestKafkaConnectorTest.java
Show resolved
Hide resolved
plugin/trino-kafka/src/test/java/io/trino/plugin/kafka/TestKafkaConnectorTest.java
Show resolved
Hide resolved
TIMESTAMP_MILLIS)) | ||
.buildOrThrow(); | ||
String prefix = kafkaConfig.getInternalFieldPrefix(); | ||
internalFields = Stream.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have this to preserve order of columns, correct?
Earlier code the order being preserved was mostly an impl detail of ImmutableMap that insertion order = iteration order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know about ordering. I guess it is not relevant here.
This code is tricky.
Once it looked like enum already. And was refactored this way to use dynamic value from TypeManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that order here does not matter. It's just a holder.
IIRC order matters for column handlers, and they are used within cursor.
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java
Outdated
Show resolved
Hide resolved
I tried to understand what is going in this PR, let me try to summarise it and please correct me if I am wrong or miss something. When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example:
By default in current implementation this columns have some hardcoded names like:
So it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example So reason of this PR is to provide ability to tune internal column names with custom prefix like If my description is correct and I understand correctly, this PR will not resolve problem, but just postpone it until times when users will have more difficult column name, which will conflict with our internal column. But maybe I missed smth, please correct me if I ma wrong |
@vlad-lyutenko Exactly correct understanding.
True. This change has some benefits - it's backward compatible so for people who don't have colliding field names they don't need to change their existing queries to work with newer version. The drawback is that this can't anticipate a collision ahead of time. It does however allow adminstrators to use a unique prefix unlikely to ever exist - e.g. Is there some other solution you can think of? One idea which I was thinking was to expose all internal columns as a |
The probability in O() notation is the same, however it's more complex implementation. |
I think in this case, we will lose backward compatibility for people who already have some queries, |
0f4785b
to
7b64f31
Compare
This I agree with - see #14224 (comment) |
What if we could generate the name of these meta-columns based on the table structure. Like if we have column |
If these internal columns are used by endusers (and according to docs it's possible to query them) I think it's no-go as most probably clients rely on some concrete names. |
I think the current solution is the best we can do for now while also not breaking user queries. An alternative might be to expose these columns using prefixes which are invalid for the target system - in Hive that's |
LGTM % remaining comments. Please let me know when addressed @ssheikin |
59ce82c
to
47d03b5
Compare
Current exception comes from connector. |
@hashhar @vlad-lyutenko @Praveen2112 All comments are addressed. Please take a look one more time. |
47d03b5
to
e6113b4
Compare
Ok, thx, now I see it was from |
I am ok with changes |
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaRecordSet.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/test/java/io/trino/plugin/kafka/TestInternalFieldConflict.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comments.
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaFilterManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/KafkaInternalFieldManager.java
Show resolved
Hide resolved
1047da0
to
22d2fd0
Compare
- switch expressions - unused parameters - simplify loops - inline effectively final variables - use early return
When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example: `case PARTITION_OFFSET_FIELD -> longValueProvider(message.offset());` By default in current implementation these columns have some hardcoded names like: `_partition_id` `_message_corrupt` e.t.c So it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example `_key`)
22d2fd0
to
af75dfa
Compare
@wendigo please approive |
af75dfa
to
8e0dad2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will merge once CI is done.
When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example: `case PARTITION_OFFSET_FIELD -> longValueProvider(message.offset());` By default in current implementation these columns have some hardcoded names like: `_partition_id` `_message_corrupt` e.t.c So it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example `_key`) Reason of this PR is to provide ability to tune internal column names with custom prefix like `XX_my_prefix_XX_`, so this conflict could be more rare then with simple current default prefix `_` This PR will not resolve problem, but just postpone it until times when users will have more difficult column name, which will conflict with our internal column. This change is backward compatible so for people who don't have colliding field names they don't need to change their existing queries to work with newer version. The drawback is that this can't anticipate a collision ahead of time. It does however allow adminstrators to use a unique prefix unlikely to ever exist - e.g. `__kafka_message_metadata_`.
8e0dad2
to
bd40ee5
Compare
Description
When we process Kafka topics, for internal usage we add some bunch of additional columns with some useful data and we fill them with data which is coming from Kafka (RecordSet), for example:
case PARTITION_OFFSET_FIELD -> longValueProvider(message.offset());
By default in current implementation this columns have some hardcoded names like:
_partition_id
_message_corrupt
e.t.cSo it could be a situation, that Kafka topic itself have some fields with similar name, and in this case during processing we could have a conflict (like two columns have the same name, for example
_key
)So reason of this PR is to provide ability to tune internal column names with custom prefix like
XX_my_prefix_XX_key
, so this conflict could be more rare then with simple current default prefix_key
This PR will not resolve problem, but just postpone it until times when users will have more difficult column name, which will conflict with our internal column.
This change is backward compatible so for people who don't have colliding field names they don't need to change their existing queries to work with newer version.
The drawback is that this can't anticipate a collision ahead of time. It does however allow adminstrators to use a unique prefix unlikely to ever exist - e.g.
__kafka_message_metadata_
.Thank you, @vlad-lyutenko and @hashhar for the description :)
Non-technical explanation
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(*) Release notes are required, with the following suggested text: