-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use bytes for record keys instead of String #157
Conversation
50e5c2c
to
b3048ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Have you considered how to handle the places where the key is used in toString?
- Would there be a way to inject the serializer / deserializer and making things generic instead of using only raw bytes?
return true; | ||
} | ||
|
||
final String stringKey = new String(key, StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the key cannot be read a String? this can happen. https://stackoverflow.com/questions/70667113/why-cant-i-decode-any-byte-using-utf-8#comment124924879_70667113
it looks like ignoreKeys
should keep a set of byte arrays instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I assumed new String(key, StandardCharsets.UTF_8)
never throws an exception. Instead, it silently accepts the malformed texts by converting some irregular values with a placeholder.
One commonly used approach in UTF-8 decoders is to replace any malformed UTF-8 sequence by a replacement character (U+FFFD), which looks a bit like an inverted question mark, or a similar symbol.
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Also, I assumed that we support key denylist only when the target key is String
,
- Support key blacklisting only if the key is the type of String (the filter just tries to instantiate String from whatever the byte array is and do comparison which might always fail when key is not a type of String).
Make key deserializer of KafkaConsumer configurable #61 (comment)
If those assumptions are valid, I think we can just ignore non-UTF8 keys here because they will never match the configured keys stored in the denylist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It seems we might want to support other input for this denylist, maybe longs...
Let's leave the possibility open even if we don't implement it in this PR
Ah not yet. In this sense, this change will make it hard to debug it since printed keys in log or elsewhere won't be human-readable anymore. I'm not sure what's the most preferred way to display it (print in hex or try interpreting as UTF-8). Do you guys have any ideas about that?
One of the solutions I came up with is to inject a serializer into
However, adding a type variable in |
I don't see how that would be a bad thing, TBH. Jokes aside, maybe we need to dissociate the "context key" (String, human readable key) and "record key" (bytes from Kafka), and just give a way to convert between them (by default : a StringSerializer / StringDeserializer, but could be Long serdes + toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this!
Overall strategy looks good though, here is my thoughts:
decaton-processor
=> +1 for changingProcessingContext
's key type tobyte[]
(which will be considered as breaking change).- As @mauhiz mentioned, making thing generic would be an option though, it could introduce another type parameter in
DecatonProcessor
, so the impact for users will be huge. - On the other hand, I suppose only limited users refer
ProcessingContext#key
in their code so only changing it tobyte[]
would be good compromise.
- As @mauhiz mentioned, making thing generic would be an option though, it could introduce another type parameter in
decaton-client
=> I think we can keep key type asString
for DecatonClient.- Because
DecatonClient
is just a default client to produce decaton tasks in "standard" format, rather than general producer to produce tasks in arbitrary key/value serialization format
- Because
WDYT? @kawamuray
@ocadaruma Agree with your comment.
Before I made a significant refactoring for class hierarchies, decaton was holding task type as generic parameter almost at all classes in objects hierarchy, while the type parameter is referred only to hold a field for most of the classes, and it was really messy. It might be an option to add another type parameter for key, to the I'm happy to hear if there's a better idea though. |
Thank you all for sharing your idea!
I also agree for this strategy. Let me try to implement subsequent changes in this way. |
Added 33d026e to let Please note that the underlying producer ( I believe that user-impact due to this change is still small, since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few feedbacks.
Besides, could you fix below points as well?
- We need to fix here too: https://github.com/line/decaton/blob/master/processor/src/main/java/com/linecorp/decaton/processor/LoggingContext.java#L47
- Let's mention that key-blocking feature expects only String key while Decaton accepts any type of key in
CONFIG_IGNORE_KEYS
's javadoc
key -> new ArrayList<>()).add(record); | ||
} | ||
|
||
@Override | ||
public void doAssert() { | ||
// Checks there's no overlap between two consecutive records' processing time | ||
for (Entry<String, List<ProcessedRecord>> entry : records.entrySet()) { | ||
for (Entry<ByteBuffer, List<ProcessedRecord>> entry : records.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should convert key to string when generating assert message to get readable result here: https://github.com/line/decaton/pull/157/files#diff-141d497c7aafcef2217814fbe1d2ea75d00a4089212776327360030a935d2278L56
processor/src/main/java/com/linecorp/decaton/processor/runtime/internal/SubPartitioner.java
Outdated
Show resolved
Hide resolved
processor/src/it/java/com/linecorp/decaton/processor/ArbitraryTopicTypeTest.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test(timeout = 30000) | ||
public void testPrintableAsciiStringKeyValue() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess String-key based scenario is already tested in other basic tests.
The intention to have another test here is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought that there were no test scenarios that produce String-key records without DecatonClient
but an arbitrary Producer
.
But it looks we already have such scenarios? Let me consider removing this one then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d79abd2 Replaced with byte key/value test case (or we don't need two tests?)
processor/src/it/java/com/linecorp/decaton/processor/ArbitraryTopicTypeTest.java
Outdated
Show resolved
Hide resolved
@@ -153,17 +154,17 @@ public void testSingleThreadProcessing() throws Exception { | |||
// Note that this processing semantics is not be considered as Decaton specification which users can rely on. | |||
// Rather, this is just a expected behavior based on current implementation when we set concurrency to 1. | |||
ProcessingGuarantee noDuplicates = new ProcessingGuarantee() { | |||
private final Map<String, List<TestTask>> produced = new HashMap<>(); | |||
private final Map<String, List<TestTask>> processed = new HashMap<>(); | |||
private final Map<ByteBuffer, List<TestTask>> produced = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByteBuffer is mutable and its equals/hashcode methods depend on position/limit/remaining. Meaning, reading from it will affect findability in the map.
I don't know if is this overkill, but what about defining a new wrapper type, that
simply delegates equals/hashCode to Arrays.equals / Arrays.hashCode?
It could also cache hashCode (like String does)
And have a toString that just attempts to decode it with UTF-8
// Preceding isEmpty() check is for reducing tiny overhead applied for each contains() by calling | ||
// Object#hashCode. Since ignoreKeys should be empty for most cases.. | ||
if (!ignoreKeys.isEmpty() && ignoreKeys.contains(record.key())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this again, I thought we might want to changes ignoreKeys to a Set<ByteBuffer>
to avoid the charset decoding step (=encode only once when reading blacklist config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is finally fixed by 6d3cc6c
processor/src/main/java/com/linecorp/decaton/processor/runtime/internal/TaskRequest.java
Show resolved
Hide resolved
testing/src/main/java/com/linecorp/decaton/testing/processor/ProcessedRecord.java
Show resolved
Hide resolved
For LoggingContext to print record keys in bytes as human-readable String, this commit introduces a new interface `KeyFormatter` that translates keys from bytes into `String`. This interface has a canonical implementation that reads the byte key as a UTF-8 byte sequence. For other places, this commit just prevents the byte keys from being printed, or prints it as String if it's inside a test case.
To avoid charset decoding step.
Addressed all of the comments above. Notably, we introduced two new classes here. EDIT (2022-06-21) Those were removed #157 (comment)
TaskKey This is a tiny wrapper of record keys in
KeyFormatter This is a functional interface that translates keys from |
Somehow I could only re-request a review of @ocadaruma but @mauhiz could you take a look too? |
processor/src/main/java/com/linecorp/decaton/processor/runtime/internal/TaskKey.java
Outdated
Show resolved
Hide resolved
...sor/src/main/java/com/linecorp/decaton/processor/runtime/internal/BlacklistedKeysFilter.java
Outdated
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/runtime/internal/SubPartitioner.java
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/formatter/KeyFormatter.java
Outdated
Show resolved
Hide resolved
Based on the comments, I removed/replaced the following new classes.
|
Hmm, seems the CI failed because of unresolvable dependencies (due to CloudFlare incident?) and I have no right to rerun the workflow
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few trivial feedbacks but almost looks good!
processor/src/main/java/com/linecorp/decaton/processor/LoggingContext.java
Outdated
Show resolved
Hide resolved
testing/src/main/java/com/linecorp/decaton/testing/TestUtils.java
Outdated
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/HashableKey.java
Outdated
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/HashableKey.java
Outdated
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/HashableKey.java
Outdated
Show resolved
Hide resolved
processor/src/main/java/com/linecorp/decaton/processor/HashableKey.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public String toString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we delegate to a static, reusable method, e.g. for use in LoggingContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit afraid that adding a such static method like HashableByteArray.asString(byte[])
(note: nope, we can use it) leads to give multiple responsibilities to the class.toString
can't be used due to name collision
It might be better to create a utility class with a static method ByteArrays.toString(byte[])
? WDYT (cc. @ocadaruma)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method may support both null
array and non-null array.
I'm also not sure if we can make UTF-8 conversion as if it's a canonical way (ByteArrays.toString
) to convert byte[]
into String
... Just keep using String(byte[], Charset)
here and there will be an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway let me add ByteArrays.toString
as PoC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added de1cb5f com.linecorp.decaton.processor.internal.ByteArrays
with static toString
method
- rename class and fields - make it `final` class - move to decaton.processor.runtime.internal
processor/src/main/java/com/linecorp/decaton/processor/runtime/internal/HashableByteArray.java
Outdated
Show resolved
Hide resolved
f3a0b0b
to
9d9db83
Compare
de1cb5f
to
a288fc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great progress, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left last comment
processor/src/main/java/com/linecorp/decaton/processor/internal/ByteArrays.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found one more point to fix....
I hope this is the last. :)
processor/src/main/java/com/linecorp/decaton/processor/internal/HashableByteArray.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update.
LGTM!
Fixes #126.
Related: #61 (previous attempt to support non-String keys, but abandoned)
Motivation
In these days, decaton has been used to subscribe arbitrary topics. Such topics may not have
String
as keys andDecatonTask
as value.Currently, however, Decaton does not fully support topics with non-String keys. One of the major unsupported features is retry queuing as reported in #126.
This is due to the previous design decision that Decaton should only subscribe topics whose contents are produced by
DecatonClient
withString
keys.Solution
To unlock the whole features of Decaton for non-String keys, this PR updates the key type from
String
tobyte[]
. This change is originally proposed by @kawamuray in #61 (comment).Please note that this PR contains breaking changes, which update signature of public interfaces and classes
likeEDIT Based on the comment, we decided to keep usingDecatonClient
String
forDecatonClient
. Onlydecaton-processor
classes and underlying producers (e.g.,DecatonTaskProducer
) will be affected.Basically, this PR is made of two changes below.
ArbitraryTopicTypeTest
to reproduce the problem Retry fails if a non-ASCII string is used for the Kafka Record Key #126String
key definition withbyte[]
. If the hash of a key is needed for Map or Set, wrap the key withByteBuffer
.TODO
byte[]
keys -toString
won't be human-readable Use bytes for record keys instead of String #157 (review)DecatonClient
, so that users can avoid serializing keys every time when they put a task Use bytes for record keys instead of String #157 (review): EDIT We try to keep usingString
key inDecatonClient
as discussed in Use bytes for record keys instead of String #157 (review).@ajalab will make a PoC