Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kafka headers as column #4462

Merged
merged 3 commits into from Sep 10, 2020
Merged

Add Kafka headers as column #4462

merged 3 commits into from Sep 10, 2020

Conversation

0xE282B0
Copy link

@0xE282B0 0xE282B0 commented Jul 15, 2020

This PR adds the internal _headers field to the kafka connector.
As type a map with key String and value array of byte[] is used.

For better review the changes are splited into two commits:

  1. Extract the type definitions from the enum that describes the internal fields to a map that is initialized in the constructor where the TypeManager can be used.
  2. Add the Header column and adapt the KafkaRecordCursor

I would appreciate if someone could review the code.

@cla-bot cla-bot bot added the cla-signed label Jul 15, 2020
@0xE282B0 0xE282B0 requested review from findepi, aalbu and losipiuk July 16, 2020 19:33
@findepi findepi requested a review from charlesjmorgan July 16, 2020 20:43
Copy link
Member

@aalbu aalbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR, great addition to the connector.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @0xE282B0 for your PR

thanks @aalbu for the review; i am just skimming at this point

Copy link
Member

@charlesjmorgan charlesjmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, thanks for your work on this :)

@martint martint requested a review from findepi July 29, 2020 06:48
@@ -269,4 +278,53 @@ public void close()
kafkaConsumer.close();
}
}

public static FieldValueProvider headerMapValueProvider(MapType varcharMapType, Headers headers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assume empty headers case is not a rare case, so we could want to cache a result for this (empty block).

@dain wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xE282B0 can you address this one unless the assumption about the fact, that empty headers is a common thing, is false?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easier way to get an empty MapBlock than creating a MapType and using a MapBlockBuilder?

MapType mapType = new MapType(VarcharType.VARCHAR, new ArrayType(VarbinaryType.VARBINARY),
        MethodHandles.empty(methodType(Boolean.class, Block.class, int.class, long.class)),
        MethodHandles.empty(methodType(Boolean.class, Block.class, int.class, Block.class, int.class)),
        MethodHandles.empty(methodType(long.class, Object.class)),
        MethodHandles.empty(methodType(long.class, Object.class)));
BlockBuilder mapBlockBuilder = new MapBlockBuilder(mapType, null, 0);
mapBlockBuilder.beginBlockEntry();
mapBlockBuilder.closeEntry();
Block emptyMapBlock = mapType.getObject(mapBlockBuilder, 0);

414ddcb#diff-1938e94844afe6ab2ee40936951490d9R300-R308

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware. Looks good to me.

Nit: can you rename createEmptyMapBlockProvider to createEmptyHeadersFieldProvider
and EMPTY_MAP_BLOCK_PROVIDER to EMPTY_HEADERS_FIELD_PROVIDER

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add a test when we are reading _headers column but headers map is empty.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming sounds better - more specific.
The empty header test doesn't look as straightforward as I wanted it to be, but it works. Since the assertQuery does not work on maps and VARBINARY does not work with JSON, I converted the map type and then cast it to JSON to compare it with empty objects.

SELECT cast(transform_values(_headers,(k,v)->transform(v,x->from_utf8(x))) AS JSON)

Recomendations welcome 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test could be a bit more readable if headers topic also defined named id field. It would make assertions simpler.
Then if messages with empty headers has id equal to 1 and 2.
The assertion would look like:

 assertQuery("SELECT id FROM default." + headersTopic + " WHERE cardinality(_headers) = 0",
                "VALUES (1), (2)");

The other test query could be more readable too if you used id in WHERE clause.

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we don't have an ID, but we can enumerate the messages based on the value of the message.

assertQuery("SELECT _message FROM default." + headersTopic + " WHERE cardinality(_headers) = 0",
               "VALUES ('1'),('2')");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of extending the schema with id but using _message is totally fine too.

record = new ProducerRecord<>(topicName, null, "{}".getBytes(UTF_8));
record.headers()
.add("foo", "bar".getBytes(UTF_8))
.add("foo", null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking care of null values
can header key be null as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, at least the Kafka Java client throws an exception if you try to create a Header with null as key.
See: https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/header/internals/RecordHeader.java#L32

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xE282B0 For me the responsibility split between static enum InternalField and runtime InternalFieldDescription is very counterintuitive.

As we are forced to introduce InternalFieldDescription which is built at runtime it does not make sense to me to keep rich InternalField with subset of column information.
Instead I would suggest to move all the information to InternalFieldDescription and keep InternalField enum just as a name marker.

Given some naming changes it may look like this:

public class KafkaInternalFieldManager {
   enum InternalFieldKey {
        PARTITION_ID_FIELD,
        PARTITION_OFFSET_FIELD,
        MESSAGE_CORRUPT_FIELD,
        ...
   }

   public static class InternalField {
     private final InternalFieldKey key;
     private final String columnName;
     private final String comment;
     private final Type type;
     
     // constructor, getter, setters
  }  

  private final Map<InternalFieldKey, InternalField> internalFields;

}

We could even drop InternalFieldKey just define constants to be used for matching based on columnName.

WDYT?

@0xE282B0
Copy link
Author

Thanks for the review @losipiuk,
I agree that it feels odd to have the static enum and a representation as class with type at runtime. Therefore I'd like to get rid of the enum.
@findepi WDYT?

@0xE282B0 0xE282B0 requested a review from losipiuk August 20, 2020 07:07
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks looks good. Minor comments.

@@ -269,4 +278,53 @@ public void close()
kafkaConsumer.close();
}
}

public static FieldValueProvider headerMapValueProvider(MapType varcharMapType, Headers headers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xE282B0 can you address this one unless the assumption about the fact, that empty headers is a common thing, is false?

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
One more request for a test and we are good to go :)

@@ -269,4 +278,53 @@ public void close()
kafkaConsumer.close();
}
}

public static FieldValueProvider headerMapValueProvider(MapType varcharMapType, Headers headers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware. Looks good to me.

Nit: can you rename createEmptyMapBlockProvider to createEmptyHeadersFieldProvider
and EMPTY_MAP_BLOCK_PROVIDER to EMPTY_HEADERS_FIELD_PROVIDER

@@ -269,4 +278,53 @@ public void close()
kafkaConsumer.close();
}
}

public static FieldValueProvider headerMapValueProvider(MapType varcharMapType, Headers headers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add a test when we are reading _headers column but headers map is empty.

@findepi findepi requested a review from losipiuk September 7, 2020 10:52
Sven Pfennig added 3 commits September 9, 2020 08:56
The KafkaInternalFieldManager creates the internalFields map in the
consructor where the TypeManager can be used.

Signed-off-by: Sven Pfennig <[email protected]>
Column definition has been added to KafkaInternalFieldDescription with
map(VARCHAR,array(VARBINARY)) type from TypeManager.
ValueProvider has been added to KafkaRecordSet

Signed-off-by: Sven Pfennig <[email protected]>
@losipiuk losipiuk merged commit d446959 into trinodb:master Sep 10, 2020
@losipiuk
Copy link
Member

Thanks! I did not notice previously that you already made a change to the test :)

@losipiuk losipiuk mentioned this pull request Sep 10, 2020
9 tasks
@martint martint added this to the 342 milestone Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants