[QST] - Propose directly invoking librdkafka from libcudf #2473

jdye64 · 2019-08-06T17:49:24Z

What is your question?

There is currently a substantial bottleneck when reading messages from Kafka -> librdkafka (C kafka library) -> python-confluent-kafka (python kafka library that wraps librdkafka) -> Python -> and then to cuDF.

In a nutshell the slowest part is the creation of the PyObjects that are returned from the librdkafka C library back to Python. In the case of cuDF this bottleneck could be completely avoided as we simply receive those PyObjects and turn around to immediately place them on the GPU as a cuDF DataFrame.

I wanted to discuss the possibility of simply overloading the cudf.from_csv(), from_json(), etc functions to allow for libcudf to read directly from librdkafka and circumvent the Python library completely. This would greatly speed up our streaming strategy and allow for larger datasets to be placed directly on the GPU without the creation of PyObjects.

I have attached 2 rough whiteboard screenshots (sorry for the terrible light glare). Curious what others thoughts are on this? The librdkafka library is quite robust and sound be simple to integrate into libcudf.

harrism · 2019-08-06T22:20:13Z

What is the format of Kafka? Is it CSV/Json, etc? i.e. can we just use it as a data source for existing file readers?

randerzander · 2019-08-07T03:40:51Z

Kafka messages are arbitrary bytes. It's common to see delimited text, JSON, Avro, in Kafka.

So when users fill their kafka topics with those formats, yes we can use cudf's existing readers.

The problem is that existing Kafka Python APIs pull individual messages, lowering throughput to the point that the GPU is underutilized. Kafka can supply messages faster and more efficiently, via librdkafka, but we lack a means of getting them into cuDF, hence @jdye64's proposal.

harrism · 2019-08-08T05:20:31Z

@jdye64 are you comfortable proposing an implementation in a PR? Either way, @mjsamoht and team should chime in.

jdye64 · 2019-08-08T15:34:50Z

@harrism sure. I'll start getting something together now

jdye64 · 2019-08-09T17:27:02Z

@harrism I made an initial WIP commit that does nothing more than add the librdkafka library to the cmake project. Just wanted to get your thoughts before continuing any further with the implementation. Thanks!

mjsamoht · 2019-08-09T18:38:52Z

cc @j-ieong @OlivierNV

IIUC the ask here is to add support to our cuIO CSV/JSON/... readers to accept Kafka messages as input. Is that correct?

I don't know how we get from Kafka messages to CSV or JSON so it's difficult to comment on feasibility. But I would imagine this requires an extra mem2mem pass and can't be easily integrated into cuIO readers.

jdye64 · 2019-08-09T19:05:14Z

You are correct about the extra copy. Given the nature of Kafka and how it operates I really don't see a way around that. However this would be a big improvement over what we currently have which is 3 different memory to memory copies that are slowing down the entire pipeline so seems like even if its 1 extra its still better than what we have.

Also this PR didn't have any "code" I more just wanted to make sure I was following the standards everyone is looking for when it comes to adding another external project to cmake

mjsamoht · 2019-08-09T20:08:35Z

So Kafka messages have a header and payload? And all we need to do is assemble the payloads into a buffer containing the CSV/JSON/... data?

jdye64 · 2019-08-10T03:00:14Z

Correct. There is a C struct that defines a message rd_message_t to be exact. That defines length of payload, offset in Kafka topic, etc. However all we need to do in take that payload and parse it via CSV/JSON/ .... or whichever readers we choose to allow to have the kafka functionality.

The payload is held in a local C buffer so it would be a copy from that memory buffer to our desired memory location. That memory buffer would be local to the cuDF process with this approach as well.

harrism · 2019-09-26T08:48:52Z

Let's tackle this in 0.11 @jdye64 you have already taken on the cmake changes needed to incorporate librdkafka. I learned this week from @mjsamoht that cuIO already has the datasource class: https://github.com/rapidsai/cudf/blob/952a5a20edbe8920195beb7effb69af44672ec51/cpp/src/io/utilities/datasource.hpp

I guess the idea is that this already provides a base class that can be implemented for different data sources (as I understand it). Perhaps this can be extended to read from kafka payloads?

jdye64 · 2019-12-11T00:57:56Z

I have since opened a Feature request at #3405 which has an active PR. I think we can close this QST and instead use that feature issue and PR.

jdye64 added Needs Triage Need team to review and classify question Further information is requested labels Aug 6, 2019

randerzander added the proposal Change current process or code label Aug 7, 2019

harrism added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. labels Aug 8, 2019

jdye64 mentioned this issue Aug 9, 2019

[WIP] Incorporate librdkafka into cudf CMake #2519

Closed

kkraus14 removed the Needs Triage Need team to review and classify label Aug 15, 2019

harrism assigned jdye64 Dec 11, 2019

jdye64 closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] - Propose directly invoking librdkafka from libcudf #2473

[QST] - Propose directly invoking librdkafka from libcudf #2473

jdye64 commented Aug 6, 2019

harrism commented Aug 6, 2019

randerzander commented Aug 7, 2019 •

edited

Loading

harrism commented Aug 8, 2019 •

edited

Loading

jdye64 commented Aug 8, 2019

jdye64 commented Aug 9, 2019

mjsamoht commented Aug 9, 2019

jdye64 commented Aug 9, 2019

mjsamoht commented Aug 9, 2019

jdye64 commented Aug 10, 2019

harrism commented Sep 26, 2019

jdye64 commented Dec 11, 2019

[QST] - Propose directly invoking librdkafka from libcudf #2473

[QST] - Propose directly invoking librdkafka from libcudf #2473

Comments

jdye64 commented Aug 6, 2019

harrism commented Aug 6, 2019

randerzander commented Aug 7, 2019 • edited Loading

harrism commented Aug 8, 2019 • edited Loading

jdye64 commented Aug 8, 2019

jdye64 commented Aug 9, 2019

mjsamoht commented Aug 9, 2019

jdye64 commented Aug 9, 2019

mjsamoht commented Aug 9, 2019

jdye64 commented Aug 10, 2019

harrism commented Sep 26, 2019

jdye64 commented Dec 11, 2019

randerzander commented Aug 7, 2019 •

edited

Loading

harrism commented Aug 8, 2019 •

edited

Loading