Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] - Propose directly invoking librdkafka from libcudf #2473

Closed
jdye64 opened this issue Aug 6, 2019 · 11 comments
Closed

[QST] - Propose directly invoking librdkafka from libcudf #2473

jdye64 opened this issue Aug 6, 2019 · 11 comments
Assignees
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code question Further information is requested

Comments

@jdye64
Copy link
Contributor

jdye64 commented Aug 6, 2019

What is your question?

kafka_current
kafka_proposed

There is currently a substantial bottleneck when reading messages from Kafka -> librdkafka (C kafka library) -> python-confluent-kafka (python kafka library that wraps librdkafka) -> Python -> and then to cuDF.

In a nutshell the slowest part is the creation of the PyObjects that are returned from the librdkafka C library back to Python. In the case of cuDF this bottleneck could be completely avoided as we simply receive those PyObjects and turn around to immediately place them on the GPU as a cuDF DataFrame.

I wanted to discuss the possibility of simply overloading the cudf.from_csv(), from_json(), etc functions to allow for libcudf to read directly from librdkafka and circumvent the Python library completely. This would greatly speed up our streaming strategy and allow for larger datasets to be placed directly on the GPU without the creation of PyObjects.

I have attached 2 rough whiteboard screenshots (sorry for the terrible light glare). Curious what others thoughts are on this? The librdkafka library is quite robust and sound be simple to integrate into libcudf.

@jdye64 jdye64 added Needs Triage Need team to review and classify question Further information is requested labels Aug 6, 2019
@harrism
Copy link
Member

harrism commented Aug 6, 2019

What is the format of Kafka? Is it CSV/Json, etc? i.e. can we just use it as a data source for existing file readers?

@randerzander randerzander added the proposal Change current process or code label Aug 7, 2019
@randerzander
Copy link
Contributor

randerzander commented Aug 7, 2019

Kafka messages are arbitrary bytes. It's common to see delimited text, JSON, Avro, in Kafka.

So when users fill their kafka topics with those formats, yes we can use cudf's existing readers.

The problem is that existing Kafka Python APIs pull individual messages, lowering throughput to the point that the GPU is underutilized. Kafka can supply messages faster and more efficiently, via librdkafka, but we lack a means of getting them into cuDF, hence @jdye64's proposal.

@harrism
Copy link
Member

harrism commented Aug 8, 2019

@jdye64 are you comfortable proposing an implementation in a PR? Either way, @mjsamoht and team should chime in.

@harrism harrism added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. labels Aug 8, 2019
@jdye64
Copy link
Contributor Author

jdye64 commented Aug 8, 2019

@harrism sure. I'll start getting something together now

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 9, 2019

@harrism I made an initial WIP commit that does nothing more than add the librdkafka library to the cmake project. Just wanted to get your thoughts before continuing any further with the implementation. Thanks!

@mjsamoht
Copy link
Contributor

mjsamoht commented Aug 9, 2019

cc @j-ieong @OlivierNV

IIUC the ask here is to add support to our cuIO CSV/JSON/... readers to accept Kafka messages as input. Is that correct?

I don't know how we get from Kafka messages to CSV or JSON so it's difficult to comment on feasibility. But I would imagine this requires an extra mem2mem pass and can't be easily integrated into cuIO readers.

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 9, 2019

You are correct about the extra copy. Given the nature of Kafka and how it operates I really don't see a way around that. However this would be a big improvement over what we currently have which is 3 different memory to memory copies that are slowing down the entire pipeline so seems like even if its 1 extra its still better than what we have.

Also this PR didn't have any "code" I more just wanted to make sure I was following the standards everyone is looking for when it comes to adding another external project to cmake

@mjsamoht
Copy link
Contributor

mjsamoht commented Aug 9, 2019

So Kafka messages have a header and payload? And all we need to do is assemble the payloads into a buffer containing the CSV/JSON/... data?

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 10, 2019

Correct. There is a C struct that defines a message rd_message_t to be exact. That defines length of payload, offset in Kafka topic, etc. However all we need to do in take that payload and parse it via CSV/JSON/ .... or whichever readers we choose to allow to have the kafka functionality.

The payload is held in a local C buffer so it would be a copy from that memory buffer to our desired memory location. That memory buffer would be local to the cuDF process with this approach as well.

@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Aug 15, 2019
@harrism
Copy link
Member

harrism commented Sep 26, 2019

Let's tackle this in 0.11 @jdye64 you have already taken on the cmake changes needed to incorporate librdkafka. I learned this week from @mjsamoht that cuIO already has the datasource class: https://github.com/rapidsai/cudf/blob/952a5a20edbe8920195beb7effb69af44672ec51/cpp/src/io/utilities/datasource.hpp

I guess the idea is that this already provides a base class that can be implemented for different data sources (as I understand it). Perhaps this can be extended to read from kafka payloads?

@jdye64
Copy link
Contributor Author

jdye64 commented Dec 11, 2019

I have since opened a Feature request at #3405 which has an active PR. I think we can close this QST and instead use that feature issue and PR.

@jdye64 jdye64 closed this as completed Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants