-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] - Propose directly invoking librdkafka from libcudf #2473
Comments
What is the format of Kafka? Is it CSV/Json, etc? i.e. can we just use it as a data source for existing file readers? |
Kafka messages are arbitrary bytes. It's common to see delimited text, JSON, Avro, in Kafka. So when users fill their kafka topics with those formats, yes we can use cudf's existing readers. The problem is that existing Kafka Python APIs pull individual messages, lowering throughput to the point that the GPU is underutilized. Kafka can supply messages faster and more efficiently, via librdkafka, but we lack a means of getting them into cuDF, hence @jdye64's proposal. |
@harrism sure. I'll start getting something together now |
@harrism I made an initial WIP commit that does nothing more than add the librdkafka library to the cmake project. Just wanted to get your thoughts before continuing any further with the implementation. Thanks! |
IIUC the ask here is to add support to our cuIO CSV/JSON/... readers to accept Kafka messages as input. Is that correct? I don't know how we get from Kafka messages to CSV or JSON so it's difficult to comment on feasibility. But I would imagine this requires an extra mem2mem pass and can't be easily integrated into cuIO readers. |
You are correct about the extra copy. Given the nature of Kafka and how it operates I really don't see a way around that. However this would be a big improvement over what we currently have which is 3 different memory to memory copies that are slowing down the entire pipeline so seems like even if its 1 extra its still better than what we have. Also this PR didn't have any "code" I more just wanted to make sure I was following the standards everyone is looking for when it comes to adding another external project to cmake |
So Kafka messages have a header and payload? And all we need to do is assemble the payloads into a buffer containing the CSV/JSON/... data? |
Correct. There is a C struct that defines a message rd_message_t to be exact. That defines length of payload, offset in Kafka topic, etc. However all we need to do in take that payload and parse it via CSV/JSON/ .... or whichever readers we choose to allow to have the kafka functionality. The payload is held in a local C buffer so it would be a copy from that memory buffer to our desired memory location. That memory buffer would be local to the cuDF process with this approach as well. |
Let's tackle this in 0.11 @jdye64 you have already taken on the cmake changes needed to incorporate librdkafka. I learned this week from @mjsamoht that cuIO already has the I guess the idea is that this already provides a base class that can be implemented for different data sources (as I understand it). Perhaps this can be extended to read from kafka payloads? |
I have since opened a Feature request at #3405 which has an active PR. I think we can close this QST and instead use that feature issue and PR. |
What is your question?
There is currently a substantial bottleneck when reading messages from Kafka -> librdkafka (C kafka library) -> python-confluent-kafka (python kafka library that wraps librdkafka) -> Python -> and then to cuDF.
In a nutshell the slowest part is the creation of the PyObjects that are returned from the librdkafka C library back to Python. In the case of cuDF this bottleneck could be completely avoided as we simply receive those PyObjects and turn around to immediately place them on the GPU as a cuDF DataFrame.
I wanted to discuss the possibility of simply overloading the cudf.from_csv(), from_json(), etc functions to allow for libcudf to read directly from librdkafka and circumvent the Python library completely. This would greatly speed up our streaming strategy and allow for larger datasets to be placed directly on the GPU without the creation of PyObjects.
I have attached 2 rough whiteboard screenshots (sorry for the terrible light glare). Curious what others thoughts are on this? The librdkafka library is quite robust and sound be simple to integrate into libcudf.
The text was updated successfully, but these errors were encountered: