-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Incorporate librdkafka into cudf CMake #2519
Conversation
… in this commit but rather just to validate with the larger group that this is indeed the proper way we want to do this
Codecov Report
@@ Coverage Diff @@
## branch-0.10 #2519 +/- ##
===============================================
- Coverage 83.36% 82.99% -0.37%
===============================================
Files 58 58
Lines 8577 8735 +158
===============================================
+ Hits 7150 7250 +100
- Misses 1427 1485 +58
Continue to review full report at Codecov.
|
Moving to 0.10 since this is WP and 0.9 is already in burn down. |
the cmake looks reasonable. @kkraus14 agree? I think you should add the code that uses librdkafka to this PR, not a separate PR. |
I will add that I did not link zlib or lz4 yet which would probably be best to also add but just wanted a baseline first. |
Is there a reason we're building librdkafka from source instead of just depending on an install via conda or other means? I would rather treat it similarly to how we treat ZLIB in the cmake for example and then depend on https://anaconda.org/conda-forge/librdkafka in the conda environments / recipes. |
@kkraus14 I debated both. I don't really care either way but here is why I was thinking cmake build from source. Its really not that many differences.
I'm fine with either way just wanted to point out why I chose this route. |
But what if a user installs confluent-kafka-python for usage elsewhere, which would then install a separate librdkafka with a potentially different version and then it becomes a race as to which symbols are loaded first, no? |
I opted to use the CMake |
I wanted to share some performance metrics. Before diving in head first I wanted to indeed prove out that this implementation would be an order of magnitude faster than directly using confluent-kafka-python to ensure its worth the trouble. I have a dataset of around 23,000,000 haproxy log lines (in JSON format) in a Kafka broker with a single partition to reduce variables. I implemented both a Jupyter cell Python function to consume messages from Kafka in batches of 500,000 and then pass those messages to the cudf.read_json() method to create a DataFrame. Likewise I created another Jupyter cell where I invoked my revised cudf.read_json() method that invoked Kafka directly and consumes the messages directly to the DataFrame. To be clear the runtimes below do include the time to make the DataFrame. This is important because the slowdown before was in the PyObject creation on receiving messages from librdkafka.
I have included a Jupyter notebook to show my exact implementation for these functions. |
@@ -66,6 +66,7 @@ build/ | |||
cpp/build/ | |||
cpp/include/cudf/ipc_generated/*.h | |||
cpp/thirdparty/googletest/ | |||
cpp/cmake-build-debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This folder is specific to you. Please use cpp/Debug, which is already .gitignored.
@@ -17,6 +17,7 @@ | |||
#include <cudf/cudf.h> | |||
#include <cudf/legacy/table.hpp> | |||
#include <utilities/error_utils.hpp> | |||
#include <librdkafka/rdkafkacpp.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused include file?
#include <librdkafka/rdkafkacpp.h> |
I wouldn't want to merge this until the functionality is actually used. Otherwise it's an unnecessary dependency. So I would extend this PR to actually implement the functionality. |
I agree with you. I intended this to be a collaboration/discussion PR and a chance to POC some improvements. I believe closing it once the discussion is complete and opening more succinct PRs is best. Since the performance results were pleasing I was trying to internalize the best place to actually place the Kafka consumer logic? Was hoping others had some feedback on that. As mentioned I currently place it directly in the read_json and read_csv functions which is likely not best. Of course it does make sense to use those functions since most messages consumed will be in csv or json format anyway so no need to rewrite that logic. One thought was to add the logic to datasource.c/hpp and overload read_csv, read_json, others to pass in Kafka configurations. In my mind this would behave much like how the datasource either reads from a memory buffer or file today except now there would be a Kafka option. Once the batch of messages were read it would continue on the existing execution to create the actual Table and then Dataframe. Thoughts on that? |
I'd be curious to see a profile like snakeviz or nsys to see where all the time was being spent. This wouldn't be the first time that Python was doing something silly and making it slow. |
I did have one but don't have it on hand any longer. The core slowdown was on the PyObject creation for each individual message received from Kafka. So if you had a batch size of 300,000 messages it would make 300,000 + 1 PyObjects, return those to cudf, user would join those into say a comma delimited string, and then return it right back to libcudf to make the DataFrame. So the hop back to python really didn't serve a purpose. |
Ah, okay, makes sense. PyObject creation is the same problem we have with implementing a performant transpose. Out of curiosity, why does Python need a separate object for each message? Is there not some way to aggregate all the messages into a single object? |
So that was actually what I tried first. It did speed it up but the speedup was minimal. I am by no means an expert on PyObjects. It seemed however that the single large concatenated "string" PyObject took almost as long as making a large number of smaller ones. |
Also we don't own the Python library "Confluent-kafka-python" and their community wasn't too excited about having that concatenation logic within their codebase since it was sort of a niche case. |
As for where the logic should go, you might want to start a conversation with @mjsamoht @j-ieong @OlivierNV @vuule. |
Extending datasource indeed seems like the right place to put this. |
@jdye64 can you drive this to completion in 0.11? |
I'm going to close this as it is taken care of in #3504 now. Thanks for all the feedback here everyone! |
There is no code in this PR but rather just to validate with the larger group that this is indeed the proper way we want to do this
Closes #2473
@harrism This is just a WIP PR that simply adds the librdkafka library to the current cudf build. I was hoping you could validate what I am doing here is ok. IF it is then I will continue on with the implementation details.