Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Incorporate librdkafka into cudf CMake #2519

Closed
wants to merge 2 commits into from

Conversation

jdye64
Copy link
Contributor

@jdye64 jdye64 commented Aug 9, 2019

There is no code in this PR but rather just to validate with the larger group that this is indeed the proper way we want to do this

Closes #2473

@harrism This is just a WIP PR that simply adds the librdkafka library to the current cudf build. I was hoping you could validate what I am doing here is ok. IF it is then I will continue on with the implementation details.

… in this commit but rather just to validate with the larger group that this is indeed the proper way we want to do this
@jdye64 jdye64 requested a review from a team as a code owner August 9, 2019 17:25
@jdye64 jdye64 requested a review from a team as a code owner August 9, 2019 17:41
@codecov
Copy link

codecov bot commented Aug 9, 2019

Codecov Report

Merging #2519 into branch-0.10 will decrease coverage by 0.36%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.10    #2519      +/-   ##
===============================================
- Coverage        83.36%   82.99%   -0.37%     
===============================================
  Files               58       58              
  Lines             8577     8735     +158     
===============================================
+ Hits              7150     7250     +100     
- Misses            1427     1485      +58
Impacted Files Coverage Δ
python/cudf/cudf/utils/cudautils.py 51.52% <0%> (+3.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b6e91f...547dd63. Read the comment docs.

@harrism harrism changed the title [WIP] Incorporated librdkafka into the cudf CMake project. There is no code… [WIP] Incorporate librdkafka into cudf CMake Aug 11, 2019
@harrism harrism changed the base branch from branch-0.9 to branch-0.10 August 11, 2019 22:24
@harrism
Copy link
Member

harrism commented Aug 11, 2019

Moving to 0.10 since this is WP and 0.9 is already in burn down.

@harrism
Copy link
Member

harrism commented Aug 11, 2019

the cmake looks reasonable. @kkraus14 agree?

I think you should add the code that uses librdkafka to this PR, not a separate PR.

@harrism harrism added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Aug 11, 2019
@jdye64
Copy link
Contributor Author

jdye64 commented Aug 11, 2019

I will add that I did not link zlib or lz4 yet which would probably be best to also add but just wanted a baseline first.

@kkraus14
Copy link
Collaborator

Is there a reason we're building librdkafka from source instead of just depending on an install via conda or other means? I would rather treat it similarly to how we treat ZLIB in the cmake for example and then depend on https://anaconda.org/conda-forge/librdkafka in the conda environments / recipes.

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 12, 2019

@kkraus14 I debated both. I don't really care either way but here is why I was thinking cmake build from source. Its really not that many differences.

  1. Its a small and quick build so the added build time wasn't minimal.
  2. Allows for us to use a very specific version vs what is installed via a package manager hence allowing us to control which version we pull from Git and use between releases OR as needed to test new features without relying on the anaconda packages since those are out of our control. I see lots of other projects struggle with this given the rapid change in between Kafka releases and version requirements.
  3. Saved the end user from performing another conda install therefore less docs and a more "simple" installation process.
  4. We were already doing this with Apache Arrow
  5. If we needed to make librdkafka specific performance enhancements we could pull from that specific branch or possibly even fork in the future.

I'm fine with either way just wanted to point out why I chose this route.

@kkraus14
Copy link
Collaborator

@kkraus14 I debated both. I don't really care either way but here is why I was thinking cmake build from source. Its really not that many differences.

  1. Its a small and quick build so the added build time wasn't minimal.
  2. Allows for us to use a very specific version vs what is installed via a package manager hence allowing us to control which version we pull from Git and use between releases OR as needed to test new features without relying on the anaconda packages since those are out of our control. I see lots of other projects struggle with this given the rapid change in between Kafka releases and version requirements.
  3. Saved the end user from performing another conda install therefore less docs and a more "simple" installation process.

I'm fine with either way just wanted to point out why I chose this route.

But what if a user installs confluent-kafka-python for usage elsewhere, which would then install a separate librdkafka with a potentially different version and then it becomes a race as to which symbols are loaded first, no?

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 12, 2019

I opted to use the CMake NO_DEFAULT_PATH option which should prevent that when locating the librdkafka library and headers.

cb4228a#diff-e06d4dfe6ac984f1aed235fa6161fcc3R49

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 27, 2019

I wanted to share some performance metrics. Before diving in head first I wanted to indeed prove out that this implementation would be an order of magnitude faster than directly using confluent-kafka-python to ensure its worth the trouble.

I have a dataset of around 23,000,000 haproxy log lines (in JSON format) in a Kafka broker with a single partition to reduce variables. I implemented both a Jupyter cell Python function to consume messages from Kafka in batches of 500,000 and then pass those messages to the cudf.read_json() method to create a DataFrame. Likewise I created another Jupyter cell where I invoked my revised cudf.read_json() method that invoked Kafka directly and consumes the messages directly to the DataFrame. To be clear the runtimes below do include the time to make the DataFrame. This is important because the slowdown before was in the PyObject creation on receiving messages from librdkafka.

Avg. Message Size Batch Size Python Runtime Sec C++ Runtime Sec C++ Speedup
20kb 10,000 8.36 0.59 14x
20kb 100,000 8.69 0.68 13x
20kb 200,000 8.96 0.76 12x
20kb 300,000 9.35 0.84 12x
  • librdkafka has a maximum batch size of 999,999 hence that is as high as you can go. It is generally streaming standard to max out around 300K per batch.
  • It was interesting to note that increasing the batch size does not offer much of a slowdown in the runtime.

I have included a Jupyter notebook to show my exact implementation for these functions.

Kafka_Benchmarking.ipynb.zip

@@ -66,6 +66,7 @@ build/
cpp/build/
cpp/include/cudf/ipc_generated/*.h
cpp/thirdparty/googletest/
cpp/cmake-build-debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This folder is specific to you. Please use cpp/Debug, which is already .gitignored.

@@ -17,6 +17,7 @@
#include <cudf/cudf.h>
#include <cudf/legacy/table.hpp>
#include <utilities/error_utils.hpp>
#include <librdkafka/rdkafkacpp.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused include file?

Suggested change
#include <librdkafka/rdkafkacpp.h>

@harrism
Copy link
Member

harrism commented Aug 28, 2019

I wouldn't want to merge this until the functionality is actually used. Otherwise it's an unnecessary dependency. So I would extend this PR to actually implement the functionality.

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 28, 2019

I agree with you. I intended this to be a collaboration/discussion PR and a chance to POC some improvements. I believe closing it once the discussion is complete and opening more succinct PRs is best.

Since the performance results were pleasing I was trying to internalize the best place to actually place the Kafka consumer logic? Was hoping others had some feedback on that. As mentioned I currently place it directly in the read_json and read_csv functions which is likely not best. Of course it does make sense to use those functions since most messages consumed will be in csv or json format anyway so no need to rewrite that logic.

One thought was to add the logic to datasource.c/hpp and overload read_csv, read_json, others to pass in Kafka configurations. In my mind this would behave much like how the datasource either reads from a memory buffer or file today except now there would be a Kafka option. Once the batch of messages were read it would continue on the existing execution to create the actual Table and then Dataframe. Thoughts on that?

@jrhemstad
Copy link
Contributor

I'd be curious to see a profile like snakeviz or nsys to see where all the time was being spent. This wouldn't be the first time that Python was doing something silly and making it slow.

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 28, 2019

I did have one but don't have it on hand any longer. The core slowdown was on the PyObject creation for each individual message received from Kafka. So if you had a batch size of 300,000 messages it would make 300,000 + 1 PyObjects, return those to cudf, user would join those into say a comma delimited string, and then return it right back to libcudf to make the DataFrame. So the hop back to python really didn't serve a purpose.

@jrhemstad
Copy link
Contributor

I did have one but don't have it on hand any longer. The core slowdown was on the PyObject creation for each individual message received from Kafka. So if you had a batch size of 300,000 messages it would make 300,000 + 1 PyObjects, return those to cudf, user would join those into say a comma delimited string, and then return it right back to libcudf to make the DataFrame. So the hop back to python really didn't serve a purpose.

Ah, okay, makes sense. PyObject creation is the same problem we have with implementing a performant transpose.

Out of curiosity, why does Python need a separate object for each message? Is there not some way to aggregate all the messages into a single object?

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 28, 2019

So that was actually what I tried first. It did speed it up but the speedup was minimal. I am by no means an expert on PyObjects. It seemed however that the single large concatenated "string" PyObject took almost as long as making a large number of smaller ones.

@jdye64
Copy link
Contributor Author

jdye64 commented Aug 28, 2019

Also we don't own the Python library "Confluent-kafka-python" and their community wasn't too excited about having that concatenation logic within their codebase since it was sort of a niche case.

@harrism
Copy link
Member

harrism commented Aug 29, 2019

As for where the logic should go, you might want to start a conversation with @mjsamoht @j-ieong @OlivierNV @vuule.

@OlivierNV
Copy link
Contributor

Extending datasource indeed seems like the right place to put this.

@harrism
Copy link
Member

harrism commented Sep 26, 2019

@jdye64 can you drive this to completion in 0.11?

@jdye64
Copy link
Contributor Author

jdye64 commented Dec 12, 2019

I'm going to close this as it is taken care of in #3504 now. Thanks for all the feedback here everyone!

@jdye64 jdye64 closed this Dec 12, 2019
@jdye64 jdye64 deleted the embeddedkafka branch December 12, 2019 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants