[WIP] Incorporate librdkafka into cudf CMake #2519

jdye64 · 2019-08-09T17:25:43Z

There is no code in this PR but rather just to validate with the larger group that this is indeed the proper way we want to do this

Closes #2473

@harrism This is just a WIP PR that simply adds the librdkafka library to the current cudf build. I was hoping you could validate what I am doing here is ok. IF it is then I will continue on with the implementation details.

… in this commit but rather just to validate with the larger group that this is indeed the proper way we want to do this

codecov · 2019-08-09T18:21:55Z

Codecov Report

Merging #2519 into branch-0.10 will decrease coverage by 0.36%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.10    #2519      +/-   ##
===============================================
- Coverage        83.36%   82.99%   -0.37%     
===============================================
  Files               58       58              
  Lines             8577     8735     +158     
===============================================
+ Hits              7150     7250     +100     
- Misses            1427     1485      +58

Impacted Files	Coverage Δ
python/cudf/cudf/utils/cudautils.py	`51.52% <0%> (+3.73%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b6e91f...547dd63. Read the comment docs.

harrism · 2019-08-11T22:24:46Z

Moving to 0.10 since this is WP and 0.9 is already in burn down.

harrism · 2019-08-11T22:26:21Z

the cmake looks reasonable. @kkraus14 agree?

I think you should add the code that uses librdkafka to this PR, not a separate PR.

jdye64 · 2019-08-11T22:38:38Z

I will add that I did not link zlib or lz4 yet which would probably be best to also add but just wanted a baseline first.

kkraus14 · 2019-08-12T19:18:18Z

Is there a reason we're building librdkafka from source instead of just depending on an install via conda or other means? I would rather treat it similarly to how we treat ZLIB in the cmake for example and then depend on https://anaconda.org/conda-forge/librdkafka in the conda environments / recipes.

jdye64 · 2019-08-12T19:29:18Z

@kkraus14 I debated both. I don't really care either way but here is why I was thinking cmake build from source. Its really not that many differences.

Its a small and quick build so the added build time wasn't minimal.
Allows for us to use a very specific version vs what is installed via a package manager hence allowing us to control which version we pull from Git and use between releases OR as needed to test new features without relying on the anaconda packages since those are out of our control. I see lots of other projects struggle with this given the rapid change in between Kafka releases and version requirements.
Saved the end user from performing another conda install therefore less docs and a more "simple" installation process.
We were already doing this with Apache Arrow
If we needed to make librdkafka specific performance enhancements we could pull from that specific branch or possibly even fork in the future.

I'm fine with either way just wanted to point out why I chose this route.

kkraus14 · 2019-08-12T19:31:34Z

@kkraus14 I debated both. I don't really care either way but here is why I was thinking cmake build from source. Its really not that many differences.

Its a small and quick build so the added build time wasn't minimal.

Allows for us to use a very specific version vs what is installed via a package manager hence allowing us to control which version we pull from Git and use between releases OR as needed to test new features without relying on the anaconda packages since those are out of our control. I see lots of other projects struggle with this given the rapid change in between Kafka releases and version requirements.

Saved the end user from performing another conda install therefore less docs and a more "simple" installation process.

I'm fine with either way just wanted to point out why I chose this route.

But what if a user installs confluent-kafka-python for usage elsewhere, which would then install a separate librdkafka with a potentially different version and then it becomes a race as to which symbols are loaded first, no?

jdye64 · 2019-08-12T19:34:15Z

I opted to use the CMake NO_DEFAULT_PATH option which should prevent that when locating the librdkafka library and headers.

cb4228a#diff-e06d4dfe6ac984f1aed235fa6161fcc3R49

jdye64 · 2019-08-27T16:56:00Z

I wanted to share some performance metrics. Before diving in head first I wanted to indeed prove out that this implementation would be an order of magnitude faster than directly using confluent-kafka-python to ensure its worth the trouble.

I have a dataset of around 23,000,000 haproxy log lines (in JSON format) in a Kafka broker with a single partition to reduce variables. I implemented both a Jupyter cell Python function to consume messages from Kafka in batches of 500,000 and then pass those messages to the cudf.read_json() method to create a DataFrame. Likewise I created another Jupyter cell where I invoked my revised cudf.read_json() method that invoked Kafka directly and consumes the messages directly to the DataFrame. To be clear the runtimes below do include the time to make the DataFrame. This is important because the slowdown before was in the PyObject creation on receiving messages from librdkafka.

Avg. Message Size	Batch Size	Python Runtime Sec	C++ Runtime Sec	C++ Speedup
20kb	10,000	8.36	0.59	14x
20kb	100,000	8.69	0.68	13x
20kb	200,000	8.96	0.76	12x
20kb	300,000	9.35	0.84	12x

librdkafka has a maximum batch size of 999,999 hence that is as high as you can go. It is generally streaming standard to max out around 300K per batch.
It was interesting to note that increasing the batch size does not offer much of a slowdown in the runtime.

I have included a Jupyter notebook to show my exact implementation for these functions.

Kafka_Benchmarking.ipynb.zip

harrism · 2019-08-28T00:42:33Z

.gitignore

@@ -66,6 +66,7 @@ build/
 cpp/build/
 cpp/include/cudf/ipc_generated/*.h
 cpp/thirdparty/googletest/
+cpp/cmake-build-debug


This folder is specific to you. Please use cpp/Debug, which is already .gitignored.

harrism · 2019-08-28T00:43:16Z

cpp/src/io/io_functions.cpp

@@ -17,6 +17,7 @@
 #include <cudf/cudf.h>
 #include <cudf/legacy/table.hpp>
 #include <utilities/error_utils.hpp>
+#include <librdkafka/rdkafkacpp.h>


Unused include file?

Suggested change

#include <librdkafka/rdkafkacpp.h>

harrism · 2019-08-28T00:44:10Z

I wouldn't want to merge this until the functionality is actually used. Otherwise it's an unnecessary dependency. So I would extend this PR to actually implement the functionality.

jdye64 · 2019-08-28T12:45:54Z

I agree with you. I intended this to be a collaboration/discussion PR and a chance to POC some improvements. I believe closing it once the discussion is complete and opening more succinct PRs is best.

Since the performance results were pleasing I was trying to internalize the best place to actually place the Kafka consumer logic? Was hoping others had some feedback on that. As mentioned I currently place it directly in the read_json and read_csv functions which is likely not best. Of course it does make sense to use those functions since most messages consumed will be in csv or json format anyway so no need to rewrite that logic.

One thought was to add the logic to datasource.c/hpp and overload read_csv, read_json, others to pass in Kafka configurations. In my mind this would behave much like how the datasource either reads from a memory buffer or file today except now there would be a Kafka option. Once the batch of messages were read it would continue on the existing execution to create the actual Table and then Dataframe. Thoughts on that?

jrhemstad · 2019-08-28T13:34:13Z

I'd be curious to see a profile like snakeviz or nsys to see where all the time was being spent. This wouldn't be the first time that Python was doing something silly and making it slow.

jdye64 · 2019-08-28T16:10:25Z

I did have one but don't have it on hand any longer. The core slowdown was on the PyObject creation for each individual message received from Kafka. So if you had a batch size of 300,000 messages it would make 300,000 + 1 PyObjects, return those to cudf, user would join those into say a comma delimited string, and then return it right back to libcudf to make the DataFrame. So the hop back to python really didn't serve a purpose.

jrhemstad · 2019-08-28T16:30:50Z

I did have one but don't have it on hand any longer. The core slowdown was on the PyObject creation for each individual message received from Kafka. So if you had a batch size of 300,000 messages it would make 300,000 + 1 PyObjects, return those to cudf, user would join those into say a comma delimited string, and then return it right back to libcudf to make the DataFrame. So the hop back to python really didn't serve a purpose.

Ah, okay, makes sense. PyObject creation is the same problem we have with implementing a performant transpose.

Out of curiosity, why does Python need a separate object for each message? Is there not some way to aggregate all the messages into a single object?

jdye64 · 2019-08-28T16:34:05Z

So that was actually what I tried first. It did speed it up but the speedup was minimal. I am by no means an expert on PyObjects. It seemed however that the single large concatenated "string" PyObject took almost as long as making a large number of smaller ones.

jdye64 · 2019-08-28T16:34:47Z

Also we don't own the Python library "Confluent-kafka-python" and their community wasn't too excited about having that concatenation logic within their codebase since it was sort of a niche case.

harrism · 2019-08-29T02:12:59Z

As for where the logic should go, you might want to start a conversation with @mjsamoht @j-ieong @OlivierNV @vuule.

OlivierNV · 2019-08-29T16:56:27Z

Extending datasource indeed seems like the right place to put this.

harrism · 2019-09-26T08:54:41Z

@jdye64 can you drive this to completion in 0.11?

jdye64 · 2019-12-12T23:04:43Z

I'm going to close this as it is taken care of in #3504 now. Thanks for all the feedback here everyone!

Incorporated librdkafka into the cudf CMake project. There is no code…

cb4228a

… in this commit but rather just to validate with the larger group that this is indeed the proper way we want to do this

jdye64 requested a review from a team as a code owner August 9, 2019 17:25

squash later

547dd63

jdye64 requested a review from a team as a code owner August 9, 2019 17:41

harrism changed the title ~~[WIP] Incorporated librdkafka into the cudf CMake project. There is no code…~~ [WIP] Incorporate librdkafka into cudf CMake Aug 11, 2019

harrism changed the base branch from branch-0.9 to branch-0.10 August 11, 2019 22:24

harrism added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Aug 11, 2019

harrism requested changes Aug 28, 2019

View reviewed changes

jdye64 closed this Dec 12, 2019

jdye64 deleted the embeddedkafka branch December 12, 2019 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Incorporate librdkafka into cudf CMake #2519

[WIP] Incorporate librdkafka into cudf CMake #2519

jdye64 commented Aug 9, 2019 •

edited by harrism

Loading

codecov bot commented Aug 9, 2019 •

edited

Loading

harrism commented Aug 11, 2019

harrism commented Aug 11, 2019 •

edited

Loading

jdye64 commented Aug 11, 2019

kkraus14 commented Aug 12, 2019

jdye64 commented Aug 12, 2019 •

edited

Loading

kkraus14 commented Aug 12, 2019

jdye64 commented Aug 12, 2019

jdye64 commented Aug 27, 2019

harrism Aug 28, 2019

harrism Aug 28, 2019

harrism commented Aug 28, 2019

jdye64 commented Aug 28, 2019

jrhemstad commented Aug 28, 2019

jdye64 commented Aug 28, 2019 •

edited

Loading

jrhemstad commented Aug 28, 2019

jdye64 commented Aug 28, 2019

jdye64 commented Aug 28, 2019

harrism commented Aug 29, 2019

OlivierNV commented Aug 29, 2019

harrism commented Sep 26, 2019

jdye64 commented Dec 12, 2019

[WIP] Incorporate librdkafka into cudf CMake #2519

[WIP] Incorporate librdkafka into cudf CMake #2519

Conversation

jdye64 commented Aug 9, 2019 • edited by harrism Loading

codecov bot commented Aug 9, 2019 • edited Loading

Codecov Report

harrism commented Aug 11, 2019

harrism commented Aug 11, 2019 • edited Loading

jdye64 commented Aug 11, 2019

kkraus14 commented Aug 12, 2019

jdye64 commented Aug 12, 2019 • edited Loading

kkraus14 commented Aug 12, 2019

jdye64 commented Aug 12, 2019

jdye64 commented Aug 27, 2019

harrism Aug 28, 2019

Choose a reason for hiding this comment

harrism Aug 28, 2019

Choose a reason for hiding this comment

harrism commented Aug 28, 2019

jdye64 commented Aug 28, 2019

jrhemstad commented Aug 28, 2019

jdye64 commented Aug 28, 2019 • edited Loading

jrhemstad commented Aug 28, 2019

jdye64 commented Aug 28, 2019

jdye64 commented Aug 28, 2019

harrism commented Aug 29, 2019

OlivierNV commented Aug 29, 2019

harrism commented Sep 26, 2019

jdye64 commented Dec 12, 2019

jdye64 commented Aug 9, 2019 •

edited by harrism

Loading

codecov bot commented Aug 9, 2019 •

edited

Loading

harrism commented Aug 11, 2019 •

edited

Loading

jdye64 commented Aug 12, 2019 •

edited

Loading

jdye64 commented Aug 28, 2019 •

edited

Loading