-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kinesis KCL Source #434
Conversation
49d6126
to
67bb52a
Compare
I'm not entirely happy with the checkpoint part so far. Since we're leaving the My approach has been: let the user check whether the So, it'll be encouraged in the docs to use a supervision strategy to handle this failures. WDYT? |
Good point. Given the potential for failed checkpoints, would it make sense to implement the "v2" IRecordProcessor interface? I suppose the additional sequence info passed to the |
Thanks for the link! I'll definitely have a look at the v2 and I'll migrate to it, but unfortunately it seems that the race conditions can still occur. I guess it's not a big deal, we'll just need to capture potential failures. I'll update the PR with these changes later today. |
0757128
to
e0648c9
Compare
Changes applied. @jaymell I'd love to get your feedback for the current code. Do you miss something from the current interface? I mean, parameters and |
@aserrallerios So far it looks great. I haven't yet actually started working on integrating this with our current application -- should have more meaningful feedback once I do so, hopefully this week or next. Thanks again for sharing the PR. |
Hi, Thanks. |
The Source based on Kinesis SDK (not KCL) is already merged. So you can use it right now. The publisher Flow is on its way to be merged, and this PR is still to be finished (tests and more reviews from maintainers are missing) and I cannot work on it right now (I'm on vacation). So I won't expect it merged before mid Sept or Oct. |
@aserrallerios How do I checkpoint with the Source based on Kinesis SDK? I saw your new added class KinesisWorkerCheckpointSettings(1000, 10.seconds). Does this mean the existing one does not support checkpoint? Checkpointing is critical, otherwise, when an app restarts, it will start the processing from 24 hours old given a setting of ShardIteratorType.TRIM_HORIZON. |
The Kinesis SDK doesn't offer any means of checkpointing record sequences, as far as I kown. You need to handle checkpoints manually, for example, using an external database. |
@aserrallerios Got it, Thanks. I am trying your early implementation. The following line gives null pointer exception, as the worker is still null after the constructor is invoked.
|
Thnx @cwei-bgl! I'll work on it this weekend. |
I've fixed the NPE, improved some code, added javadsl methods and added the test outline (pending implementation). TODO:
About the implementation itself, what do you guys prefer:
The current implementation is the first option, but it's easy to change. The implications are more "philosophical" if you ask me. Following the inversion of control principle, the second option would be the way to go, but forcing the user to do stuff with the materialized value seems very error prone and unfriendly to new users. WDYT? |
An argument against using a materialized value of type So the current implementation ( |
What's the status on this PR? Seems to be ready to review at the very least? |
This can be reviewed but I plan to rebase it to this branch: It's easier to finish documentation and stuff over that branch, and thus I've been working locally on that direction. I'll try to keep this very PR updated too, if early reviewers want to help (much appreciated). Another option is to use that PR for both publisher/worker. I'll ask there, lets see what the community thinks. |
ee85afd
to
ecc79fe
Compare
Ready to be reviewed. Both implementation and documentation are "complete". Just some hard-to-test tests are missing. @etspaceman: I think I'll keep this separated from the Kinesis publisher PR, we rather keep things simple and separated. The community seems to be having a tough time reviewing all the pending stuff :( |
351f599
to
9e1cac1
Compare
I guess build is failing due to Cassandra tests, unrelated to this PR. Can anyone confirm? |
I think it could be the instability tracked by #527 |
I think now it failed for Kinsesis as well. |
Can you please point me where the problem is? I can only see the C* test error:
|
There was an issue with cassandra tests, which is now fixed in master. Please rebase from the master and push rebased commits to this PR branch. This will kick in CI validation with the cassandra problem fixed. |
9e1cac1
to
bd82289
Compare
Rebased to fix conflicts. |
I'm afraid we may not be able to merge this because of the Amazon License the client pulls in. |
I'm not an expert on lincenses and stuff, so keep me posted, because the PR can be reworked into something more generic that doesn't include the KCL dependency. Obviously it'd require extra work from the final user and the result wouldn't be as satisfactory... |
@johanandren Can you indicate what parts of the license are concerning for this merge? |
"The Work and any derivative works thereof only may be used or intended for use with the web services, computing platforms or applications provided by Amazon.com, Inc. or its affiliates, including Amazon Web Services, Inc." is scary to me - IANAL, but I'm pretty sure that means the derivative work (which in this case would be Alpakka) cannot be Apache-licensed anymore. |
Some interesting discussion on the topic in https://issues.apache.org/jira/browse/LEGAL-198 (especially https://issues.apache.org/jira/browse/LEGAL-198?focusedCommentId=15618136#comment-15618136). They decided it's OK for them since they only distribute that module as source, but that frankly seems questionable, and doesn't obviously apply to us since we also distribute binaries. |
Can we leave that dependency as "provided" and tell the final user to add it to his/her project? |
@johanandren @raboof - Any thoughts on @aserrallerios's mention here? |
I'm not sure we dare to do that, but will check with "those who know better"™ |
Sorry for the delay, but we have now checked with legal, and it is not possible to include this in Alpakka because of the dependency license. Alternative options are:
Sorry about this, but not much else we can do. |
What do you guys think about making a generic Otherwise, can anyone think of a better suited repository for this PR? |
Have a look at localstack's license. Can we do something similar? |
What a bummer - is this dead in the water, or is it still an option to leave the KCL dependency as "provided"? For publishing the connector yourself, it seems like it would be a matter of forking this entire kinesis connector, right? I'm not sure what the easiest path is here. Honestly, I don't understand why a code that uses the KCL library would be considered derivative. You're not actually creating a different (derivative) version of KCL itself. |
Seems that the connector should reside in its own repo as @johanandren suggests. |
I'll try to publish it for myself until a better place is found: https://github.com/aserrallerios/kcl-akka-stream Are you ok with it? Any other place you think it could fit? |
Go for it 👍 |
Thank you @aserrallerios for putting effort into this Akka streams connector and offering it in your own repo, with #777 we'll link to it from the external connectors listing. |
So, I know this is an old PR, but its worth noting that the KCL has updated their license to Apache 2.0. Might be worth reconsidering this module as an official alpakka module: https://github.com/awslabs/amazon-kinesis-client/releases/tag/v1.10.0 |
Thank you for letting us know. With the Apache 2 license, we would be able to make it part of Alpakka. /cc @aserrallerios |
That's awesome! Should I work on updating this very PR then? |
Prefer a new PR, please. |
Kinesis Source using KCL library. Tests pending, early reviews are much welcome.
The motivation behind this is that it'll allow alpakka users to automatically handle sharding among clients of Kinesis streams, simplified checkpoint of records' sequence using Dynamo and the potential usage of the KPL library for higher throughput (with local dynamic buffering and batching).