-
-
Notifications
You must be signed in to change notification settings - Fork 1
Reconsider arroyo's interface #9
Comments
Some context that should clarify a few design constraints. This is not meant to push back against a redesign
This is because the Consumer is not a simple wrapper over a Kafka consumer. There are a few implementations. I will reference some from here and some from python as they are not all fully reimplemented yet.
Can this all be done via composition? by swapping the backend (in memory vs kafka) inside the consumer class. Yes, but that backend would have to implement a common trait so we would just move the trait somewhere else. Fundamentally there is more than one consumer implementation (functional requirement) and they are supposed to be indistinguishable to the logic that uses the consumer.
In high level consumers like the synchronized consumer the message is not a Kafka message but an interval. If we dropped high level consumers altogether we could avoid this, but there has to be a replacement for that functionality that we use. And the replacement should not make rebalancing details visible to the ProcessingStrategy |
Thank you, all that context in helpful. In turns of what arroyo is for I'm getting the following:
Questions:
Food for thoughtrdkafka has a StreamConsumer which is a high level consumer implementation. Can we build on top of the StreamConsumer to make our lives easier? |
We want to deprecate it for subscriptions. There are other scenarios (like post process forwarder) where we do not have an alternative as of now, even though it does not use the arroyo implementation but a previous one. We either have to replace the current implementation with the arroyo synchronized consumer or redesign it entirely and there is no plan for the second at present. |
That is possible with the rust rdkafka. It is not possible in the original python arroyo implementation as there is no abstract way from the python confluent library to simulate a consumer, at least if you want to retain type hints. And you cannot just monkey patch the kafka implementation because that is not python code but just C code exposed as a python class. So I would be ok doing this in this implementation. Though it cannot be done in the original one. |
The application developer does not see that happening. They only know about two events: the strategy is created with a new state (if it is stateful). The strategy is terminated and thus they have the responsibility of cleaning up the state, which can include committing offset, flushing producers writing on DB, sending smoke signal to somebody, etc. The application does not know which partitions have been assigned or revoked, it does not know if it is consumer termination or if it is rebalancing, it does not know if kafka simply took partitions away and reassigned, it does not know if the rebalancing is eager or cooperative, how to reset offsets, when and why. So the application only care about its state: how to initialize it. How to close it in a consistent way.
The user should not plug into rebalancing. That would require to know all the possible types of rebalancing and to interpret separately partitions revocation and assignment. Furthermore the application would have to deal with consumer termination separately. The application does not need to deal with this complexity. |
In one sentence: simplify the development of streaming applications by providing reusable patterns and hiding pitfalls that come when dealing with Kafka. This includes making it possible to unit test them.
Messaging/streaming applications tend to follow some common patterns (a lot of theory here https://www.enterpriseintegrationpatterns.com/patterns/messaging/) so it is quite useful to have implementation of these higher level primitives (rdkafka itself has some in the streaming consumer).
|
Preface
I do not use arroyo or work with kafka as part of my job, except when something goes wrong with our consumers. However I would like to better understand how our kafka infrastructure works and make it user friendly enough that someone with no knowledge of kafka can write a high performance consumer. This issue question's arroyo's design but also my own understanding of it.
Problem
As arroyo is being re-implemented in another language, it is a worthy exercise to look at the way it is built and see if things can be done to improve code readability and user experience. Taking messages from kafka and putting them somewhere else is a key part of sentry and we want teams who know very little about kafka to write consumers for it. How can we get there?
Goals
Proposals
Clear boundaries on what arroyo is for
Building a kafka consumer has many complexities, we should be explicit about which of them is to be handled by arroyo, and which of them the user of arroyo is supposed to worry about. We should be able to do this without referencing any of the traits/class names that we have introduced as part of arroyo. As a plus, this exercise builds the case for using arroyo in the first place.
This may be just a documentation issue however being clear about this can drive a better UX for contributors and users. This is the most important thing to be clear about, everything else is not nearly as important.
Composition over inheritance
arroyo-python makes heavy use of inheritance, and arroyo-rust makes heavy use of traits. Too much ability to customize can make the library more unapproachable and easier to misues. Example:
What is the value gained from making a consumer a trait? Arroyo is a kafka consumer library. This goes back into being clear what arroyo is for. Why not just make an arroyo consumer a wrapper object around the rdkafka one?
Why would we process anything except kafka messages?
The use of the factory (AFAIK) is to reset the state of the strategy on partition revocation. Maybe a
reset
function on the strategy wold be enough without introducing another layer of indirection?Do the right thing by default
seeing this comment in the ProcessingStrategy trait:
the trait requires an asynchronous behavior. Maybe it would be better to make this an
async
function? This way the could could simplyawait
when it does a blocking operation, making the consumer code more clear.The text was updated successfully, but these errors were encountered: