-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Implement windowing semantics #99
Comments
This would be awesome! We don't have use cases for that at the moment, but I'm happy to discuss about it. So Would you like to give a try? :-) |
I'd like to give it a try, but first would like to have it super clear in terms of the design.
The thing is that it seems like there's no way to swap the context implementation, it's somehow hardcoded in the processor here: https://sourcegraph.com/github.com/lovoo/goka/-/blob/processor.go#L643:9 What is the best way you see for someone to create custom context? Pass it as an option? Create something like context builder maybe?
I think both are valid solutions, but I think I like the option better.
That's true. The thing is that in kafka streams nobody is supposed to use state store change log topics, they are only for restoring the state store. But Goka is explicitly allowing other pieces to use group table's changelog topic, which is also totally valid I think. |
Nice! I think it should be easy to wrap the callbacks in g := goka.DefineGroup(
tb.Input(topic, codec, callback), // callback will get wrapped
tb.Persist(otherCodec), // otherCodec will get wrapped
) And the wrapper would do something like this:
A similar idea for persist. What do you think? |
I guess that was unclear. I'll write a longer example later today. |
Ok, so the idea was to create a package, eg, The table values could be defined like this: // bucket for the messages of one stream
type bucket struct {
messages []interface{}
}
// wrapper around the user value. This is the actual value stored in the group table.
type bucketValue struct {
// buckets for the streams that are buffered
buckets map[goka.Stream]bucket
// actual value of the user
value interface{}
}
// codec of codecs: keeps codecs of buffered streams and table
type bucketCodec struct {
stream map[goka.Stream]goka.Codec
value goka.Codec
}
func (bc *bucketCodec) Encode(value interface{}) (data []byte, err error) {
// serialize a bucketValue usind the codecs in bc
return nil, nil
}
func (bc *bucketCodec) Decode(data []byte) (value interface{}, err error) {
// deserialize a bucketValue using the codecs in bc
return new(bucketValue), nil
} The context would be some specialized context containing any function relevant to get messages from the buckets. To process input streams with timebuckets, the user should implement callbacks with this signature:
// goka.Edge for buffered streams
func Input(topic goka.Stream, codec goka.Codec, cb goka.ProcessCallback, retrieve RetrieveTimeFunc) goka.Edge {
return &inputStream{topic, codec, cb, retrieve}
}
type inputStream struct {
topic goka.Stream
codec goka.Codec
cb ProcessCallback
retrieve RetrieveTimeFunc
}
func (i *inputStream) String() string { return "..." }
func (i *inputStream) Topic() string { return string(i.topic) }
func (i *inputStream) Codec() goka.Codec { return i.codec }
// Functions with this signature can retrieve a timestamp from a message
type RetrieveTimeFunc func(message interface{}) (time.Time, error) The persistency edge would be something like this: // goka.Edge for wrapped value
func Persist(codec goka.Codec, windowLength time.Duration) goka.Edge {
return &table{codec, windowLength}
}
type table struct {
topic goka.Stream
codec goka.Codec
windowLength time.Duration
} When defining a group graph, we can wrap the callbacks and codecs and provide a normal // timebucket.DefineGraph
func DefineGraph(group goka.Group, edges ...goka.Edge) *goka.GroupGraph {
var (
t = findPersist(edges) // persistency edge
i = findTimeBucketed(edges) // timebucketed inputStreams
o = removeFrom(edges, t, i) // remaining edges
e []Edge // result
)
// 1. replace codec of persistency edge
c := &bucketCodec{
stream: make(map[goka.Stream]goka.Codec),
value: t.Codec(),
}
for _, edge := range i {
c.stream[goka.Stream(edge.Topic())] = edge.Codec()
}
e = append(e, goka.Persist(c))
// 2. wrap callbacks in timebucketed inputStreams
for _, in := range i {
// callback with the timebucket context
cb := func(ctx goka.Context, m interface{}) {
// create timebucket Context
tbctx := newContext(ctx)
// save message in bucket if necessary
if decide(in.retrieve(m), t.windowLength) {
tbctx.bucket.Add(m)
}
// call user callback func(ctx timebucket.Context, m interface{})
in.cb(tbctx, m)
// if messages are in buffer and SetValue not called, call it now
tbctx.persist()
}
e = append(e, goka.Input(in.Topic(), in.Codec(), cb)
}
// 3. put everything together
return goka.DefineGroup(group, append(e, o...)...)
} I just don't know how There are definitely other ways how to inject that into goka. Let me know if you have other ideas. |
Just read the thread and from the high level it seems pretty reasonable, although conceptually there’s something I don’t like about the idea of duplicating core functions (goka.Stream and etc.) specifically for doing windowing. I’ll play a little bit more with Kafka Streams to see how it works internally, because I only was reading code without actually playing with it. On the other hand, there’s an idea that comes into my mind that seems should be taken into account: TTL for the records. Windows should have expiry time (maybe optional) after which they should go away. It maybe something similar to Kafka’s compaction mechanism. Kafka Streams is using rocks db as a state store, and rocks db has TTL built in. Although I don’t know if they are using it for windowing at all. |
We'd not be duplicating core functions. The functions I suggested should be seen as "builders" which wrap the original code with additional functionality. But other solutions are also possible. And yes, it would be great to learn from your experience with Kafka Streams. Just the TTL for records sounds complex in the context of Goka. Of course we can expire entries in the buckets when events arrive for the respective keys, but expiring entries in buckets that receive no message will be harder. |
i could see it being something along these lines (from a very hight level perspective):
Things to take into account:
A possible user experience could be something like this:
This is very similar to how Kafka Streams works (without taking into account different window types and more advanced features). What do you think? |
That sounds really nice! It is simpler than my proposal above and should also solve your issue with joining two streams, right? Do you have a suggestion how to implement the cleaning up process? Sorry for the late response. |
@db7 In Kafka Streams, time is advanced only when new messages are coming. So basically it could be something like this when we process a new message:
|
@burdiyan that sounds easy to implement. As far as I see, all these changes can be implemented on top of the existing edges. We just have to be careful what edges can be combined together. If We were planning at some point to come up with an goka.DefineGraph(group,
goka.Input(inTopic, inCodec, func(ctx goka.Context, m interface{} {
result := mapFunction(m)
ctx.Emit(outTopic, ctx.Key(), result)
}),
goka.Output(outTopic, outCodec),
) Instead of doing that by hand, the operators library would provide a GroupGraph builder like this: operators.Map(inTopic, inCodec, outTopic, outCodec,
func(m interface{}) interface{} { return mapFunction(m) }) The same thing can be done for A operators.Window("count-clicks-5m", new(codec.Int), 5*time.Minute, 24*time.Hour,
operators.WindowInput("add-clicks", new(codec.String),
func(value interface{}, message interface{}) interface{} {
var count int
if v := value; v != nil {
count = v.(int)
}
count++
return count
}),
operators.WindowInput("sub-clicks", new(codec.String),
func(value interface{}, message interface{}) interface{} {
var count int
if v := value; v != nil {
count = v.(int)
}
count--
return count
}),
) The The advantage of building such patterns on top of the edge primitives is that we have type safety when combining the edges and we can restrict/extend the callback interface, tailoring it to the use case. I would prefer that than pushing Also, in the operators.Window(group, codec, time, time).
AddInput(topic, codec, function).
AddInput(...).
Build() |
@db7 Is there any update on this? I'd love to add time based windowing to my application but am unsure on how to implement it myself :( |
@andrewmunro currently the easiest approach I would recommend is to use a group table, create a record there that would manage all the windowing. For example you could have something like:
Then on each incoming record from your input topic, you would retrieve Let me know if the above make sense for you, I'm not sure I'm explaining myself clear enough. IMO, something similar right now is the only way to implement windowing semantics without modifying Goka itself. BTW, @andrewmunro could you describe a little bit more your use case for windowing semantics that you need in your application? |
@burdiyan That kind of makes sense. So here's an example. Say I have an incoming stream of transactions from customers. I want to aggregate the amount of money a customer has spent in different time windows, such as this week, this month, this year (maybe too much data to store). So my input model would look like this: type Transaction {
CustomerID string
Amount float64
} And my group model looks like this: type Customer struct {
ID string
AmountTransacted float64
} And finally, my aggregator looks like this: g := goka.DefineGroup(
group,
goka.Input(topic, new(transaction.Codec), func(ctx goka.Context, msg interface{}){
t := msg.(*transaction.Transaction)
c = ctx.Value().(*customer.Customer)
c.AmountTransacted += t.Amount
ctx.SetValue(c)
}),
goka.Persist(new(customer.Codec)),
) Essentially what I want is to add an type Customer struct {
ID string
AmountTransacted float64
AmountTransactedThisMonth float64
} |
@andrewmunro you'll also need to keep track of what is the current month you are considering using a timestamp and then you'd have to reset the aggregation of the month once the month is over. Something like this: func(ctx goka.Context, msg interface{}){
t := msg.(*transaction.Transaction)
c = ctx.Value().(*customer.Customer)
c.AmountTransacted += t.Amount
// reset amount if month is over
if t.Timestamp.Month() != c.Timestamp.Month() {
c.Timestamp = t.Timestamp
c.AmountTransactedThisMonth = 0
}
c.AmountTransactedThisMonth += t.Amount
ctx.SetValue(c)
}) Also note that you don't need to keep the customer ID in your state since it will be the same as the key. |
@db7 I thought about this, but won’t any views accessing a customer’s state potentially be incorrect until that customer makes another transaction? |
Do you mean when a month is over and the customer object is not yet updated? You can check what the current month is when a service reads the data from the view. What we usually do is to add methods to the type stored in the group table. Something like this: func (c *Customer) GetAmountThisMonth() float64 {
if c.Timestamp.Month() != time.Now().Month() {
return 0.0
}
return c.AmonthTransactedThisMonth
} |
Hi @db7 |
Hi @keisar I haven't been working on this for a while. I guess @frairon would be able to help you better. If I understand your problem every 15 minutes you need to create a message for every key of a table. The only idea I have at the moment is the following. In your main (not in ProcessCallback), you create a ticker that iterates over the table every 15 minutes and sends a message with an emitter. AFAIR, you can iterate directly on the top of the processor. But it may be better to create another program with a view of the table and iterate on the view. |
Thank you @db7 |
Hi @keisar, |
Hi @frairon |
Hi @frairon
Hopefully with this approach the transmitter can still scale and the notifier doesn't do much (every 15 minutes sends small messages for each window) so it won't be an issue scale-wise. |
@keisar We've implemented similar windowing semantics in a goka processor by relying on a ticker and |
It would be great if Goka could natively provide primitives for windowed processing. It’s a pretty complex topic, but seems really needed in many stream processing workloads.
Kafka streams handles that by providing timestamp extractors to identify the actual time of the message, and modifying the record keys to be stored in the state store augmenting them with the window bucket.
Another approach that may help users to implement their own windowing is that if goka.Context would allow to get and set arbitrary values from the storage. I really don’t like it conceptually but it gives more freedom and is less complicated to implement.
I’m sure you were thinking about windowing in general. Do you have any clear idea on how it could be built into Goka?
The text was updated successfully, but these errors were encountered: