Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Performance of List semigroup in state aggregation is slow #689

Closed
pankajroark opened this issue Sep 19, 2016 · 5 comments
Closed

Performance of List semigroup in state aggregation is slow #689

pankajroark opened this issue Sep 19, 2016 · 5 comments

Comments

@pankajroark
Copy link
Contributor

Summingbird online uses lists for storing state of each key in cache.
Ultimately the value that it sums (at least in SyncSummingQueue) looks like: Map[Key, (Seq[InputState], Value)]
Thus doing plus on list(the implementation of seq that summingbird uses in Summer and FinalFlatMap) semigroup is very common and is in the critical path of summingbird online. The performance of this semigroup is very slow because we do
semigroup.plus(existingValue, newValue)
while boils down to
oldList ++ List(newValue)

oldList is typically the bigger one. So the cost of above addition is proportional to the size of bigger list. Thus if we accumulate one value at a time in the list the cost is O(n^2) rather than O(n).

I've seen this show up in YourKit profiles.

Aggregating in batch doesn't help because the top level value being summed is a Map[Key, (List[State], Value)] and GenericMapMonoid doesn't do a sumOption on underlying values in the map when doing sumOption. So ultimately plus gets called on the list of state.

One option is to change the data structure used for storing list of state, we could use a vector for example.

Another possible solution is to somehow provide a different implementation for the List semigroup used for combining state lists. Changing the order should fix it for example. List(newValue) ++ oldList would be fast and for state the order doesn't matter much. We would acknowledge newer tuples first but that shouldn't matter much, afaik nothing relies on acking order. Potentially it could result in tuples timing out but typically timeouts are pretty large for the order in a single batch to matter.

Looking to hear thoughts on this.

@johnynek
Copy link
Collaborator

See batched:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Batched.scala

Wrapping with that type will force the use of sumOption and can address the
concerns here I think (fixes the O(N^2) for sure). Try and see if that
fixes it in your kit.
On Mon, Sep 19, 2016 at 11:16 Pankaj Gupta [email protected] wrote:

Summingbird online uses lists for storing state of each key in cache.
Ultimately the value that it sums (at least in SyncSummingQueue) looks
like: Map[Key, (Seq[InputState], Value)]
Thus doing plus on list(the implementation of seq that summingbird uses in
Summer and FinalFlatMap) semigroup is very common and is in the critical
path of summingbird online. The performance of this semigroup is very slow
because we do
semigroup.plus(existingValue, newValue)
while boils down to
oldList ++ List(newValue)

oldList is typically the bigger one. So the cost of above addition is
proportional to the size of bigger list. Thus if we accumulate one value at
a time in the list the cost is O(n^2) rather than O(n).

I've seen this show up in YourKit profiles.

Aggregating in batch doesn't help because the top level value being summed
is a Map[Key, (List[State], Value)] and GenericMapMonoid doesn't do a
sumOption on underlying values in the map when doing sumOption. So
ultimately plus gets called on the list of state.

One option is to change the data structure used for storing list of state,
we could use a vector for example.

Another possible solution is to somehow provide a different implementation
for the List semigroup used for combining state lists. Changing the order
should fix it for example. List(newValue) ++ oldList would be fast and for
state the order doesn't matter much. We would acknowledge newer tuples
first but that shouldn't matter much, afaik nothing relies on acking order.
Potentially it could result in tuples timing out but typically timeouts are
pretty large for the order in a single batch to matter.

Looking to hear thoughts on this.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#689, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEJdtsKeYTtv-gpNeXUssb3DuKsuQCKks5qrtGLgaJpZM4KAzIv
.

@johnynek
Copy link
Collaborator

johnynek commented Dec 24, 2016 via email

@johnynek
Copy link
Collaborator

johnynek commented Dec 24, 2016 via email

@pankajroark
Copy link
Contributor Author

pankajroark commented Dec 25, 2016 via email

@pankajroark
Copy link
Contributor Author

We ended up replacing list with chain to fix this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants