Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Split the state_group_cache in two #3726

Merged
merged 9 commits into from
Aug 21, 2018
Merged

Conversation

ara4n
Copy link
Member

@ara4n ara4n commented Aug 20, 2018

One half contains normal state events; the other contains member events.

The idea is that the lazyloading common case of wanting a subset of the members events but the entirety of the other events can be accomplished efficiently without having to make DictionaryCache aware of these sort of complicated semantics.

one half contains normal state events; the other contains member events.
the idea is that the lazyloading common case of wanting a subset of the members cache
but the entirety of the other cache can be accomplished efficiently without
having to make DictionaryCache aware of these sort of complicated semantics
@ara4n ara4n changed the title WIP: split the state_group_cache in two Split the state_group_cache in two Aug 21, 2018
@ara4n
Copy link
Member Author

ara4n commented Aug 21, 2018

no longer WIP; @richvdh ptal.

if members is True:
clause_to_args.append(("AND type = ?", (EventTypes.Member, )))
elif members is False:
clause_to_args.append(("AND type <> ?", (EventTypes.Member, )))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so suppose I call _get_state_groups_from_groups([group_id], types=[('m.room.topic', '')], members=False).

I think this means we will do two separate selects:

SELECT {...} WHERE state_group IN (...) AND type='m.room.topic' AND state_key='';
SELECT {...} WHERE state_group IN (...) AND type<>'m.room.member';

... which doesn't sound right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, you're right - i'd forgotten (again) that the postgres branch implements the clauses as a series of selects rather than concatenating them.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, having pored over this for some time, I still think it's confusing, but I'll concede that I don't have any other bright ideas right now. Some (hopefully small) thoughts below.

Your tests are failing, possibly because of the whole ? vs %s thing.

)
self._state_group_members_cache = DictionaryCache(
"*stateGroupMembersCache*",
500000 * get_cache_factor_for("stateGroupMembersCache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need to remember to tune this when we deploy it.

@@ -61,7 +61,12 @@ def __init__(self, db_conn, hs):
super(StateGroupWorkerStore, self).__init__(db_conn, hs)

self._state_group_cache = DictionaryCache(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be useful to have a block comment somewhere (and here seems like a reasonable place) which explains in detail what's going on here, and what the two caches do.

@@ -284,6 +289,9 @@ def _get_state_groups_from_groups(self, groups, types):
types (Iterable[str, str|None]|None): list of 2-tuples of the form
(`type`, `state_key`), where a `state_key` of `None` matches all
state_keys for the `type`. If None, all types are returned.
members (Boolean|None): whether we are limiting this to return just
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool|None. Also the description could do with being a bit clearer. How about:

"If not None, then, in addition to any filtering implied by types, the results are also filtered to only include member events (if True), or to exclude member events (if False)"

@@ -358,12 +366,18 @@ def _get_state_groups_from_groups_txn(
# empty where clause with no extra args.
clause_to_args = [("", [])]

additional_clause = ""
if members is True:
additional_clause = "AND type = '?'" % EventTypes.Member
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't use % interpolation on a ? ... I guess you mean %s here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: generally we prefer % (foo, ) over % foo to avoid unpleasant surprises when foo turns out to be a tuple.

Copy link
Member

@richvdh richvdh Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we, instead of building additional_clause, do sql += " AND type = '%s' % (EventTypes.Member,) (and put it at line 350)? I think I'd find that clearer than messing with additional_clause.

Deferred[dict[int, dict[(type, state_key), EventBase]]]
a dictionary mapping from state group to state dictionary.
"""
if types is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding the multiple calls to _get_state_for_groups_using_cache quite hard to reason about, and the special-case of checking for filtered_types == [EventTypes.Member] is a bit sad (apart from the case of longer lists including EventTypes.member, what if filtered_types is actually a tuple or a set? I know the docstring claims it should be a list but I can't see any reason for that, and the failure mode will be subtle if someone gets it wrong).

How about:

        if types is not None:
            non_member_types = [t for t in types if t[0] != EventTypes.Member]

            if filtered_types is not None and EventTypes.Member not in filtered_types:
                # we want all of the membership events
                member_types = None
            else:
                member_types = [t for t in types if t[0] == EventTypes.Member]

        else:
            non_member_types = None
            member_types = None

        non_member_state = yield self._get_state_for_groups_using_cache(
            groups, self._state_group_cache, non_member_types, filtered_types,
        )
        member_state = yield self._get_state_for_groups_using_cache(
            groups, self._state_group_members_cache, member_types, None,
        )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly much terser, but ironically i personally find it much more cryptic and hard to reason about, whereas the simple symmetricity of "if we're lazy-loading members, split the query intelligently. if we're filtering on types, split the query naively. otherwise, split the query without filtering" felt clearer by spelling out the flows we care about and following the same pattern for each branch.

However, I don't have strong feelings, and it's nice that this handles the whole "what if filtered_types is a longer list that contains EventTypes.Member" scenario better (which I'd considered, but fell through to the naive handler). So i've gone with it (with a comment to explain the somewhat magical 'None' in the final line).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the param was "we're lazy-loading members", I'd agree with you. But for better or worse, it's not.

@@ -725,7 +810,7 @@ def _get_state_for_groups(self, groups, types=None, filtered_types=None):
types_to_fetch = types

group_to_state_dict = yield self._get_state_groups_from_groups(
missing_groups, types_to_fetch
missing_groups, types_to_fetch, cache == self._state_group_members_cache,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit horrid and feels a bit backwards.

Can you pass a bool param (use_members_cache?) into this function instead of cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i follow. passing the right cache object into _get_state_for_groups_using_cache seems fine to me, and simplifies all the other cache references to just be cache.get etc? And _get_state_groups_from_groups is already taking the bool|None to say whether it should be limiting to members or non-members or not. Or is the problem just the inlining the boolean expression?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this was unclear.

I'm suggesting:

rather than pass a cache into _get_state_for_groups_using_cache, pass a bool which is True to use _state_group_members_cache and False to use _state_group_cache. Then on the first line (or wherever) of _get_state_for_groups_using_cache, do

cache = self._state_group_members_cache if use_members_cache else self._state_group_cache

and then on this line here you can just use use_members_cache instead of going back to a bool.

It just felt odd to be going back to a bool here. Though I don't feel that strongly about it so if you want to leave it alone that's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that's how i interpreted, but am unsure that a cryptic bool flying around the place is better than saying "use this cache please", even though it does mean we end up with the slightly backwards comparison here back to a bool. i'd rather leave it as is if you're borderline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@ara4n
Copy link
Member Author

ara4n commented Aug 21, 2018

@richvdh ptal; have incorporated all changes other than failing to understand why you object to cache == self._state_group_members_cache for determining which cache is being used. i'm also failing to see which bit is confusing in general - is it the splitting of the cache? or the clarity of the code?

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

# and the other for tracking member_events. This means that lazy loading
# queries can be made in a cache-friendly manner by querying both caches
# separately and then merging the result. So for the example above, you
# would query the members cache for a specific subset of state types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "subset of state keys" ?

# event IDs for the state types in a given state group to avoid hammering
# on the state_group* tables.
#
# The point using a DictionaryCache is that it can cache a subset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The point of using"

@richvdh
Copy link
Member

richvdh commented Aug 21, 2018

i'm also failing to see which bit is confusing in general - is it the splitting of the cache?

Don't take my confusion as a criticism of the PR specifically; it's the general complexity of the code as a whole, and a lot of it is down to the fact it's building on a codebase that is already somewhat confusing (as I think we've both agreed) and adding new dimensions of complexity to it.

That said there are things that I wish we weren't doing, though as I said I can't think of plausible better solutions:

  • the fact that we are basically special-casing members
  • the fact that _get_state_groups_from_groups takes a tri-state
  • the fact that said tri-state overlaps in a not-entirely-obvious-way with types
  • the fact that other bits of code have yet a third way of filtering keys called filtered_types (and _get_state_for_groups_using_cache has all three in one function, yay!)
  • _get_state_for_groups's magic for splitting the queries among the two caches is magic. (I do think it's better now you've adopted my rewrite, and I don't think that's just NIH syndrome)

@ara4n ara4n merged commit bb81e78 into develop Aug 21, 2018
@hawkowl hawkowl deleted the matthew/split_state_group_cache branch September 20, 2018 14:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants