-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Conversation
one half contains normal state events; the other contains member events. the idea is that the lazyloading common case of wanting a subset of the members cache but the entirety of the other cache can be accomplished efficiently without having to make DictionaryCache aware of these sort of complicated semantics
no longer WIP; @richvdh ptal. |
synapse/storage/state.py
Outdated
if members is True: | ||
clause_to_args.append(("AND type = ?", (EventTypes.Member, ))) | ||
elif members is False: | ||
clause_to_args.append(("AND type <> ?", (EventTypes.Member, ))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so suppose I call _get_state_groups_from_groups([group_id], types=[('m.room.topic', '')], members=False)
.
I think this means we will do two separate selects:
SELECT {...} WHERE state_group IN (...) AND type='m.room.topic' AND state_key='';
SELECT {...} WHERE state_group IN (...) AND type<>'m.room.member';
... which doesn't sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, you're right - i'd forgotten (again) that the postgres branch implements the clauses as a series of selects rather than concatenating them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, having pored over this for some time, I still think it's confusing, but I'll concede that I don't have any other bright ideas right now. Some (hopefully small) thoughts below.
Your tests are failing, possibly because of the whole ? vs %s thing.
) | ||
self._state_group_members_cache = DictionaryCache( | ||
"*stateGroupMembersCache*", | ||
500000 * get_cache_factor_for("stateGroupMembersCache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need to remember to tune this when we deploy it.
synapse/storage/state.py
Outdated
@@ -61,7 +61,12 @@ def __init__(self, db_conn, hs): | |||
super(StateGroupWorkerStore, self).__init__(db_conn, hs) | |||
|
|||
self._state_group_cache = DictionaryCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be useful to have a block comment somewhere (and here seems like a reasonable place) which explains in detail what's going on here, and what the two caches do.
synapse/storage/state.py
Outdated
@@ -284,6 +289,9 @@ def _get_state_groups_from_groups(self, groups, types): | |||
types (Iterable[str, str|None]|None): list of 2-tuples of the form | |||
(`type`, `state_key`), where a `state_key` of `None` matches all | |||
state_keys for the `type`. If None, all types are returned. | |||
members (Boolean|None): whether we are limiting this to return just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool|None
. Also the description could do with being a bit clearer. How about:
"If not None, then, in addition to any filtering implied by types
, the results are also filtered to only include member events (if True), or to exclude member events (if False)"
synapse/storage/state.py
Outdated
@@ -358,12 +366,18 @@ def _get_state_groups_from_groups_txn( | |||
# empty where clause with no extra args. | |||
clause_to_args = [("", [])] | |||
|
|||
additional_clause = "" | |||
if members is True: | |||
additional_clause = "AND type = '?'" % EventTypes.Member |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't use % interpolation on a ? ... I guess you mean %s
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: generally we prefer % (foo, )
over % foo
to avoid unpleasant surprises when foo turns out to be a tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we, instead of building additional_clause, do sql += " AND type = '%s' % (EventTypes.Member,)
(and put it at line 350)? I think I'd find that clearer than messing with additional_clause
.
Deferred[dict[int, dict[(type, state_key), EventBase]]] | ||
a dictionary mapping from state group to state dictionary. | ||
""" | ||
if types is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding the multiple calls to _get_state_for_groups_using_cache
quite hard to reason about, and the special-case of checking for filtered_types == [EventTypes.Member]
is a bit sad (apart from the case of longer lists including EventTypes.member
, what if filtered_types is actually a tuple or a set? I know the docstring claims it should be a list but I can't see any reason for that, and the failure mode will be subtle if someone gets it wrong).
How about:
if types is not None:
non_member_types = [t for t in types if t[0] != EventTypes.Member]
if filtered_types is not None and EventTypes.Member not in filtered_types:
# we want all of the membership events
member_types = None
else:
member_types = [t for t in types if t[0] == EventTypes.Member]
else:
non_member_types = None
member_types = None
non_member_state = yield self._get_state_for_groups_using_cache(
groups, self._state_group_cache, non_member_types, filtered_types,
)
member_state = yield self._get_state_for_groups_using_cache(
groups, self._state_group_members_cache, member_types, None,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is certainly much terser, but ironically i personally find it much more cryptic and hard to reason about, whereas the simple symmetricity of "if we're lazy-loading members, split the query intelligently. if we're filtering on types, split the query naively. otherwise, split the query without filtering" felt clearer by spelling out the flows we care about and following the same pattern for each branch.
However, I don't have strong feelings, and it's nice that this handles the whole "what if filtered_types is a longer list that contains EventTypes.Member" scenario better (which I'd considered, but fell through to the naive handler). So i've gone with it (with a comment to explain the somewhat magical 'None' in the final line).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the param was "we're lazy-loading members", I'd agree with you. But for better or worse, it's not.
@@ -725,7 +810,7 @@ def _get_state_for_groups(self, groups, types=None, filtered_types=None): | |||
types_to_fetch = types | |||
|
|||
group_to_state_dict = yield self._get_state_groups_from_groups( | |||
missing_groups, types_to_fetch | |||
missing_groups, types_to_fetch, cache == self._state_group_members_cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit horrid and feels a bit backwards.
Can you pass a bool param (use_members_cache
?) into this function instead of cache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i follow. passing the right cache object into _get_state_for_groups_using_cache
seems fine to me, and simplifies all the other cache references to just be cache.get
etc? And _get_state_groups_from_groups
is already taking the bool|None
to say whether it should be limiting to members or non-members or not. Or is the problem just the inlining the boolean expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this was unclear.
I'm suggesting:
rather than pass a cache
into _get_state_for_groups_using_cache
, pass a bool which is True to use _state_group_members_cache
and False to use _state_group_cache
. Then on the first line (or wherever) of _get_state_for_groups_using_cache
, do
cache = self._state_group_members_cache if use_members_cache else self._state_group_cache
and then on this line here you can just use use_members_cache
instead of going back to a bool.
It just felt odd to be going back to a bool here. Though I don't feel that strongly about it so if you want to leave it alone that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that's how i interpreted, but am unsure that a cryptic bool flying around the place is better than saying "use this cache please", even though it does mean we end up with the slightly backwards comparison here back to a bool. i'd rather leave it as is if you're borderline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
@richvdh ptal; have incorporated all changes other than failing to understand why you object to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
synapse/storage/state.py
Outdated
# and the other for tracking member_events. This means that lazy loading | ||
# queries can be made in a cache-friendly manner by querying both caches | ||
# separately and then merging the result. So for the example above, you | ||
# would query the members cache for a specific subset of state types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be "subset of state keys" ?
synapse/storage/state.py
Outdated
# event IDs for the state types in a given state group to avoid hammering | ||
# on the state_group* tables. | ||
# | ||
# The point using a DictionaryCache is that it can cache a subset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The point of using"
Don't take my confusion as a criticism of the PR specifically; it's the general complexity of the code as a whole, and a lot of it is down to the fact it's building on a codebase that is already somewhat confusing (as I think we've both agreed) and adding new dimensions of complexity to it. That said there are things that I wish we weren't doing, though as I said I can't think of plausible better solutions:
|
One half contains normal state events; the other contains member events.
The idea is that the lazyloading common case of wanting a subset of the members events but the entirety of the other events can be accomplished efficiently without having to make DictionaryCache aware of these sort of complicated semantics.