Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

long-running /search query on matrix.org after client disconnected #14290

Closed
anoadragon453 opened this issue Oct 25, 2022 · 5 comments
Closed
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Message-Search Searching messages A-Performance Performance, both client-facing and admin-facing O-Uncommon Most users are unlikely to come across this or unexpected workflow T-Other Questions, user support, anything else.

Comments

@anoadragon453
Copy link
Member

anoadragon453 commented Oct 25, 2022

Discovered a long-running SearchRestServlet (search text of all rooms) on matrix.org that was running for a user that is only in 21 rooms. But perhaps one of the rooms is large.

Logs:

$ grep "POST-2984873" homeserver.log
2730070:2022-10-25 07:30:34,357 - synapse.handlers.search - 142 - INFO - POST-2984873 - Search batch properties: None, None, None
2730071:2022-10-25 07:30:34,358 - synapse.handlers.search - 149 - INFO - POST-2984873 - Search content: {'search_categories': {'room_events': {'search_term': 'update', 'filter': {'limit': 10}, 'order_by': 'recent', 'event_context': {'before_limit': 1, 'after_limit': 1, 'include_profile': True}}}}
2755696:2022-10-25 07:33:34,346 - synapse.http.site - 372 - INFO - POST-2984873 - Connection from client lost before response was sent
4542688:2022-10-25 11:06:10,238 - synapse.handlers.search - 327 - INFO - POST-2984873 - Found 0 events to return
4542689:2022-10-25 11:06:10,239 - synapse.http.server - 744 - WARNING - POST-2984873 - Not sending response to request <XForwardedForRequest at 0x7f979c09ae50 method='POST' uri='/_matrix/client/r0/search' clientproto='HTTP/1.1' site='8080'>, already disconnected.
4542690:2022-10-25 11:06:10,239 - synapse.access.http.8080 - 460 - INFO - POST-2984873 - 194.126.177.75 - 8080 - {@[redacted]:matrix.org} Processed request: 12935.892sec/-12755.893sec (0.004sec, 0.001sec) (0.003sec/12935.878sec/8) 0B 200! "POST /_matrix/client/r0/search HTTP/1.1" "[redacted]" [0 dbevts]

See that the request was processing for ~3.5hrs. There was a spike in DB traffic at the very end of the event as well. Looks like a good endpoint candidate to make cancelable.

@anoadragon453 anoadragon453 added A-Performance Performance, both client-facing and admin-facing T-Other Questions, user support, anything else. A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Oct 25, 2022
@MadLittleMods MadLittleMods added the A-Message-Search Searching messages label Oct 25, 2022
@squahtx
Copy link
Contributor

squahtx commented Oct 31, 2022

Do we know whether the 3.5 hours was all in a single big DB transaction or lots of smaller transactions?
Cancellation would help with the latter, but not the former.

@anoadragon453
Copy link
Member Author

Looking at SearchStore.search_msgs, I believe the bulk is in a single query unfortunately.

async def search_msgs(
self, room_ids: Collection[str], search_term: str, keys: Iterable[str]
) -> JsonDict:
"""Performs a full text search over events with given keys.
Args:
room_ids: List of room ids to search in
search_term: Search term to search for
keys: List of keys to search in, currently supports
"content.body", "content.name", "content.topic"
Returns:
Dictionary of results
"""
clauses = []
args: List[Any] = []
# Make sure we don't explode because the person is in too many rooms.
# We filter the results below regardless.
if len(room_ids) < 500:
clause, args = make_in_list_sql_clause(
self.database_engine, "room_id", room_ids
)
clauses = [clause]
local_clauses = []
for key in keys:
local_clauses.append("key = ?")
args.append(key)
clauses.append("(%s)" % (" OR ".join(local_clauses),))
count_args = args
count_clauses = clauses
if isinstance(self.database_engine, PostgresEngine):
search_query = search_term
tsquery_func = self.database_engine.tsquery_func
sql = f"""
SELECT ts_rank_cd(vector, {tsquery_func}('english', ?)) AS rank,
room_id, event_id
FROM event_search
WHERE vector @@ {tsquery_func}('english', ?)
"""
args = [search_query, search_query] + args
count_sql = f"""
SELECT room_id, count(*) as count FROM event_search
WHERE vector @@ {tsquery_func}('english', ?)
"""
count_args = [search_query] + count_args
elif isinstance(self.database_engine, Sqlite3Engine):
search_query = _parse_query_for_sqlite(search_term)
sql = """
SELECT rank(matchinfo(event_search)) as rank, room_id, event_id
FROM event_search
WHERE value MATCH ?
"""
args = [search_query] + args
count_sql = """
SELECT room_id, count(*) as count FROM event_search
WHERE value MATCH ?
"""
count_args = [search_query] + count_args
else:
# This should be unreachable.
raise Exception("Unrecognized database engine")
for clause in clauses:
sql += " AND " + clause
for clause in count_clauses:
count_sql += " AND " + clause
# We add an arbitrary limit here to ensure we don't try to pull the
# entire table from the database.
sql += " ORDER BY rank DESC LIMIT 500"
results = await self.db_pool.execute(
"search_msgs", self.db_pool.cursor_to_dict, sql, *args
)
results = list(filter(lambda row: row["room_id"] in room_ids, results))
# We set redact_behaviour to block here to prevent redacted events being returned in
# search results (which is a data leak)
events = await self.get_events_as_list( # type: ignore[attr-defined]
[r["event_id"] for r in results],
redact_behaviour=EventRedactBehaviour.block,
)
event_map = {ev.event_id: ev for ev in events}
highlights = None
if isinstance(self.database_engine, PostgresEngine):
highlights = await self._find_highlights_in_postgres(
search_query, events, tsquery_func
)
count_sql += " GROUP BY room_id"
count_results = await self.db_pool.execute(
"search_rooms_count", self.db_pool.cursor_to_dict, count_sql, *count_args
)
count = sum(row["count"] for row in count_results if row["room_id"] in room_ids)
return {
"results": [
{"event": event_map[r["event_id"]], "rank": r["rank"]}
for r in results
if r["event_id"] in event_map
],
"highlights": highlights,
"count": count,
}

@DMRobertson
Copy link
Contributor

If I was executing a long-running query in psql I could press ^C to interrupt it. Does psycopg2 offer a similar mechanism that we could trigger on cancellation?

@squahtx
Copy link
Contributor

squahtx commented Nov 2, 2022

If I was executing a long-running query in psql I could press ^C to interrupt it. Does psycopg2 offer a similar mechanism that we could trigger on cancellation?

There is, according to stackoverflow.
https://stackoverflow.com/a/2275123:

You can cancel a query by calling the pg_cancel_backend(pid) PostgreSQL function in a separate connection.

You can know the PID of the backend to cancel from the connection.get_backend_pid() method of psycopg2 (available from version 2.0.8).

Plumbing everything together is the challenging part. The cancellation has to go from the main thread to the db threadpool thread running the transaction, and there's twisted code in the way.

@squahtx
Copy link
Contributor

squahtx commented Nov 17, 2022

I suspect that @erikjohnston's proposed fix in #14402 has resolved this issue, so I shall close it.

We'll reopen the issue if we see an excessively slow /search again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Message-Search Searching messages A-Performance Performance, both client-facing and admin-facing O-Uncommon Most users are unlikely to come across this or unexpected workflow T-Other Questions, user support, anything else.
Projects
None yet
Development

No branches or pull requests

4 participants