Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add documentation about the user directory search algorithm #16320

Merged
merged 9 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/16320.doc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Improve documentation of the user directory search algorithm.
134 changes: 109 additions & 25 deletions docs/user_directory.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,133 @@
User Directory API Implementation
=================================
# User Directory API Implementation

The user directory is currently maintained based on the 'visible' users
on this particular server - i.e. ones which your account shares a room with, or
who are present in a publicly viewable room present on the server.
The user directory is maintained based on the 'visible' users of a homeserver -
i.e. ones which are local to the server and ones which any local user shares a
clokep marked this conversation as resolved.
Show resolved Hide resolved
room with.

The directory info is stored in various tables, which can (typically after
DB corruption) get stale or out of sync. If this happens, for now the
The directory info is stored in various tables, which can sometimes get out of
sync (although this is considered a bug). If this happens, for now the
solution to fix it is to use the [admin API](usage/administration/admin_api/background_updates.md#run)
and execute the job `regenerate_directory`. This should then start a background task to
flush the current tables and regenerate the directory.
flush the current tables and regenerate the directory. Depending on the size
of your homeserver (number of users and rooms) this can take a while.

Data model
----------
## Data model

There are five relevant tables that collectively form the "user directory".
Three of them track a master list of all the users we could search for.
The last two (collectively called the "search tables") track who can
see who.
Three of them track a list of all known users. The last two (collectively called
the "search tables") track which users are visible to each other.

From all of these tables we exclude three types of local user:
- support users
- appservice users
- deactivated users

* `user_directory`. This contains the user_id, display name and avatar we'll
return when you search the directory.
- Because there's only one directory entry per user, it's important that we only
ever put publicly visible names here. Otherwise we might leak a private

- support users
- appservice users
- deactivated users

A description of each table follows:

* `user_directory`. This contains the user ID, display name and avatar of each user.
- Because there is only one directory entry per user, it is important that it
only contain publicly visible information. Otherwise, this will leak the
nickname or avatar used in a private room.
- Indexed on rooms. Indexed on users.

DMRobertson marked this conversation as resolved.
Show resolved Hide resolved
* `user_directory_search`. To be joined to `user_directory`. It contains an extra
column that enables full text search based on user ids and display names.
Different schemas for SQLite and Postgres with different code paths to match.
column that enables full text search based on user IDs and display names.
Different schemas for SQLite and Postgres are used.
- Indexed on the full text search data. Indexed on users.

Comment on lines 34 to 37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I always found it strange that this is a separate table. I guess it's to limit the sqlite vs postgres differences to a small table rather than a larger one?

Copy link
Member Author

@clokep clokep Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because you enable "full text search" on a table, but the database engines then backs that by multiple tables that are hidden from you.

* `user_directory_stream_pos`. When the initial background update to populate
the directory is complete, we record a stream position here. This indicates
that synapse should now listen for room changes and incrementally update
the directory where necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* `users_in_public_rooms`. Contains associations between users and the public rooms they're in.
Used to determine which users are in public rooms and should be publicly visible in the directory.
* `users_in_public_rooms`. Contains associations between users and the public
rooms they're in. Used to determine which users are in public rooms and should
be publicly visible in the directory. Both local and remote users are tracked.

* `users_who_share_private_rooms`. Rows are triples `(L, M, room id)` where `L`
is a local user and `M` is a local or remote user. `L` and `M` should be
different, but this isn't enforced by a constraint.

Note that if two local users share a room then there will be two entries:
`(user1, user2, !room_id)` and `(user2, user1, !room_id)`.

## Configuration options

The exact way user search works can be tweaked via some server-level
[configuration options](usage/configuration/config_documentation.md#user_directory).

The information is not repeated here, but the options are mentioned below.

## Search algorithm

If `search_all_users` is `false`, then results are limited to users who:

clokep marked this conversation as resolved.
Show resolved Hide resolved
1. Are found in the `users_in_public_rooms` table, or
2. Are found in the `users_who_share_private_rooms` where `L` is the requesting
user and `M` is the search result.

Otherwise, if `search_all_users` is `true`, no such limits are placed and all
users known to the server (matching the search query) will be returned.

By default, locked users are not returned. If `show_locked_users` is `true` then
no filtering on the locked status of a user is done.

DMRobertson marked this conversation as resolved.
Show resolved Hide resolved
The user provided search term is lowercased and normalized using [NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization),
this treats the string as case-insensitive, canonicalizes different forms of the
same text, and maps some "roughly equivalent" characters together.

The search term is then split into words:

* If [ICU](https://en.wikipedia.org/wiki/International_Components_for_Unicode) is
available, then the system's [default locale](https://unicode-org.github.io/icu/userguide/locale/#default-locales)
will be used to break the search term into words. (See the
[installation instructions](setup/installation.md) for how to install ICU.)
* If unavailable, then runs of ASCII characters, numbers, underscores, and hypens
are considered words.

The queries for PostgreSQL and SQLite are detailed below, by their overall goal
is to find matching users, preferring users who are "real" (e.g. not bots,
not deactivated). It is assumed that real users will have an display name and
avatar set.

### PostgreSQL

The above words are then transformed into two queries:

1. "exact" which matches the parsed words exactly (using [`to_tsquery`](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES));
2. "prefix" which matches the parsed words as prefixes (using `to_tsquery`).

Results are composed of all rows in the `user_directory_search` table whose information
matches one (or both) of these queries. Results are ordered by calculating a weighted
score for each result, higher scores are returned first:

* 4x if a user ID exists.
* 1.2x if the user has a display name set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, when would this not be the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question. I think it is not-nullable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> \d user_directory_search
        Table "user_directory_search"
 Column  |   Type   | Collation | Nullable | Default 
---------+----------+-----------+----------+---------
 user_id | text     |           | not null | 
 vector  | tsvector |           |          | 
Indexes:
    "user_directory_search_fts_idx" gin (vector)
    "user_directory_search_user_idx" UNIQUE, btree (user_id)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this actually means the initial weight is 4, not 1? 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively, though it won't make much difference: sorting by val and 4 * val should give the same results1!

Footnotes

  1. except that the former is less prone to overflow, I guess

* 1.2x if the user has an avatar set.
* 3x by the full text search results using the [`ts_rank_cd` function](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING)
clokep marked this conversation as resolved.
Show resolved Hide resolved
against the "exact" search query; this has four variables with the following weightings:
clokep marked this conversation as resolved.
Show resolved Hide resolved
* `D`: 0.1 for the user ID's domain
clokep marked this conversation as resolved.
Show resolved Hide resolved
* `C`: 0.1 for unused
* `B`: 0.9 for the user's display name (or an empty string if it is not set)
* `A`: 0.1 for the user ID's localpart
* 1x by the full text search results using the `ts_rank_cd` function against the
"prefix" search query. (Using the same weightings as above.)
* If `prefer_local_users` is `true`, then 2x if the user is local to the homeserver.

Note that `ts_rank_cd` returns a weight between 0 and 1. The initial weighting of
all results is 1.

### SQLite

Results are composed of all rows in the `user_directory_search` whose information
matches the query. Results are ordered by the following information, with each
subsequent column used as a tiebreaker, for each result:

1. By the [`rank`](https://www.sqlite.org/windowfunctions.html#built_in_window_functions)
of the full text search results using the [`matchinfo` function](https://www.sqlite.org/fts3.html#matchinfo). Higher
ranks are returned first.
2. If `prefer_local_users` is `true`, then users local to the homeserver are
returned first.
3. Users with a display name set are returned first.
4. Users with an avatar set are returned first.
Loading