Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text search using SearchVector #8430

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Netacci
Copy link

@Netacci Netacci commented Jan 16, 2025

Bugzilla link
Work in progress document

Before

The search functionality allowed users to search only by author email address or ID. For example:

Endpoint: http://localhost:8000/api/project/try/push/[email protected]
Result: Returns all revisions authored by [email protected]

After

The search functionality has been enhanced to support more advanced use cases. Users can now search by additional attributes like bug_numbers, summary, etc., using a unified search parameter (search).

The implementation leverages Django's SearchVector to perform full-text searches across multiple fields dynamically

The search combines revision, author, and comments fields

Endpoint: http://localhost:8000/api/project/try/push/?search=1906541
Result: Returns results matching the query across relevant fields such as bug_numbers, summary, author, and revisions.

17.01.2025_07.48.28_REC.mp4

treeherder/model/models.py Outdated Show resolved Hide resolved
treeherder/model/models.py Outdated Show resolved Hide resolved
treeherder/model/models.py Outdated Show resolved Hide resolved
treeherder/model/models.py Outdated Show resolved Hide resolved
treeherder/webapp/api/push.py Outdated Show resolved Hide resolved
treeherder/webapp/api/push.py Outdated Show resolved Hide resolved
treeherder/webapp/api/push.py Outdated Show resolved Hide resolved
treeherder/webapp/api/push.py Outdated Show resolved Hide resolved
@Netacci Netacci force-pushed the set-full-text-search-using-search-vector branch 3 times, most recently from a1e1520 to ed99452 Compare January 24, 2025 12:33
@Netacci Netacci force-pushed the set-full-text-search-using-search-vector branch from 6bab3df to 186cabc Compare January 27, 2025 14:58
@Netacci Netacci force-pushed the set-full-text-search-using-search-vector branch from 186cabc to f6af36a Compare January 27, 2025 15:30
treeherder/model/models.py Show resolved Hide resolved
treeherder/webapp/api/push.py Outdated Show resolved Hide resolved
@Netacci Netacci changed the title WIP-Set full text search using search vector Full-text search using SearchVector Jan 29, 2025
@Netacci Netacci requested a review from Archaeopteryx January 29, 2025 18:55
@Netacci Netacci marked this pull request as ready for review January 29, 2025 18:56
@julienw
Copy link
Contributor

julienw commented Jan 30, 2025

FTR here are the SQL queries generated from the current code for search=1916016 (I added command: ["postgres", "-c", "log_statement=all"] to the configuration for postgres in docker-compose.yml:

SELECT "repository"."id", "repository"."repository_group_id", "repository"."name", "repository"."dvcs_type", "repository"."url", "repository"."branch", "repository"."codebase", "repository"."description", "repository"."active_status", "repository"."life_cycle_order", "repository"."performance_alerts_enabled", "repository"."expire_performance_data", "repository"."is_try_repo", "repository"."tc_root_url" FROM "repository" WHERE "repository"."name" = 'mozilla-central' LIMIT 21
SELECT "repository"."id", "repository"."repository_group_id", "repository"."name", "repository"."dvcs_type", "repository"."url", "repository"."branch", "repository"."codebase", "repository"."description", "repository"."active_status", "repository"."life_cycle_order", "repository"."performance_alerts_enabled", "repository"."expire_performance_data", "repository"."is_try_repo", "repository"."tc_root_url" FROM "repository" WHERE "repository"."name" = 'mozilla-central' LIMIT 21
SELECT "push"."id", "push"."repository_id", "push"."revision", "push"."author", "push"."time", "repository"."id", "repository"."repository_group_id", "repository"."name", "repository"."dvcs_type", "repository"."url", "repository"."branch", "repository"."codebase", "repository"."description", "repository"."active_status", "repository"."life_cycle_order", "repository"."performance_alerts_enabled", "repository"."expire_performance_data", "repository"."is_try_repo", "repository"."tc_root_url" FROM "push" INNER JOIN "repository" ON ("push"."repository_id" = "repository"."id") WHERE ("push"."repository_id" = 1 AND "push"."id" IN (SELECT "subquery"."push_id" FROM (SELECT DISTINCT U0."push_id", U1."time" FROM "commit" U0 INNER JOIN "push" U1 ON (U0."push_id" = U1."id") WHERE (U1."repository_id" = 1 AND to_tsvector('english'::regconfig, COALESCE(U0."revision", '') || ' ' || COALESCE(U0."author", '') || ' ' || COALESCE(U0."comments", '')) @@ (plainto_tsquery('english'::regconfig, '1916016'))) ORDER BY U1."time" DESC LIMIT 200) subquery)) ORDER BY "push"."time" DESC LIMIT 10
SELECT "commit"."id", "commit"."push_id", "commit"."revision", "commit"."author", "commit"."comments" FROM "commit" WHERE "commit"."push_id" IN (34)
SELECT "commit"."id", "commit"."push_id", "commit"."revision", "commit"."author", "commit"."comments" FROM "commit" WHERE "commit"."push_id" = 34 ORDER BY "commit"."id" DESC LIMIT 20

I think only the 3rd one comes from this patch, let me reformat it:

SELECT "push"."id", "push"."repository_id", "push"."revision", "push"."author", "push"."time", "repository"."id", "repository"."repository_group_id", "repository"."name", "repository"."dvcs_type", "repository"."url", "repository"."branch", "repository"."codebase", "repository"."description", "repository"."active_status", "repository"."life_cycle_order", "repository"."performance_alerts_enabled", "repository"."expire_performance_data", "repository"."is_try_repo", "repository"."tc_root_url"
FROM "push"
INNER JOIN "repository" ON ("push"."repository_id" = "repository"."id")
WHERE ("push"."repository_id" = 1 AND "push"."id" IN (
  SELECT "subquery"."push_id"
  FROM (
    SELECT DISTINCT U0."push_id", U1."time"
    FROM "commit" U0
    INNER JOIN "push" U1 ON (U0."push_id" = U1."id")
    WHERE (
      U1."repository_id" = 1 AND
      to_tsvector('english'::regconfig, COALESCE(U0."revision", '') || ' ' || COALESCE(U0."author", '') || ' ' || COALESCE(U0."comments", '')) @@ (plainto_tsquery('english'::regconfig, '1916016'))
    )
    ORDER BY U1."time" DESC LIMIT 200
  ) subquery)
)
ORDER BY "push"."time" DESC
LIMIT 10

which doesn't look too bad. I only wonder about the 3rd nested subquery but I guess that's unavoidable with this strategy.

Still I asked chatgpt about simplifying it, here is what I got:

SELECT DISTINCT ON (p.id) 
    p.id, 
    p.repository_id, 
    p.revision, 
    p.author, 
    p.time, 
    r.repository_group_id, 
    r.name, 
    r.dvcs_type, 
    r.url, 
    r.branch, 
    r.codebase, 
    r.description, 
    r.active_status, 
    r.life_cycle_order, 
    r.performance_alerts_enabled, 
    r.expire_performance_data, 
    r.is_try_repo, 
    r.tc_root_url
FROM push p
INNER JOIN repository r ON p.repository_id = r.id
INNER JOIN commit c ON c.push_id = p.id
WHERE 
    p.repository_id = 1 
    AND to_tsvector('english'::regconfig, COALESCE(c.revision, '') || ' ' || COALESCE(c.author, '') || ' ' || COALESCE(c.comments, '')) 
        @@ plainto_tsquery('english'::regconfig, '1916016')
ORDER BY p.time DESC
LIMIT 10;

Interstingly it still uses the Gin index according to explain (we couldn't make it work with just one query before)! And is much faster (about 30%).
Now it's not clear to me how to achieve it with the Django API, @Archaeopteryx would you be able to help?

Also I'd be totally fine landing the patch as it is now (especially with the new tests, the index, etc) and tuning the query afterwards.

(Note: It could be interesting to see where the 2 last requests come from but that's out of scope for this patch.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants