Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST API Performance on package requests #167

Closed
costrouc opened this issue Sep 23, 2021 · 2 comments
Closed

REST API Performance on package requests #167

costrouc opened this issue Sep 23, 2021 · 2 comments
Labels
needs: discussion 💬 This item needs team-level discussion before scoping type: enhancement 💅🏼
Milestone

Comments

@costrouc
Copy link
Member

costrouc commented Sep 23, 2021

When requesting /api/v1/package/?distinct_on=namespace&distinct_on=name&distinct_on=version the query takes around 1-2 seconds to complete. As this project scales to more users this is untenable. Investigation is needed on how to speed up the query. To me two immediate solutions are apparent. Also both solutions are orthogonal so they could also both be good to have.

  • The CondaPackage table is not normalized well. Many packages have the same name and description. Realistically the table should be split into CondaChannel, CondaPackage(channel_id, name), CondaPackageBuild(package_id, version, build, ...).

  • A caching layer for database queries. The /api/v1/package/ route is extremely cachable. The cache could easily be evicted for a given channel based on the last updated datetime in the database.

@costrouc costrouc added type: enhancement 💅🏼 needs: discussion 💬 This item needs team-level discussion before scoping labels Sep 23, 2021
@costrouc costrouc added this to the Release 0.4.0 milestone Sep 23, 2021
@pierrotsmnrd
Copy link
Contributor

Hi @costrouc

I have digged into this issue, and I have some interesting results to share.

First I have retrieved the SQL query generated by SQLAlchemy for the API call /api/v1/package/?distinct_on=namespace&distinct_on=name&distinct_on=version :

SELECT DISTINCT ON (conda_package.name, conda_package.version) conda_package.id AS conda_package_id, 
                    conda_package.channel_id AS conda_package_channel_id, 
                    conda_package.build AS conda_package_build, 
                    conda_package.build_number AS conda_package_build_number, 
                    conda_package.constrains AS conda_package_constrains, 
                    conda_package.depends AS conda_package_depends, 
                    conda_package.license AS conda_package_license,
                    conda_package.license_family AS conda_package_license_family,
                    conda_package.md5 AS conda_package_md5, 
                    conda_package.name AS conda_package_name, 
                    conda_package.sha256 AS conda_package_sha256, 
                    conda_package.size AS conda_package_size, 
                    conda_package.subdir AS conda_package_subdir, 
                    conda_package.timestamp AS conda_package_timestamp,
                    conda_package.version AS conda_package_version, 
                    conda_package.summary AS conda_package_summary, 
                    conda_package.description AS conda_package_description 

FROM conda_package 
JOIN conda_channel ON conda_channel.id = conda_package.channel_id 
ORDER BY conda_package.name ASC, 
          conda_package.version ASC, 
          conda_channel.name ASC, 
          conda_package.name ASC, 
          conda_package.version ASC, 
          conda_package.build ASC 

LIMIT 30 OFFSET 0

I then connected to the PostgreSQL server, tried this query to see the result, and then I asked for the query plan with EXPLAIN ANALYZE

Here's the result :

                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=116931.38..116967.82 rows=30 width=730) (actual time=479.430..498.843 rows=30 loops=1)
   ->  Unique  (cost=116931.38..153809.94 rows=30361 width=730) (actual time=459.574..478.983 rows=30 loops=1)
         ->  Gather Merge  (cost=116931.38..152291.88 rows=303611 width=730) (actual time=459.572..478.956 rows=67 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Sort  (cost=115931.35..116247.62 rows=126505 width=730) (actual time=422.409..422.448 rows=84 loops=3)
                     Sort Key: conda_package.name, conda_package.version, conda_channel.name, conda_package.build
                     Sort Method: external merge  Disk: 63928kB
                     Worker 0:  Sort Method: external merge  Disk: 54776kB
                     Worker 1:  Sort Method: external merge  Disk: 55400kB
                     ->  Hash Join  (cost=35.42..23051.79 rows=126505 width=730) (actual time=6.657..54.081 rows=101204 loops=3)
                           Hash Cond: (conda_package.channel_id = conda_channel.id)
                           ->  Parallel Seq Scan on conda_package  (cost=0.00..22683.05 rows=126505 width=698) (actual time=0.021..14.696 rows=101204 loops=3)
                           ->  Hash  (cost=21.30..21.30 rows=1130 width=36) (actual time=6.510..6.511 rows=2 loops=3)
                                 Buckets: 2048  Batches: 1  Memory Usage: 17kB
                                 ->  Seq Scan on conda_channel  (cost=0.00..21.30 rows=1130 width=36) (actual time=6.504..6.505 rows=2 loops=3)
 Planning Time: 0.185 ms
 JIT:
   Functions: 36
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 7.603 ms, Inlining 0.000 ms, Optimization 2.458 ms, Emission 33.857 ms, Total 43.917 ms
 Execution Time: 517.956 ms
(22 rows)

We can see that the execution time is 517ms. My understanding of the query plan is that DISTINCT ON makes the query and computations much slower, and my intuition told me there must be a better way : What we're doing here with the DISTINCT ON, is actually a GROUP BY on package name and package version, and we keep the results with the minimum ID (the first one, ordered by ID asc).

So I started reworking the raw SQL query to obtain the same result, but in a more optimized way. here it is :

SELECT  packages_id.conda_package_id,
        conda_package.channel_id AS conda_package_channel_id, 
        conda_package.build AS conda_package_build, 
        conda_package.build_number AS conda_package_build_number, 
        conda_package.constrains AS conda_package_constrains, 
        conda_package.name AS conda_package_name, 
        conda_package.size AS conda_package_size, 
        conda_package.subdir AS conda_package_subdir, 
        conda_package.timestamp AS conda_package_timestamp,
        conda_package.version AS conda_package_version
FROM (  SELECT MIN(conda_package.id) as conda_package_id
        FROM conda_package 
        GROUP BY conda_package.name, conda_package.version
        ORDER BY  conda_package.name ASC, 
                conda_package.version ASC 
        LIMIT 30 OFFSET 0
 ) as packages_id
JOIN conda_package ON conda_package.id = packages_id.conda_package_id
JOIN conda_channel ON conda_channel.id = conda_package.channel_id 
ORDER BY conda_channel.name ASC, 
          conda_package.name ASC, 
          conda_package.version ASC, 
          conda_package.build ASC

The subquery is the one making all the logic : grouping on the right columns and keeping the min package id, and the LIMIT. This is very important to limit from there. Then the overall query relies on the subquery and just uses the package id to 'decorate' with other fields we're interested into.

The EXPLAIN ANALYZE shows a much better execution time (divided by 3):

                                                                                           QUERY PLAN                                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=27462.92..27463.00 rows=30 width=107) (actual time=176.562..182.518 rows=30 loops=1)
   Sort Key: conda_channel.name, conda_package.name, conda_package.version, conda_package.build
   Sort Method: quicksort  Memory: 31kB
   ->  Nested Loop  (cost=27196.41..27462.19 rows=30 width=107) (actual time=176.449..182.499 rows=30 loops=1)
         ->  Nested Loop  (cost=27196.26..27457.08 rows=30 width=75) (actual time=176.443..182.472 rows=30 loops=1)
               ->  Limit  (cost=27195.83..27203.58 rows=30 width=20) (actual time=176.424..182.404 rows=30 loops=1)
                     ->  Finalize GroupAggregate  (cost=27195.83..35039.59 rows=30361 width=20) (actual time=176.423..182.401 rows=30 loops=1)
                           Group Key: conda_package_1.name, conda_package_1.version
                           ->  Gather Merge  (cost=27195.83..34280.56 rows=60722 width=20) (actual time=176.417..182.386 rows=31 loops=1)
                                 Workers Planned: 2
                                 Workers Launched: 2
                                 ->  Sort  (cost=26195.81..26271.71 rows=30361 width=20) (actual time=136.315..136.368 rows=751 loops=3)
                                       Sort Key: conda_package_1.name, conda_package_1.version
                                       Sort Method: external merge  Disk: 1600kB
                                       Worker 0:  Sort Method: quicksort  Memory: 3535kB
                                       Worker 1:  Sort Method: quicksort  Memory: 3535kB
                                       ->  Partial HashAggregate  (cost=23631.83..23935.44 rows=30361 width=20) (actual time=58.225..69.216 rows=39023 loops=3)
                                             Group Key: conda_package_1.name, conda_package_1.version
                                             Batches: 5  Memory Usage: 4145kB  Disk Usage: 3392kB
                                             Worker 0:  Batches: 5  Memory Usage: 4145kB  Disk Usage: 720kB
                                             Worker 1:  Batches: 5  Memory Usage: 4145kB  Disk Usage: 744kB
                                             ->  Parallel Seq Scan on conda_package conda_package_1  (cost=0.00..22683.05 rows=126505 width=20) (actual time=0.214..15.183 rows=101204 loops=3)
               ->  Index Scan using conda_package_pkey on conda_package  (cost=0.42..8.44 rows=1 width=75) (actual time=0.002..0.002 rows=1 loops=30)
                     Index Cond: (id = (min(conda_package_1.id)))
         ->  Index Scan using conda_channel_pkey on conda_channel  (cost=0.15..0.17 rows=1 width=36) (actual time=0.001..0.001 rows=1 loops=30)
               Index Cond: (id = conda_package.channel_id)
 Planning Time: 0.435 ms
 Execution Time: 184.956 ms
(28 rows)

Now the question is : How do we turn this query back into SQLAlchemy statements ?

This can also be useful once the tables' normalization has been improved, and having a cache remains indeed necessary.

@costrouc
Copy link
Member Author

Thanks @pierrotsmnrd. This is very helpful. I'm not knowledgable enough on why this is faster but this does seem to produce a 2.5x improvement. I think this is yet another way we should speed up this particular query.

I see Parallel Seq Scan on conda_package conda_package_1 (cost=0.00..22683.05 rows=126505 width=20) (actual time=0.214..15.183 rows=101204 loops=3) in both of the queries and ideally we need to design the tables better such that a sequential scan is not necessary. This tells me to some effect that performance as it stands currently will scale with the number of rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: discussion 💬 This item needs team-level discussion before scoping type: enhancement 💅🏼
Projects
None yet
Development

No branches or pull requests

2 participants