-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REST API Performance on package requests #167
Comments
Hi @costrouc I have digged into this issue, and I have some interesting results to share. First I have retrieved the SQL query generated by SQLAlchemy for the API call SELECT DISTINCT ON (conda_package.name, conda_package.version) conda_package.id AS conda_package_id,
conda_package.channel_id AS conda_package_channel_id,
conda_package.build AS conda_package_build,
conda_package.build_number AS conda_package_build_number,
conda_package.constrains AS conda_package_constrains,
conda_package.depends AS conda_package_depends,
conda_package.license AS conda_package_license,
conda_package.license_family AS conda_package_license_family,
conda_package.md5 AS conda_package_md5,
conda_package.name AS conda_package_name,
conda_package.sha256 AS conda_package_sha256,
conda_package.size AS conda_package_size,
conda_package.subdir AS conda_package_subdir,
conda_package.timestamp AS conda_package_timestamp,
conda_package.version AS conda_package_version,
conda_package.summary AS conda_package_summary,
conda_package.description AS conda_package_description
FROM conda_package
JOIN conda_channel ON conda_channel.id = conda_package.channel_id
ORDER BY conda_package.name ASC,
conda_package.version ASC,
conda_channel.name ASC,
conda_package.name ASC,
conda_package.version ASC,
conda_package.build ASC
LIMIT 30 OFFSET 0 I then connected to the PostgreSQL server, tried this query to see the result, and then I asked for the query plan with Here's the result : QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=116931.38..116967.82 rows=30 width=730) (actual time=479.430..498.843 rows=30 loops=1)
-> Unique (cost=116931.38..153809.94 rows=30361 width=730) (actual time=459.574..478.983 rows=30 loops=1)
-> Gather Merge (cost=116931.38..152291.88 rows=303611 width=730) (actual time=459.572..478.956 rows=67 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=115931.35..116247.62 rows=126505 width=730) (actual time=422.409..422.448 rows=84 loops=3)
Sort Key: conda_package.name, conda_package.version, conda_channel.name, conda_package.build
Sort Method: external merge Disk: 63928kB
Worker 0: Sort Method: external merge Disk: 54776kB
Worker 1: Sort Method: external merge Disk: 55400kB
-> Hash Join (cost=35.42..23051.79 rows=126505 width=730) (actual time=6.657..54.081 rows=101204 loops=3)
Hash Cond: (conda_package.channel_id = conda_channel.id)
-> Parallel Seq Scan on conda_package (cost=0.00..22683.05 rows=126505 width=698) (actual time=0.021..14.696 rows=101204 loops=3)
-> Hash (cost=21.30..21.30 rows=1130 width=36) (actual time=6.510..6.511 rows=2 loops=3)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
-> Seq Scan on conda_channel (cost=0.00..21.30 rows=1130 width=36) (actual time=6.504..6.505 rows=2 loops=3)
Planning Time: 0.185 ms
JIT:
Functions: 36
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 7.603 ms, Inlining 0.000 ms, Optimization 2.458 ms, Emission 33.857 ms, Total 43.917 ms
Execution Time: 517.956 ms
(22 rows)
We can see that the execution time is So I started reworking the raw SQL query to obtain the same result, but in a more optimized way. here it is : SELECT packages_id.conda_package_id,
conda_package.channel_id AS conda_package_channel_id,
conda_package.build AS conda_package_build,
conda_package.build_number AS conda_package_build_number,
conda_package.constrains AS conda_package_constrains,
conda_package.name AS conda_package_name,
conda_package.size AS conda_package_size,
conda_package.subdir AS conda_package_subdir,
conda_package.timestamp AS conda_package_timestamp,
conda_package.version AS conda_package_version
FROM ( SELECT MIN(conda_package.id) as conda_package_id
FROM conda_package
GROUP BY conda_package.name, conda_package.version
ORDER BY conda_package.name ASC,
conda_package.version ASC
LIMIT 30 OFFSET 0
) as packages_id
JOIN conda_package ON conda_package.id = packages_id.conda_package_id
JOIN conda_channel ON conda_channel.id = conda_package.channel_id
ORDER BY conda_channel.name ASC,
conda_package.name ASC,
conda_package.version ASC,
conda_package.build ASC The subquery is the one making all the logic : grouping on the right columns and keeping the min package id, and the LIMIT. This is very important to limit from there. Then the overall query relies on the subquery and just uses the package id to 'decorate' with other fields we're interested into. The EXPLAIN ANALYZE shows a much better execution time (divided by 3): QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=27462.92..27463.00 rows=30 width=107) (actual time=176.562..182.518 rows=30 loops=1)
Sort Key: conda_channel.name, conda_package.name, conda_package.version, conda_package.build
Sort Method: quicksort Memory: 31kB
-> Nested Loop (cost=27196.41..27462.19 rows=30 width=107) (actual time=176.449..182.499 rows=30 loops=1)
-> Nested Loop (cost=27196.26..27457.08 rows=30 width=75) (actual time=176.443..182.472 rows=30 loops=1)
-> Limit (cost=27195.83..27203.58 rows=30 width=20) (actual time=176.424..182.404 rows=30 loops=1)
-> Finalize GroupAggregate (cost=27195.83..35039.59 rows=30361 width=20) (actual time=176.423..182.401 rows=30 loops=1)
Group Key: conda_package_1.name, conda_package_1.version
-> Gather Merge (cost=27195.83..34280.56 rows=60722 width=20) (actual time=176.417..182.386 rows=31 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=26195.81..26271.71 rows=30361 width=20) (actual time=136.315..136.368 rows=751 loops=3)
Sort Key: conda_package_1.name, conda_package_1.version
Sort Method: external merge Disk: 1600kB
Worker 0: Sort Method: quicksort Memory: 3535kB
Worker 1: Sort Method: quicksort Memory: 3535kB
-> Partial HashAggregate (cost=23631.83..23935.44 rows=30361 width=20) (actual time=58.225..69.216 rows=39023 loops=3)
Group Key: conda_package_1.name, conda_package_1.version
Batches: 5 Memory Usage: 4145kB Disk Usage: 3392kB
Worker 0: Batches: 5 Memory Usage: 4145kB Disk Usage: 720kB
Worker 1: Batches: 5 Memory Usage: 4145kB Disk Usage: 744kB
-> Parallel Seq Scan on conda_package conda_package_1 (cost=0.00..22683.05 rows=126505 width=20) (actual time=0.214..15.183 rows=101204 loops=3)
-> Index Scan using conda_package_pkey on conda_package (cost=0.42..8.44 rows=1 width=75) (actual time=0.002..0.002 rows=1 loops=30)
Index Cond: (id = (min(conda_package_1.id)))
-> Index Scan using conda_channel_pkey on conda_channel (cost=0.15..0.17 rows=1 width=36) (actual time=0.001..0.001 rows=1 loops=30)
Index Cond: (id = conda_package.channel_id)
Planning Time: 0.435 ms
Execution Time: 184.956 ms
(28 rows) Now the question is : How do we turn this query back into SQLAlchemy statements ? This can also be useful once the tables' normalization has been improved, and having a cache remains indeed necessary. |
Thanks @pierrotsmnrd. This is very helpful. I'm not knowledgable enough on why this is faster but this does seem to produce a 2.5x improvement. I think this is yet another way we should speed up this particular query. I see |
When requesting
/api/v1/package/?distinct_on=namespace&distinct_on=name&distinct_on=version
the query takes around 1-2 seconds to complete. As this project scales to more users this is untenable. Investigation is needed on how to speed up the query. To me two immediate solutions are apparent. Also both solutions are orthogonal so they could also both be good to have.The
CondaPackage
table is not normalized well. Many packages have the same name and description. Realistically the table should be split intoCondaChannel
,CondaPackage(channel_id, name)
,CondaPackageBuild(package_id, version, build, ...)
.A caching layer for database queries. The
/api/v1/package/
route is extremely cachable. The cache could easily be evicted for a given channel based on the last updated datetime in the database.The text was updated successfully, but these errors were encountered: