-
Notifications
You must be signed in to change notification settings - Fork 46
Implement pep 503 for simple repository. #506
Conversation
thanks for the major updates to the README. I'm going to add some notes to specific changes here, but the overarching theme of my notes will be "small steps please". Given the nature of PyPI and the simple index, I'm hesitant to have too much change in any given update to core components. I'm positive that adding |
@@ -1036,22 +1036,21 @@ def simple_body(self, path): | |||
raise NotFound, path + " does not have any releases" | |||
|
|||
html = [] | |||
html.append("""<html><head><title>Links for %s</title><meta name="api-version" value="2" /></head>""" | |||
html.append("""<!DOCTYPE html><html><head><title>Links for %s</title></head>""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the meta
tag here is used, can we leave it until it's value is understood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The !DOCTYPE
addition is sound, however I'd like to stage it in a separate PR if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dstufft the api-version
meta node was added in be17fea#diff-19fadc30e1b17100568adbd8c6c3cc13, do you have any information on it's use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed DOCTYPE and reeadded meta attributes.
just saw #505 and it helped color some of the decisions made here. regarding select performance, I did want to point out that things get significantly slower when working with packages that have a longer release history. though i have no suggestions on improvement: running on a local copy of the production release_files/releases tables: pypi=# explain analyze select filename, releases.requires_python, md5_digest from release_files inner join releases on release_files.name=releases.name and release_files.version=releases.version where release_files.name='setuptools';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.85..178.97 rows=50 width=61) (actual time=0.068..4.953 rows=632 loops=1)
-> Index Scan using releases_pkey on releases (cost=0.42..60.60 rows=14 width=19) (actual time=0.034..0.219 rows=202 loops=1)
Index Cond: (name = 'setuptools'::text)
-> Index Scan using release_files_name_version_idx on release_files (cost=0.42..8.45 rows=1 width=78) (actual time=0.019..0.022 rows=3 loops=202)
Index Cond: ((name = 'setuptools'::text) AND (version = releases.version))
Planning time: 5.652 ms
Execution time: 5.038 ms
(7 rows)
pypi=# explain analyze select filename, python_version, md5_digest from release_files where name='setuptools';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on release_files (cost=17.02..2132.36 rows=593 width=66) (actual time=0.274..0.982 rows=632 loops=1)
Recheck Cond: (name = 'setuptools'::text)
Heap Blocks: exact=628
-> Bitmap Index Scan on release_files_name_version_idx (cost=0.00..16.87 rows=593 width=0) (actual time=0.176..0.176 rows=632 loops=1)
Index Cond: (name = 'setuptools'::text)
Planning time: 0.119 ms
Execution time: 1.039 ms
(7 rows) |
I'd really like to have @dstufft take a pass here, as this is the most significant change to |
if href.startswith('http://cheeseshop.python.org/pypi') or \ | ||
href.startswith('http://pypi.python.org/pypi') or \ | ||
href.startswith('http://www.python.org/pypi'): | ||
# Suppress URLs that point to us | ||
continue | ||
if rel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, removing rel="internal"
tag will break older versions of pip
. I'm not opposed, but it is asking for pitchforks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reintroduced.
I'd like to propose that we break out the following ancillary changes into a separate PR that can be deployed after a lengthy "warning" period. We can put it in the status page as a scheduled maintenance to alert people that it's changing ahead of time and have somewhere to point when they get angry and flashmob us.
|
No problem, I completely understand. I'll take care of sending incremental updates.
One of the changes possible would be to actually update the Anyway, now that I have that working, it should be easy to pull that out in smaller chunks. I have to work a bit on other stuff this week but I can try to spend some cycles. |
This would not be the end of the world. Our data model is a little weird anyways, techincally all metadata is per file (right now) and it's entirely possible to have wildly different metadata for different files. We just sort of pretend this doesn't exist, and use the metadata for one of the files as all of the metadata for a particular release. This was done by basically splitting the metadata with a semi arbitrary line and saying "this metadata is likely to be different per-release, and this metadata is likely to be different per-file". So there are two sort of ways to handle this, one is that it's just denormalizing the release centric data onto Changes to the DB have to go through Wareh |
806bfed
to
09b4978
Compare
I will trust you on this one. i honestly doubt people should/would bother to have various files with different requires-python; and I think that per release make sens or at least it is easier to migrate to per-file later if we decide; or at least simpler than deciding that per-file was a mistake and try to gather all at one back at a release level.
Sure, just tell me if you want to make the changes or if the performance hit on Postgres because of the Join is acceptable. Can you also benchmark the following ? explain analyze select filename, requires_python, md5_digest
from (select requires_python, version from releases where name='setuptools') as s1,
(select filename, version, md5_digest from release_files where name='setuptools') as s2
where s2.version=s1.version I don't get the ~200 nested loops I see on your benchmark as I don't have the full db, but this query might be quicker as the cross product should be smaller if we filter by name first; though I don't know how much the query-planner can optimize... Or maybe even the following that don't make the join explicit : explain analyse select filename, releases.requires_python, md5_digest
from release_files,releases
where release_files.version=releases.version
and release_files.name='setuptools'
and releases.name='setuptools' Thanks ! |
How often are each releases and release_files productiondb updated ? It seem like I could make a materialized view that contain the above join and refresh it on insert/updates of each release / release_files ? |
Quick development setup | ||
----------------------- | ||
|
||
Make sure you read http://wiki.python.org/moin/CheeseShopDev#DevelopmentEnvironmentHints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove/update this line it links to a page that states that it is immediately out of date and that people should come back here to learn what to do. Infinite loops are sad.
…files Assuming there are much less updates of the `release` and `release_file` table than reads then we take the overhead of building this materialized view only rarely. Building it concurrently is slow but does not block reads so the building cost might be invisible. The more efficient alternative would be : - To make this an actual table and update only the affected row but seem much more complicated and prone to errors and confusions for maintenance as this table should not be manipulated directly. But would not be that hard since Posgres 9.5 supports UPSERTs. It would be easy to convert back-and from a virtual-table to a real table without actually changing Python code. - To duplicate the `requires_python` column on both `releases` and `release_files` which seem redundant and might lead to inconsistencies. - Transfer `requires_python` to `release_files`, though it is also queried from `releases` for we might still have the overhead of join but in the other direction and will require careful checking of the legacy-pypi codebase. Related: pypi/legacy#506
Rebase and updated to work with pypi/warehouse#1448 (once accepted) |
Just checked "Allow edits from maintainers" and gave pushed right to my branch to @michaelpacer . |
end This exposes the data-requires-python field when the packages registers it.
cgi.escape() does not escape single quotes, so we need to make sure any arguments parsed with cgi.escape() are wrapped with double quotes. Using cgi.escape(quote=True) escapes double quotes. Though this is unlikely to occur in a python_requires value, we want to protect against double quote based sql injections. The html parsing was separated into its own private function so that it could be unit tested.
I took out the html formatting parts of the Would you prefer me to include a test to go with these updates or to leave this as is? |
I opened #535 which resurect some of the testing and pass on travis. |
If the time-to-merge of this solution would be extended because of needing to wait for tests, I would support not including tests and simply merging as is. At the least, it supports escaping all of the information that is passed now. Previously the I especially would prefer not to write tests if there is no long term intention of maintaining a testing suite for this repo. I understand not wanting to start supporting a testing suite given that the intention is to deprecate this when it becomes feasible. But until the warehouse lives, I also understand wanting to keep the cheese shop resilient. So I support whatever the maintainers decide in either direction. |
Hello there ! I hope that @dstufft has found a new position that give him some time to still work on Open Source and that he can still work remotely from home ! I'm pretty sure he can't discuss it while contract are being signed. Just giving my encouragement. I made a couple of PRs to PyPI legacy this one that adds limited functionality for Python 3 only packages, so I was wondering if there is any hope to get that soonish, and if there is anything I can do to help. I've tried as well to simplify and document how to run PyPI legacy locally and can do more cleanup work if this is what is needed for things to move forward. There are a couple of library out there that wish to release Python 3 only version and hope not to use gross hack or break Python 2 user workflow. Of course we need pip 8.2 to be out, but that would be neat to be able to at least get that into the wild on test it a bit. Thanks ! |
Thanks @dstufft for releasing pip 9 ! 🍰 |
Hello there ! Anything we can do here ? |
I think that will automatically deploy in ~15 minutes or so. If not then I'll go looking to see why not. |
Thanks a lot, really appreciate it. I just tested on testpypi with py3onlytest, it does not seem to show the data-attributes. I've tried with the most recent stable setuptools (and twine). |
Hmm, none of those has a requires_python in the DB:
So either we're not properly capturing that value when it is sent, or twine/setuptools is not actually sending it, or both. |
Is that on regular PyPI or TestPyPI? |
Sorry this one is regular pypi. It's also on testpypi, but with |
@Carreau It's not deployed to regular PyPI yet. The flit page was cached though, I purged it and it shows up now. This suggests the problem is in twine or setuptools or both and not PyPI though. |
Good, thanks a lot. I'll investigate both and report back. Thanks ! |
So
Though the Thanks. |
Hum, so the only thing I can see is that the db-trigger are not triggered correctly. As the main page pull the I'm trying to re-run the Warehouse/PyPI-legacy locally but seem to have issues with |
@Carreau That seems to be the issue yes:
|
Ok, thanks for confirming. I'm unsure where this could be coming from. I'm trying to replicate locally. Fighting with legacy codebase to have it running without S3 and reddis. |
Sorry previous comment was in a wrong window. But I think I know what is happening at least for me locally. When uploading through the legacy pypi interface. It seem when calling I'm still unsure of why on testpipy it does seem to work at least in the webUI. I'm giving up for today |
Thanks both for getting this in place; I can see the |
Implement pep 503 data-requires for simple repository.
This exposes the data-requires-python field when the packages registers it.
This also expand the readme with some basic instruction on how to connect
pypi-legacy to a running warehouse setup.
SQL seem to be ~2x to 5x slower, but likely still reasonable time
SQL Before :
SQL After :
Likely longer for project with more releases.