Skip to content
This repository has been archived by the owner on May 30, 2020. It is now read-only.

Implement pep 503 for simple repository. #506

Merged
merged 4 commits into from
Nov 30, 2016
Merged

Conversation

Carreau
Copy link
Contributor

@Carreau Carreau commented Aug 14, 2016

Implement pep 503 data-requires for simple repository.

This exposes the data-requires-python field when the packages registers it.

This also expand the readme with some basic instruction on how to connect
pypi-legacy to a running warehouse setup.

SQL seem to be ~2x to 5x slower, but likely still reasonable time

SQL Before :

    explain analyse select filename, python_version, md5_digest from release_files where name='ipython_sql'

    Index Scan using release_files_name_idx on release_files  (cost=0.29..15.60 rows=3 width=66) (actual time=0.028..0.028 rows=0 loops=1)
      Index Cond: (name = 'ipython_sql'::text)
    Planning time: 0.224 ms
    Execution time: 0.058 ms

SQL After :

    explain analyse select filename, releases.requires_python, md5_digest
    from release_files
    inner join releases
        on release_files.version=releases.version
        and release_files.name=releases.name
    where release_files.name='ipython_sql'

    Nested Loop  (cost=4.60..31.74 rows=1 width=61) (actual time=0.013..0.013 rows=0 loops=1)
      Join Filter: (release_files.version = releases.version)
      ->  Index Scan using release_files_name_idx on release_files  (cost=0.29..15.60 rows=3 width=77) (actual time=0.012..0.012 rows=0 loops=1)
            Index Cond: (name = 'ipython_sql'::text)
      ->  Materialize  (cost=4.31..16.01 rows=3 width=18) (never executed)
            ->  Bitmap Heap Scan on releases  (cost=4.31..16.00 rows=3 width=18) (never executed)
                  Recheck Cond: (name = 'ipython_sql'::text)
                  ->  Bitmap Index Scan on release_name_idx  (cost=0.00..4.31 rows=3 width=0) (never executed)
                        Index Cond: (name = 'ipython_sql'::text)
    Planning time: 0.668 ms
    Execution time: 0.092 ms

Likely longer for project with more releases.

@ewdurbin
Copy link
Member

@Carreau

thanks for the major updates to the README.

I'm going to add some notes to specific changes here, but the overarching theme of my notes will be "small steps please". Given the nature of PyPI and the simple index, I'm hesitant to have too much change in any given update to core components. I'm positive that adding data-requires-python will
brick some people's parsers/scrapers and minimizing change will help us support them better, since we can point them to PEP 503 and this change to understand what's gone on. Multiple changes at once are likely confuse and bewilder people, instigating more pitchforks and burdening support.

@@ -1036,22 +1036,21 @@ def simple_body(self, path):
raise NotFound, path + " does not have any releases"

html = []
html.append("""<html><head><title>Links for %s</title><meta name="api-version" value="2" /></head>"""
html.append("""<!DOCTYPE html><html><head><title>Links for %s</title></head>"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the meta tag here is used, can we leave it until it's value is understood?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The !DOCTYPE addition is sound, however I'd like to stage it in a separate PR if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstufft the api-version meta node was added in be17fea#diff-19fadc30e1b17100568adbd8c6c3cc13, do you have any information on it's use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed DOCTYPE and reeadded meta attributes.

@ewdurbin
Copy link
Member

ewdurbin commented Aug 15, 2016

just saw #505 and it helped color some of the decisions made here.

regarding select performance, I did want to point out that things get significantly slower when working with packages that have a longer release history. though i have no suggestions on improvement:

running on a local copy of the production release_files/releases tables:

pypi=# explain analyze select filename, releases.requires_python, md5_digest from release_files inner join releases on release_files.name=releases.name and release_files.version=releases.version where release_files.name='setuptools';
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.85..178.97 rows=50 width=61) (actual time=0.068..4.953 rows=632 loops=1)
   ->  Index Scan using releases_pkey on releases  (cost=0.42..60.60 rows=14 width=19) (actual time=0.034..0.219 rows=202 loops=1)
         Index Cond: (name = 'setuptools'::text)
   ->  Index Scan using release_files_name_version_idx on release_files  (cost=0.42..8.45 rows=1 width=78) (actual time=0.019..0.022 rows=3 loops=202)
         Index Cond: ((name = 'setuptools'::text) AND (version = releases.version))
 Planning time: 5.652 ms
 Execution time: 5.038 ms
(7 rows)

pypi=# explain analyze select filename, python_version, md5_digest from release_files where name='setuptools';
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on release_files  (cost=17.02..2132.36 rows=593 width=66) (actual time=0.274..0.982 rows=632 loops=1)
   Recheck Cond: (name = 'setuptools'::text)
   Heap Blocks: exact=628
   ->  Bitmap Index Scan on release_files_name_version_idx  (cost=0.00..16.87 rows=593 width=0) (actual time=0.176..0.176 rows=632 loops=1)
         Index Cond: (name = 'setuptools'::text)
 Planning time: 0.119 ms
 Execution time: 1.039 ms
(7 rows)

@ewdurbin
Copy link
Member

I'd really like to have @dstufft take a pass here, as this is the most significant change to /simple we've implemented in PyPI-Legacy in the past... well a long time.

if href.startswith('http://cheeseshop.python.org/pypi') or \
href.startswith('http://pypi.python.org/pypi') or \
href.startswith('http://www.python.org/pypi'):
# Suppress URLs that point to us
continue
if rel:
Copy link
Member

@ewdurbin ewdurbin Aug 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, removing rel="internal" tag will break older versions of pip. I'm not opposed, but it is asking for pitchforks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reintroduced.

@ewdurbin
Copy link
Member

ewdurbin commented Aug 15, 2016

I'd like to propose that we break out the following ancillary changes into a separate PR that can be deployed after a lengthy "warning" period. We can put it in the status page as a scheduled maintenance to alert people that it's changing ahead of time and have somewhere to point when they get angry and flashmob us.

  • Addition of the !DOCTYPE element
  • Removal of the api-version meta tag
  • Removal of the rel="internal" attribute on package links

@Carreau
Copy link
Contributor Author

Carreau commented Aug 15, 2016

Multiple changes at once are likely confuse and bewilder people, instigating more pitchforks and burdening support.

No problem, I completely understand. I'll take care of sending incremental updates.

regarding select performance, I did want to point out that things get significantly slower when working with packages that have a longer release history. though i have no suggestions on improvement

One of the changes possible would be to actually update the releases_files table and duplicate the requires_python information into another column.

Anyway, now that I have that working, it should be easy to pull that out in smaller chunks.

I have to work a bit on other stuff this week but I can try to spend some cycles.

@dstufft
Copy link
Member

dstufft commented Aug 15, 2016

One of the changes possible would be to actually update the releases_files table and duplicate the requires_python information into another column.

This would not be the end of the world. Our data model is a little weird anyways, techincally all metadata is per file (right now) and it's entirely possible to have wildly different metadata for different files. We just sort of pretend this doesn't exist, and use the metadata for one of the files as all of the metadata for a particular release. This was done by basically splitting the metadata with a semi arbitrary line and saying "this metadata is likely to be different per-release, and this metadata is likely to be different per-file".

So there are two sort of ways to handle this, one is that it's just denormalizing the release centric data onto release_files (and if we do that, it should be done via a database trigger, I can show you an example in Warehouse) or decide that requires_python is not a release centric piece of data but is actually a file centric piece of data and just move the column.

Changes to the DB have to go through Wareh

@Carreau Carreau force-pushed the pep-503 branch 2 times, most recently from 806bfed to 09b4978 Compare August 19, 2016 03:05
@Carreau
Copy link
Contributor Author

Carreau commented Aug 19, 2016

So there are two sort of ways to handle this, one is that it's just denormalizing the release centric data onto release_files (and if we do that, it should be done via a database trigger, I can show you an example in Warehouse) or decide that requires_python is not a release centric piece of data but is actually a file centric piece of data and just move the column.

I will trust you on this one. i honestly doubt people should/would bother to have various files with different requires-python; and I think that per release make sens or at least it is easier to migrate to per-file later if we decide; or at least simpler than deciding that per-file was a mistake and try to gather all at one back at a release level.

Changes to the DB have to go through Warehouse

Sure, just tell me if you want to make the changes or if the performance hit on Postgres because of the Join is acceptable.

Can you also benchmark the following ?

explain analyze select filename, requires_python, md5_digest
from (select requires_python, version from releases where name='setuptools') as s1,
     (select filename, version, md5_digest from release_files where name='setuptools') as s2
where s2.version=s1.version

I don't get the ~200 nested loops I see on your benchmark as I don't have the full db, but this query might be quicker as the cross product should be smaller if we filter by name first; though I don't know how much the query-planner can optimize...

Or maybe even the following that don't make the join explicit :

explain analyse select filename, releases.requires_python, md5_digest
from release_files,releases
where release_files.version=releases.version 
        and release_files.name='setuptools'
        and releases.name='setuptools'

Thanks !

@Carreau
Copy link
Contributor Author

Carreau commented Aug 23, 2016

So there are two sort of ways to handle this, one is that it's just denormalizing the release centric data onto release_files (and if we do that, it should be done via a database trigger, I can show you an example in Warehouse) or decide that requires_python is not a release centric piece of data but is actually a file centric piece of data and just move the column.

How often are each releases and release_files productiondb updated ? It seem like I could make a materialized view that contain the above join and refresh it on insert/updates of each release / release_files ?

Quick development setup
-----------------------

Make sure you read http://wiki.python.org/moin/CheeseShopDev#DevelopmentEnvironmentHints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove/update this line it links to a page that states that it is immediately out of date and that people should come back here to learn what to do. Infinite loops are sad.

Carreau added a commit to Carreau/warehouse that referenced this pull request Aug 31, 2016
…files

Assuming there are much less updates of the `release` and `release_file`
table than reads then we take the overhead of building this materialized
view only rarely.

Building it concurrently is slow but does not block reads so the
building cost might be invisible.

The more efficient alternative would be :

   - To make this an actual table and update only the affected row but
     seem much more complicated and prone to errors and confusions for
     maintenance as this table should not be manipulated directly. But
     would not be that hard since Posgres 9.5 supports UPSERTs.

     It would be easy to convert back-and from a virtual-table to a real
     table without actually changing Python code.

   - To duplicate the `requires_python` column on both `releases` and
     `release_files` which seem redundant and might lead to
     inconsistencies.

   - Transfer `requires_python` to `release_files`, though it is also
     queried from `releases` for we might still have the overhead of
     join but in the other direction and will require careful checking
     of the legacy-pypi codebase.

Related: pypi/legacy#506
@Carreau
Copy link
Contributor Author

Carreau commented Sep 16, 2016

Rebase and updated to work with pypi/warehouse#1448 (once accepted)

@Carreau
Copy link
Contributor Author

Carreau commented Oct 5, 2016

Just checked "Allow edits from maintainers" and gave pushed right to my branch to @michaelpacer .

Carreau and others added 2 commits October 5, 2016 16:03
end
This exposes the data-requires-python field when the packages registers it.
cgi.escape() does not escape single quotes, so we need to make sure any
arguments parsed with cgi.escape() are wrapped with double quotes.

Using cgi.escape(quote=True) escapes double quotes. Though this is
unlikely to occur in a python_requires value, we want to protect against
double quote based sql injections.

The html parsing was separated into its own private function so that it
could be unit tested.
@mpacer
Copy link
Contributor

mpacer commented Oct 6, 2016

I took out the html formatting parts of the simple_body() function and put them in a private function outisde of the main class so that they can be more easily tested without needing to instantiate the full object.

Would you prefer me to include a test to go with these updates or to leave this as is?

@Carreau Carreau mentioned this pull request Oct 6, 2016
@Carreau
Copy link
Contributor Author

Carreau commented Oct 6, 2016

I opened #535 which resurect some of the testing and pass on travis.

@mpacer
Copy link
Contributor

mpacer commented Oct 6, 2016

If the time-to-merge of this solution would be extended because of needing to wait for tests, I would support not including tests and simply merging as is.

At the least, it supports escaping all of the information that is passed now. Previously the rel argument was not escaped (but just passed in as a raw string), and so an injection could have occurred fairly easily if bad data had been included.

I especially would prefer not to write tests if there is no long term intention of maintaining a testing suite for this repo. I understand not wanting to start supporting a testing suite given that the intention is to deprecate this when it becomes feasible. But until the warehouse lives, I also understand wanting to keep the cheese shop resilient.

So I support whatever the maintainers decide in either direction.

@Carreau
Copy link
Contributor Author

Carreau commented Nov 1, 2016

Hello there !

I hope that @dstufft has found a new position that give him some time to still work on Open Source and that he can still work remotely from home ! I'm pretty sure he can't discuss it while contract are being signed. Just giving my encouragement.

I made a couple of PRs to PyPI legacy this one that adds limited functionality for Python 3 only packages, so I was wondering if there is any hope to get that soonish, and if there is anything I can do to help.

I've tried as well to simplify and document how to run PyPI legacy locally and can do more cleanup work if this is what is needed for things to move forward. There are a couple of library out there that wish to release Python 3 only version and hope not to use gross hack or break Python 2 user workflow.

Of course we need pip 8.2 to be out, but that would be neat to be able to at least get that into the wild on test it a bit.

Thanks !

@Carreau
Copy link
Contributor Author

Carreau commented Nov 2, 2016

Of course we need pip 8.2 to be out, but that would be neat to be able to at least get that into the wild on test it a bit.

Thanks @dstufft for releasing pip 9 ! 🍰

@Carreau
Copy link
Contributor Author

Carreau commented Nov 23, 2016

Hello there ! Anything we can do here ?

@dstufft dstufft merged commit 12b6e3a into pypi:master Nov 30, 2016
@dstufft
Copy link
Member

dstufft commented Nov 30, 2016

I think that will automatically deploy in ~15 minutes or so. If not then I'll go looking to see why not.

@Carreau Carreau deleted the pep-503 branch November 30, 2016 18:25
@Carreau
Copy link
Contributor Author

Carreau commented Nov 30, 2016

Thanks a lot, really appreciate it.

I just tested on testpypi with py3onlytest, it does not seem to show the data-attributes.

I've tried with the most recent stable setuptools (and twine).

@dstufft
Copy link
Member

dstufft commented Nov 30, 2016

Hmm, none of those has a requires_python in the DB:

warehouse-test::DATABASE=> SELECT name, version, filename, requires_python FROM release_files WHERE name = 'py3onlytest';
    name     | version |             filename             | requires_python
-------------+---------+----------------------------------+-----------------
 py3onlytest | 0.1     | py3onlytest-0.1.tar.gz           |
 py3onlytest | 0.1     | py3onlytest-0.1-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2.tar.gz           |
(4 rows)

warehouse-test::DATABASE=> SELECT name, requires_python FROM releases WHERE name = 'py3onlytest';
    name     | requires_python
-------------+-----------------
 py3onlytest |
 py3onlytest |
(2 rows)

So either we're not properly capturing that value when it is sent, or twine/setuptools is not actually sending it, or both.

@Carreau
Copy link
Contributor Author

Carreau commented Dec 1, 2016

I'm confused... can run the same for flit? as it seem to have that set :
screen shot 2016-11-30 at 19 17 59

But it does not have the data in the /simple/flit page.

I'll try to dig out where the upload path is and what could go wrong there.

@dstufft
Copy link
Member

dstufft commented Dec 1, 2016

Is that on regular PyPI or TestPyPI?

@Carreau
Copy link
Contributor Author

Carreau commented Dec 1, 2016

Is that on regular PyPI or TestPyPI?

Sorry this one is regular pypi. It's also on testpypi, but with Requires Python: 3 not >=3. Though the one on tespypi does not show data attributes either

@dstufft
Copy link
Member

dstufft commented Dec 1, 2016

@Carreau It's not deployed to regular PyPI yet. The flit page was cached though, I purged it and it shows up now. This suggests the problem is in twine or setuptools or both and not PyPI though.

@Carreau
Copy link
Contributor Author

Carreau commented Dec 1, 2016

@Carreau It's not deployed to regular PyPI yet. The flit page was cached though, I purged it and it shows up now. This suggests the problem is in twine or setuptools or both and not PyPI though.

Good, thanks a lot. I'll investigate both and report back. Thanks !

@Carreau
Copy link
Contributor Author

Carreau commented Dec 1, 2016

So

  1. in setuptools I was using requires_python instead of python_requires. I'm astonished nothing complained.

  2. The listing in testPyPI shows Python Requires : >=3

Though the /simple/ page still does not have the data attribute. I might be missing something. Will have a look tomorrow.

Thanks.

@Carreau
Copy link
Contributor Author

Carreau commented Dec 1, 2016

Hum, so the only thing I can see is that the db-trigger are not triggered correctly.

As the main page pull the Requires-Python from the releases table and is correct, and the listing page is pulling things from the releases_file the only thing I can see leading to this a DB inconsistency.

I'm trying to re-run the Warehouse/PyPI-legacy locally but seem to have issues with make initdb for now.

@dstufft
Copy link
Member

dstufft commented Dec 1, 2016

@Carreau That seems to be the issue yes:

warehouse-test::DATABASE=> SELECT name, version, filename, requires_python FROM release_files WHERE name = 'py3onlytest';
    name     | version |             filename             | requires_python
-------------+---------+----------------------------------+-----------------
 py3onlytest | 0.1     | py3onlytest-0.1.tar.gz           |
 py3onlytest | 0.1     | py3onlytest-0.1-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2.tar.gz           |
 py3onlytest | 0.3     | py3onlytest-0.3-py3-none-any.whl |
 py3onlytest | 0.3     | py3onlytest-0.3.tar.gz           |
 py3onlytest | 0.4     | py3onlytest-0.4-py3-none-any.whl |
 py3onlytest | 0.5     | py3onlytest-0.5-py3-none-any.whl |
 py3onlytest | 0.5     | py3onlytest-0.5.tar.gz           |
(9 rows)

warehouse-test::DATABASE=> SELECT name, version, requires_python FROM releases WHERE name = 'py3onlytest';
    name     | version | requires_python
-------------+---------+-----------------
 py3onlytest | 0.4     | >=3,<5
 py3onlytest | 0.5     | >=3
 py3onlytest | 0.3     | >=3,<5
 py3onlytest | 0.2     |
 py3onlytest | 0.1     |
(5 rows)

@Carreau
Copy link
Contributor Author

Carreau commented Dec 1, 2016

Ok, thanks for confirming. I'm unsure where this could be coming from. I'm trying to replicate locally. Fighting with legacy codebase to have it running without S3 and reddis.

@Carreau
Copy link
Contributor Author

Carreau commented Dec 1, 2016

Sorry previous comment was in a wrong window. But I think I know what is happening at least for me locally.

When uploading through the legacy pypi interface. It seem when calling python setup.py register is does create an entry in release but does not set requires_python. Thus when twine uploads files, these twine does send the python_requires that have no effect.

I'm still unsure of why on testpipy it does seem to work at least in the webUI.

I'm giving up for today

@takluyver
Copy link

Thanks both for getting this in place; I can see the data-requires-python on flit's simple index page now, and with support in pip, it should be working :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants