Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for rdflib 5.0.0 #161

Merged
merged 5 commits into from
Dec 12, 2020
Merged

Add support for rdflib 5.0.0 #161

merged 5 commits into from
Dec 12, 2020

Conversation

jayaddison
Copy link
Contributor

This changeset adds support for rdflib 5.0.0 to the extruct library.

Users of extruct currently often encounter issue #131 (module not found) when they upgrade to rdflib 5.0.0; this is due to the removal of the pyRdfa parser plugin as noted here.

Fortunately, the standalone pyRdfa3 library added support to register as an rdflib plugin in RDFLib/pyrdfa3#26.

What this means is that when pyRdfa3 is installed in the Python environment, it should restore RDFA parsing functionality for users of rdflib 5.0.0 or greater, and resolve the ModuleNotFoundError.

The currently-published PyPi package pyRdfa3, version 3.5.3, includes this functionality via rdf.plugins.parser in entry_points.txt

NB: This release isn't currently tagged in the source repo, but I think pyRdfa3 v3.5.3 is from around commit RDFLib/pyrdfa3@1562b3f or RDFLib/pyrdfa3@b623cdd based on looking at some diffs.

NB: A note of caution for review/usage is that full Python3 support has been added to pyRdfa3 from version 4.0.0 (in RDFLib/pyrdfa3#34) -- but from inspection of the Python-code-related changes, extruct does not appear to depend at all on the affected Python methods (notably copyErrors and rdf_from_sources which are both provided for CGI-related context in the top-level processURI method).

May fix #131 (alternative approach).

@jayaddison
Copy link
Contributor Author

⏸️ Sigh, after all that, this upgrade might not be so simple after all.

When using tox to run tests, module resolution/plugin registration for pyRdfa3 seemed to work fine. But when running py.test tests the ModuleNotFoundError re-occurs.

This corresponds to an upstream pyRdfa change in commit 9c893ecd92c5dfcf7ab8f441eecda787ba13bef6
@jayaddison
Copy link
Contributor Author

⏯️ Ok, this is in progress again; the module resolution error was due to the fact that I'd failed to update some module imports within the library, and there was also an upstream change regarding the lang XML attribute that had to be reflected in the test expectations.

Tests pass under Python2.7 but I'm still seeing a test failure under Python3:

FAILED tests/test_rdfa.py::TestRDFa::test_wikipedia_xhtml_rdfa_no_prefix - AssertionError: Lists differ: ...

@jayaddison
Copy link
Contributor Author

jayaddison commented Dec 8, 2020

The remaining issue with the test_wikipedia_xhtml_rdfa_no_prefix test is that the default xmlns attribute is serialized in a different order between Python 2.7 and Python 3.x environments.

Python 3.x's minidom implementation retains the user-supplied ordering for element attributes (python/cpython#10219).

Python 2.7 sorts the attributes and as a result they meet the current extruct library test expectations.

XMLNS attributes are added by pyRdfa in this block of code which assigns prefixed namespaces first, and then adds the default (unprefixed) namespace last.

When Python 2.7 sorts these attribute keys it brings xmlns (the default namespace) entry ahead of any prefixed (xmlns:, which sorts lexically after xmlns) namespace entries.

Python 3.x retains the attribute ordering and leaves the default xmlns namespace attribute at the end of the list.

Possible paths forward
To resolve the behaviour so that extruct tests pass, various options are available:

  • Update the test expectations so that they do not care about attribute order
  • Update the library so that the Python 3.x implementation sorts the attributes, maintaining consistency with 2.7 and the existing tests
  • Update pyRdfa to modify the order in which namespace attributes are assigned, and then update this library once that change is available

…ribute ordering differences are not considered to be test failures
@jayaddison
Copy link
Contributor Author

Ok, the latest commit pushes a change for which I'm able to get tests to pass for both Python 2.7 and Python 3.9. Unfortunately since Travis builds may not be running at the moment, this may not be apparent from the pull request as-is.

The approach taken was to add a workaround in the test suite, to ignore differences in attribute ordering. This was achieved by using lxml's canonicalize method.

The rationale behind this is:

  • It is not easily possible to make Python 2.7 retain attribute ordering without messy monkey-patching or other questionable approaches
  • It does not seem clear that adding re-ordering of attributes in pyRdfa -- even if user-configurable -- would be a 'correct' fix
  • Retaining attribute ordering in Python 3.7 seems desirable: having scrapers that return values in the format they were published can be considered a feature in many situations
  • Adding a workaround in a test suite that can be removed later seems relatively innocuous

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jayaddison , I agree with your call re python 2.7 tests, and the changes look great! Let me have another look at it, but so far I don't see any issues.

@@ -27,15 +27,15 @@
],
"http://purl.org/rss/1.0/modules/content/encoded": [
{
"@value": "<p xml:lang=\"en\" xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" xmlns:dc=\"http://purl.org/dc/terms/\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\" xmlns:og=\"http://ogp.me/ns#\" xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\" xmlns:sioc=\"http://rdfs.org/sioc/ns#\" xmlns:sioct=\"http://rdfs.org/sioc/types#\" xmlns:skos=\"http://www.w3.org/2004/02/skos/core#\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\">Op deze vernieuwde website kunt u enkele van mijn projecten vinden, tevens kunt u lessen downloaden die ik heb gemaakt.</p>\n\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the commit RDFLib/pyrdfa3@9c893ec referenced in 98f2205

@jayaddison
Copy link
Contributor Author

Thanks @lopuhin!

@lopuhin
Copy link
Member

lopuhin commented Dec 12, 2020

thanks for the PR @jayaddison 👍

The changes look good to me.

There are two issues, both unrelated to this PR:

  • There are some package version combinations allowed by our setup.py which lead to test failures (e.g. mf2py==1.0.5, and also old rdflib-jsonld with new rdflib), both in master and in this PR. It's probably not practical to enumerate all working combinations, but maybe we could at least have one combination which is known to work (in requirements.txt) and also check that latest versions allowed by setup.py work as well. This is obviously a topic for another PR, and not a blocker here.
  • We don't have working builds at the moment, we'll probably need to migrate off Travis to fix this - again this is unrelated to this PR but I'd like to have working builds before we merge anything. Sorry for this issue.

So let us try to fix the builds first.

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests passed in #165 , merging, thanks again @jayaddison

@lopuhin lopuhin merged commit bf103f5 into scrapinghub:master Dec 12, 2020
@lopuhin
Copy link
Member

lopuhin commented Dec 12, 2020

Btw this PR also fixes tests for python 3.8 and 3.9 (which were broken in master).

@jayaddison jayaddison deleted the dependencies/rdflib-5.0.0 branch December 12, 2020 11:43
@jayaddison
Copy link
Contributor Author

Hi @lopuhin - not urgent, but I was wondering whether you know when there might be an updated release of extruct? (I'd quite like to upgrade some libraries that have versioning pins in place as a result of the rdflib 5.0.0 compatibility issue)

@lopuhin
Copy link
Member

lopuhin commented Dec 22, 2020

hi @jayaddison good point, let me work on it this week - the only blocker is moving the release to github actions from travis, should not take long.

@jayaddison
Copy link
Contributor Author

@lopuhin That'd be great, but take your time; I didn't realize that'd require setup in CI. Thanks for the update.

@lopuhin
Copy link
Member

lopuhin commented Dec 28, 2020

@jayaddison it's released now, thanks again for your contributions 👍

@jayaddison
Copy link
Contributor Author

Brilliant, thanks again @lopuhin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ModuleNotFoundError: No module named 'rdflib.plugins.parsers.pyRdfa' for RDFLib v5.0.0
2 participants