Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sort longturtle blank nodes #2997

Merged
merged 2 commits into from
Nov 29, 2024
Merged

feat: sort longturtle blank nodes #2997

merged 2 commits into from
Nov 29, 2024

Conversation

edmondchuc
Copy link
Contributor

@edmondchuc edmondchuc commented Nov 28, 2024

Summary of changes

Fixes #1890 - Sorting Turtle output?

This change improves git diffing Turtle data serialized with the longturtle serializer. Previously, blank nodes were not sorted, and round-trips using RDFLib's turtle parser and longturtle serializer would have blank node objects flip flop around, making it difficult to read real changes using git diff.

This PR fixes the above by implementing a sort on values where triples in the object position are blank nodes. The blank nodes are sorted by grabbing their concise-bounded description graph and sorting it as a string in their longturtle serialization.

This adds an additional cost to the longturtle serializer, but I think the cost is worth it if we are after a deterministic output with blank nodes.

Note that this depends on RDF data parsed using the turtle parser and its behaviour of how it assigns blank nodes. Exact serialization with data added via the graph object cannot be guaranteed. This would require implementing RDF Canonicalization to guarantee the same blank node identifiers in the graph.

Once RDF Canonicalization is implemented, sorting by the blank node identifier directly will be enough to guarantee deterministic serialization, and the expensive CBD sorting can be removed.

Update: looks like top-level blank nodes with no inbound relationships don't get sorted. I think this makes sense since we're only applying the sort to blank nodes in the object position. If we cared about sorting blank nodes in the subject position, we can apply the same kind of sorting based on the text serialization of the CBD onto the subject blank nodes.

for subject in subjects_list:
if self.isDone(subject):
continue
if firstTime:
firstTime = False
if self.statement(subject) and not firstTime:
self.write("\n")

Fixes #2767 - Bug in longturtle serialization

This PR also fixes the missing trailing whitespace in the special case described by @mschiedon.

The longturtle serializer fails to emit a whitespace separator between a predicate and a list of objects if one of these objects is a blank node (and the blank node cannot be 'inlined', i.e. is used more than once)

This fix was previously fixed in PR #2700 but was inadvertently reverted in PR #2731 when I was trying to fix the ruff linting rule in the test case. To get around the ruff linting rule, I've now moved the target result of the test case into a text file.

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Checked that all tests and type checking passes.
  • If the change adds new features or changes the RDFLib public API:
    • Created an issue to discuss the change and get in-principle agreement.
    • Considered adding an example in ./examples.
  • If the change has a potential impact on users of this project:
    • Added or updated tests that fail without the change.
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

@edmondchuc
Copy link
Contributor Author

@ajnelson-nist would be good to see if you can test this with some of the data you work with and confirm that this PR fixes the blank node sorting issues in your workflows.

@coveralls
Copy link

coveralls commented Nov 28, 2024

Coverage Status

coverage: 90.276% (+0.03%) from 90.247%
when pulling 4ef7733 on edmond/longturtle-sort
into 08dd4b7 on main.

Copy link
Member

@nicholascar nicholascar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff @edmondchuc!

@nicholascar nicholascar merged commit 28a6190 into main Nov 29, 2024
22 checks passed
@nicholascar nicholascar deleted the edmond/longturtle-sort branch November 29, 2024 03:38
edmondchuc added a commit that referenced this pull request Jan 15, 2025
* feat: sort longturtle blank nodes in the object position by their cbd string

* fix: #2767
edmondchuc added a commit that referenced this pull request Jan 15, 2025
* feat: sort longturtle blank nodes in the object position by their cbd string

* fix: #2767
edmondchuc added a commit that referenced this pull request Jan 15, 2025
* feat: sort longturtle blank nodes in the object position by their cbd string

* fix: #2767
edmondchuc added a commit that referenced this pull request Jan 16, 2025
* feat: sort longturtle blank nodes in the object position by their cbd string

* fix: #2767
edmondchuc added a commit that referenced this pull request Jan 16, 2025
* feat: sort longturtle blank nodes in the object position by their cbd string

* fix: #2767
nicholascar added a commit that referenced this pull request Jan 16, 2025
* 7.1.1 post release (#2953)

* Fix Black formatting in ./admin/get_merged_prs.py (#2954)

* build(deps-dev): bump ruff from 0.7.0 to 0.7.1 (#2955)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.7.0 to 0.7.1.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.7.0...0.7.1)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ashley Sommer <[email protected]>

* Fix defined namespace warnings (#2964)

* Fix defined namespace warnings

Current docs-generation tests are polluted by lots of warnings that occur when Sphinx tries to read various parts of DefinedNamespace.

* Fix tests that no longer need incorrect exceptions handled.

* fix black formatting in test file

* Undo typing changes, so this works on current pre-3.9 branch

* better handling for any/all double-underscore properties

* Don't include __slots__ in dir().

* test: earl test passing

* Annotate Serializer.serialize and descendants (#2970)

This patch aligns the type signatures on `Serializer` subclasses,
including renaming the arbitrary-keywords dictionary to always be
`**kwargs`.  This is in part to prepare for the possibility of adding
`*args` as a positional-argument delimiter.

References:
* #1890 (comment)

Signed-off-by: Alex Nelson <[email protected]>

* build(deps): bump orjson from 3.10.10 to 3.10.11 (#2966)

Bumps [orjson](https://github.com/ijl/orjson) from 3.10.10 to 3.10.11.
- [Release notes](https://github.com/ijl/orjson/releases)
- [Changelog](https://github.com/ijl/orjson/blob/master/CHANGELOG.md)
- [Commits](ijl/orjson@3.10.10...3.10.11)

---
updated-dependencies:
- dependency-name: orjson
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.7.1 to 0.7.2 (#2969)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.7.1 to 0.7.2.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.7.1...0.7.2)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.7.2 to 0.7.3 (#2979)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.7.2 to 0.7.3.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.7.2...0.7.3)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.7.3 to 0.8.0 (#2994)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.7.3 to 0.8.0.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.7.3...0.8.0)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump orjson from 3.10.11 to 3.10.12 (#2991)

Bumps [orjson](https://github.com/ijl/orjson) from 3.10.11 to 3.10.12.
- [Release notes](https://github.com/ijl/orjson/releases)
- [Changelog](https://github.com/ijl/orjson/blob/master/CHANGELOG.md)
- [Commits](ijl/orjson@3.10.11...3.10.12)

---
updated-dependencies:
- dependency-name: orjson
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* added Node as an exported name from the root package location. Updated linting commands section in the developer section to use ruff check. (#2981)

* build(deps-dev): bump wheel from 0.45.0 to 0.45.1 (#2992)

Bumps [wheel](https://github.com/pypa/wheel) from 0.45.0 to 0.45.1.
- [Release notes](https://github.com/pypa/wheel/releases)
- [Changelog](https://github.com/pypa/wheel/blob/main/docs/news.rst)
- [Commits](pypa/wheel@0.45.0...0.45.1)

---
updated-dependencies:
- dependency-name: wheel
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Car <[email protected]>

* feat: sort longturtle blank nodes (#2997)

* feat: sort longturtle blank nodes in the object position by their cbd string

* fix: #2767

* build(deps-dev): bump pytest from 8.3.3 to 8.3.4 (#2999)

Bumps [pytest](https://github.com/pytest-dev/pytest) from 8.3.3 to 8.3.4.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@8.3.3...8.3.4)

---
updated-dependencies:
- dependency-name: pytest
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump poetry from 1.8.4 to 1.8.5 (#3001)

Bumps [poetry](https://github.com/python-poetry/poetry) from 1.8.4 to 1.8.5.
- [Release notes](https://github.com/python-poetry/poetry/releases)
- [Changelog](https://github.com/python-poetry/poetry/blob/1.8.5/CHANGELOG.md)
- [Commits](python-poetry/poetry@1.8.4...1.8.5)

---
updated-dependencies:
- dependency-name: poetry
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.8.0 to 0.8.2 (#3003)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.8.0 to 0.8.2.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.8.0...0.8.2)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.8.2 to 0.8.3 (#3010)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.8.2 to 0.8.3.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.8.2...0.8.3)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump berkeleydb from 18.1.11 to 18.1.12 (#3009)

Bumps [berkeleydb](https://www.jcea.es/programacion/pybsddb.htm) from 18.1.11 to 18.1.12.

---
updated-dependencies:
- dependency-name: berkeleydb
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# Conflicts:
#	poetry.lock

* build(deps): bump orjson from 3.10.12 to 3.10.13 (#3018)

Bumps [orjson](https://github.com/ijl/orjson) from 3.10.12 to 3.10.13.
- [Release notes](https://github.com/ijl/orjson/releases)
- [Changelog](https://github.com/ijl/orjson/blob/master/CHANGELOG.md)
- [Commits](ijl/orjson@3.10.12...3.10.13)

---
updated-dependencies:
- dependency-name: orjson
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ruff from 0.8.4 to 0.8.6 (#3025)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.8.4 to 0.8.6.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.8.4...0.8.6)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: deterministic longturtle serialisation using RDF canonicalization + n-triples sort (#3008)

* feat: use the RGDA1 canonicalization algorithm + lexical n-triples sort to produce deterministic longturtle serialisation

* chore: normalise usage of format

* chore: apply black

* fix: double up of semicolons when subject is a blank node

* fix: lint

* jsonld: Do not merge nodes with different invalid URIs (#3011)

When parsing JSON-LD with invalid URIs in the `@id`, the
`generalized_rdf: True` option allows parsing these nodes as blank nodes
instead of outright rejecting the document.

However, all nodes with invalid URIs were mapped to the same blank node,
resulting in incorrect data. For example, without this patch, the new test
fails with:

```
AssertionError: Expected:
@Prefix schema: <https://schema.org/> .

<https://example.org/root-object> schema:author [ schema:familyName "Doe" ;
            schema:givenName "Jane" ;
            schema:name "Jane Doe" ],
        [ schema:familyName "Doe" ;
            schema:givenName "John" ;
            schema:name "John Doe" ] .

Got:
@Prefix schema: <https://schema.org/> .

<https://example.org/root-object> schema:author <> .

<> schema:familyName "Doe" ;
    schema:givenName "Jane",
        "John" ;
    schema:name "Jane Doe",
        "John Doe" .
```

* Fixed incorrect ASK behaviour for dataset with one element (#2989)

* Pass base uri to serializer when writing to file. (#2977)

Co-authored-by: Nicholas Car <[email protected]>

* Dataset documentation improvements (#3012)

* example printout improvements

* added BN graph creation

* updated tests var names & added one subtest

* typos & improved formatting

* updated Graph & Dataset docco

* typo fix

* fix code-in-comment syntax

* fix code-in-comment syntax 2

* fix code-in-comment syntax - ellipses

* fix code-in-comment syntax - sort print loop output

* blacked

* ruff fixes

* Poetry 2.0.0 pyproject.toml file

* move to PEP621 (Poetry 2.0.0) pyproject.toml

* require poetry 2.0.0

* require poetry 2.0.0

* add in requirement for poetry-plugin-export

* change from --sync to sync command

* further pyproject.toml format updates

* add poetry plugin to requirements-poetry.in

* fix pre-commit poetry version to 2.0.0

* remove testing artifact

* update license to 2025

* add me to contributors

* remove outdated --check arg

* typo

* test add back in precommit args

* test remove precommit args

* match ruff version to pre-commit autoupdate PR #3026; add back in --check

* re-remove --check

* add David to CONTRIBUTORS

* ruff in pyproject.toml to match pre-commit

* updates for David's comments

* fix Dataset docc ReST formatting

* remove ConjunctiveGraph example; add Dataset example; add JSON-LS serialization example

* Add RDFLib Path to SHACL path utility and corresponding tests (#2990)

* shacl path parser: Add additional test case

* shacl utilities: Add new SHACL path building utility with corresponding tests

---------

Co-authored-by: Nicholas Car <[email protected]>
# Conflicts:
#	rdflib/extras/shacl.py

* fix: typing and import issues

* fix: line length as int

* fix: ruff version conflict

* fix: berkeleydb pin to 18.1.10 for python 3.8 compatibility

* 3a not 2a

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Alex Nelson <[email protected]>
Co-authored-by: Nicholas Car <[email protected]>
Co-authored-by: Ashley Sommer <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alex Nelson <[email protected]>
Co-authored-by: joecrowleygaia <[email protected]>
Co-authored-by: Val Lorentz <[email protected]>
Co-authored-by: jcbiddle <[email protected]>
Co-authored-by: Sander Van Dooren <[email protected]>
Co-authored-by: Nicholas Car <[email protected]>
Co-authored-by: Matt Goldberg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in longturtle serialization Sorting Turtle output?
3 participants