Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore scheme and trailing slashes in meta_schemes key #822

Closed
wants to merge 2 commits into from

Conversation

lcvisser
Copy link

This PR intends to resolve #569 by ignoring the scheme (http, https) and any trailing slashes in the $schema uri.

@codecov
Copy link

codecov bot commented Jul 13, 2021

Codecov Report

Merging #822 (20ce1d8) into main (61382d7) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head 20ce1d8 differs from pull request most recent head a4d6fb9. Consider uploading reports for the commit a4d6fb9 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     python-jsonschema/jsonschema#822      +/-   ##
==========================================
+ Coverage   96.08%   96.12%   +0.03%     
==========================================
  Files          18       18              
  Lines        2706     2733      +27     
  Branches      308      308              
==========================================
+ Hits         2600     2627      +27     
  Misses         86       86              
  Partials       20       20              
Impacted Files Coverage Δ
jsonschema/_utils.py 89.81% <100.00%> (+0.29%) ⬆️
jsonschema/tests/test_validators.py 98.64% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61382d7...a4d6fb9. Read the comment docs.

@lcvisser
Copy link
Author

lcvisser commented Aug 3, 2021

Should I do something else?

@Julian
Copy link
Member

Julian commented Aug 3, 2021

Hey! Sorry for not responding here -- the key thing we need to verify is whether this change is actually fully correct under the spec.

Specifically, $ids in the spec are, as far as I vaguely recall, meant to be identifiers, nothing more -- and the URI specifications leave how you treat trailing slashes up to the individual use case -- so whether https://example.com/foo and https://example.com/foo/ are indeed the same is up to the spec I believe.

I think what we need to do here is ask this question upstream, in perhaps the slack or discussion board.

EDIT: to clarify a bit further:

  • making the URI we use precisely match the $id in each official metaschema is an easy immediately mergeable change.
  • making the URI we accept liberal to empty fragments or trailing slashes is not immediately acceptable but if we confirm that's specified behavior it sounds very reasonable to me

@lcvisser
Copy link
Author

lcvisser commented Aug 3, 2021

Thanks for the reply.

I understand the motivation to stick to the spec, but the original issue reported a URI with trailing slash and hash and my motivation has a trailing hash in it, so there is a practical aspect to consider also.

If it turns out to be against spec to have trailing slashes/hashes, would it be acceptable to have something like a uri-strictness flag or something, to deal with stubborn reality?

@Julian
Copy link
Member

Julian commented Aug 3, 2021

If it's against the spec, it'd seem quite easy to do external from the validation process. I.e. def canonicalize_urls(schema) -> schema which just walks all the $id fields in a provided dict and removes trailing slashes and empty fragments and returns the resulting normalized schema.

I could possibly be convinced that'd be worth adding somewhere (there's some precedence in wanting this kind of thing in e.g. python-jsonschema/referencing#2 or #252) but yeah not as a flag.

@lcvisser
Copy link
Author

lcvisser commented Aug 3, 2021

I asked on their Slack channel; I'll reproduce the response here for completeness:

Stripping/ignoring the scheme is definitely against spec (specifically RFC 3986). It happens to be safe (most of the time) to assume that "https" URIs are referencing the same URIs as their HTTPS counterparts.
But a) that's not guaranteed in all cases, and b) there are more schemes than just http/https. I've seen people do goofy things with schemes. Best not to make assumptions about those things.

IMO it would only be valid to accept other forms for those URIs if the evaluator actually fetched that document via the network with that URI and parsed it to be sure it’s the same thing. it should not simply assume that a URI that looks similar is referencing the same metaschema as the official URI.
..and not all evaluators support fetching documents via the network (nor are they required to)

Yeah, plus the standard metaschema URIs (including the ones in their http forms) are "well known", so most implementations I've seen don't even need to fetch them over the network.
So if you're worried about making insecure http requests, it's generally not an issue for the official meta schemas (since implementation libraries often ship with the metaschemas pre-fetched for your convenience.)
Not a guarantee though, so double check any implementation you're using.

I'll think about your proposal for a canonize_uri() utility-like function. Would it be better to make a new MR for that or reuse this one?

@Julian
Copy link
Member

Julian commented Aug 3, 2021

Did you ask specifically about trailing fragments too?

Ignoring schemes I was sure wasn't going to fly, so unsurprised there, but those seemed more likely to be allowed. Is an empty fragment considered the same URL as fragmentless?

As for the helper function, either way is fine with me on reusing this one or creating a new one -- whatever is easiest for you.

And thanks!

@lcvisser
Copy link
Author

lcvisser commented Aug 3, 2021

Sorry, should have included my question also:

Hi; the JSON-schema Python package has an open issue (#569) regarding what should be accepted for $schema . json-schema.org uses e.g. http://json-schema.org/draft-07/schema#, but "in the wild" there may be https instead of http and/or trailing slash before the # (i.e. http://json-schema.org/draft-07/schema/# ) and/or a missing # at the end, and combinations thereof. Would it be against spec to be liberal in what to accept for $schema? In particular: this pull request proposes to use a canonical URI, stripping scheme and empty fragments - would that violate anything? Seems a safe change to me, but better safe than sorry and double check.

I don't know enough about this, to be honest. It seems that the safest way forward is to only accept the format that is in the schema http://json-schema.org/draft-07/schema# and then provide a utility function that liberally "converts" to that. The doctoring of that utility could provide the necessary warnings and caveats as put forward in the Slack channel.

@Julian Julian added the Waiting for Author A PR waiting for author changes or issue waiting on reporter feedback label Aug 9, 2021
@lcvisser
Copy link
Author

lcvisser commented Sep 12, 2021

Sorry for the lack of update.

The longer I thought about it, the more I got convinced I was getting myself into a pickle. I agree that the least-intrusive way to solve this is to add an utility function that allows the user to override a particular $schema lookup. The rationale being that the provider of the schema should stick to the standard, hence silently eating schemes or trailing slashes is indeed wrong, as you argued.

I've updated the PR with the utility function. Here is an MWE. The first attempt fails because the $schema URI does not resolve (note https scheme and trailing slash):

import jsonschema


schema = {
    "$schema": "https://json-schema.org/draft-04/schema/#",
    "definitions": {
        "my_type": {
            "type": "number",
            "maximum": 10,
            "exclusiveMaximum": True
        }
    },
    "properties": {
        "my_number": {
            "$ref": "#/definitions/my_type"
        }
    }
}

instance = {
    "$schema": "https://json-schema.org/draft-04/schema/#",
    "my_number": 7
}


# Fails
try:
    jsonschema.validate(instance, schema)
except jsonschema.exceptions.SchemaError as exc:
    print("failed")

# Succeeds
jsonschema.validators.set_validator_for(schema, jsonschema.validators.Draft4Validator)
jsonschema.validate(instance, schema)
print("success")

I will fix the merge conflict and add unit tests together with other comments you may have.

@lcvisser
Copy link
Author

Any feedback @Julian ?

@Julian
Copy link
Member

Julian commented Sep 18, 2021

Apologies, I am inordinately busy, even busier than usual, as academic semester is starting -- I'll try to get you some feedback but it likely won't be before middle of the week or early next. Sorry -- please do ping me if you have to wait any longer.

@lcvisser
Copy link
Author

Apologies, I am inordinately busy, even busier than usual, as academic semester is starting -- I'll try to get you some feedback but it likely won't be before middle of the week or early next. Sorry -- please do ping me if you have to wait any longer.

No worries! Just wanted to make sure that it was noticed.

@Julian
Copy link
Member

Julian commented Sep 30, 2021

@lcvisser -- looking at this briefly, my honest reaction is -- is this really any better than just recommending people do schema["$schema"] = Draft4Validator.META_SCHEMA["$id"]? To me it seems obviously not personally, but willing to hear what you think.

@lcvisser
Copy link
Author

@Julian Personally, I think the utility functions is more user friendly, as it is discoverable and documented via the docstring. But it's a personal preference, so if you'd like the code base as minimal as possible, than that's a valid argument too of course. In that case, I'd propose to reject and close this MR (and arguably also the linked issue).

@Julian
Copy link
Member

Julian commented Sep 30, 2021

I think for now will close -- I appreciate your effort and research. We can revisit in the future if needed.

@Julian Julian closed this Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Author A PR waiting for author changes or issue waiting on reporter feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make $schema lookup more liberal for empty fragment URLs
2 participants