-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML namespaces stripped when referenced #85
Comments
Unfortunately the spec used to inform the transformation doesn't handle namespaces, so we're on our own to figure out how to best handle this. I would think the structure of your last example there should be: {
"a:foo": {
"@xmlns:a": "http://www.w3.org/1999/xhtml",
"#text": "bar"
}
} However this technically a breaking change as the structure of the transformed data would change if an element had a namespace, when previously it was a simple key/value pair.
This might be the way forward as it would keep backwards compatibility, but still support this use case. I can also note that in 2.x of I still need to update tests and add the flag, but I pushed up https://github.com/Blacksmoke16/oq/compare/xml-namespaces that includes the logic that will be behind the flag. With this code your examples now produce this, with the JSON version being what I included earlier: ./bin/oq -i xml -o xml <<< '<?xml version="1.0"?><a:foo xmlns:a="http://www.w3.org/1999/xhtml">bar</a:foo>'
<?xml version="1.0" encoding="UTF-8"?>
<root>
<a:foo xmlns:a="http://www.w3.org/1999/xhtml">bar</a:foo>
</root> |
Thanks for the quick turnaround! :)
Perfect! So too is the output from your branch. I'm using |
What's the issue? |
During the build of
I'm not sure if it matters, but I see loads of the repeated warning below as the tests are run, which seem fairly critical:
I'm using Nix on macOS. Occasionally build differences in the macOS ecosystem makes Nix packages fail and I don't think it has quite as many eyes on it as Linux (or first class Nix) users enjoy. I intend to file a bug at some point, but they might know about it already since the package is marked as "broken" - at least for macOS users. Apologies if this derails this particular issue. |
I was able to build a binary from the $ oq -i xml <<< '<?xml version="1.0"?><foo xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:a="http://www.w3.org/1999/xhtml">bar</foo>'
{
":foo": {
"@xmlns:": "urn:oasis:names:tc:SAML:2.0:metadata",
"#text": "bar"
}
} To get this to happen I had to include an unprefixed namespace. The result is that prefixed namespace is still stripped, and now the element names are prefixed with a... blank namespace? In this case it is If I leave off the prefixed namespace, the output is the same: $ oq -i xml <<< '<?xml version="1.0"?><foo xmlns="urn:oasis:names:tc:SAML:2.0:metadata">bar</foo>'
{
":foo": {
"@xmlns:": "urn:oasis:names:tc:SAML:2.0:metadata",
"#text": "bar"
}
} |
Yea that's prob your best bet, I don't really know anything about nix.
Ok, so I think these are easy fixes. I pushed up another commit that should resolve these if you want to try again. |
I've been playing around with it today. I'm running into trouble with my transformation script and I haven't been able to create a simple reproduction - so it might just be my script at this point. Tomorrow I plan on figuring out what the intended behavior of XML namespaces should actually be so I can make an intelligent ask ;) I really appreciate all of the effort and the quick turnaround! |
Inspired by some libraries that handle XPath and by reading the XML spec itself, I have some suggestions to make for Here’s a fact list:
XPath libraries handle this by having the consumer establish namespaces that are important to the consumer, and assign a prefix to it. The consumer expresses “Give me what is on this path” but the namespace is essential to that expression. For example, these are two different nodes: <?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b">
<a:foo>
herp
</a:foo>
<b:foo>
derp
</b:foo>
</root> Both of these oq -i xml --xmlns "b=https://b" '.["b:foo"] .["#text"]' <<< '
<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b">
<a:foo>
herp
</a:foo>
<b:foo>
derp
</b:foo>
</root>
' And the result should be:
Additionally, the same <?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns="https://b">
<a:foo>
herp
</a:foo>
<foo>
derp
</foo>
</root> These are the same documents because the last The contrived oq -i xml --xmlns "c=https://b" '.["c:foo"] .["#text"]' <<< "$document" Here the prefix is changed, but both documents above will satisfy the query and produce an output of “derp”. This works because the When For example, it should be an acceptable transformation to start with this: <?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns="https://b">
<a:foo>
herp
</a:foo>
<foo>
<bar xmlns="https://c">
<baz xmlns="https://d">
</baz>
</bar>
</foo>
</root> And wind up with this: <?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b" xmlns:c="https://c" xmlns:d="https://d">
<a:foo>
herp
</a:foo>
<b:foo>
<c:bar>
<d:baz>
</d:baz>
</c:bar>
</b:foo>
</root> Where all of the default prefixes have become explicit prefixes. Clearing the default namespace is done by setting the default namespace to <?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns="https://b">
<a:foo>
herp
</a:foo>
<foo>
<bar xmlns="https://c">
<baz xmlns="https://d">
<qux xmlns="" />
</baz>
</bar>
</foo>
</root> And wind up with this: <?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b" xmlns:c="https://c" xmlns:d="https://d">
<a:foo>
herp
</a:foo>
<b:foo>
<c:bar>
<d:baz>
<qux />
</d:baz>
</c:bar>
</b:foo>
</root> The Uniqueness of Attributes section of the spec I’m less confident about, but after reading it a few times I can only ascertain that it means to show that attributes don’t inherit default namespaces. I know this is a lot to digest! I appreciate the time spent so far. I’m happy to help sort things out if there’s some confusion here. It took me some time to really grok it - I didn’t realize XML could be so complicated. |
@LoganBarnett I guess the one question I have is, are you suggesting |
@Blacksmoke16 Not From my perspective, the implementation hoops Earlier you mentioned Some context for fun, feel free to ignore since I tend to write a lot: Prior to this, we were using |
Ok good :) I see what I did tho. I mistook the filter to be addition args to that option 🙈.
Yea I think ideally this behavior of producing XML that is semantically equal to what was input when going to/from I did manage to fix an additional issue related to some of your latest examples if you want to rebuild and try again. Apparently the methods that are available to get a node's namespace also include the parent's namespaces as well. So your 2nd to last example with a, b, c, and d namespaces is transformed into: {
"root": {
"@xmlns:a": "https://a",
"@xmlns": "https://b",
"a:foo": {
"@xmlns:a": "https://a",
"@xmlns": "https://b",
"#text": "herp"
},
"foo": {
"@xmlns:a": "https://a",
"@xmlns": "https://b",
"bar": {
"@xmlns": "https://c",
"@xmlns:a": "https://a",
"baz": {
"@xmlns": "https://d",
"@xmlns:a": "https://a"
}
}
}
}
} Which I suppose is semantically equal, just a bit more verbose? 🤷. Going to look into if I just need to bind a diff function to get ones only defined on a given node.
Glad to hear, it's a nice lang ;).
The |
Ok, I was able to monkeypatch this in: class ::XML::Node
def node_namespaces : Array(Namespace)
namespaces = [] of Namespace
return namespaces unless (ns = @node.value.ns_def)
while ns
namespaces << Namespace.new(document, ns)
ns = ns.value.next
end
namespaces
end
end Which makes it now represented like: {
"root": {
"@xmlns:a": "https://a",
"@xmlns": "https://b",
"a:foo": {
"#text": "herp"
},
"foo": {
"bar": {
"@xmlns": "https://c",
"baz": {
"@xmlns": "https://d"
}
}
}
}
} Which looks to be exactly what we'd expect yea? |
@LoganBarnett I created #89 that implements the actual I also extracted the bug related to element prefixes being dropped and added specs for the current behavior as part of #88 and #90. I think that part is fine to release as part of |
Thanks so much for iterating with me on this! :) Alright, I got a workable Here's a quick test:
I would expect this or something semantically similar:
I get:
This is the help I see with
Which leaves me with the impression that this is a toggle flag and not necessarily a flag where I can provide a namespace. The error makes me think the argument I think I'm passing to the bin/oq -i xml -o xml --xml-root '' --xmlns '.["a:foo"]' <<< \
'<?xml version="1.0"?><a:foo xmlns:a="http://bar"><a:baz>qux</a:baz></a:foo>'
oq error: Error in attribute Here's a couple of other things I tried: bin/oq -i xml -o xml --xml-root '' --xmlns '.foo' <<< \
'<?xml version="1.0"?><a:foo xmlns:a="http://bar"><a:baz>qux</a:baz></a:foo>'
<?xml version="1.0" encoding="UTF-8"?>
bin/oq -i xml -o xml --xml-root '' --xmlns '.["a:foo"]' <<< \
'<?xml version="1.0"?><a:foo xmlns:a="http://bar"><a:baz>qux</a:baz></a:foo>'
oq error: Error in attribute One of the tricky things I found in the XML namespace stuff is that the prefix ( Based on your specs and examples above I think we're in a really good spot. I think the only thing I'm seeing now is the need to address into the document with a query that's namespace aware. |
@LoganBarnett I think I don't fully follow what purpose that mapping would have. Like what does I also think you found a diff bug. For example: echo $'{"foo":"bar"}' | oq .foo
"bar"
echo $'{"foo":"bar"}' | oq .["foo"]
jq: error: foo/0 is not defined at <top-level>, line 1:
.[foo]
jq: 1 compile error Can deff get that fixed. Seems the quotes aren't making it to EDIT: NVM, this isn't a bug, just a case where you need to quote the filter. I.e. |
Apologies if I sound repetitious here - this took me some time to grok and even more time to figure out how to convey it without sounding like a long winded spec. The namespace is the URL, which has a very static semantic in XML. If I am using elements that are bound to the namespace Here's some documents, which are all semantically similar: <?xml version="1.0" ?>
<foo xmlns="https://foo-namespace">
bar
</foo> <?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
bar
</f:foo> <?xml version="1.0" ?>
<foo:foo xmlns:foo="https://foo-namespace">
bar
</foo:foo> However this one is not a node of the namespace <?xml version="1.0" ?>
<foo>
bar
</foo> Additionally, I could use the same prefix but a different namespace and semantically the documents are still different. The document below is not semantically the same as this post's first examples. It's prefix is the same, but the prefix is irrelevant and all that matters is the namespace. <?xml version="1.0" ?>
<f:foo xmlns:f="https://bar-namespace">
bar
</f:foo> XPath query tools (like Supposed we have this document, and we want to get the contents of the <?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
<f:bar>
baz
</f:bar>
</f:foo> We need some mechanism or notation to indicate that we're looking for a Using our document directly above, we could use this oq -i xml -o xml --xml-root '' --xmlns 'a=https://foo-namespace' '.["a:foo"] .["a:bar"]' < foo-bar.xml Now the curve-ball, which introduces the need to support multiple namespaces. Suppose <?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
<b:bar xmlns:b="https://bar-namespace">
baz
</b:bar>
</f:foo> Now we need an additional namespace to annotate our query with. Here's where the suggestion for multiple oq -i xml -o xml --xml-root '' --xmlns 'a=https://foo-namespace' --xmlns 'b=https://bar-namespace' '.["a:foo"] .["b:bar"]' < foo-bar-independent.xml This way our query has a way to align the namespaces (which are just URLs). Does that kind of make sense? Feelings of revulsion aside ;) IMO this is very convoluted. In the wild I'm seeing varied styles of annotating the namespace, so I have reason to believe using some random prefix is quite common, and life seems like it would be much easier if we just didn't have them :\ |
I'll also add that declaring a default namespace also eliminates the need for a prefix entirely. So not only is the prefix completely arbitrary but it might not even be there. |
@LoganBarnett Thanks for the explanation. I see what you want now. If you know the structure of the XML document (as you would need to in order to setup the mappings), why couldn't you just do like: ./bin/oq -i xml -o xml --no-prolog --xml-root '' --xmlns '.["f:foo"] | .["b:bar"] | .["#text"]' < foo-bar.xml |
The problem is that semantically correct documents may not use prefixes Then we could really have a |
I think I've been going in circles with my explanation re: prefixes and namespaces a bit. A prefix could be likened to a variable and the namespace the variable's value. The variable (prefix) could be named anything, and in some cases isn't present at all. The value (namespace) is all we truly care about, and the variable (prefix) is just internal machinations that help us make sense of the code. The mapping we could provide EDIT: Specify "it" in regards to an ideal vs practical implementation. |
@LoganBarnett Ohh ok now i understand where the mapping fits in. The gist of it is you need a way to agnostically navigate the tree based on the namespace href no matter what the prefix is, if any. I.e. normalizing semantically equivalent elements into a standardized one that's easier to query. I'll have to play around what that implementation would look like. Might actually be pretty easy. |
I think we've achieved a mind meld! :D
Sweet! You've had loads of patience through this. Thanks so much :) I'm about to wrap up for the day but I think I could contribute some additional specs at the very least. |
@LoganBarnett I pushed up |
@Blacksmoke16 I wasn't able to get to this today due to some frantic work needs. I should be able to try it out tomorrow. Thanks! |
Apologies for my absence. I managed to do a quick test with the build. The namespaces are preserved nicely! I think my query is correct on the second example with $ bin/oq -i xml -o xml --xml-root '' --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.' < foo-bar-independent.xml
<?xml version="1.0" encoding="UTF-8"?>
<a:foo xmlns:a="https://foo-namespace">
<b:bar xmlns:b="https://bar-namespace">
baz
</b:bar>
</a:foo>
$ bin/oq -i xml -o xml --xml-root '' --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.["a:foo"]' < foo-bar-independent.xml
oq error: Error in attribute Thoughts? This is the document I'm working with. The output from the first example is perfect. Thank you! <?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
<b:bar xmlns:b="https://bar-namespace">
baz
</b:bar>
</f:foo> |
@LoganBarnett Glad to hear! Regarding {
"@xmlns:a": "https://foo-namespace",
"b:bar": {
"@xmlns:b": "https://bar-namespace",
"#text": "\n baz\n "
}
} Notice there is the |
@Blacksmoke16 I did the |
@LoganBarnett The problem is To solve this I'd either have to produce a better error when this happens, or like maybe keep track when you're in the root context to know if those attributes should be skipped. 🤷 EDIT: XML requires there be a root element, so in this case by excluding the root element you're causing it to try and generate invalid XML. A clearer error would prob be the better solution. |
@Blacksmoke16 sorry, work got nuts there for a bit. This makes sense to me, I think. I get that XML (and thus Using the same document as before, without $ bin/oq -i xml -o xml --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.["a:foo"]' < foo-bar-independent.xml
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="https://foo-namespace">
<b:bar xmlns:b="https://bar-namespace">
baz
</b:bar>
</root> If I make $ bin/oq -i xml -o xml --xml-root 'a:foo' --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.["a:foo"]' < foo-bar-independent.xml
<?xml version="1.0" encoding="UTF-8"?>
<a:foo xmlns:a="https://foo-namespace">
<b:bar xmlns:b="https://bar-namespace">
baz
</b:bar>
</a:foo> I think this works in a variety of cases - one of mine is that I'm sorting some sibling nodes in a document. I think I have it broken into two queries - one to do the sort and the other to assign it back. Doing both was problematic. I think in this case here I'd just We should be good here! Thanks so much :) Would you be okay with me contributing a separate documentation pull request for this new functionality? |
@LoganBarnett No problem!
If you want sure, probably wouldn't hurt to just add it in the README. |
If I have a root element which declares an
xmlns
prefix, that prefix is stripped from the elements in the document. This is causing some trouble in some document transformations I'm doing, since the validators which consume the transforms care very much about these namespaces.Additionally, the namespace declaration itself doesn't survive.
Here is an example in which the prefix is preserved:
Inspecting the JSON output reveals that it is similarly preserved:
If I add an
xmlns
declaration for that prefix, the prefix is stripped from the output and thexmlns
attribute itself is also removed.The JSON output mirrors the behavior:
I admit my knowledge about deeper XML validation and transformation is limited - I might be missing something here. However I think there's a lot of value in
oq
making XML transformations without necessarily changing the rest of the document (within reason - bad XML is bad XML and it's not reasonable to support that). If this behavior is intentional, perhaps we could have some kind of flag in which we could disable it?Thanks for all the work on
oq
! It's been an invaluable tool at my workplace :)The text was updated successfully, but these errors were encountered: