Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect_namespaces: Different result since 1.5.6 #885

Closed
wolfgangw opened this issue Apr 9, 2013 · 8 comments
Closed

collect_namespaces: Different result since 1.5.6 #885

wolfgangw opened this issue Apr 9, 2013 · 8 comments

Comments

@wolfgangw
Copy link
Contributor

Nokogiri::XML::Document#collect_namespaces will return different results with version 1.5.5 and versions since 1.5.6 respectively.

By the way: the sample below, while abstract and reduced, reflects real-world instance documents, not merely an academic exercise.

Consider this document with multiple default namespace scopes:

<?xml version="1.0"?>
<foo xmlns="foo_space">
  <bar>
    <barchild xmlns="bar_space">
      barchild says: I'm not always here.
      When I am I don't want to clutter
      root node's namespace definitions.
      Also I don't want to be cluttered
      myself -- no prefix for me please.
    </barchild>
  </bar>
</foo>

1.5.5:

irb> [Nokogiri::VERSION, Nokogiri::LIBXML_VERSION]
=> ["1.5.5", "2.7.8"]
irb> Nokogiri::XML(open 'multi-default-ns.xml').collect_namespaces
=> {"xmlns"=>"foo_space"}

1.5.6 and ever since:

irb> [Nokogiri::VERSION, Nokogiri::LIBXML_VERSION]
=> ["1.5.6", "2.7.8"]
irb> Nokogiri::XML(open 'multi-default-ns.xml').collect_namespaces
=> {"xmlns"=>"bar_space"}

This on ruby 2.0.0-p0. Reproducible with ruby 1.9.3-p194 / libxml 2.8.0.

$ diff nokogiri-1.5.5/lib/nokogiri/xml/document.rb nokogiri-1.5.6/lib/nokogiri/xml/document.rb
152,154c152,155
<       # Note this is a very expensive operation in current implementation, as it
<       # traverses the entire graph, and also has to bring each node across the
<       # libxml bridge into a ruby object.

---
>       # Note that this method does an xpath lookup for nodes with
>       # namespaces, and as a result the order may be dependent on the
>       # implementation of the underlying XML library.
>       #
156,158c157,160
<         ns = {}
<         traverse { |j| ns.merge!(j.namespaces) }
<         ns

---
>         xpath("//namespace::*").inject({}) do |hash, ns|
>           hash[["xmlns",ns.prefix].compact.join(":")] = ns.href if ns.prefix != "xml"
>           hash
>         end

Apart from the changed behaviour: Facing a document like the sample above (perfectly valid XML, I believe) a namespaces hash with attributes as keys won't fly, right? A list would. And break a million systems along the way. Essentially I'm looking for the proper mechanism to process these documents. Any hint appreciated.

By the way, the comment Non-prefixed default namespaces (as in “xmlns=”) are not included in the hash. on Nokogiri's::XML::Document#collect_namespaces needs updating.

@wolfgangw
Copy link
Contributor Author

See this real-world example with multiple default namespace declarations (Signer and Signature nodes).

@flavorjones
Copy link
Member

Thanks for pointing this out. As mentioned in the relevant commit (c13f9a5) Nokogiri is relying on the underlying XML library to provide the namespaces.

In this case, it's unclear to me what the "right" behavior is, since the return value is (as you point out) a hash and both namespaces cannot be returned.

Rather than classifying this as a bug, I think the answer is that #collect_namespaces should return an array of tuples; but I'm not going to make that kind of a breaking change without bumping the major version, so maybe an alternative method? I'm open to discussion here.

@wolfgangw
Copy link
Contributor Author

Yes, I don't see how you could touch #collect_namespaces without breaking a lot of dishes. Only that what's been served on those dishes is not correct for all cases. My code naively depended on 1.5.5 behaviour and fell flat on its nose after c13f9a5. Lesson learned. Interesting, though, that the change didn't raise much, if any, concern.

Can I suggest #collect_all_namespaces (returns [[node, prefix, href], [...]]) without being flagellated to pieces?

@wolfgangw
Copy link
Contributor Author

Using these in my code (dcp_inspect) now: Nokogiri::XML::Document#collect_all_namespaces_href_keys and Nokogiri::XML::Document#collect_all_namespaces_prefix_keys.

@leejarvis
Copy link
Member

@wolfgangw #collect_all_namespaces sounds fine to me. Could you draft this up?

@wolfgangw
Copy link
Contributor Author

Sure, define draft up? Where and how?

@leejarvis
Copy link
Member

@wolfgangw If you could make the changes you propose and submit a pull request, we can continue discussion over implementation there. This issue can be closed in favour.

@flavorjones
Copy link
Member

I'm going to close this, as it's not clear how we can fix this without changing the return value of collect_namespaces, which is sadly misdesigned given this edge case. I've captured this problem in the ROADMAP.md for future consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants