Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri.parse accepts Pathname object as argument, but only work with small files then. #1821

Closed
wants to merge 4 commits into from

Conversation

phokz
Copy link

@phokz phokz commented Nov 28, 2018

I hit this yesterday wit code like
doc = Nokogiri.parse(Rails.root.join('config','file.xml'))

when the file.xml was smaller than about 4k bytes, it worked as a charm.
With larger files, weird behavior was happening. E.g. resulting parsed document
was containing duplicate records.

Then I found this:
p = Pathname.new('/etc/passwd')
[p.read(10), p.read(10)]
=> ["root:x:0:0", "root:x:0:0"]

Nokogiri accepts either string or whatever responds to 'read' method.
Pathname objects respond to read, but they do not keep state or file opened.
Nokogiri reads input in chunks, calling read(SOME_CHUNK_SIZE) possibly several times, based on file size. Pathname#read allways reads from the beginning of file.

With correct alignment of last element before SOME_CHUNK_SIZE boundary, this could be possibly
used to DoS attack that causes OOM (demo attached in commit d90bf40 in my fork).

In PR there are failing tests and proposed fix. It might not be the best approach to fix this.
Maybe other solution would be just to raise an exception when object in string_or_io has none of well known ancestors like String, File, StringIO ...

@flavorjones
Copy link
Member

Thank you for submitting this! I will take a look as soon as I can.

@flavorjones
Copy link
Member

@phokz Thanks again for submitting this. I'm looking at the implementation for Pathname#read, and yeah, it's invoking IO::read and yeah, that is absolutely going to be an infinite loop every time.

The fix you're suggesting in this PR, though, has one drawback that I think can be avoided: it reads the entire document into memory, which for large docs can be a performance penalty (memory usage). Instead what I'd like to do is open a File object and use that instead.

I'll submit my changes as a separate PR and then we can chat about the advantages. Either way, I'd like to credit you in the CHANGELOG and use your tests, so you'll have a commit in git history as well.

@flavorjones
Copy link
Member

There are also a few other places in the code where we're calling #read and I'd like to add some test coverage for those and address this problem there as well.

@flavorjones
Copy link
Member

Hmm. Thinking about this a bit more, and this feels like we're adding a new feature, which is to allow Pathname arguments to the various parse methods. This opens the door to supporting any object's #read method, when really we just want to support IO objects.

@tenderlove I'm curious what you think about this. I ❤️ ducktyping, but here Pathname is not obeying the IO#read contract and so I'm inclined to change how Nokogiri's determines its behavior from checking responds_to?(:read) to checking kind_of?(IO).

If we do that, then #read_memory will get invoked on the Pathname instead of #read_io, and that will lead to an exception getting raised like TypeError (no implicit conversion of Pathname into String) which is super-clear to the developer calling it.

@phokz
Copy link
Author

phokz commented Jan 10, 2019

Hi, thank you for diving into this issue. My solution was a quick hack. I quite like checking kind_of?(IO) way.

Best regards
Josef

@flavorjones
Copy link
Member

@phokz Thanks, I think I will schedule a fix for v1.11.0.

@tenderlove
Copy link
Member

I think we may want to check for to_io rather than read. I hit a similar issue in Stackprof and solved it by checking whether the thing responds to to_io or not.

@flavorjones
Copy link
Member

Interestingly, StringIO does not implement #to_io nor is it a kind_of?(IO) ...

flavorjones added a commit that referenced this pull request Apr 7, 2019
because implementing read like this can result in nondeterministic
behavior. see related #1821 and #1888.
@flavorjones
Copy link
Member

flavorjones commented Sep 10, 2020

See additional context from user help request at https://gitter.im/sparklemotion/nokogiri?at=5f5a485bf969413294d84734

hello = Pathname("hello.xml")

puts hello.read
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
</project>
# => nil

require 'nokogiri'
# => true

doc = Nokogiri::XML(hello)

puts doc
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
</project>
<?xml version="1.0" encoding="UTF-8"?>
# => nil

@doriantaylor
Copy link
Contributor

Crashing through the wall from #2110; if I was gonna do this I'd explicitly check is_a? Pathname and then have it .expand_path.open. I also lament that StringIO does not inherit IO.

flavorjones pushed a commit to doriantaylor/nokogiri that referenced this pull request Nov 12, 2020
flavorjones added a commit to doriantaylor/nokogiri that referenced this pull request Nov 12, 2020
- no need to resolve Pathname in Nokogiri.parse
- added test coverage for XML::Document and HTML::Document#parse
- moved resolution of Pathname to after `url` is set in HTML::Document#parse
- open files with implict "r" because "b" assumes ASCII-8BIT encoding

Part of sparklemotion#2110, sparklemotion#1821
flavorjones added a commit to doriantaylor/nokogiri that referenced this pull request Nov 12, 2020
@flavorjones
Copy link
Member

flavorjones commented Nov 12, 2020

Closing this PR in favor of #2111. @phokz you'll get a shout-out in the CHANGELOG!

flavorjones pushed a commit to doriantaylor/nokogiri that referenced this pull request Nov 12, 2020
flavorjones added a commit to doriantaylor/nokogiri that referenced this pull request Nov 12, 2020
- no need to resolve Pathname in Nokogiri.parse
- added test coverage for XML::Document and HTML::Document#parse
- moved resolution of Pathname to after `url` is set in HTML::Document#parse
- open files with implict "r" because "b" assumes ASCII-8BIT encoding

Part of sparklemotion#2110, sparklemotion#1821
flavorjones added a commit to doriantaylor/nokogiri that referenced this pull request Nov 12, 2020
flavorjones added a commit to doriantaylor/nokogiri that referenced this pull request Nov 12, 2020
flavorjones added a commit that referenced this pull request Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants