Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Nokogiri::XML::Document to the TypedData API #2807

Merged

Conversation

etiennebarrie
Copy link
Contributor

Ref: #2806

Paired with @byroot. First we created a rb_data_type_t and then used it with TypedData_Get_Struct everywhere we're unwrapping an xmlDoc.
Then we added a limited version of dsize, which counts all the children nodes of the document recursively.

This allows ObjSpace.memsize_of to be closer to reality:

$ ruby -rbundler/setup -rnokogiri -robjspace -e 'puts ObjectSpace.memsize_of(Nokogiri::XML(File.read("test/files/po.xml")))'                                                    
88

After the second commit:

$ ruby -rbundler/setup -rnokogiri -robjspace -e 'puts ObjectSpace.memsize_of(Nokogiri::XML(File.read("test/files/po.xml")))'                                                    
9144

We're missing other members in the struct, not sure which ones would the most important to have a correct size, but this is a good start.

@etiennebarrie
Copy link
Contributor Author

Oh I forgot to mention, compared to #2806, given we unwrap xmlDoc structs outside of ext/nokogiri/xml_document.c, we had to declare noko_xml_document_data_type as extern in ext/nokogiri/nokogiri.h and couldn't make it static in ext/nokogiri/xml_document.c which means it also makes it an external symbol, which is why we prefixed it with noko_.

Also the memsize and memsize_node can be made static, will push an amended commit.

@etiennebarrie etiennebarrie force-pushed the xml-document-typed-data branch from 63b18c3 to 2e25ff1 Compare March 3, 2023 13:33
@flavorjones
Copy link
Member

flavorjones commented Mar 3, 2023

Thanks for starting this!

One note: I don't think we need to implement memsize, Nokogiri forces libxml2 to use ruby_xmalloc and friends for memory allocation: https://github.com/sparklemotion/nokogiri/blob/main/ext/nokogiri/nokogiri.c#L203

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

I don't think we need to implement memsize, Nokogiri forces libxml2 to use ruby_xmalloc and friends for memory allocation

That's unrelated.

Using ruby_xmalloc is for the GC to know how much was allocated in general, so that it can feeds its heuristics to decide when to trigger. But it doesn't know for which object it was.

The dsize function is only used through ObjectSpace.memsize_of and ObjectSpace.dump to estimate the size of an object, and is useful for memory profilers like heap-profiler, memory_profiler etc. It's to answer the question: "How much memory would I free if I got rid of this object".

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

Also IMHO this PR is good to go. The memsize function could maybe be made more accurate, but it's already a good start.

@flavorjones
Copy link
Member

@byroot Thanks for the additional context, I'm not familiar at all with dsize. Just to confirm my understanding, it's not called at all during normal GC runs? Only if you explicitly call those profiling methods?

Any reason we didn't implement dsize on XML::Node instead of here? The memory model here is pretty complex (Nodes hold references to the parent Document, for example) so I want to make sure we're returning useful information.

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

Just to confirm my understanding, it's not called at all during normal GC runs? Only if you explicitly call those profiling methods?

Yup. That's right.

Any reason we didn't implement dsize on XML::Node instead of here?

That's a good question. The reason is you want to avoid double accounting, so if you have multiple objects that ultimately point to the same native structures, you don't want to report the same memory regions twice.

So given that XML::Node can't exist without an XML::Document, and that not all XML::Node of a document are always allocated, it seemed natural to "blame" the XML::Document for the whole DOM, and to consider that XML::Node don't actually hold any memory, they simply point to the Document memory.

So when XML::Node is converted, it's dsize function should return 0. Hope my explanation makes sense.

@flavorjones
Copy link
Member

That makes a lot of sense, thanks.

Please note that XML::Node was already converted last year as part of #2578 / #2579. Do we need to implement XML::Node's dsize, or is the absence of that method an implicit 0?

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

Urk:

[BUG] object allocation during garbage collection phase

-- C level backtrace information -------------------------------------------
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(0x7fb325302639) [0x7fb325302639]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(0x7fb325302dbb) [0x7fb325302dbb]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(rb_bug+0xeb) [0x7fb3250ff0d9]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(0x7fb3250ff77a) [0x7fb3250ff77a]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(0x7fb325104377) [0x7fb325104377]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(rb_ary_dup+0x22) [0x7fb32510ea52]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(0x7fb3252ee1a7) [0x7fb3252ee1a7]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(rb_vm_exec+0x1d2) [0x7fb3252efcf2]
/lib/x86_64-linux-gnu/libruby-2.7.so.2.7(rb_yield+0x271) [0x7fb3252fcf81]

I'm not 100% sure, but that looks like a Ruby bug, nokogiri is nowhere to be seen in the backtrace. cc @peterzhu2118, what do you think?

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

Full crash for posterity: https://gist.github.com/byroot/150db7e3eb5900da552dd795791fca90

I see it's ruby 2.7.0-p0, it may make sense to test on latest 2.7 instead.

@flavorjones
Copy link
Member

I get that crash with ruby 3.2.0 as well.

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

On other branches?

@flavorjones
Copy link
Member

Here's the stack walkback:

(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7542859 in __GI_abort () at abort.c:79
#2  0x00007ffff7a30dd7 in die () at error.c:800
#3  rb_bug (fmt=fmt@entry=0x7ffff7e04dd8 "object allocation during garbage collection phase") at error.c:800
#4  0x00007ffff7a3199a in newobj_slowpath (size_pool_idx=0, wb_protected=1, cr=0x55555555dde0, 
    objspace=0x55555555d0f0, flags=5, klass=140737283550600) at gc.c:2812
#5  newobj_slowpath_wb_protected (klass=140737283550600, flags=5, objspace=0x55555555d0f0, cr=0x55555555dde0, 
    size_pool_idx=0) at gc.c:2843
#6  0x00007ffff7b086c4 in newobj_of0 (alloc_size=25, cr=<optimized out>, wb_protected=1, flags=5, 
    klass=140737283550600) at gc.c:2885
#7  newobj_of (alloc_size=25, wb_protected=1, v3=0, v2=0, v1=0, flags=5, klass=140737283550600) at gc.c:2896
#8  rb_wb_protected_newobj_of (klass=klass@entry=140737283550600, flags=flags@entry=5, size=size@entry=25)
    at gc.c:2918
#9  0x00007ffff7c45e80 in str_alloc_embed (capa=1, klass=140737283550600) at string.c:926
#10 str_new0 (klass=140737283550600, ptr=0x0, len=0, termlen=1) at string.c:926
#11 0x00007ffff7c4b4df in str_new_static (encindex=0, len=0, ptr=0x0, klass=140737283550600)
    at ./include/ruby/internal/encoding/encoding.h:450
#12 rb_enc_str_new_static (ptr=ptr@entry=0x0, len=len@entry=0, enc=enc@entry=0x0) at string.c:1072
#13 0x00007ffff7ad63a4 in warn_vsprintf (args=0x7fffffffa0f0, 
    fmt=0x7ffff7e04f08 "realloc during GC detected, this could cause crashes if it triggers another GC", line=34, 
    file=0x7ffff6d6f228 "<internal:gc>", enc=0x0) at error.c:399
#14 warning_string (enc=0x0, 
    fmt=0x7ffff7e04f08 "realloc during GC detected, this could cause crashes if it triggers another GC", 
    args=0x7fffffffa0f0) at error.c:399
#15 0x00007ffff7a30cf3 in rb_warn (
    fmt=fmt@entry=0x7ffff7e04f08 "realloc during GC detected, this could cause crashes if it triggers another GC")
    at error.c:414
#16 0x00007ffff7a31a1f in objspace_xrealloc (objspace=0x55555555d0f0, ptr=0x55555697b200, new_size=3, 
    old_size=<optimized out>) at gc.c:12310
#17 0x00007ffff21f1e5c in xmlStrncat (len=1, add=0x555556a82ed0 "x", cur=0x55555697b200 "x") at xmlstring.c:460
#18 xmlStrncat (cur=0x55555697b200 "x", add=0x555556a82ed0 "x", len=1) at xmlstring.c:446
#19 0x00007ffff21a7b14 in xmlNodeAddContentLen (len=<optimized out>, content=0x555556a82ed0 "x", cur=0x555556aa1060)
    at tree.c:5938
#20 xmlNodeAddContentLen (cur=0x555556aa1060, content=0x555556a82ed0 "x", len=<optimized out>) at tree.c:5896
#21 0x00007ffff21a78ac in xmlAddChild (parent=0x55555685ad60, parent@entry=0x555556aa1060, cur=0x5555567ba110)
    at tree.c:3446
#22 0x00007ffff2166c08 in dealloc_node_i2 (key=0x5555567ba110, doc=0x555556aa1060, node=<optimized out>)
    at ../../../../ext/nokogiri/xml_document.c:20
#23 dealloc_node_i (key=key@entry=93825011523856, node=<optimized out>, doc=doc@entry=93825012182368)
    at ../../../../ext/nokogiri/xml_document.c:29
#24 0x00007ffff7c3d6b6 in apply_functor (_=0, d=<synthetic pointer>, v=<optimized out>, k=93825011523856)
    at st.c:1574
#25 st_general_foreach (check_p=0, arg=<synthetic pointer>, replace=0x0, func=<optimized out>, tab=0x5555567ba110)
    at st.c:1484
#26 rb_st_foreach (tab=tab@entry=0x5555569a5740, func=func@entry=0x7ffff2166ba0 <dealloc_node_i>, 
    arg=arg@entry=93825012182368) at st.c:1581
#27 0x00007ffff21668e7 in dealloc (data=0x55555685ad60) at ../../../../ext/nokogiri/xml_document.c:72
#28 0x00007ffff7afc1e9 in obj_free (objspace=0x55555555d0f0, obj=140737245210600) at gc.c:3594
#29 0x00007ffff7afd4b9 in gc_sweep_plane (heap=<optimized out>, ctx=<synthetic pointer>, bitset=43993350013709, 
    p=140737245210600, objspace=<optimized out>) at gc.c:5584
#30 gc_sweep_page (heap=<optimized out>, ctx=<synthetic pointer>, objspace=<optimized out>) at gc.c:5669
#31 gc_sweep_step (objspace=<optimized out>, size_pool=0x55555555d118, heap=<optimized out>) at gc.c:5980
#32 0x00007ffff7afef7e in gc_sweep_rest (objspace=<optimized out>) at gc.c:6046
#33 gc_sweep (objspace=0x55555555d0f0) at gc.c:6203
#34 0x00007ffff7b04e49 in gc_marks (full_mark=<optimized out>, objspace=0x55555555d0f0) at gc.c:8716
#35 gc_start (objspace=0x55555555d0f0, reason=<optimized out>) at gc.c:9547
#36 0x00007ffff7b06624 in garbage_collect (reason=107520, objspace=0x55555555d0f0) at gc.c:9428

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

is the absence of that method an implicit 0?

Yes, the absence is an implicit 0.

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

Oh nice:

"realloc during GC detected, this could cause crashes if it triggers another GC",

So that was the original issue, and while trying to print that warning, it allocated an caused an actual crash 🤣

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

Ok, my understanding of that stack trace is:

  • We free an XML::Document via dealloc (in xml_document.c)
  • We call dealloc_node_i2 which somehow call xmlAddChild
  • That AddChild (why do we add a child when freeing???) do a realloc, and since you configured LibXML to use Ruby's allocation functions, the GC complains
  • And then we hit that Ruby bug, because it should only have been a warnings, but it turns into a crash.

This is very intriguing, I'll dig a bit more. But just an observation that this crash wouldn't happen if LibXML was using regular malloc (not saying it should).

@peterzhu2118
Copy link
Contributor

peterzhu2118 commented Mar 3, 2023

I added that feature in ruby/ruby#6921, it's probably not a good idea to use rb_warn and I should use something else that doesn't allocate....

@flavorjones
Copy link
Member

flavorjones commented Mar 3, 2023

Just give me a bit of time to dig into what's going on, thanks. The memory model here is complicated and a bit fragile, unfortunately.

@byroot
Copy link
Contributor

byroot commented Mar 3, 2023

Ok, so the culprit is here:

xmlAddChild((xmlNodePtr)doc, node);

I'm pretty sure the issue already exist on main. I don't know the LibXml API enough to tell why a node is added and not removed though.

Just give me a bit of time to dig into what's going on

Yeah, no rush at all.

@casperisfine
Copy link

Seems like the origin of this addChild is dafc963

And the reason is that these nodes are no longer in the document, but I think they could just be directly freed with xmlFreeNode()

@flavorjones
Copy link
Member

What's happening here is that two text nodes in a row are being "deallocated" in dealloc_node_i2, and libxml2 is merging them leading to dangling pointers. Probably because of subtle ordering changes introduced by the TypedData changeover.

Let's sit on this for a bit and do the rest of the classes. I need to think about how to work around this.

@flavorjones
Copy link
Member

Ah, @casperisfine suggested removing the FREE_IMMEDIATELY and I think that's working. Will throw a commit onto this PR.

@flavorjones flavorjones force-pushed the xml-document-typed-data branch from 2e25ff1 to 7216317 Compare March 3, 2023 21:32
@flavorjones
Copy link
Member

OK, I've rebased onto origin/main and also removed the FREE_IMMEDIATELY flag. Let's see what CI says.

This is a tricky one to explain, but in summary, when dealloc_node_i2
parents unparented nodes, libxml2 may merge two adjacent text nodes,
which causes additional memory allocations during GC.

If the FREE_IMMEDIATELY flag is set, this will generate warnings. But
generating those warnings during GC will lead to a segfault for
reasons I haven't dug into yet.

Anyway, let's leave this flag off for now.
@flavorjones flavorjones force-pushed the xml-document-typed-data branch from 7216317 to 0804380 Compare March 3, 2023 21:34
@flavorjones
Copy link
Member

I've added a commit to improve the fidelity of the memsize calculation, and added some test coverage.

@flavorjones flavorjones force-pushed the xml-document-typed-data branch from 9ea11fc to b026f60 Compare March 4, 2023 21:43
and include test coverage for it
@flavorjones flavorjones force-pushed the xml-document-typed-data branch from b026f60 to 6b23461 Compare March 5, 2023 02:29
@flavorjones
Copy link
Member

Thanks, @etiennebarrie and @casperisfine !

@flavorjones flavorjones merged commit cca6353 into sparklemotion:main Mar 5, 2023
flavorjones added a commit that referenced this pull request Mar 7, 2023
See #2808

Note that I chose to make a new public function,
`noko_xml_node_set_unwrap`, to allow other files to unwrap node
sets. This differs from the tactic adopted in #2807 / cb1557d which
was to make public the data type.

We should probably decide which is the better approach and
standardize on it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants