-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove .unwrap()
in detach()
#35
Comments
Detatch panics in three situations:
If detach went wrong, you must have a malformed tree. Could you point out which line of the detach method causes panic? I would also like a debug print of the tree you're using, in order to understand what is going wrong. If you provide us with more information, we could try to understand what part of the source code is failing. Removing the unwrap isn't a solution, because if detatch fails, it means that there's something very wrong going on with the tree, and we must fix it. Some invariants must not be respected for this to happen. |
@ronanM Could you point out which line of the detach method causes panic? Until you don't I am not able to help you. |
I've deleted all my code. It was a function in a web crawler that removes all HTML tags that do not contain URLs. Versions used:
My program crawls lots of websites all over the world. The buggy My function with a test Ok: #[cfg(test)]
mod scrapper_test {
use cow_utils::CowUtils;
use ego_tree::NodeMut;
use scraper::{Html, Node};
fn strip_html(parsed_html: &mut Html) -> bool {
fn depth_first_and_child_last_to_first(node_mut: &mut ego_tree::NodeMut<Node>, depth: usize) -> bool {
// --- Recursive depth first (last child to first child, to detach without problem with NodeId).
if let Some(mut current_child_id) = node_mut.last_child().map(|v| v.id()) {
loop {
let Some(mut current_child) = node_mut.tree().get_mut(current_child_id) else {
break;
};
let prev_node_id_opt = current_child.prev_sibling().map(|v| v.id());
if !depth_first_and_child_last_to_first(&mut current_child, depth + 1) {
return false;
}
if let Some(prev) = prev_node_id_opt {
current_child_id = prev;
} else {
break;
}
}
}
let node = node_mut.value();
println!("{}🟡 {:?}", " ".repeat(depth), &node);
match node {
Node::Element(e) => {
let elt_name_lower = e.name.local.cow_to_ascii_lowercase();
let is_one_of_the_most_used_tags_without_url_attribute =
["div", "li", "span", "p"].contains(&elt_name_lower.as_ref());
if is_one_of_the_most_used_tags_without_url_attribute {
e.attrs.clear()
} else {
let mut clean_element = |elt_name, atts_to_keep: &[&str]| {
let found_elt = elt_name == elt_name_lower;
if found_elt {
e.attrs.retain(|att_name, att_value| {
let att_name_lower = att_name.local.cow_to_ascii_lowercase();
atts_to_keep.contains(&att_name_lower.as_ref()) && !att_value.starts_with("data:")
});
}
found_elt
};
// https://stackoverflow.com/questions/2725156/complete-list-of-html-tag-attributes-which-have-a-url-value
let attributes_already_cleaned = clean_element("a", &["href"])
|| clean_element("img", &["src", "longdesc", "usemap"])
|| clean_element("link", &["href"])
|| clean_element("script", &["src"])
|| clean_element("base", &["href"])
|| clean_element("html", &["manifest"])
|| clean_element("body", &["background"])
|| clean_element("form", &["action"])
|| clean_element("frame", &["src", "longdesc"])
|| clean_element("iframe", &["src", "longdesc"])
|| clean_element("audio", &["src"])
|| clean_element("video", &["src", "poster"]);
if !attributes_already_cleaned {
e.attrs.clear()
}
}
// println!("⭕ {:?}", &e);
if e.attrs.is_empty() {
if node_mut.has_children() {
if depth > 2 {
reattach_children_to_parent_then_delete_current_node(node_mut).is_some()
} else {
true
}
} else {
checked_detach(node_mut)
}
} else {
true
}
}
Node::Text(t) => checked_detach(node_mut),
Node::Comment(c) => checked_detach(node_mut),
_ => true,
}
}
fn reattach_children_to_parent_then_delete_current_node(node_mut: &mut NodeMut<Node>) -> Option<()> {
let mut attach_children_to_their_grand_parent = || -> Option<()> {
let parent_id = node_mut.parent()?.id();
let mut current_child_id = node_mut.last_child()?.id();
let tree = node_mut.tree();
loop {
let mut parent = tree.get_mut(parent_id)?;
parent.prepend_id(current_child_id);
let prev_id_opt = tree.get_mut(current_child_id)?.prev_sibling().map(|v| v.id());
match prev_id_opt {
Some(prev_id) => current_child_id = prev_id,
None => return Some(()),
}
}
};
attach_children_to_their_grand_parent()?;
checked_detach(node_mut).then_some(())
}
fn checked_detach(node_mut: &mut NodeMut<Node>) -> bool {
// /!\ Check precondition because `NodeMut::detach()` is not safe. /!\
let is_good_precondition = (|| {
let mut parent = node_mut.parent()?;
let first_child_id = parent.first_child().map(|v| v.id())?;
let last_child_id = parent.last_child().map(|v| v.id())?;
let id = node_mut.id();
if first_child_id != last_child_id {
if first_child_id == id {
node_mut.next_sibling()?;
} else if last_child_id == id {
node_mut.prev_sibling()?;
}
}
Some(())
})()
.is_some();
if is_good_precondition {
node_mut.detach();
} else {
println!("💥 🔥 🧨 Bugged detach 🧨 💥 🔥");
}
is_good_precondition
}
depth_first_and_child_last_to_first(&mut parsed_html.tree.root_mut(), 0)
}
#[test]
fn strip_html_test_current_test() {
let mut html = scraper::html::Html::parse_document(
r#"
<html att=html>
<head att=head>
<!-- comment head 1-->
<meta content="httrack" name="keywords">
<style type="text/css">css rules</style>
<script language="js">javascript</script>
</head>
<body att=1>
<div body=1>
<div body=11>
<div body=111>some text</div>
<div body=112>some <br> br <!--comment div 112--> </div>
some text
</div>
<div body=12>
<div body=121></div>
<div body=122></div>
</div>
</div>
<div body=2>
<span><a href="deep-link"></a></span>
</div>
<!-- comment body 1-->
<other att=1></other>
<form action="url" att=a>
<!-- comment form 1-->
<other att=2></other>
<div>
<div>
<img src="url">
</div>
</div>
<other att=3></other>
</form>
<other att=4></other>
</body>
</html>
"#,
);
let is_good_stripping = strip_html(&mut html);
assert!(is_good_stripping);
assert_eq!(
html.html(),
r#"<html><body><a href="deep-link"></a><form action="url"><img src="url"></form></body></html>"#
)
}
} |
This is too much code for a reproducer and not actually a reproducer without an input that triggers a panic. One bug I think I see that might be triggered by your code is that |
In a web crawler, when I strip the HTML tree, sometimes
detach()
panic on the.unwrap()
.The text was updated successfully, but these errors were encountered: