Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove .unwrap() in detach() #35

Closed
ronanM opened this issue Sep 22, 2024 · 4 comments · Fixed by #40
Closed

Remove .unwrap() in detach() #35

ronanM opened this issue Sep 22, 2024 · 4 comments · Fixed by #40
Assignees
Labels

Comments

@ronanM
Copy link

ronanM commented Sep 22, 2024

In a web crawler, when I strip the HTML tree, sometimes detach() panic on the .unwrap().

@LoZack19
Copy link
Contributor

LoZack19 commented Oct 9, 2024

Detatch panics in three situations:

  1. If its parent doesn't have any children (impossible)
  2. If it's the first and not only child of its parent, but it doesn't have a next sibling (impossible)
  3. If it's the last and not only child of its parent, but it doesn't have a previous sibling (impossible)

If detach went wrong, you must have a malformed tree. Could you point out which line of the detach method causes panic? I would also like a debug print of the tree you're using, in order to understand what is going wrong. If you provide us with more information, we could try to understand what part of the source code is failing.

Removing the unwrap isn't a solution, because if detatch fails, it means that there's something very wrong going on with the tree, and we must fix it. Some invariants must not be respected for this to happen.

@LoZack19 LoZack19 added the bug label Oct 9, 2024
@LoZack19 LoZack19 self-assigned this Oct 9, 2024
@LoZack19
Copy link
Contributor

@ronanM Could you point out which line of the detach method causes panic? Until you don't I am not able to help you.

@ronanM
Copy link
Author

ronanM commented Nov 28, 2024

I've deleted all my code. It was a function in a web crawler that removes all HTML tags that do not contain URLs.

Versions used:

scraper = "0.20.0"
html5ever = "0.27.0"
ego-tree = "0.6.3"

My program crawls lots of websites all over the world. The buggy detach() function was rare.

My function with a test Ok:

#[cfg(test)]
mod scrapper_test {
  use cow_utils::CowUtils;
  use ego_tree::NodeMut;
  use scraper::{Html, Node};

  fn strip_html(parsed_html: &mut Html) -> bool {
    fn depth_first_and_child_last_to_first(node_mut: &mut ego_tree::NodeMut<Node>, depth: usize) -> bool {
      // --- Recursive depth first (last child to first child, to detach without problem with NodeId).
      if let Some(mut current_child_id) = node_mut.last_child().map(|v| v.id()) {
        loop {
          let Some(mut current_child) = node_mut.tree().get_mut(current_child_id) else {
            break;
          };
          let prev_node_id_opt = current_child.prev_sibling().map(|v| v.id());

          if !depth_first_and_child_last_to_first(&mut current_child, depth + 1) {
            return false;
          }

          if let Some(prev) = prev_node_id_opt {
            current_child_id = prev;
          } else {
            break;
          }
        }
      }

      let node = node_mut.value();
      println!("{}🟡 {:?}", "  ".repeat(depth), &node);

      match node {
        Node::Element(e) => {
          let elt_name_lower = e.name.local.cow_to_ascii_lowercase();
          let is_one_of_the_most_used_tags_without_url_attribute =
            ["div", "li", "span", "p"].contains(&elt_name_lower.as_ref());

          if is_one_of_the_most_used_tags_without_url_attribute {
            e.attrs.clear()
          } else {
            let mut clean_element = |elt_name, atts_to_keep: &[&str]| {
              let found_elt = elt_name == elt_name_lower;

              if found_elt {
                e.attrs.retain(|att_name, att_value| {
                  let att_name_lower = att_name.local.cow_to_ascii_lowercase();
                  atts_to_keep.contains(&att_name_lower.as_ref()) && !att_value.starts_with("data:")
                });
              }

              found_elt
            };

            // https://stackoverflow.com/questions/2725156/complete-list-of-html-tag-attributes-which-have-a-url-value
            let attributes_already_cleaned = clean_element("a", &["href"])
              || clean_element("img", &["src", "longdesc", "usemap"])
              || clean_element("link", &["href"])
              || clean_element("script", &["src"])
              || clean_element("base", &["href"])
              || clean_element("html", &["manifest"])
              || clean_element("body", &["background"])
              || clean_element("form", &["action"])
              || clean_element("frame", &["src", "longdesc"])
              || clean_element("iframe", &["src", "longdesc"])
              || clean_element("audio", &["src"])
              || clean_element("video", &["src", "poster"]);

            if !attributes_already_cleaned {
              e.attrs.clear()
            }
          }

          // println!("⭕ {:?}", &e);

          if e.attrs.is_empty() {
            if node_mut.has_children() {
              if depth > 2 {
                reattach_children_to_parent_then_delete_current_node(node_mut).is_some()
              } else {
                true
              }
            } else {
              checked_detach(node_mut)
            }
          } else {
            true
          }
        }

        Node::Text(t) => checked_detach(node_mut),

        Node::Comment(c) => checked_detach(node_mut),

        _ => true,
      }
    }

    fn reattach_children_to_parent_then_delete_current_node(node_mut: &mut NodeMut<Node>) -> Option<()> {
      let mut attach_children_to_their_grand_parent = || -> Option<()> {
        let parent_id = node_mut.parent()?.id();
        let mut current_child_id = node_mut.last_child()?.id();
        let tree = node_mut.tree();

        loop {
          let mut parent = tree.get_mut(parent_id)?;
          parent.prepend_id(current_child_id);

          let prev_id_opt = tree.get_mut(current_child_id)?.prev_sibling().map(|v| v.id());
          match prev_id_opt {
            Some(prev_id) => current_child_id = prev_id,
            None => return Some(()),
          }
        }
      };

      attach_children_to_their_grand_parent()?;

      checked_detach(node_mut).then_some(())
    }

    fn checked_detach(node_mut: &mut NodeMut<Node>) -> bool {
      // /!\ Check precondition because `NodeMut::detach()` is not safe. /!\
      let is_good_precondition = (|| {
        let mut parent = node_mut.parent()?;
        let first_child_id = parent.first_child().map(|v| v.id())?;
        let last_child_id = parent.last_child().map(|v| v.id())?;
        let id = node_mut.id();

        if first_child_id != last_child_id {
          if first_child_id == id {
            node_mut.next_sibling()?;
          } else if last_child_id == id {
            node_mut.prev_sibling()?;
          }
        }
        Some(())
      })()
      .is_some();

      if is_good_precondition {
        node_mut.detach();
      } else {
        println!("💥 🔥 🧨 Bugged detach 🧨 💥 🔥");
      }
      is_good_precondition
    }

    depth_first_and_child_last_to_first(&mut parsed_html.tree.root_mut(), 0)
  }

  #[test]
  fn strip_html_test_current_test() {
    let mut html = scraper::html::Html::parse_document(
      r#"
    <html att=html>
      <head att=head>
        <!-- comment head 1-->
        <meta content="httrack" name="keywords">
        <style type="text/css">css rules</style>
        <script language="js">javascript</script>
      </head>
      <body att=1>
        <div body=1>
          <div body=11>
            <div body=111>some text</div>
            <div body=112>some <br> br  <!--comment div 112--> </div>
            some text
          </div>
          <div body=12>
            <div body=121></div>
            <div body=122></div>
          </div>
        </div>
        <div body=2>
          <span><a href="deep-link"></a></span>
        </div>
        <!-- comment body 1-->
        <other att=1></other>
        <form action="url" att=a>
          <!-- comment form 1-->
          <other att=2></other>
          <div>
            <div>
              <img src="url">
            </div>
          </div>
          <other att=3></other>
        </form>
        <other att=4></other>
      </body>
    </html>
    "#,
    );

    let is_good_stripping = strip_html(&mut html);
    assert!(is_good_stripping);

    assert_eq!(
      html.html(),
      r#"<html><body><a href="deep-link"></a><form action="url"><img src="url"></form></body></html>"#
    )
  }
}

@adamreichold
Copy link
Member

This is too much code for a reproducer and not actually a reproducer without an input that triggers a panic.

One bug I think I see that might be triggered by your code is that prepend_id does not appear to prevent adding a node to itself as a child which could produce a malformed tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants