Attribute names including `;` cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

aianta · 2024-12-09T23:23:17Z

Simple reproducer, attached HTML text file is the document in question. Using JSoup version 1.18.3

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

public class SelectXpathTest {

    @Test
    public void test() throws IOException {

        String html = new String(Files.readAllBytes(Path.of("src/test/resources/debug.html.txt")));
        Document document = Jsoup.parse(html);
        int foundElements = document.selectXpath(xpath).size();
        System.out.println("Found %s elements".formatted(foundElements));

        assert foundElements > 0;

    }

    private static final String xpath = "/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5]/fieldset[1]/div[2]/div/div[1]/label[1]/input[2]";

}

debug.html.txt

I have evidence that the xpath provided should be resolvable, since I am able to take the attached html, open it in firefox and then using the following JS:

function getElementByXpath(path) {
    return document.evaluate(path, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;
}

run the following command in the console:

getElementByXpath("/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5]/fieldset[1]/div[2]/div/div[1]/label[1]/input[2]")

to get:

<input id="assignment_text_entry" type="checkbox" value="1" name="online_submission_types[online_text_entry]" aria-label="Online Submission Type - Text Entry" style="">

Initial debugging shows that the last element JSoup is able to retrieve along the xpath is:

/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]

When I retrieve the children for /html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1], I get 7 children, then:

/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[1] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[2] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[3] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[4] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5] -> Nope...

The text was updated successfully, but these errors were encountered:

jhy · 2024-12-11T00:36:48Z

Hi,

In a case like this my first thought is that you have a different DOM tree in Firefox vs jsoup. Given the javascript requirement and sources on the page, I expect that is what is causing the tree difference. Remember that jsoup parses HTML, and does not execute javascript.

Using the Firefox debugger to generate or execute the XPath will only show you the results from Firefox's DOM post javascript execution.

My suggestion would be to simplify your query, and to use Try.jsoup to iteratively validate it. For e.g. that input can be found simply with either
CSS: #assignment_text_entry or
XPath: //*[@id='assignment_text_entry'].

aianta · 2024-12-11T03:47:03Z

Hi,

I removed all resources specified in the head and the problem still persists. The only JS I'm adding to the page is the function I'm using to eval the xpath, and I'm doing that through the browser console after the file is loaded (as a local file, not from a webserver, so all resource links should be broken anyways).

debug.html.txt

There should not be any difference in the dom tree between firefox and JSoup's input. Additionally, the JSoup element @ /html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1] shows that it has children, but after the first few, the rest seem to be un-xpathable.

If these elements were added by JS, my understanding is that JSoup wouldn't have included them in the parsed tree at all.

Unfortunately, I don't know anything other than the absolute xpath of the element a priori for my use case, so I cannot use any more direct xpath expression.

jhy · 2024-12-11T04:46:32Z

Unfortunately, I don't know anything other than the absolute xpath of the element a priori for my use case, so I cannot use any more direct xpath expression.

Interesting, can you expand on what the use case is?

I think the issue is stemming from this HTML (line 417):

<div id="gpa-scale-dialog" title="&quot;What" is="" gpa="" scale="" grading?&quot;="" style="display: none;">

That's in a div after div[4]. Not that there are attribute names like grading?&quot, perhaps that is interrupting jsoup's conversion from jsoup DOM to W3C DOM which we use for the xpath selection.

If I swap that for <div id="gpa-scale-dialog">, then your original xpath query works.

jhy · 2024-12-11T04:57:46Z

Here's a simpler reduction (see this try.jsoup):

<div id=0>

<div id=1></div>
<div id=2 grading;=""> </div>
<div id=3 grading?&quot;=""> 

</div>

XPath selector //div only finds 0 and 1. The ; in the attribute name is causing the drop out.

jhy · 2024-12-11T05:23:59Z

Thanks, fixed! Validated that your original xpath and HTML pass.

aianta · 2024-12-11T16:43:08Z

Huge! Thank you so much 😊

Unfortunately, I don't know anything other than the absolute xpath of the element a priori for my use case, so I cannot use any more direct xpath expression.

Interesting, can you expand on what the use case is?

I'd love to, I hope to come back to this comment an drop a link to a paper once I'm done.

But the short version is I'm a grad student working on a system to extract the 'meanings' of button clicks that were captured from user interaction traces. The traces contain these xpaths, and then I process the elements/html context near those elements to determine what task/purpose that button click/input interaction was trying to accomplish.

When I wrote the software producing the traces, it felt simpler to have it compute absolute xpaths for elements since the same logic could work for any element on any page.

Now that I think about it, I guess I could have made some attempts at generating shorter xpath expressions that resolve uniquely to a given element. I think the Selenium IDE has a feature like that where it proposes a range of locators xpath/CSS query for selecting arbitrary elements on the page. Might add it to the back burner of ideas to explore.

But yeah, I've seen some pretty gnarly attribute abuse with some of the html I'm working with. I'll have to double check my own processing pipelines aren't causing it somewhere, because that specific element was definitely the title attribute value leaking it's contents all over the place.

Either way, thanks again for patching this obscure edge case.

jhy · 2024-12-11T22:59:12Z

Sounds cool, will be interested to see the paper.

jsoup has a method Element#cssSelector() which will generate a unique CSS selector from the ID, or parent + class combo.

jhy closed this as completed Dec 11, 2024

jhy added the not-a-bug This issue is not a bug; it is working as per spec label Dec 11, 2024

jhy reopened this Dec 11, 2024

jhy removed the not-a-bug This issue is not a bug; it is working as per spec label Dec 11, 2024

jhy changed the title ~~document.selectXpath() fails to retrieve element.~~ Attribute names including ; cannot be converted to W3C DOM; subsequent XPath query may be incorrect Dec 11, 2024

jhy closed this as completed in d67c994 Dec 11, 2024

jhy added this to the 1.19.1 milestone Dec 11, 2024

jhy added the fixed label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attribute names including `;` cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

Attribute names including `;` cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

aianta commented Dec 9, 2024 •

edited

Loading

jhy commented Dec 11, 2024

aianta commented Dec 11, 2024 •

edited

Loading

jhy commented Dec 11, 2024

jhy commented Dec 11, 2024

jhy commented Dec 11, 2024

aianta commented Dec 11, 2024 •

edited

Loading

jhy commented Dec 11, 2024 •

edited

Loading

Attribute names including ; cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

Attribute names including ; cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

Comments

aianta commented Dec 9, 2024 • edited Loading

jhy commented Dec 11, 2024

aianta commented Dec 11, 2024 • edited Loading

jhy commented Dec 11, 2024

jhy commented Dec 11, 2024

jhy commented Dec 11, 2024

aianta commented Dec 11, 2024 • edited Loading

jhy commented Dec 11, 2024 • edited Loading

Attribute names including `;` cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

Attribute names including `;` cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

aianta commented Dec 9, 2024 •

edited

Loading

aianta commented Dec 11, 2024 •

edited

Loading

aianta commented Dec 11, 2024 •

edited

Loading

jhy commented Dec 11, 2024 •

edited

Loading