Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute names including ; cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

Closed
aianta opened this issue Dec 9, 2024 · 7 comments
Labels
Milestone

Comments

@aianta
Copy link

aianta commented Dec 9, 2024

Simple reproducer, attached HTML text file is the document in question. Using JSoup version 1.18.3

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

public class SelectXpathTest {

    @Test
    public void test() throws IOException {

        String html = new String(Files.readAllBytes(Path.of("src/test/resources/debug.html.txt")));
        Document document = Jsoup.parse(html);
        int foundElements = document.selectXpath(xpath).size();
        System.out.println("Found %s elements".formatted(foundElements));

        assert foundElements > 0;

    }

    private static final String xpath = "/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5]/fieldset[1]/div[2]/div/div[1]/label[1]/input[2]";

}

debug.html.txt

I have evidence that the xpath provided should be resolvable, since I am able to take the attached html, open it in firefox and then using the following JS:

function getElementByXpath(path) {
    return document.evaluate(path, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;
}

run the following command in the console:

getElementByXpath("/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5]/fieldset[1]/div[2]/div/div[1]/label[1]/input[2]")

to get:

<input id="assignment_text_entry" type="checkbox" value="1" name="online_submission_types[online_text_entry]" aria-label="Online Submission Type - Text Entry" style="">

Initial debugging shows that the last element JSoup is able to retrieve along the xpath is:

/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]

When I retrieve the children for /html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1], I get 7 children, then:

/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[1] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[2] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[3] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[4] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5] -> Nope...
@jhy
Copy link
Owner

jhy commented Dec 11, 2024

Hi,

In a case like this my first thought is that you have a different DOM tree in Firefox vs jsoup. Given the javascript requirement and sources on the page, I expect that is what is causing the tree difference. Remember that jsoup parses HTML, and does not execute javascript.

Using the Firefox debugger to generate or execute the XPath will only show you the results from Firefox's DOM post javascript execution.

My suggestion would be to simplify your query, and to use Try.jsoup to iteratively validate it. For e.g. that input can be found simply with either
CSS: #assignment_text_entry or
XPath: //*[@id='assignment_text_entry'].

@jhy jhy closed this as completed Dec 11, 2024
@jhy jhy added the not-a-bug This issue is not a bug; it is working as per spec label Dec 11, 2024
@aianta
Copy link
Author

aianta commented Dec 11, 2024

Hi,

I removed all resources specified in the head and the problem still persists. The only JS I'm adding to the page is the function I'm using to eval the xpath, and I'm doing that through the browser console after the file is loaded (as a local file, not from a webserver, so all resource links should be broken anyways).

debug.html.txt

There should not be any difference in the dom tree between firefox and JSoup's input. Additionally, the JSoup element @ /html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1] shows that it has children, but after the first few, the rest seem to be un-xpathable.

If these elements were added by JS, my understanding is that JSoup wouldn't have included them in the parsed tree at all.

Unfortunately, I don't know anything other than the absolute xpath of the element a priori for my use case, so I cannot use any more direct xpath expression.

@jhy
Copy link
Owner

jhy commented Dec 11, 2024

Unfortunately, I don't know anything other than the absolute xpath of the element a priori for my use case, so I cannot use any more direct xpath expression.

Interesting, can you expand on what the use case is?

I think the issue is stemming from this HTML (line 417):

<div id="gpa-scale-dialog" title="&quot;What" is="" gpa="" scale="" grading?&quot;="" style="display: none;"> 

That's in a div after div[4]. Not that there are attribute names like grading?&quot, perhaps that is interrupting jsoup's conversion from jsoup DOM to W3C DOM which we use for the xpath selection.

If I swap that for <div id="gpa-scale-dialog">, then your original xpath query works.

@jhy jhy reopened this Dec 11, 2024
@jhy jhy removed the not-a-bug This issue is not a bug; it is working as per spec label Dec 11, 2024
@jhy
Copy link
Owner

jhy commented Dec 11, 2024

Here's a simpler reduction (see this try.jsoup):

<div id=0>

<div id=1></div>
<div id=2 grading;=""> </div>
<div id=3 grading?&quot;=""> 

</div>

XPath selector //div only finds 0 and 1. The ; in the attribute name is causing the drop out.

@jhy jhy changed the title document.selectXpath() fails to retrieve element. Attribute names including ; cannot be converted to W3C DOM; subsequent XPath query may be incorrect Dec 11, 2024
@jhy jhy closed this as completed in d67c994 Dec 11, 2024
@jhy jhy added this to the 1.19.1 milestone Dec 11, 2024
@jhy jhy added the fixed label Dec 11, 2024
@jhy
Copy link
Owner

jhy commented Dec 11, 2024

Thanks, fixed! Validated that your original xpath and HTML pass.

@aianta
Copy link
Author

aianta commented Dec 11, 2024

Huge! Thank you so much 😊

Unfortunately, I don't know anything other than the absolute xpath of the element a priori for my use case, so I cannot use any more direct xpath expression.

Interesting, can you expand on what the use case is?

I'd love to, I hope to come back to this comment an drop a link to a paper once I'm done.

But the short version is I'm a grad student working on a system to extract the 'meanings' of button clicks that were captured from user interaction traces. The traces contain these xpaths, and then I process the elements/html context near those elements to determine what task/purpose that button click/input interaction was trying to accomplish.

When I wrote the software producing the traces, it felt simpler to have it compute absolute xpaths for elements since the same logic could work for any element on any page.

Now that I think about it, I guess I could have made some attempts at generating shorter xpath expressions that resolve uniquely to a given element. I think the Selenium IDE has a feature like that where it proposes a range of locators xpath/CSS query for selecting arbitrary elements on the page. Might add it to the back burner of ideas to explore.

But yeah, I've seen some pretty gnarly attribute abuse with some of the html I'm working with. I'll have to double check my own processing pipelines aren't causing it somewhere, because that specific element was definitely the title attribute value leaking it's contents all over the place.

Either way, thanks again for patching this obscure edge case.

@jhy
Copy link
Owner

jhy commented Dec 11, 2024

Sounds cool, will be interested to see the paper.

jsoup has a method Element#cssSelector() which will generate a unique CSS selector from the ID, or parent + class combo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants