-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attribute names including ;
cannot be converted to W3C DOM; subsequent XPath query may be incorrect
#2244
Comments
Hi, In a case like this my first thought is that you have a different DOM tree in Firefox vs jsoup. Given the javascript requirement and sources on the page, I expect that is what is causing the tree difference. Remember that jsoup parses HTML, and does not execute javascript. Using the Firefox debugger to generate or execute the XPath will only show you the results from Firefox's DOM post javascript execution. My suggestion would be to simplify your query, and to use Try.jsoup to iteratively validate it. For e.g. that input can be found simply with either |
Hi, I removed all resources specified in the head and the problem still persists. The only JS I'm adding to the page is the function I'm using to eval the xpath, and I'm doing that through the browser console after the file is loaded (as a local file, not from a webserver, so all resource links should be broken anyways). There should not be any difference in the dom tree between firefox and JSoup's input. Additionally, the JSoup element @ If these elements were added by JS, my understanding is that JSoup wouldn't have included them in the parsed tree at all. Unfortunately, I don't know anything other than the absolute xpath of the element a priori for my use case, so I cannot use any more direct xpath expression. |
Interesting, can you expand on what the use case is? I think the issue is stemming from this HTML (line 417): <div id="gpa-scale-dialog" title=""What" is="" gpa="" scale="" grading?"="" style="display: none;"> That's in a If I swap that for |
Here's a simpler reduction (see this try.jsoup): <div id=0>
<div id=1></div>
<div id=2 grading;=""> </div>
<div id=3 grading?"="">
</div> XPath selector |
;
cannot be converted to W3C DOM; subsequent XPath query may be incorrect
Thanks, fixed! Validated that your original xpath and HTML pass. |
Huge! Thank you so much 😊
I'd love to, I hope to come back to this comment an drop a link to a paper once I'm done. But the short version is I'm a grad student working on a system to extract the 'meanings' of button clicks that were captured from user interaction traces. The traces contain these xpaths, and then I process the elements/html context near those elements to determine what task/purpose that button click/input interaction was trying to accomplish. When I wrote the software producing the traces, it felt simpler to have it compute absolute xpaths for elements since the same logic could work for any element on any page. Now that I think about it, I guess I could have made some attempts at generating shorter xpath expressions that resolve uniquely to a given element. I think the Selenium IDE has a feature like that where it proposes a range of locators xpath/CSS query for selecting arbitrary elements on the page. Might add it to the back burner of ideas to explore. But yeah, I've seen some pretty gnarly attribute abuse with some of the html I'm working with. I'll have to double check my own processing pipelines aren't causing it somewhere, because that specific element was definitely the title attribute value leaking it's contents all over the place. Either way, thanks again for patching this obscure edge case. |
Sounds cool, will be interested to see the paper. jsoup has a method Element#cssSelector() which will generate a unique CSS selector from the ID, or parent + class combo. |
Simple reproducer, attached HTML text file is the document in question. Using JSoup version
1.18.3
debug.html.txt
I have evidence that the xpath provided should be resolvable, since I am able to take the attached html, open it in firefox and then using the following JS:
run the following command in the console:
getElementByXpath("/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5]/fieldset[1]/div[2]/div/div[1]/label[1]/input[2]")
to get:
<input id="assignment_text_entry" type="checkbox" value="1" name="online_submission_types[online_text_entry]" aria-label="Online Submission Type - Text Entry" style="">
Initial debugging shows that the last element JSoup is able to retrieve along the xpath is:
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]
When I retrieve the children for
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]
, I get 7 children, then:The text was updated successfully, but these errors were encountered: