Source position for attributes #1933

jrfaller · 2023-04-06T13:52:46Z

Hi all,

First, thanks a lot for this excellent library!

I am working on an open-source tool that diffs tree-based languages such as HTML or XML (http://github.com/GumTreeDiff/gumtree), and I wanted to use JSoup to parse HTML and XML files since it is one of the very few parsers I know that has precise source position for the XML/HTML elements. However, I have remarked that this information is not present for the attributes, and it prevents me from fully diffing these files, as attribute name or value modifications are frequent.

Would it be possible to add source location information for attribute keys and values? Or do you have an idea for a workaround?

Thanks in advance.

Cheers!

jurgenvinju · 2023-04-07T20:19:56Z

I have a different use case for the same functionality. Also a fan of JSoup!

We're generating LSP servers for DSLs and some of these have an XML syntax. Check out usethesource/rascal-language-servers .

With accurate locations for each attribute and value we can more easily generate useful features like reference resolution, also inside string attributes that are parsed further.

jhy · 2023-11-17T07:28:56Z

Thanks for the feedback and info on your use cases. And sorry for the very late reply!

I think this is a great idea and look forward to adding it.

Question for you both (and for any other folks using the source ranges). Currently the implementation only provides the source range for explicitly inserted tags. Implicitly created tags (like e.g. head or body) will have a -1 start / end range. Are you using value for a specific reason? I am thinking that it would be better to give that the range of whichever tag triggered that node's creation.

jrfaller · 2023-11-17T07:59:10Z

Hi @jhy and thanks for your feedback!

I was not aware of the behavior you described, thanks for pointing it out.

In our use case, we use Jsoup as a black box to parse XML files, with all the tags already present (and where we absolutely do not want to have nodes not in the document). Ideally, we would need precise positions for all the syntactic elements inside the file (elements, attribute names, attribute values). From what I understand there are some cases where nodes are dynamically inserted by Jsoup that are not present in the text file being parsed? I think this is beyond the scope of my use case or the one of @jurgenvinju if this is the case.

I am not sure how to handle these elements though since they are not in the text file it could be nice to have a source range that conveys this information. However, for debugging purposes, it could be also useful to pinpoint the places responsible for the creation of the element. So I have mixed feelings about this one (IIUC)!

Cheers!

jhy · 2023-11-17T08:29:36Z

Thanks for the (fast!) reply. Definitely makes sense. Using the XML Parser, you won't have any implicitly created elements -- that's only for the HTML parser. So it's not directly applicable to your use case.

I think if there's a clean way to have the SourceRange be able to tell if it was explicit or implicit, we'll be in a good spot. And similarly for closing nodes.

jrfaller · 2023-11-17T08:34:10Z

Ha yes right we have this problem for closing nodes when someone uses the <foo/> syntax. So this would be nice to know whenever we hit an element that is not materialized in the textual syntax! Maybe this could be a property on the nodes like isImplicit(), avoiding from the beginning to search for source information for those nodes?

jurgenvinju · 2023-11-17T14:06:51Z

Hi! cool that we are making progress on this topic. Our use case is that of "high fidelity" parsing, so we want to introduce as little "noise" in the process as possible. The image of what we have in the AST model should be as close as possible to that what the user is looking at in their editor.

additional nodes or attributes that are inserted into the trees to canonicalize the model are "noise" to us. So it's important they are marked as such, like @jrfaller indicated, or that we can chose not to include them at all.
closing nodes that are not there: that's great! As long as the closing range (offset+length of the entire node) does not extend over into the range of the next node. Especially if final spaces or newlines of the contents of the node with a missing closing tag are attributed to the "right node" using the offset/length, we are very very happy.

jhy · 2023-11-20T06:09:42Z

Thanks, that aligns to what I was thinking too. I have refactored and improved how source ranges are tracked in #2056. It would be great if you could install a snapshot version of 1.17.1 and have a try and see if there are any gaps. (Not including attribute ranges yet though).

someone uses the <foo/> syntax

That will now include a source range and an end range that is the same and are equal; pos 0 through 6.

Maybe this could be a property on the nodes like isImplicit()

Yes, added that. The source range and/or the end source will be tracked() and isImplicit(). And the pos/endPos will be the same. I.e. it's a zero-width range for the start or for the end. In HTML that might be an implicitly inserted <head>. In XML, that might be implicitly closed tags. E.g. <foo><bar></foo>. Bar will be implicitly closed with the pos before </foo>.

Especially if final spaces or newlines of the contents of the node with a missing closing tag are attributed to the "right node" using the offset/length

<foo>\n - the \n is a TextNode and will have a range coming directly after and non-overlapping with the foo.

jhy · 2023-11-20T10:42:30Z

I have implemented a first draft of attribute tracking in master...attr-track

It's a bit of a different approach than the node tracking because attributes are built differently. They aren't tokens, but rather sub tokens of tag tokens. So the tracking has to be done internally.

jhy · 2023-11-21T05:44:06Z

OK, I've merged the complete implementation. If you get a chance to review this in a snapshot before release, I'd value your feedback! (Well, I'll still value it afterwards too :)

jrfaller · 2023-11-24T07:25:24Z

Wow amazing! Thanks a lot!

Is there a nightly repository so that I can try it out?

Cheers!

jurgenvinju · 2023-11-24T08:51:58Z

Ooo Nice! I'm going to give this a try early next week. I'll post a link to a PR here.

jhy · 2023-11-27T03:22:08Z

@jrfaller there are instructions here on how to build and install a snapshot version. There is no pre-built snapshot.

But, I just released this in 1.17.1 so you can use that directly.

jurgenvinju · 2023-11-27T14:18:21Z

Here's my PR on the Rascal project. It works! But... I have a missing Range on the attributes on the top node.

usethesource/rascal#1893

For testing purposes I was parsing the pom.xml file of the rascal project itself:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.rascalmpl</groupId>
    <artifactId>rascal</artifactId>
    <version>0.35.0-RC2-SNAPSHOT</version>
    <packaging>jar</packaging>
  ... SNIP!

The project node has an xsi:schemaLocation attribute.

Here's some output of the readXML function that will print the filename only if the attribute sources ranges are -1:

rascal>import IO;
ok
rascal>import lang::xml::IO;
ok
rascal>iprintln(readXML(|cwd:///pom.xml|, trackOrigins=true), lineLimit=-1)
"#document"(
  "project"(
    "modelversion"(
      "4.0.0",
      src=|cwd:///pom.xml|(209,14,<2,4>,<2,18>)),
    "groupid"(
      "org.rascalmpl",
      src=|cwd:///pom.xml|(249,9,<4,4>,<4,13>)),
    "artifactid"(
      "rascal",
      src=|cwd:///pom.xml|(286,12,<5,4>,<5,16>)),
    "version"(
      "0.35.0-RC2-SNAPSHOT",
      src=|cwd:///pom.xml|(322,9,<6,4>,<6,13>)),
    "packaging"(
      "jar",
      src=|cwd:///pom.xml|(365,11,<7,4>,<7,15>)),
    "scm"(
      "developerconnection"(
        "scm:git:ssh://[email protected]/usethesource/rascal.git",
        src=|cwd:///pom.xml|(411,21,<10,8>,<10,29>)),
      "tag"(
        "SNAPSHOT",
        src=|cwd:///pom.xml|(515,5,<11,8>,<11,13>)),
      src=|cwd:///pom.xml|(397,5,<9,4>,<9,9>)),
    "repositories"(
      ... SNIP!
      "dependency"(
        "groupid"(
          "com.ibm.icu",
          src=|cwd:///pom.xml|(18901,9,<441,12>,<441,21>)),
        "artifactid"(
          "icu4j",
          src=|cwd:///pom.xml|(18944,12,<442,12>,<442,24>)),
        "version"(
          "69.1",
          src=|cwd:///pom.xml|(18987,9,<443,12>,<443,21>)),
        src=|cwd:///pom.xml|(18876,12,<440,8>,<440,20>)),
      src=|cwd:///pom.xml|(15755,14,<351,4>,<351,18>)),
    src=|cwd:///pom.xml|(0,204,<1,0>,<1,204>),
    srcs=("schemalocation":|cwd:///pom.xml|),
    schemalocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"),
  src=|cwd:///pom.xml|)

Here we see the srcs map saying: ("schemalocation":|cwd:///pom.xml|), which is caused by the AttributeRange.valueRange() having a begin offset of -1. I tracked this with the Java debugger. Is this supposed to happen? Or is the top-node somehow different?

jurgenvinju · 2023-11-27T14:20:18Z

There are other nodes where it does work:


"transformer"(
                    "manifestentries"(
                      "name"(
                        "rascal",
                        src=|cwd:///pom.xml|(12223,6,<284,40>,<284,46>)),
                      "main-class"(
                        "org.rascalmpl.shell.RascalShell",
                        src=|cwd:///pom.xml|(12283,12,<285,40>,<285,52>)),
                      "specification-version"(
                        "${project.version}",
                        src=|cwd:///pom.xml|(12380,23,<286,40>,<286,63>)),
                      "specification-vendor"(
                        "http://www.usethesource.io",
                        src=|cwd:///pom.xml|(12486,22,<287,40>,<287,62>)),
                      src=|cwd:///pom.xml|(12165,17,<283,36>,<283,53>)),
                    src=|cwd:///pom.xml|(12030,98,<282,32>,<282,130>),
                    srcs=("implementation":|cwd:///pom.xml|(12059,67,<282,61>,<282,128>)),
                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"),

See here the srcs map has the exact position of the value that goes with implementation.

jurgenvinju · 2023-11-27T14:21:47Z

My code for reference:

 private ISourceLocation attrToLoc(Attribute a, ISourceLocation file) {
        Range range = a.sourceRange().valueRange();

        if (range.start().pos() < 0) {
            // this is strange
            return file;
        }

        return vf.sourceLocation(file, 
            range.start().pos(), 
            range.end().pos() - range.start().pos(),
            range.start().lineNumber(),
            range.end().lineNumber(),
            range.start().columnNumber() - 1,
            range.end().columnNumber() - 1
        );
    }

jrfaller · 2023-11-27T20:00:56Z

Amazing! Got initial support for XML in GumTree with attributes thanks to you! here for reference : GumTreeDiff/gumtree@6584227

I will report If there is any strange behavior.

Screenshot:

jhy · 2023-11-28T01:12:07Z

Thanks for the feedback from both of you, appreciated and great to see it in use, in such a variety of tools.

@jurgenvinju I have fixed the issue you saw, in #2067. The issue was not with it being the first element exactly, but due to a mixed-case attribute name not being normalized (your configuration of the XML parser is to lowercase normalize attribute names).

You might like to validate that with a snapshot install and let me know of any other issues.

jurgenvinju · 2023-11-28T11:07:22Z

I tested with 03df8ce of jsoup; unfortunately I see no improvement yet. The same attribute mix-case, seems to be missing it's position. If you want I'll dig a little deeper to see if there are other names in the map.

jhy · 2023-11-28T11:25:50Z

Hmm - yes please if you can provide a small code sample of your parser settings and the input, and which attribute is missing, that would help.

jurgenvinju · 2023-11-28T11:59:20Z

parsing this file: https://raw.githubusercontent.com/usethesource/rascal/main/pom.xml
this is how we parse:

 try (InputStream reader = URIResolverRegistry.getInstance().getInputStream(loc)) {
            Parser xmlParser = Parser.xmlParser()
                .settings(new ParseSettings(false, false))
                .setTrackPosition(true)
                ;
            
            Document doc = Jsoup.parse(reader, charset, uri, xmlParser);
            ...

This is the cause in JSoup why I have an -1 on the schemalocation attribute:

public Range.AttributeRange sourceRange() {
       if (parent == null) return Range.AttributeRange.UntrackedAttr;
       return parent.sourceRange(key);
   }

The parent is equal to null in my case. This is in the Attribute class on line 144.

jurgenvinju · 2023-11-28T12:20:34Z

I believe it is caused by this code, that does not clone the parent when the node's attribute name is normalized:

 private static Attribute removeNamespace(Attribute a, Attributes otherAttributes, boolean fullyQualify) {
        if (fullyQualify) {
            return a;
        }
        
        String key = a.getKey();
        int index = key.indexOf(":");

        if (index == -1) {
            return a;
        }

        String newKey = key.substring(index+1);

        if (otherAttributes.hasKey(newKey)) {
            // keep disambiguation if necessary
            return a;
        }

        return new Attribute(newKey, a.getValue()); // here the parent is not cloned
    }

That is in our own code! So I have to find a way to do that.

jurgenvinju · 2023-11-28T12:49:20Z

Ok. This is the analysis. It all works fine on the JSoup side now; with a minor remark:

I used Attribute.clone() and then Attribute.setKey(newKey) to cut away namespace prefixes.
I also used new Attribute(key, value) without the parent
both uses of the Attribute class lead to missing lookups in the source map data-structures.
1. After the first, the source lookup will be one the new name, while the old name is still in the map
2. After the second there is not map at all to look things into.
@jhy maybe setKey could be considered to also update the keys in the source map

However, this is not necessary. It will work like this because I changed my code to not use the about methods anymore. Thanks!

jurgenvinju · 2023-11-28T13:58:34Z

Ok. Now I'm up with the next issue:

This is while running this code:

 public Range.AttributeRange sourceRange(String key) {
        if (!hasKey(key)) return UntrackedAttr;
        //noinspection unchecked
        Map<String, Range.AttributeRange> ranges = (Map<String, Range.AttributeRange>) userData(AttrRangeKey);
        if (ranges == null) return Range.AttributeRange.UntrackedAttr;
        Range.AttributeRange range = ranges.get(key);
        return range != null ? range : Range.AttributeRange.UntrackedAttr;
    }

So I am looking at the normalized key schemalocation in the Attribute's key, while the parent has a source range lookup map with the unnormalized key schemaLocation in it.

jurgenvinju · 2023-11-28T14:01:45Z

Perhaps this is also solved if we let setKey update the underlying lookup maps?

jurgenvinju · 2023-11-28T15:03:15Z

Just for completeness sake: if I use new ParseSettings(true, true) instead, we're good. So I'm testing with those settings for now, so there is no rush :-)

jhy · 2023-11-29T00:49:26Z

So I am looking at the normalized key schemalocation in the Attribute's key, while the parent has a source range lookup map with the unnormalized key schemaLocation in it.

I am a bit confused by this. This is fixed by #2067 and you're testing after those two commits, right?

    @Test void preserveCaseOff() {
        String xml = "<el Id=1'>One</el>";
        Document doc = Jsoup.parse(xml, Parser.xmlParser()
             .setTrackPosition(true).settings(new ParseSettings(false, false)));
        Element el = doc.expectFirst("el");
        for (Attribute attribute : el.attributes()) {
            System.out.println(attribute);
            System.out.println(attribute.sourceRange());
            System.out.println(attribute.getKey());
        }
    }

Gives

id="1'"
1,5:4-1,7:6=1,8:7-1,10:9
id

The attribute key and the sourceRange key are both normalized per the parser settings. And with default XML settings (preserve case), we get:

Id="1'"
1,5:4-1,7:6=1,8:7-1,10:9
Id

I think I'm missing something, can you give a small code snippet that shows the problem? It should work correctly regardless of the preserve case setting.

Having setKey update the sourceRange definitely makes sense, will add that.

I also want to make a better accessor for an Attribute - right now you can only get it via the iterator. (The Attribute is never held - the Element holds an Attributes which contains arrays for keys and values. This is to keep the DOM's memory use as low as possible in routine use. The Attribute is instantiated with a link back to the Attributes during the iterator). Perhaps add it as .attribute(key) to Element and/or Attributes.

jurgenvinju · 2023-11-29T12:51:31Z

I am a bit confused by this. This is fixed by #2067 and you're testing after those two commits, right?

Me too! :-) I cloned the latest version which is after those two commits indeed.

I will try and reduce this to a small code snippet; probably discovering my own mistakes by doing so. Thanks for your patience.

jhy added the improvement label Nov 17, 2023

jhy mentioned this issue Nov 20, 2023

Improved accuracy of tracked source positions #2056

Merged

jhy self-assigned this Nov 20, 2023

jhy added this to the 1.17.1 milestone Nov 20, 2023

jhy mentioned this issue Nov 21, 2023

Tracks the source range of attributes (name and value) when source tracking is on #2057

Merged

jhy linked a pull request Nov 21, 2023 that will close this issue

Tracks the source range of attributes (name and value) when source tracking is on #2057

Merged

jhy closed this as completed in #2057 Nov 21, 2023

jhy mentioned this issue Nov 28, 2023

Attribute source ranges are not normalized correctly #2067

Closed

jurgenvinju mentioned this issue Nov 28, 2023

xml attribute locations usethesource/rascal#1893

Draft

This was referenced Nov 29, 2023

Provide an Attribute getter in Element, Attributes #2069

Closed

Changing an attributes key should maintain its source range #2070

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source position for attributes #1933

Source position for attributes #1933

jrfaller commented Apr 6, 2023

jurgenvinju commented Apr 7, 2023 •

edited

Loading

jhy commented Nov 17, 2023

jrfaller commented Nov 17, 2023

jhy commented Nov 17, 2023

jrfaller commented Nov 17, 2023 •

edited

Loading

jurgenvinju commented Nov 17, 2023

jhy commented Nov 20, 2023

jhy commented Nov 20, 2023

jhy commented Nov 21, 2023

jrfaller commented Nov 24, 2023

jurgenvinju commented Nov 24, 2023

jhy commented Nov 27, 2023

jurgenvinju commented Nov 27, 2023

jurgenvinju commented Nov 27, 2023

jurgenvinju commented Nov 27, 2023

jrfaller commented Nov 27, 2023 •

edited

Loading

jhy commented Nov 28, 2023 •

edited

Loading

jurgenvinju commented Nov 28, 2023

jhy commented Nov 28, 2023

jurgenvinju commented Nov 28, 2023 •

edited

Loading

jurgenvinju commented Nov 28, 2023 •

edited

Loading

jurgenvinju commented Nov 28, 2023

jurgenvinju commented Nov 28, 2023 •

edited

Loading

jurgenvinju commented Nov 28, 2023

jurgenvinju commented Nov 28, 2023

jhy commented Nov 29, 2023 •

edited

Loading

jurgenvinju commented Nov 29, 2023

Source position for attributes #1933

Source position for attributes #1933

Comments

jrfaller commented Apr 6, 2023

jurgenvinju commented Apr 7, 2023 • edited Loading

jhy commented Nov 17, 2023

jrfaller commented Nov 17, 2023

jhy commented Nov 17, 2023

jrfaller commented Nov 17, 2023 • edited Loading

jurgenvinju commented Nov 17, 2023

jhy commented Nov 20, 2023

jhy commented Nov 20, 2023

jhy commented Nov 21, 2023

jrfaller commented Nov 24, 2023

jurgenvinju commented Nov 24, 2023

jhy commented Nov 27, 2023

jurgenvinju commented Nov 27, 2023

jurgenvinju commented Nov 27, 2023

jurgenvinju commented Nov 27, 2023

jrfaller commented Nov 27, 2023 • edited Loading

jhy commented Nov 28, 2023 • edited Loading

jurgenvinju commented Nov 28, 2023

jhy commented Nov 28, 2023

jurgenvinju commented Nov 28, 2023 • edited Loading

jurgenvinju commented Nov 28, 2023 • edited Loading

jurgenvinju commented Nov 28, 2023

jurgenvinju commented Nov 28, 2023 • edited Loading

jurgenvinju commented Nov 28, 2023

jurgenvinju commented Nov 28, 2023

jhy commented Nov 29, 2023 • edited Loading

jurgenvinju commented Nov 29, 2023

jurgenvinju commented Apr 7, 2023 •

edited

Loading

jrfaller commented Nov 17, 2023 •

edited

Loading

jrfaller commented Nov 27, 2023 •

edited

Loading

jhy commented Nov 28, 2023 •

edited

Loading

jurgenvinju commented Nov 28, 2023 •

edited

Loading

jurgenvinju commented Nov 28, 2023 •

edited

Loading

jurgenvinju commented Nov 28, 2023 •

edited

Loading

jhy commented Nov 29, 2023 •

edited

Loading