Non-sequential parsing of TEI elements #234

RobertoRDT · 2024-01-27T09:29:40Z

At the time of writing, EVT 3 parsing of TEI elements is sequential, element by element, which makes it problematic to refer to elements that are at a later position than the currently parsed element.
For example, to avoid the problem of data redundancy in TEI encoding for authorial philology, a solution based on the copyof attribute would be very effective:

in layer 0 we have <mod change="layer-0" xml:id="MS_12_23">original text</mod>
in layer 1 this text is deleted: <mod change="layer-1"><mod change="copyof="#MS_12_23"/></mod>

This would also be a very useful mechanism in many other situations, e.g. to put together text based on portions of other separate documents and other stand-off use cases. However, since EVT 3 proceeds sequentially, when it encounters <del copyof="#MS_12_23"/> it would not yet have parsed the contents of <mod change="strato-0" xml:id="MS_12_23"> (authorial philology encoding proceeds in reverse chronological order, most recent text first) and thus that node would not yet exist.
Another currently impossible use case: retrieving the content of a footnote in the <back>, or of a bibliographic entry also in the <back>.

Possible solutions:

second round of parsing with only an actual reference to the external element, which, although it would increase the start-up time if the functionality were used intensively, would not have a particular impact on performance;
on-demand parsing of the requested node not yet encountered with a copy of the same in its own data structure (and thus duplication + memory costs).

The text was updated successfully, but these errors were encountered:

laurelled · 2024-05-13T14:33:11Z

I've had the time to analyse the complexity of implementing a second round of parsing, based on the current situation.
For those who lack the time to read thoroughly, in short my opinion is that its implementation would require a huge effort in terms of refactoring the existing code. We should discuss together on what road to take.

I'll be redundant for the sake of clarity.
Currently, most of the parsing is being done by specific classes that implements the Parser interface, which requires to implement the parse method. I'll paste here one of the classes for reference:

@xmlParser('sic', SicParser)
export class SicParser extends EmptyParser implements Parser<XMLElement> {
    attributeParser = createParser(AttributeParser, this.genericParse);
    parse(xml: XMLElement): Sic {
        const attributes = this.attributeParser.parse(xml);
        const { type } = attributes;

        return {
            type: Sic,
            sicType: type || '',
            class: getClass(xml),
            content: parseChildren(xml, this.genericParse),
            attributes,
        };
    }
}

There's also that @xmlParser('sic', SicParser), which comes pretty handy because it maps the tag name to its parser. The combination of both the interface implementation and the mapping is great for a generic double-pass parsing. In fact, we would just need to get the tag name, retrieve the corresponding parser class and call the parse method which any of them has.

However, some parsing is not done following this method. For example:

Sometimes the association between the string and the parser is not an actual tag name, for example in the named-entity-parser.ts:

@xmlParser('evt-named-entities-list-parser', NamedEntitiesListParser)
export class NamedEntitiesListParser extends EmptyParser implements Parser<XMLElement> {
...
}

That adds up complexity to take care while scanning through the document. Maybe this could be easily fixed by also adding the tag

Some tags are not parsed through a parser class which implements the Parser interface and, thus, the parse method. For example, the witness-parser.service.ts :

export class WitnessesParserService {
  private witListTagName = 'listWit';
  private witTagName = 'witness';
  private witNameAttr = 'type="siglum"';
  private groupTagName = 'head';
  private attributeParser = createParser(AttributeParser, parse);

  constructor(
    private genericParserService: GenericParserService,
  ) {
  }

  public parseWitnessesData(document: XMLElement): Witnesses {
    const lists = Array.from(document.querySelectorAll<XMLElement>(this.witListTagName));

    return {
      witnesses: this.parseWitnessesList(lists),
      groups: this.parseWitnessesGroups(lists),
    };
  }

  private parseWitnessesList(lists: XMLElement[]) {
    const parsedList = lists.filter((list) => !isNestedInElem(list, list.tagName))
      .map((list) => this.parseWitnesses(list))
      .reduce((x, y) => x.concat(y), []);

    return parsedList;
  }

It doesn't also include the mapping between listWit and this class. structure-xml-parser.service.ts, which is much more complicated, works also this way.

Refactoring those problems is not straight-forward and would require thorough discussions with other devs, and from my understanding right now it's not really a good time for that. We should still discuss how to proceed, as it's a pretty important feature for the stand-off apparatus implementation.

RobertoRDT · 2024-05-14T14:36:27Z

[A summary of all subsequent comments so that they aren't lost in Slack]

RobertoRDT
As you noticed, this is pretty crucial for the stand-off markup processing needed to support DEPA, so we should try to find a (perhaps not perfect) solution right now.

Lorenzo Bafunno
So, the main topics that would start a discussion would be:

is the double round of parsing worth the refactoring of the existing code? This would also possibly risk the rise of new bugs or unexpected behaviour
Are there any other options?

@Davide I. Cucurnia got around it by duplicating some elements, the implementation can be found in analogue-parser.ts

RobertoRDT
Could you parse specific stand-off elements (<listApp> f.i., but also <div> elements in the <back> with notes, hotspots etc.) before all the other ones, so that the necessary information is already available when you get to the inline markup needing it?

Because, if I got it right, the problem is that when I have a list in the <back> and something connected to that list in the <body>, when EVT parses the <body> it still lacks the relevant info available in the <back>. Sorry if this doesn't make much sense 😅

@Davide I. Cucurnia
Yes like briefly addressed weeks ago the other solution is an on-demand parsing of the required (=referred) element when needed, similar to what's done in the analogue and source parsers. Memory-wise a copy of the referred element is however needed in order to show it in the page, so we cannot avoid it. The double round parsing solution would also inflate the boot time of the app...

Could you parse specific stand-off elements (<listApp> f.i., but also <div> elements in the <back> with notes, hotspots etc.) before all the other ones, so that the necessary information is already available when you get to the inline markup needing it?
Yes, this solution sounds similar to what it's currently developed in the analogue and source parsers, which is the on-demand parsing.

RobertoRDT
Would it be safe to say that if I put everything in the <standOff> element before <text> there would be little need to copy the referred elements in memory? As an interim solution to give us time to look for a proper fix.

https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-standOff.html

As you can see in this example <standOff> can precede the main <text>: https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-standOff.html#index-egXML-d54e131382

Lorenzo Bafunno
Thank you for the link, I didn't understand at first. Still, I don't think that would change the situation. A copy of the referred element is still needed. The problem is that when a parser class parses the XML/TEI code, it's not provided with a context of what has been parsed previously, neither it knows what is yet to be parsed.

I'll try to provide an example

RobertoRDT
OK that means I got it wrong because I thought the problem lies when the referred element is down in the TEI document.

Lorenzo Bafunno
So, currently, from my limited understanding (Davide I. Cucurnia correct me if i'm wrong), everytime we want to extract information needed for a component, we do these steps:

we pass the whole document to a parsing service, which narrows the scope to only those element we need to extract.
the service calls the parser to retrieve information from a tag, or a series of them.
we store the result of the whole process in a data structure, most likely an array.

Ideally, we could retrieve data from both the <body> and the <back> and merge it together in one structure. That's no problem (easier said than done). But that creates duplication.An example would be this:

<body>
 <w xml:id="w1">hello</w>
 <w xml:id="w2">world</w>
</body>
<back>
 <app from="w1"> <rdg wit="#c">hallo</rdg> </app>
</back>

Let's imagine we store a combination of <w> and the corresponding app and let's call the resulting structure A. That would be great for the DEPA. But there is already a parser that provides information about the <app>, so the information stored in the type A is redundant and takes up "useless" memory.

Sorry if it took me that long but I needed time to make it clear, hope it makes sense 😅

Andrew Forsberg

Let’s imagine we store a combination of <w> and the corresponding app and let’s call the resulting structure A. That would be great for the DEPA. But there is already a parser that provides information about the <app>, so the information stored in the type A is redundant and takes up “useless” memory
@Lorenzo Bafunno

— Apologies in advance, as this might be an overly naive suggestion, but just in case — would an initial parse of <back> to create a unique set of from values help at all? The latter could be used with a hasBackRef() helper function. Then, during the full first (and only) parse a quick and almost free check on that set would identify whether there were elements in <back> that needed to be taken into account.
(nb: I haven’t had a chance to check @Davide I. Cucurnia’s analogue-parser.ts yet. That’s next 🙂

ajf-ajf · 2024-05-16T05:35:11Z

Thanks @Lorenzo Bafunno (@laurelled), for clarifying so many details on Tuesday. I’m still not proposing this as a ‘sure fire’ solution, but it might work as an interim patch of sorts. The idea is:

Identify the app node and scan it first.
Create a simple set that contains only unique from variable strings.
When parsing a node in the document, use a helper check function to test whether the string exists.
a. If it does, collect and parse whatever’s needed from the app node; or
b. If it doesn’t, onwards and upwards, we’re done here.

It’s not great, and it means another sort-of global/config type variable hanging around. On the other hand, as an interim solution — it’s only a set containing a few strings. The check against it should be very cheap in terms of resources.
Anyhow, that’s the basic idea. Any thoughts?

laurelled · 2024-06-10T12:54:38Z

As I could experience in these weeks, Andrew's solution is easily implementable. However, another problem has risen. That is how the parsing and the following visualization is done.

I'll paste the information found in the parsing wiki:

After reading the source file indicated in the proper configuration parameter, EVT parses the structure of the edition. At the moment, everything is based on pages (this will probably change when we will add the support for critical edition and pageless editions). A page is identified as the list of XML elements included between a <pb/> and the next one (or between a <pb/> and the end of the node containing the main text, which is the <body> in the case of the last page).
Each page is represented in the EVT Model as a Page:
interface Page {
  id: string;
  label: string;
   originalContent: OriginalEncodingNodeType[];
   parsedContent: Array<ParseResult<GenericElement>>;
}
The content of each page is therefore represented as an array of object retrieved by parsing the original XML nodes.
After parsing the structure, for each page identified, we then proceed to parse all the child nodes, by calling the parse method of the GenericParserService.Parsers are defined in a map that associates a parser with each supported tagName. This map is retrieved by the generic parsing function which chooses the right parser based on the node type and its tagName. If a tag does not match a specific parser, the ElementParser, which does not add any logic to the parsing results, is used. Tags and parsers are divided by belonging TEI module.

That's great, except for the fact that each resulting type is associated with a component through the ContentViewerComponent in a very general way:

This is a dynamic component that takes a ParsedElement as input and establishes which component to use for displaying this data based on the type indicated in the type property.
This type is used to manage the component register, to be accessed for dynamic compilation, and also the type of data that the component in question receives as input

The content viewer takes the Page parsedContent attribute and, for each element of the array, associate it with a component and visualize it. But with DEPA this way of handling cause problems, because there can be apparatuses that overlay each other. They (i) wouldn't be a separate entity and (ii) I can't surround them with a new parent tag without ruining the XML encoding.

The only solution I can think of is creating a fake surrounding tag to handle it in a specific component, "imitating" the way EVT2 handled it.

RenatoCaenaro · 2024-11-04T20:37:35Z

For DEPA (#239) we are solving the problem with an in-memory dictionary of the selected elements () that can be queried when parsing the document. We can think of a similar strategy for these attributes as well.
I will discuss this with my team in the coming days.

RobertoRDT added bug Something isn't working enhancement New feature or request labels Jan 27, 2024

RobertoRDT added this to the 1.0 milestone Jan 27, 2024

laurelled mentioned this issue Jul 3, 2024

Parser's current state blocks the development of Double End Point Attachment (DEPA) #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-sequential parsing of TEI elements #234

Non-sequential parsing of TEI elements #234

RobertoRDT commented Jan 27, 2024

laurelled commented May 13, 2024

RobertoRDT commented May 14, 2024 •

edited

Loading

ajf-ajf commented May 16, 2024

laurelled commented Jun 10, 2024 •

edited

Loading

RenatoCaenaro commented Nov 4, 2024

Non-sequential parsing of TEI elements #234

Non-sequential parsing of TEI elements #234

Comments

RobertoRDT commented Jan 27, 2024

laurelled commented May 13, 2024

RobertoRDT commented May 14, 2024 • edited Loading

ajf-ajf commented May 16, 2024

laurelled commented Jun 10, 2024 • edited Loading

RenatoCaenaro commented Nov 4, 2024

RobertoRDT commented May 14, 2024 •

edited

Loading

laurelled commented Jun 10, 2024 •

edited

Loading