Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-sequential parsing of TEI elements #234

Open
RobertoRDT opened this issue Jan 27, 2024 · 5 comments
Open

Non-sequential parsing of TEI elements #234

RobertoRDT opened this issue Jan 27, 2024 · 5 comments
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@RobertoRDT
Copy link
Member

At the time of writing, EVT 3 parsing of TEI elements is sequential, element by element, which makes it problematic to refer to elements that are at a later position than the currently parsed element.
For example, to avoid the problem of data redundancy in TEI encoding for authorial philology, a solution based on the copyof attribute would be very effective:

  • in layer 0 we have <mod change="layer-0" xml:id="MS_12_23">original text</mod>
  • in layer 1 this text is deleted: <mod change="layer-1"><mod change="copyof="#MS_12_23"/></mod>

This would also be a very useful mechanism in many other situations, e.g. to put together text based on portions of other separate documents and other stand-off use cases. However, since EVT 3 proceeds sequentially, when it encounters <del copyof="#MS_12_23"/> it would not yet have parsed the contents of <mod change="strato-0" xml:id="MS_12_23"> (authorial philology encoding proceeds in reverse chronological order, most recent text first) and thus that node would not yet exist.
Another currently impossible use case: retrieving the content of a footnote in the <back>, or of a bibliographic entry also in the <back>.

Possible solutions:

  1. second round of parsing with only an actual reference to the external element, which, although it would increase the start-up time if the functionality were used intensively, would not have a particular impact on performance;
  2. on-demand parsing of the requested node not yet encountered with a copy of the same in its own data structure (and thus duplication + memory costs).
@RobertoRDT RobertoRDT added bug Something isn't working enhancement New feature or request labels Jan 27, 2024
@RobertoRDT RobertoRDT added this to the 1.0 milestone Jan 27, 2024
@laurelled
Copy link
Contributor

I've had the time to analyse the complexity of implementing a second round of parsing, based on the current situation.
For those who lack the time to read thoroughly, in short my opinion is that its implementation would require a huge effort in terms of refactoring the existing code. We should discuss together on what road to take.

I'll be redundant for the sake of clarity.
Currently, most of the parsing is being done by specific classes that implements the Parser interface, which requires to implement the parse method. I'll paste here one of the classes for reference:

@xmlParser('sic', SicParser)
export class SicParser extends EmptyParser implements Parser<XMLElement> {
    attributeParser = createParser(AttributeParser, this.genericParse);
    parse(xml: XMLElement): Sic {
        const attributes = this.attributeParser.parse(xml);
        const { type } = attributes;

        return {
            type: Sic,
            sicType: type || '',
            class: getClass(xml),
            content: parseChildren(xml, this.genericParse),
            attributes,
        };
    }
}

There's also that @xmlParser('sic', SicParser), which comes pretty handy because it maps the tag name to its parser. The combination of both the interface implementation and the mapping is great for a generic double-pass parsing. In fact, we would just need to get the tag name, retrieve the corresponding parser class and call the parse method which any of them has.

However, some parsing is not done following this method. For example:

  • Sometimes the association between the string and the parser is not an actual tag name, for example in the named-entity-parser.ts:
@xmlParser('evt-named-entities-list-parser', NamedEntitiesListParser)
export class NamedEntitiesListParser extends EmptyParser implements Parser<XMLElement> {
...
}

That adds up complexity to take care while scanning through the document. Maybe this could be easily fixed by also adding the tag

  • Some tags are not parsed through a parser class which implements the Parser interface and, thus, the parse method. For example, the witness-parser.service.ts :
export class WitnessesParserService {
  private witListTagName = 'listWit';
  private witTagName = 'witness';
  private witNameAttr = 'type="siglum"';
  private groupTagName = 'head';
  private attributeParser = createParser(AttributeParser, parse);

  constructor(
    private genericParserService: GenericParserService,
  ) {
  }

  public parseWitnessesData(document: XMLElement): Witnesses {
    const lists = Array.from(document.querySelectorAll<XMLElement>(this.witListTagName));

    return {
      witnesses: this.parseWitnessesList(lists),
      groups: this.parseWitnessesGroups(lists),
    };
  }

  private parseWitnessesList(lists: XMLElement[]) {
    const parsedList = lists.filter((list) => !isNestedInElem(list, list.tagName))
      .map((list) => this.parseWitnesses(list))
      .reduce((x, y) => x.concat(y), []);

    return parsedList;
  }

It doesn't also include the mapping between listWit and this class. structure-xml-parser.service.ts, which is much more complicated, works also this way.

Refactoring those problems is not straight-forward and would require thorough discussions with other devs, and from my understanding right now it's not really a good time for that. We should still discuss how to proceed, as it's a pretty important feature for the stand-off apparatus implementation.

@RobertoRDT
Copy link
Member Author

RobertoRDT commented May 14, 2024

[A summary of all subsequent comments so that they aren't lost in Slack]

RobertoRDT
As you noticed, this is pretty crucial for the stand-off markup processing needed to support DEPA, so we should try to find a (perhaps not perfect) solution right now.

Lorenzo Bafunno
So, the main topics that would start a discussion would be:

  • is the double round of parsing worth the refactoring of the existing code? This would also possibly risk the rise of new bugs or unexpected behaviour
  • Are there any other options?

@Davide I. Cucurnia got around it by duplicating some elements, the implementation can be found in analogue-parser.ts

RobertoRDT
Could you parse specific stand-off elements (<listApp> f.i., but also <div> elements in the <back> with notes, hotspots etc.) before all the other ones, so that the necessary information is already available when you get to the inline markup needing it?

Because, if I got it right, the problem is that when I have a list in the <back> and something connected to that list in the <body>, when EVT parses the <body> it still lacks the relevant info available in the <back>. Sorry if this doesn't make much sense 😅

@Davide I. Cucurnia
Yes like briefly addressed weeks ago the other solution is an on-demand parsing of the required (=referred) element when needed, similar to what's done in the analogue and source parsers. Memory-wise a copy of the referred element is however needed in order to show it in the page, so we cannot avoid it. The double round parsing solution would also inflate the boot time of the app...

Could you parse specific stand-off elements (<listApp> f.i., but also <div> elements in the <back> with notes, hotspots etc.) before all the other ones, so that the necessary information is already available when you get to the inline markup needing it?
Yes, this solution sounds similar to what it's currently developed in the analogue and source parsers, which is the on-demand parsing.

RobertoRDT
Would it be safe to say that if I put everything in the <standOff> element before <text> there would be little need to copy the referred elements in memory? As an interim solution to give us time to look for a proper fix.

https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-standOff.html

As you can see in this example <standOff> can precede the main <text>: https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-standOff.html#index-egXML-d54e131382

Lorenzo Bafunno
Thank you for the link, I didn't understand at first. Still, I don't think that would change the situation. A copy of the referred element is still needed. The problem is that when a parser class parses the XML/TEI code, it's not provided with a context of what has been parsed previously, neither it knows what is yet to be parsed.

I'll try to provide an example

RobertoRDT
OK that means I got it wrong because I thought the problem lies when the referred element is down in the TEI document.

Lorenzo Bafunno
So, currently, from my limited understanding (Davide I. Cucurnia correct me if i'm wrong), everytime we want to extract information needed for a component, we do these steps:

  • we pass the whole document to a parsing service, which narrows the scope to only those element we need to extract.
  • the service calls the parser to retrieve information from a tag, or a series of them.
  • we store the result of the whole process in a data structure, most likely an array.

Ideally, we could retrieve data from both the <body> and the <back> and merge it together in one structure. That's no problem (easier said than done). But that creates duplication.An example would be this:

<body>
 <w xml:id="w1">hello</w>
 <w xml:id="w2">world</w>
</body>
<back>
 <app from="w1"> <rdg wit="#c">hallo</rdg> </app>
</back>

Let's imagine we store a combination of <w> and the corresponding app and let's call the resulting structure A. That would be great for the DEPA. But there is already a parser that provides information about the <app>, so the information stored in the type A is redundant and takes up "useless" memory.

Sorry if it took me that long but I needed time to make it clear, hope it makes sense 😅

Andrew Forsberg

Let’s imagine we store a combination of <w> and the corresponding app and let’s call the resulting structure A. That would be great for the DEPA. But there is already a parser that provides information about the <app>, so the information stored in the type A is redundant and takes up “useless” memory
@Lorenzo Bafunno

— Apologies in advance, as this might be an overly naive suggestion, but just in case — would an initial parse of <back> to create a unique set of from values help at all? The latter could be used with a hasBackRef() helper function. Then, during the full first (and only) parse a quick and almost free check on that set would identify whether there were elements in <back> that needed to be taken into account.
(nb: I haven’t had a chance to check @Davide I. Cucurnia’s analogue-parser.ts yet. That’s next 🙂

@ajf-ajf
Copy link
Collaborator

ajf-ajf commented May 16, 2024

Thanks @Lorenzo Bafunno (@laurelled), for clarifying so many details on Tuesday. I’m still not proposing this as a ‘sure fire’ solution, but it might work as an interim patch of sorts. The idea is:

  1. Identify the app node and scan it first.
  2. Create a simple set that contains only unique from variable strings.
  3. When parsing a node in the document, use a helper check function to test whether the string exists.
    a. If it does, collect and parse whatever’s needed from the app node; or
    b. If it doesn’t, onwards and upwards, we’re done here.

It’s not great, and it means another sort-of global/config type variable hanging around. On the other hand, as an interim solution — it’s only a set containing a few strings. The check against it should be very cheap in terms of resources.
Anyhow, that’s the basic idea. Any thoughts?

@laurelled
Copy link
Contributor

laurelled commented Jun 10, 2024

As I could experience in these weeks, Andrew's solution is easily implementable. However, another problem has risen. That is how the parsing and the following visualization is done.

I'll paste the information found in the parsing wiki:

After reading the source file indicated in the proper configuration parameter, EVT parses the structure of the edition. At the moment, everything is based on pages (this will probably change when we will add the support for critical edition and pageless editions). A page is identified as the list of XML elements included between a <pb/> and the next one (or between a <pb/> and the end of the node containing the main text, which is the <body> in the case of the last page).
Each page is represented in the EVT Model as a Page:

interface Page {
  id: string;
  label: string;
   originalContent: OriginalEncodingNodeType[];
   parsedContent: Array<ParseResult<GenericElement>>;
}

The content of each page is therefore represented as an array of object retrieved by parsing the original XML nodes.
After parsing the structure, for each page identified, we then proceed to parse all the child nodes, by calling the parse method of the GenericParserService.Parsers are defined in a map that associates a parser with each supported tagName. This map is retrieved by the generic parsing function which chooses the right parser based on the node type and its tagName. If a tag does not match a specific parser, the ElementParser, which does not add any logic to the parsing results, is used. Tags and parsers are divided by belonging TEI module.

That's great, except for the fact that each resulting type is associated with a component through the ContentViewerComponent in a very general way:

This is a dynamic component that takes a ParsedElement as input and establishes which component to use for displaying this data based on the type indicated in the type property.
This type is used to manage the component register, to be accessed for dynamic compilation, and also the type of data that the component in question receives as input

The content viewer takes the Page parsedContent attribute and, for each element of the array, associate it with a component and visualize it. But with DEPA this way of handling cause problems, because there can be apparatuses that overlay each other. They (i) wouldn't be a separate entity and (ii) I can't surround them with a new parent tag without ruining the XML encoding.

The only solution I can think of is creating a fake surrounding tag to handle it in a specific component, "imitating" the way EVT2 handled it.

@RenatoCaenaro
Copy link
Collaborator

For DEPA (#239) we are solving the problem with an in-memory dictionary of the selected elements () that can be queried when parsing the document. We can think of a similar strategy for these attributes as well.
I will discuss this with my team in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants