Parser's current state blocks the development of Double End Point Attachment (DEPA) #239

laurelled · 2024-07-03T08:42:00Z

EVT3 parser module is too dependent on the structure of the edition and heavily rely on the intrinsic XML nesting to work properly. I'll refer you to #234 for further analysis on how the current parser works and its problems, but I'll mainly stick to DEPA related issues here.
First of all, I'll paste the information found in the parsing wiki:

After reading the source file indicated in the proper configuration parameter, EVT parses the structure of the edition. At the moment, everything is based on pages (this will probably change when we will add the support for critical edition and pageless editions). A page is identified as the list of XML elements included between a <pb/> and the next one (or between a <pb/> and the end of the node containing the main text, which is the <body> in the case of the last page).
Each page is represented in the EVT Model as a Page:
interface Page {
   id: string;
   label: string;
   originalContent: OriginalEncodingNodeType[];
   parsedContent: Array<ParseResult<GenericElement>>;
}
The content of each page is therefore represented as an array of object retrieved by parsing the original XML nodes.
After parsing the structure, for each page identified, we then proceed to parse all the child nodes, by calling the parse method of the GenericParserService.

Parsers are defined in a map that associates a parser with each supported tagName. This map is retrieved by the generic parsing function which chooses the right parser based on the node type and its tagName. If a tag does not match a specific parser, the ElementParser, which does not add any logic to the parsing results, is used. Tags and parsers are divided by belonging TEI module.

That's great, except for the fact that each resulting type is associated with a component through the ContentViewerComponent

This is a dynamic component that takes a ParsedElement as input and establishes which component to use for displaying this data based on the type indicated in the type property.
This type is used to manage the component register, to be accessed for dynamic compilation, and also the type of data that the component in question receives as input

And the problem is that the parsing and the visualization are too dependent on XML tags. The content viewer takes the Page parsedContent attribute and for each element of the array - that is, a XML tag that has been parsed and contains information only about that scope and the nested ones - the ContentViewer associates it with a specific component (e.g. the <bibl> tag is associated with the BibliographyComponent) and visualize it. But with DEPA it's a real mess, because its major strength is using anchors as delimiters. These anchors can be siblings, or even be children of different tags. There's also the possibility of apparatuses overlaying with each other. Therefore, it is hard to think of a forced solution where we would surround the interested scope with a fake tag just to visualize it correctly. Especially with overlaying apparatuses.

The text was updated successfully, but these errors were encountered:

laurelled · 2024-07-04T15:46:06Z

My idea was to enclose the w tags where the overlay occurs to handle it in a separate component.

<w id="1">this</w>
<w id="2">is</w>
<w id="3">some text</w>
...
<app from="1" to="2"/>
<app from="2" to="3"/>

Then we could simply enclose it to a <evt-depa> tag and send the whole thing to a visualization component (ApparatusDepa)

<evt-depa>
  <w id="1">this</w>
  <evt-depa>
    <w id="2">is</w>
    <w id="3">some text</w>
  </evt-depa>
</evt-depa>

and it would look up where the overlay happens and handle it. For multiple overlays, my idea was nesting an within an

But if there are apparatuses that overlay sequentially the virutal tag solution doesn't work. Example
Suppose we have this situation:

<w id="1">this</w>
<w id="2">is</w>
<w id="3">some text</w>
<w id="4">test</w>
...
<app from="1" to="2"/>
<app from="2" to="3"/>
<app from="3" to="4"/>

My idea was nesting two <evt-depa> but it doesn't

<evt-depa>
  <w id="1">this</w>
  <evt-depa> <-- this is the starting tag of the second app!? (from="2" to="3")
  <w id="2">is</w>
</evt-depa> <-- this is the closing tag of the first app!? (from="1" to="2")
  <evt-depa>
  <w id="3">some text</w>
  </evt-depa> <-- this is the closing tag of the second app!? (from="2" to="3")
  <w id="4">test</w>
 </evt-depa> <-- this is the closing tag of the second app!? (from="3" to="4")

RenatoCaenaro · 2024-11-04T20:22:56Z

A new parser strategy will be committed at the beginning of 2025 to solve this problem.

RenatoCaenaro mentioned this issue Nov 4, 2024

Non-sequential parsing of TEI elements #234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser's current state blocks the development of Double End Point Attachment (DEPA) #239

Parser's current state blocks the development of Double End Point Attachment (DEPA) #239

laurelled commented Jul 3, 2024

laurelled commented Jul 4, 2024 •

edited

Loading

RenatoCaenaro commented Nov 4, 2024

Parser's current state blocks the development of Double End Point Attachment (DEPA) #239

Parser's current state blocks the development of Double End Point Attachment (DEPA) #239

Comments

laurelled commented Jul 3, 2024

laurelled commented Jul 4, 2024 • edited Loading

RenatoCaenaro commented Nov 4, 2024

laurelled commented Jul 4, 2024 •

edited

Loading