Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the citation model for DTS Resource ? #101

Closed
PonteIneptique opened this issue May 17, 2018 · 23 comments
Closed

What's the citation model for DTS Resource ? #101

PonteIneptique opened this issue May 17, 2018 · 23 comments
Labels
Collection Endpoint Issues that deal with the Collection Endpoint

Comments

@PonteIneptique
Copy link
Member

PonteIneptique commented May 17, 2018

In the discussion we had, we discussed having tei:refsDecl in the metadata of Resource in the Collection API.
Basically, my examples covered that in this way :

{
    "@id" : "urn:cts:latinLit:phi1103.phi001.lascivaroma-lat1",
    "@type": "Resource",
    "...": "...",
    "tei:refsDecl": [
        {
            "tei:matchPattern":  "(\w+)",
            "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1'])",
            "@type": "poem"
        },
        {
            "tei:matchPattern":  "(\w+)\.(\w+)",
            "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']//tei:l[@n='$2'])",
            "type": "line"
        }
    ]
}

Which would result in having the object

  • capable of citation by "poem" or by "line" (because of type)
  • passages should always be matched by either of the matchPattern
  • the replacementPattern should lead to the beginning (in case of milestones ?) or container(s) of the element

I open this issue because we skipped really quickly over it in talks, agreed upon it only generally.

I think I'd have the following question :

  1. Should we actually clarify level of citation : example, the second citation in at depth 2 (lines within poem). This is something that we cannot capture by the simple expression of match pattern actually. In the CapiTainS draft guidelines, I used the attribute tei:corresp for that but maybe we should use something like dts:depth ?
  2. Should we make the tei:replacementPattern optional ? Or actually is there anything in there that we might feel is going too far ? (Noting that at least the citation structure is important for CTS compatibility)
  3. I actually also think we should move @type to tei:type in the examples.
  4. We could also make thing more complicated (but more straight forward) and allow people to build "graph" of citation system (properties name were chosen to be expressive for the example) :
[{
   "dts:citation_id": "1",
   "tei:matchPattern":  "(\w+)",
   "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1'])",         
   "@type": "poem"
},
{
   "dts:citation_id": "2",
   "dts:citation_parent": "1",
   "tei:matchPattern":  "(\w+)\.(\w+)",
   "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']//tei:l[@n='$2'])",
   "type": "line"
}]

Note that while some of that might seem to be too much, they all are partial responses to real problems in the understanding from the parser standpoint of the structure of the text...

@PonteIneptique PonteIneptique added Implementation Detail Collection Endpoint Issues that deal with the Collection Endpoint labels May 17, 2018
@PonteIneptique
Copy link
Member Author

PonteIneptique commented May 17, 2018

I'd actually had that I would recommend moving the namespace to "http://www.tei-c.org/ns/1.0#" instead of "http://www.tei-c.org/ns/1.0" otherwise prefix extension produces "http://www.tei-c.org/ns/1.0matchPattern" and that's awful.

Or "http://www.tei-c.org/ns/1.0/" btw. Both are good to me.

PonteIneptique added a commit to Capitains/MyCapytain that referenced this issue May 17, 2018
@balmas
Copy link
Contributor

balmas commented May 17, 2018

  1. I think it might be good to be explicit about level, but the @corresp attribute doesn't really seem appropriate for that for me. Using dts:level might be better, if it validates. But I think we could also infer this from the location in the refsDecl list. Either way I think would probably be ok.

  2. If tei:replacementPattern is optional then the matchPattern seems a little meaningless to me.

  3. agree (assuming it validates)

  4. I don't understand where this would be declared.

@PonteIneptique
Copy link
Member Author

I updated number 4 so that it's clearer.

My answer here is mostly targeted at your 1 : we actually can't infer because there is a possibility people have a citation complex tree rather than a citation "line". Most of CTS texts have wonderful "book->poem->line" but what about

  • book
    • poem
      • stanza
        • line
    • paragraph
      • segment

Here, the CTS model would fail most probably. You can have different match pattern (let say poems are numbered while paragraph are [a-zA-Z]+ in regexp). Here, you would not be able to infer the level of the citation. While we could with wonderful CTS DTSIzed object, because the dot . means hierarchy, for any other text that would go with a complex tree, we would be powerless to understand the relationship between citation nodes.

@emmamorlock
Copy link

emmamorlock commented May 18, 2018

My 2 cents:

  • Don't you think a "paragraph" could contain anything and not just [a-zA-Z]+?
  • in question 4, isn't the example less a graph than a straightforward hierarchy (declared via "dts:citation_parent")?
  • what I have is:
    • div
      - ab with mixed content with two types of milestones:
      - lb (with @n)
      - milestones (with @Unit and @corresp)
  • NB: the @corresp is essential to establish a relation with the abstracts textual corresponding units that are declared in msContents/msItem...

@PonteIneptique
Copy link
Member Author

Quick answers :

  • That was only an example to show that passages could be numbers for lines while letters could be paragraph identifiers. Just showing that we might have this kind of diversity.
  • Technically, a hierarchy is a graph, but I don't think this is the question . Yes, I definitely gave a simple example in ex.4 but What's the citation model for DTS Resource ? #101 (comment) shows that we might have more complex ones.
  • Noted. Unfortunately, I have not seen in TEI any attributes that could cover depth of citation scheme or type actually, and this is also an issue for the future capitains guidelines.

@hcayless
Copy link
Contributor

I'm confused about what this is meant to achieve (possibly I just haven't had enough coffee yet). Canonical References in TEI allow you to construct a custom URI referencing system, which is fine and good. But I'm missing the point of them here. Shouldn't the Reference API just tell you what sorts of references you can have? Why should the Collection API bother telling you how they're constructed?

@balmas
Copy link
Contributor

balmas commented May 18, 2018

and @hcayless 's comment makes me realize I misunderstood the point of this issue. I thought we were talking about the TEI refsDecl structure ... I clearly had either had not enough or too much coffee myself at that point :-)

To respond to Hugh's point, I could see the DTS API making this information available being useful for purposes of a chain of provenance or reproducibility.

To reframe my answers to the above in the correct context:

  1. dts:depth makes the most sense to me here, in the context of the DTS API.
  2. if I am correct that the point of this is for reproducibility, then I think replacementPattern should be present.
  3. Does using tei:type make too many assumptions about the textual markup? What if the citation doesn't correspond to something that was identified that way?
  4. The graph approach is tempting, but I'm a little worried it would increase the complexity of implementation

@PonteIneptique
Copy link
Member Author

The issue with the reference API is that it throws at you references, but for example, one of the very common thing I do with CTS APIs is : Retrieve Text Metadata -> Retrieve all References at Deepest Level (thanks to Text Metadata) -> Retrieve passages based on the last results.

Right now, our system cannot provide this kind of workflow because we do not have a space to state how the references of the text are structured.

@hcayless
Copy link
Contributor

Ok. I see the point of that use case, but I don't see yet how having the Collection API give you TEI cRefPatterns helps. Maybe I'm being dense. I see the problem, but I don't see how this is a solution.

Wouldn't it be better to come up with some declarative representation of the available levels and how citations to them are constructed? Put another way, I see the point of the matchPattern, but not the replacementPattern. As a client, I don't care how you're getting the chunk of text I want, and I wouldn't care unless I wanted to grab the document and do it myself.

What about something like:

{
    "@id" : "urn:cts:latinLit:phi1103.phi001.lascivaroma-lat1",
    "@type": "Resource",
    "...": "...",
    "dts:citeStructure": [
        {
            "dts:citePattern":  "(\\w+)",
            "dts:level": 1,
            "label": "poem"
        },
        {
            "dts:citePattern":  "(\\w+)\\.(\\w+)",
            "dts:level": 2,
            "label": "line"
        }
    ]
}

Seems like IRI templates might be better for this than regex patterns though...

@PonteIneptique
Copy link
Member Author

PonteIneptique commented May 18, 2018 via email

@hcayless
Copy link
Contributor

I probably failed to properly think through the implications when we talked bout it, but now I think it's better to just tell the client how citations are constructed than to give it implementation details it can't really use.

@PonteIneptique
Copy link
Member Author

I am not completely certain of the match pattern and replacement pattern use (whatever the namespace or implementation is). On the other end, having information about the "citation graph" structure and metadata about it seems to me important as well :) I think we have an agreement here right ?

@jonathanrobie
Copy link
Contributor

jonathanrobie commented Jun 7, 2018

How about a URI template along these lines:

 {
  "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='{&n}])",
 }

@PonteIneptique
Copy link
Member Author

PonteIneptique commented Jun 7, 2018

Recursivity and graph description of citation scheme :

{
    "@id" : "urn:cts:latinLit:phi1103.phi001.lascivaroma-lat1",
    "@type": "Resource",
    "...": "...",
    "dts:citeStructure": [
        {
            "label": "poem",
            "dts:citeStructure": [
                {
                    "label": "line"
                }
            ]
        }
    ]
}

@jonathanrobie
Copy link
Contributor

jonathanrobie commented Jun 7, 2018

I need to understand the requirements and use case better. If I am in a client, what are the sequence of steps I am taking when I encounter this data, and what do I want to do with it? I assume we have to be able to handle any kind of reference the same way, supporting CTS and other references that may be quite different.

Are you looking for a way to describe the citation structure for a given resource? What do you want the client to do with it?

A set of use cases written down in this issue would be helpful.

@PonteIneptique
Copy link
Member Author

    "dts:citeStructure": [
        {
            "dts:level": 1,
            "label": ["poem", "section"]
        },
        {
            "dts:level": 2,
            "label": ["line", "paragraph"]
        }
    ]

@PonteIneptique
Copy link
Member Author

Three simple use cases :

  • As a presenting app, I want to be able to take general decisions about how the text should be shown to the client depending on its structure. ie, if a text is book-poem|chapter-line|paragraph, I want to show the text by poem|chapter, so at level 2
  • As a collection curator, I want to be able to specify the structure of my text (which is just another metadata).
  • As a corpus researcher, I want to be able to know where my narratives cut my occur, ie where cooccurence of words is irrelevant at passage boundaries (last word of poem 1 is not a relevant co-occurence of first word of poem 2)

@mromanello
Copy link

mromanello commented Jun 8, 2018

I'd like to add a further use case, coming from a citation matching perspective which directly derives from what I'm doing with the CTS API via Capitains resolvers to build HuCit a knowledge base of classical texts and citable text units.

  • as a citation matching system, I want to retrieve information about text structures from a DTS collection. Knowing how many hierarchical levels a given text has, and what these are, it's a useful information that can be exploited when resolving ambiguous references.

I give a concrete example of this use case at p. 108 of my PhD dissertation:

screen shot 2018-06-08 at 12 03 55

@hcayless
Copy link
Contributor

I still have some misgivings about this. The example I mentioned in our last meeting was Ovid's Tristia, where you have a general structure of book, poem, line, but Book 2 is a single, almost 600-line poem. You'll note if you go to Book 2 in Perseus, that it doesn't bother to chunk it the way it does (e.g.) the Aeneid Book 1 (despite their similar length).

I understand wanting to tell a client what the levels are, but I'd want to be able to do that in a useful way. As an API client, If I was deciding how to chunk things, I could certainly do it on Book / poem for most of the Tristia, but I'd want to (maybe) do it on Book / 20-30 lines for Book 2.

@PonteIneptique
Copy link
Member Author

PonteIneptique commented Jun 21, 2018

This becomes more and more complicated right :)
One option for this would be to allow to display schemes

  "dts:citeStructure": [
       {"@value": ["book", "poem", "line"]},
       {"@value": ["book", "line"]}
    ]

But it would definitely start to make things complicated if you have more than - say - 3 or 4 different schemes. Again, if we want to have full details, maybe this would be up to the Navigation endpoint ?

@PonteIneptique
Copy link
Member Author

Option I back for next week is #101 (comment)

@PonteIneptique
Copy link
Member Author

Action item : do a pull request with comment on top with citeDepth on top of it ?

@PonteIneptique
Copy link
Member Author

Fixed in #104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Collection Endpoint Issues that deal with the Collection Endpoint
Projects
None yet
Development

No branches or pull requests

6 participants