Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linebreaks in json conversion #668

Open
jameshowison opened this issue Sep 14, 2020 · 10 comments
Open

linebreaks in json conversion #668

jameshowison opened this issue Sep 14, 2020 · 10 comments

Comments

@jameshowison
Copy link
Contributor

We pushed the chunks over to tagworks and they pointed out that they aren't very readable, due to line breaks (or rather than lack thereof).

What was done before to fix this was to add a linebreak after these fields:

'{<http://www.tei-c.org/ns/1.0}head',>
'{<http://www.tei-c.org/ns/1.0}div',>
'{<http://www.tei-c.org/ns/1.0}p',>
'{<http://www.tei-c.org/ns/1.0}figDesc',>
'{<http://www.tei-c.org/ns/1.0}note',>

and ensure a space around the ref tag.

@kermitt2 do you think we could add that to the json conversion? If we have the last sentence of a paragraph as the first sentence of a chunk then that'll look a bit strange (as it will be a single sentence, then a break, then a new sentence). Seems like this could be done in https://github.com/howisonlab/softcite-dataset/blob/master/code/corpus/corpus2JSON.py ? or in https://github.com/howisonlab/softcite-dataset/blob/master/code/corpus/TEI2LossyJSON.py ? But I'm not sure how that would affect enrichJSON.py

Pretty good example of where json can't encode (or at least doesn't naturally lend itself to encoding) semantics that are relevant for people reading the text. I appreciate the note at the top of https://github.com/howisonlab/softcite-dataset/blob/master/code/corpus/TEI2LossyJSON.py !

@jameshowison
Copy link
Contributor Author

For readability, perhaps two linebreaks needed, rather than one (to create visual distinction between elements?).

@kermitt2
Copy link
Member

Adding line break in the XML is indeed something good, because the idea of XML is to be readable by humans too!

However, adding space around the ref modifies the paragraph content, and normally we don't do that for inline elements. I am not sure about the impact, if it's only for visualization, no problem, but the offset for user annotations would need to be corrected then I guess?

I observed some <ref> and <rs> with apparently missing space before or after, I should double-check if a space is not lost in the process, but it is likely as such from the PDF.

For JSON, we keep track of the section (with the section title), but indeed the "paragraph break" is lost in the sentence segmented format. I guess there are two options:

  • I add a field to keep track of the paragraph id or rank, so that if the id changes we know that we have a new paragraph and we can "render" the json appropriately (the rendering is left to the GUI application)
  • I add empty sentences "text":"" (2 for two linebreaks) so the actual sentence text is not touched?

@jameshowison
Copy link
Contributor Author

Hmmm. What we are trying to do here is to have readable text-only text, but to maintain the connection to the source material, so we can align metadata using character offsets. It's definitely trickier than it looks. I thinking of this as a "reversible decorator" which can change the text display, but which we can remove, adjusting offsets when doing so.

And we have to decide when to make these changes. We can only make them when we know the markup, but we drop markup (lossy json indeed!).

Inline markup: Could we check for missing spaces around (etc) as we create the json? As long as the enrich respects those changes to offsets that will be ok? I'm not entirely sure of the path back from the offset json to the XML (ie I know I can map back from the offsets against the chunks received from tagworks to the entity_span against the sentence level json, but can we map back from the json to XML?

Section markup: Paragraph and section. I see what you mean about the section. My rendering code can keep track of that and inject it into the chunks when the section changes. If you include a paragraph id in each json object, I could handle that the same way.

I'm going to see if I can get the section headers injected, and that code should be reusable with the paragraph ids.

@jameshowison
Copy link
Contributor Author

jameshowison commented Sep 16, 2020

Ok, I played around a bit with that and at least for the section titles, the way my chunking code is set up makes injecting the section difficult.

@kermitt2 could you take a look at adding a text item into the json list for each section and paragraph break? The section would have the text value of the section plus \n\n. The paragraph text would just be \n\n. I'm not sure if we ever have entities in section headers, but it's possible, I guess, so this would allow for that.

But mostly it is just easier because the offsets don't have to be adjusted within sentences and the chunking code will continue to work.

@kermitt2
Copy link
Member

I made a first update of the JSON files introducing section and paragraph ranks. When the section rank increase, we mode to a new section, when the paragraph rank increase, we have a new paragraph... Normally looking at these values makes possible to render paragraph and section breaks as wished, the advantage being that we don't touch the text and we don't change the JSON structure.

For example, new paragraph:

{
            "text": "As proof-of-concept, we demonstrate that our methodology can be applied towards designing effective TALEs for any given DNA target (i.e. the yeast-one-hybrid experiments) as well as for genome-wide phenotype screens in yeast.",
            "section": "Discussion",
            "paragraph_rank": 19,
            "section_rank": 5
        },
        {
            "text": "We show experimentally that the standard TALE design algorithm can fail to predict RVDs that bind to a desired target. ",
            "section": "Discussion",
            "paragraph_rank": 20,
            "section_rank": 5
        }

... and new section:

{
            "text": "Again, both clones were able to effectively induce the expression of PDR3 and PDR5 (Fig. 3c).",
            "section": "Results",
            "paragraph_rank": 18,
            "section_rank": 4,
            "ref_spans": [
                {
                    "start": 83,
                    "end": 91,
                    "type": "figure",
                    "ref_id": "fig_2",
                    "text": "(Fig. 3c"
                }
            ]
        },
        {
            "text": "Here, we establish a methodology for the assembly of complete and biased TALE-based libraries. ",
            "section": "Discussion",
            "paragraph_rank": 19,
            "section_rank": 5
        }

The change you present implies to change the current design of the JSON, because we would need to type the JSON element in the body_text array (for "section" and paragraph's "sentence"). For the paragraph break, I don't really see how to do it. I think if we really need this (if it's not possible to use the paragraph/section ranks), I would rather simply concatenate one \n at the text at end of a paragraph and \n\n at the text at end of a section. It's not beatiful but it does not impact the offsets (it's just an "append"). For reinjecting back the annotation in XML (the "back from the json to XML" which is one of the goal), I would simply ignore the \n.

What do you think?

@jameshowison
Copy link
Contributor Author

jameshowison commented Sep 18, 2020 via email

@kermitt2
Copy link
Member

If we can avoid adding extra \n in the text field, I think this is very good. We will limit the risks of future problems when managing the new annotation offsets and when aligning back to the XML (none of these problems would be blocking I am sure, but that could be time consuming to get them right).

I've regenerated all the JSON files after updating the TEI2LossyJSON.py converter as you suggested. We have now the section titles as extra element in the text_body array, with their own text field and their own annotation spans. We can recognize the section titles from the paragraphs by the fact that the section title objects have no paragraph rank (they just have a section rank). The section field is also still present in the sentence/paragraph objects.

Example:

        {
            "text": "DISCUSSION",
            "section_rank": 10
        },
        {
            "text": "The prevalence of asthma has increased worldwide and this has been most strik
ingly observed in the industrialized countries during the last decade. ",
            "section": "DISCUSSION",
            "paragraph_rank": 12,
            "section_rank": 10
        },

@jameshowison
Copy link
Contributor Author

Cool. I'm working with that now. Any idea why there isn't a paragraph_rank key for quite a few texts here:

data/json_goldstandard/10.1002%2Fpam.22030.json:754

@jameshowison
Copy link
Contributor Author

This is looking good now, with chunk text all good. Offsets stayed functional as I only ever added linebreaks at the end of a text (for both section headers and paragraph ends). The map.csv has those flags and shows the text that was added in the textmodify column. When we get offsets back from TagWorks we will have to figure out how to adjust those, undoing the formatting. But I think that won't be too hard.

kermitt2 added a commit that referenced this issue Sep 22, 2020
kermitt2 added a commit that referenced this issue Sep 22, 2020
@kermitt2
Copy link
Member

Any idea why there isn't a paragraph_rank key for quite a few texts here:
data/json_goldstandard/10.1002%2Fpam.22030.json:754

Thanks, I found an error in my code when the JSON merging with the TEI corpus. I've just committed the fix and updated the impacted JSON files. It should be good now:

        {
            "text": "In fact, research on the effects of CA-PFL on mothers finds that the largest effects on leave-taking are concentrated among the least advantaged mothers (Rossin-Slater, Ruhm, & Waldfogel, 2013).",
            "section": "RELATED LITERATURE AND HYPOTHESES",
            "paragraph_rank": 13,
            "section_rank": 2,
            "ref_spans": [
                {
                    "start": 169,
                    "end": 193,
                    "type": "bibr",
                    "ref_id": "b56",
                    "text": "Ruhm, & Waldfogel, 2013)"
                }
            ]
        },
        {
            "text": "DATA",
            "section_rank": 3
        },
        {
            "text": "We use data from the 2000 Census and the 2000 to 2013 waves of the ACS to estimate the effects of CA-PFL on fathers' leave-taking. ",
            "section": "DATA",
            "paragraph_rank": 14,
            "section_rank": 3
        },
        {
            "text": "The ACS is conducted throughout the year and samples 1 percent of the population in most years; thus, it has the major advantage of providing the large samples needed to examine leave-taking behavior among fathers. ",
            "section": "DATA",
            "paragraph_rank": 14,
            "section_rank": 3
        },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants