-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linebreaks in json conversion #668
Comments
For readability, perhaps two linebreaks needed, rather than one (to create visual distinction between elements?). |
Adding line break in the XML is indeed something good, because the idea of XML is to be readable by humans too! However, adding space around the ref modifies the paragraph content, and normally we don't do that for inline elements. I am not sure about the impact, if it's only for visualization, no problem, but the offset for user annotations would need to be corrected then I guess? I observed some For JSON, we keep track of the section (with the section title), but indeed the "paragraph break" is lost in the sentence segmented format. I guess there are two options:
|
Hmmm. What we are trying to do here is to have readable text-only text, but to maintain the connection to the source material, so we can align metadata using character offsets. It's definitely trickier than it looks. I thinking of this as a "reversible decorator" which can change the text display, but which we can remove, adjusting offsets when doing so. And we have to decide when to make these changes. We can only make them when we know the markup, but we drop markup (lossy json indeed!). Inline markup: Could we check for missing spaces around (etc) as we create the json? As long as the enrich respects those changes to offsets that will be ok? I'm not entirely sure of the path back from the offset json to the XML (ie I know I can map back from the offsets against the chunks received from tagworks to the entity_span against the sentence level json, but can we map back from the json to XML? Section markup: Paragraph and section. I see what you mean about the section. My rendering code can keep track of that and inject it into the chunks when the section changes. If you include a paragraph id in each json object, I could handle that the same way. I'm going to see if I can get the section headers injected, and that code should be reusable with the paragraph ids. |
Ok, I played around a bit with that and at least for the section titles, the way my chunking code is set up makes injecting the section difficult. @kermitt2 could you take a look at adding a But mostly it is just easier because the offsets don't have to be adjusted within sentences and the chunking code will continue to work. |
I made a first update of the JSON files introducing section and paragraph ranks. When the section rank increase, we mode to a new section, when the paragraph rank increase, we have a new paragraph... Normally looking at these values makes possible to render paragraph and section breaks as wished, the advantage being that we don't touch the text and we don't change the JSON structure. For example, new paragraph: {
"text": "As proof-of-concept, we demonstrate that our methodology can be applied towards designing effective TALEs for any given DNA target (i.e. the yeast-one-hybrid experiments) as well as for genome-wide phenotype screens in yeast.",
"section": "Discussion",
"paragraph_rank": 19,
"section_rank": 5
},
{
"text": "We show experimentally that the standard TALE design algorithm can fail to predict RVDs that bind to a desired target. ",
"section": "Discussion",
"paragraph_rank": 20,
"section_rank": 5
} ... and new section: {
"text": "Again, both clones were able to effectively induce the expression of PDR3 and PDR5 (Fig. 3c).",
"section": "Results",
"paragraph_rank": 18,
"section_rank": 4,
"ref_spans": [
{
"start": 83,
"end": 91,
"type": "figure",
"ref_id": "fig_2",
"text": "(Fig. 3c"
}
]
},
{
"text": "Here, we establish a methodology for the assembly of complete and biased TALE-based libraries. ",
"section": "Discussion",
"paragraph_rank": 19,
"section_rank": 5
} The change you present implies to change the current design of the JSON, because we would need to type the JSON element in the What do you think? |
Good points here and on the call. Seems that this structure works,
together with adding `\n\n` based on these patterns. I thought a little
more and I can do that addition in my chunking code, rather than adding to
the text fields in json. That maintains the mapping between the json and
xml better, I think. And it keeps the preparation for tagworks located in
one place?
What was the final thinking about whether we can have entity_spans that
address section titles? If we can (and I think we coded that way), then
we'd have to have the text of section titles as `text` objects in the JSON,
no?
…--J
On Fri, Sep 18, 2020 at 6:30 AM Patrice Lopez ***@***.***> wrote:
I made a first update of the JSON files introducing section and paragraph
ranks. When the section rank increase, we mode to a new section, when the
paragraph rank increase, we have a new paragraph... Normally looking at
these values makes possible to render paragraph and section breaks as
wished, the advantage being that we don't touch the text and we don't
change the JSON structure.
For example, new paragraph:
{
"text": "As proof-of-concept, we demonstrate that our methodology can be applied towards designing effective TALEs for any given DNA target (i.e. the yeast-one-hybrid experiments) as well as for genome-wide phenotype screens in yeast.",
"section": "Discussion",
"paragraph_rank": 19,
"section_rank": 5
},
{
"text": "We show experimentally that the standard TALE design algorithm can fail to predict RVDs that bind to a desired target. ",
"section": "Discussion",
"paragraph_rank": 20,
"section_rank": 5
}
... and new section:
{
"text": "Again, both clones were able to effectively induce the expression of PDR3 and PDR5 (Fig. 3c).",
"section": "Results",
"paragraph_rank": 18,
"section_rank": 4,
"ref_spans": [
{
"start": 83,
"end": 91,
"type": "figure",
"ref_id": "fig_2",
"text": "(Fig. 3c"
}
]
},
{
"text": "Here, we establish a methodology for the assembly of complete and biased TALE-based libraries. ",
"section": "Discussion",
"paragraph_rank": 19,
"section_rank": 5
}
The change you present implies to change the current design of the JSON,
because we would need to type the JSON element in the body_text array
(for "section" and paragraph's "sentence"). For the paragraph break, I
don't really see how to do it. I think if we really need this (if it's not
possible to use the paragraph/section ranks), I would rather simply
concatenate one \n at the text at end of a paragraph and \n\n at the text
at end of a section. It's not beatiful but it does not impact the offsets
(it's just an "append"). For reinjecting back the annotation in XML (the
"back from the json to XML" which is one of the goal), I would simply
ignore the \n.
What do you think?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#668 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWOUUUAVPNGQBRSHKW7IDSGNAHDANCNFSM4RMBACCQ>
.
|
If we can avoid adding extra I've regenerated all the JSON files after updating the TEI2LossyJSON.py converter as you suggested. We have now the section titles as extra element in the Example: {
"text": "DISCUSSION",
"section_rank": 10
},
{
"text": "The prevalence of asthma has increased worldwide and this has been most strik
ingly observed in the industrialized countries during the last decade. ",
"section": "DISCUSSION",
"paragraph_rank": 12,
"section_rank": 10
}, |
Cool. I'm working with that now. Any idea why there isn't a paragraph_rank key for quite a few texts here: data/json_goldstandard/10.1002%2Fpam.22030.json:754 |
This is looking good now, with chunk text all good. Offsets stayed functional as I only ever added linebreaks at the end of a text (for both section headers and paragraph ends). The map.csv has those flags and shows the text that was added in the |
Thanks, I found an error in my code when the JSON merging with the TEI corpus. I've just committed the fix and updated the impacted JSON files. It should be good now: {
"text": "In fact, research on the effects of CA-PFL on mothers finds that the largest effects on leave-taking are concentrated among the least advantaged mothers (Rossin-Slater, Ruhm, & Waldfogel, 2013).",
"section": "RELATED LITERATURE AND HYPOTHESES",
"paragraph_rank": 13,
"section_rank": 2,
"ref_spans": [
{
"start": 169,
"end": 193,
"type": "bibr",
"ref_id": "b56",
"text": "Ruhm, & Waldfogel, 2013)"
}
]
},
{
"text": "DATA",
"section_rank": 3
},
{
"text": "We use data from the 2000 Census and the 2000 to 2013 waves of the ACS to estimate the effects of CA-PFL on fathers' leave-taking. ",
"section": "DATA",
"paragraph_rank": 14,
"section_rank": 3
},
{
"text": "The ACS is conducted throughout the year and samples 1 percent of the population in most years; thus, it has the major advantage of providing the large samples needed to examine leave-taking behavior among fathers. ",
"section": "DATA",
"paragraph_rank": 14,
"section_rank": 3
}, |
We pushed the chunks over to tagworks and they pointed out that they aren't very readable, due to line breaks (or rather than lack thereof).
What was done before to fix this was to add a linebreak after these fields:
and ensure a space around the
ref
tag.@kermitt2 do you think we could add that to the json conversion? If we have the last sentence of a paragraph as the first sentence of a chunk then that'll look a bit strange (as it will be a single sentence, then a break, then a new sentence). Seems like this could be done in https://github.com/howisonlab/softcite-dataset/blob/master/code/corpus/corpus2JSON.py ? or in https://github.com/howisonlab/softcite-dataset/blob/master/code/corpus/TEI2LossyJSON.py ? But I'm not sure how that would affect enrichJSON.py
Pretty good example of where json can't encode (or at least doesn't naturally lend itself to encoding) semantics that are relevant for people reading the text. I appreciate the note at the top of https://github.com/howisonlab/softcite-dataset/blob/master/code/corpus/TEI2LossyJSON.py !
The text was updated successfully, but these errors were encountered: