Support external annotations files to allow selective loading and avoid memory issues #21

ngawangtrinley · 2024-02-20T12:50:44Z

We're working on PechaData, a multilingual Buddhist corpus project in collaboration with bdrc.io and pecha.org. As a format, Stam is a dream for our project, and we're starting to build our project on top of it with a mechanism to update annotation coordinates when the base text is updated.

However, our dataset includes many large texts (>10mb .txt) featuring multiple annotation layers often larger than the initial text file and we are concerned about performance issues when we have to load all the annotations in memory even when we only need a couple of sets of annotations. (i.e. we have a file with 15 annotation sets including POS tags and dependencies but we only need the text and the annotations for the table of content.)

Have you considered externalizing annotations in separate files like the .ann files of BrAT or do you have another solution to load annotations selectively? We thought about patching Stam to find a solution but we would much prefer a solution coming from the creators.

Thanks a lot for your work!

proycon · 2024-02-20T13:25:10Z

Hi @ngawangtrinley !

Thanks for your interest in STAM! I'm glad you find the model interesting and useful for your project, so I'd love to support you in this. It's now precisely up to real use cases like yours that have to drive STAM development forward (we're a young software project).

When using STAM JSON there is already a way to externalize things over multiple stand-off files, as the resource texts themselves as well as the annotation datasets can be kept in separate files. So you can have a plain text file for each of your large texts, and an independent annotation data set file for each of the 15 annotation sets. In the STAM JSON serialisation of the annotation store, these are then referenced via the @include mechanism. This is explained here: https://github.com/annotation/stam/blob/master/README.md#multiple-files-and-the-include-statement .

The current option to have the actual annotations split over multiple files is to write to multiple annotation store files (which may or may not reference the same resources and datasets via the @include mechanism, it's fine if these are shared between stores).

I have, however, not yet implemented a mechanism to conveniently split an existing store into multiple, but I can certainly implement this fairly easily. The reverse is already in place, you can load (merge) multiple annotation store files into a single annotation store at run time (note that you should never work with multiple annotation stores in memory at run-time, but you can load from multiple files into one, effectively holding only the subset you need into memory), the caveat is that reserialisation to these split files is something that needs to be implemented.

I hope this answers your question, let me know if you encounter any problems.

This was not exposed yet in the Python API, but only in the Rust API, slightly related to annotation/stam#21

ngawangtrinley · 2024-02-22T14:51:26Z

The @include mechanism is perfect. I was going to ask about other cases but it seems that this mechanism will solve them too. We'll test it next week and get back to you. :)

I'm pretty sure we will need the easy way to split stores as part of our annotation update setup but we're not there yet. We will let you know if it's not yet implemented when we reach that point.

More next week...

tenzin3 · 2024-07-09T10:31:56Z

@proycon , I would like to ask how to store AnnotationDataset separatly in a json file.And what we would like to not to include base file when storing the different AnnotationDatasets separately.

I refered the following documentation here and we fully code in python.
documentation

This wasn't clearly propagated to the Python binding yet. Ref: annotation/stam#21

proycon · 2024-07-10T12:08:04Z

@tenzin3 Here's an example of how to store data set and text resource in separate files when constructing data from scratch with Python:

from stam import * 

store = AnnotationStore(id="test", config={ "use_include": True })
resource = store.add_resource(id="testres", filename="test.txt")
dataset = store.add_dataset(id="testdataset", filename="testdataset.dataset.stam.json")
dataset.add_key("pos")
data = dataset.add_data("pos","noun","D1")
store.annotate(id="A1", 
                    target=Selector.textselector(resource, Offset.simple(6,11)),
                    data=data)
store.set_filename("test.store.stam.json")
store.save()

The use_include configuration parameter is important here, when constructing a store from scratch. If you have a TextResource or AnnotationDataSet object, you can also call set_filename() on it rather than do it via add_resource()/add_dataset().

Note that you need stam 0.8.3 for this (just released), so you may need to do a pip install -U stam first.

I hope this answers your question.

tenzin3 · 2024-07-11T10:17:06Z

@proycon Thank you for your response, that is what we needed in our project.

ngawangtrinley · 2024-07-11T12:30:45Z

@proycon some of our text resources are quite big so we've been thinking about splitting them in multiple chunks. Is it possible to do cross-file annotations? For instance a chapter annotation for a chapter spanning 3 text resources.

proycon · 2024-07-11T13:55:40Z

Yes, that is no problem. You can accomplish that by using the CompositeSelector, with under it three TextSelectors, each pointing to a text selection in a different resource. Of course, all three chunks need to be loaded into an AnnotationStore for such an annotation to work. But it would allow you have different annotation stores for different kinds of annotations, associating the appropriate chunks with each. A disadvantage may only be that you lose the full absolute coordinate space over the text as a whole. The chunks would become the base unit, each starting at 0.

tenzin3 · 2024-07-12T05:59:07Z

@proycon , i wanted to ask how did you stored the AnnotationDataset ("testdataset.dataset.stam.json")separely, as i have seen you load in the below line

dataset = store.add_dataset(id="testdataset", filename="testdataset.dataset.stam.json")

Current i facing error loading the the file 'annotation_store.json'.

stam.PyStamError: [StamError] DeserializationError: Deserialization failed: [StamError] DeserializationError: Deserialization failed: Expected type AnnotationDataSet, got AnnotationStore at line 2 column 29

annotation_store.json->

{
  "@type": "AnnotationStore",
  "@id": "IC17A18B7",
  "resources": [],
  "annotationsets": [
    {
      "@type": "AnnotationDataSet",
      "@id": "root_commentary_6a0",
      "@include": "Root_Segment-e8a.json"
    }
  ],
  "annotations": []
}

Root_Segment-e8a.json:>

{
  "@type": "AnnotationStore",
  "@id": "IC17A18B7",
  "resources": [
    {
      "@type": "TextResource",
      "@id": "453",
      "@include": "453.txt"
    }
  ],
  "annotationsets": [
    {
      "@type": "AnnotationDataSet",
      "@id": "root_commentary_6a0",
      "keys": [
        {
          "@type": "DataKey",
          "@id": "Structure Type"
        }
      ],
      "data": [
        {
          "@type": "AnnotationData",
          "@id": "7942390e945f4726b626631c5703e9ad",
          "key": "Structure Type",
          "value": {
            "@type": "String",
            "value": "Root_Segment"
          }
        }
      ]
    }
  ],

proycon · 2024-07-12T09:35:32Z

@tenzin3 Can you show me the code where you created this Root_Segment-e8a.json? It indeed seems to be an AnnotationStore rather than an AnnotationDataSet, hence the error.

tenzin3 · 2024-07-12T10:08:35Z

@proycon, yes the file Root_Segment-e8a.json is indeed an AnnotationStore and following is snippet code on how i created it.

new_ann_store = AnnotationStore(id="IC17A18B7")
ann_dataset = new_ann_store.add_dataset(id="root_commentary_6a0")

After this, i save the annotation store as Root_Segment-e8a.json.

I have not been succesfull in saving AnnotationDataset separatelty in json file till now.

proycon · 2024-07-12T10:51:48Z

Try this, in line with the example I gave earlier:

new_ann_store = AnnotationStore(id="IC17A18B7", config={ "use_include": True })
ann_dataset = new_ann_store.add_dataset(id="root_commentary_6a0", filename="Root_Segment-e8a.json")
new_ann_store.set_filename("annotation_store.json")
new_ann_store.save()

That should give you an annotation_store.json as AnnotationStore with Root_segment-e8a.json being the AnnotationDataSet. Which is what you want if I interpreted things correctly?

tenzin3 · 2024-07-12T11:19:05Z

@proycon, Oo yes i tried it and it is saving AnnotationDataset separately in json However, I would like to save the AnnotationDataset to a JSON file only after adding the annotations to the dataset.

Our plan is to store each separate AnnotationDataset with its annotations in individual JSON files. We will then have a main file, annotation_store.json, which will load only the AnnotationDataset that we need at the moment.

Would it be possible to save the AnnotationDataset object itself to a JSON file? Any guidance you can provide would be greatly appreciated.

proycon · 2024-07-12T11:52:31Z

On Fri Jul 12, 2024 at 1:19 PM CEST, Tenzin Tsundue wrote: @proycon, Oo yes i tried it and it is saving AnnotationDataset separately in json However, I would like to save the AnnotationDataset to a JSON file only after adding the annotations to the dataset. Our plan is to store each separate AnnotationDataset with its annotations in individual JSON files. We will then have a main file, ```annotation_store.json```, which will load only the AnnotationDataset that we need at the moment. Would it be possible to save the AnnotationDataset object itself to a JSON file? Any guidance you can provide would be greatly appreciated.

I feel there may be a bit of misunderstanding regarding the STAM model here? Because that's what you have now, the AnnotationDataSet object in a JSON file by itself. An `AnnotationDataSet` contains `AnnotationData` and `DataKey`s, it never contains the actual annotations (only the data of those annotations). The annotations themselves are always in an `AnnotationStore`. So I think what you may want here is multiple annotation stores that reference multiple annotation data sets; say AnnotationStore1 uses AnnotationDataSet1, AnnotationDataSet2, TextResource1,TextResource2 and AnnotationStore2 uses AnnotationDataSet1 and AnnotationDataSet3, and TextResource2 and TextResource3. You can have multiple annotation stores reference the same annotationdataset(s) and textresource(s) as long as you're careful to never work with multiple stores in memory or on disk at the same time. STAM would also allow you to easily merge AnnotationStore1 and AnnotationStore2 in such cases, in case you want to jointly query annotations in both, yet keep them separated on disk.

ngawangtrinley · 2024-07-15T06:16:14Z

Thank you for the clarification. Here are a couple of follow-ups:

how can we merge two stores? The python API doesn't seem to have this feature yet.
when merging AnnotationStore1, AnnotationStore2 and AnnotationStore3 with each TextResource1 in the @include statement, does the TextResource1 get loaded 3 times (once with each store)?

proycon · 2024-07-15T20:50:26Z

Thank you for the clarification. Here are a couple of follow-ups: - how can we merge two stores? The python API doesn't seem to have this feature yet.

Once you have an `AnnotationStore` instance, you can just keep calling `from_file()` on it to merge in additional stores: https://stam-python.readthedocs.io/en/latest/autoapi/stam/index.html#stam.AnnotationStore.from_file Do be careful when you reserialize a merged store with `save()`, because that would output all merged annotations to the first loaded store only! That is; the merge is not reversible. I did also implement some initial splitting functionality now that does do the reverse: Given an annotation store, and a selection of either what you want to keep or delete, it deletes things from the store. It's still a bit experimental (needs further testing and optimisation). It's available in stam-python v0.8.4. I recommend an upgrade either way because there is also an important fix in `AnnotationStore.add_dataset()` for merging.

- when merging AnnotationStore1, AnnotationStore2 and AnnotationStore3 with each TextResource1 in the @include statement, does the TextResource1 get loaded 3 times with each store?

No, the resource will only be loaded once. This also goes for the annotation datasets and any keys and data in there.

tenzin3 · 2024-07-19T07:17:27Z

@proycon, thank you for the explanation.

One of our main goals is to store translation pairs, such as Tibetan and English translation pairs. We want to store Tibetan and English annotations separately and also store the mapping annotations that link these languages in a separate file.

Although we have successfully created mapping annotations using a composite selector, the stam module allows us to store Tibetan and English text annotations together in the store. Is there a way to store only the mapping annotations in an annotation store?

We want to keep the annotation files separate for the following reasons:

Keep annotation JSON files lightweight.
Easier updates: If we need to modify the Tibetan or English sentence annotations, we don't have to update the alignment file.
Future annotations: As more annotations, like Named Entity Recognition (NER), are added to the Tibetan sentence file, it will prevent the file from becoming too large.

Looking at the picture below, when we need annotations for Tibetan files, we want to load only those. The same goes for English files. When we want to load the alignment of Tibetan and English translations, we want the translation mapping, then we wish to load the alignment annotation files with the Tibetan and English sentence annotation files.

tenzin3 · 2024-07-22T09:10:04Z

In a team discussion with @eroux, an idea was proposed to keep the translation alignment file separate from STAM. Instead, we could use our own abstraction for the alignment file and integrate other annotations into STAM. What are your thoughts on this approach?

proycon · 2024-07-22T15:42:58Z

Thanks for including the nice picture, that makes it a lot easier for me to
understand your use case! I understand the three objectives you listed and
agree that those are good principles.

Keep annotation JSON files lightweight

One little sidenote about this first one though; JSON in general isn't a very
lightweight format of course, and STAM JSON is fairly verbose and not optimised
for file size or anything (it should compress fairly well though).
Optimisation does happen inside the STAM library as soon as you read things
into memory. Since you're listing Github as a platform, also be aware it will
warn when a file gets over 50MB and even block files over 100MB (unless you
use git LFS).

Now onto the actual issue: Keeping the alignments separated from the rest in
the way you describe is currently a bit problematic, because in
english_tibetan_alignment.json (repo 3) you're proposing to refer to
annotations using annotation selectors, but those annotations are not in the
same annotation store. So that store can not be loaded without first loading the other
two. That's currently not allowed by the model: An annotation store can depend
on stand-off text resources and stand-off annotation datasets, but it can not
depend on another store. Each must be independently loadable.

A possible solution is to include tibetan_text.txt and english_text.txt in
english_tibetan_alignment.json (stand-off) and then refer to the text spans directly from
the composite selector (using text selectors underneath). This would still keep the files small.

If you then load everything from 1, 2 and 3 into one store you still have easy
means to relate the sentence annotations and the alignments (because they
reference the same text). The question is if this is sufficiently flexible with
regard to your point 2 (easier updates), as the target spans are duplicated
now. From a STAM perspective, this duplication is not unnatural though, it's a
perfectly fine way to model things to let various annotations refer
directly to the same text span rather than to an abstraction layer on top of that (like
word or sentence annotations for instance).

In a team discussion with @eroux, an idea was proposed to keep the
translation alignment file separate from STAM. Instead, we could use our own
abstraction for the alignment file and integrate other annotations into STAM.
What are your thoughts on this approach?

Though of course possible, I'd be more inclined (and a bit biased probably) to
let STAM handle the alignments too, then you can reap the maximum benefits from
the existing library, otherwise you need to implement your own logic on top of
things.

Another possible solution, more on my side of things, is to reconsider if the
restriction you're running into now is something that can be solved elegantly
by adapting/expanding the STAM model. In other words, perhaps STAM needs to
allow annotation stores with dependencies on others. I do think you have a fair
use case here, with the three principles you outlined and the way you want to
model it. This is something that needs to be thought through and would take
time to implement, of course.

What do you think? I hope this gives some options to resolve your question.

tenzin3 · 2024-07-22T18:35:38Z

@proycon , First, I would like to thank you for your deep thoughts on all our doubts and ideas. Yes, the STAM JSON is indeed very verbose, but we are keen on using STAM due to its diverse features regarding annotations and its speed optimization, being built on Rust.

A possible solution is to include tibetan_text.txt and english_text.txt in
english_tibetan_alignment.json (stand-off) and then refer to the text spans directly from
the composite selector (using text selectors underneath).

We did think of this solution before, and as you mentioned, it would lead to duplication of resource files and annotations, which we are very much trying to avoid.

Another possible solution, more on my side of things, is to reconsider if the
restriction you're running into now is something that can be solved elegantly
by adapting/expanding the STAM model. In other words, perhaps STAM needs to
allow annotation stores with dependencies on others. I do think you have a fair
use case here, with the three principles you outlined and the way you want to
model it. This is something that needs to be thought through and would take
time to implement, of course.

That would be very helpful, especially for us since we are working with many Tibetan religious texts and their translations. If the STAM model allows dependencies on other stores, it would be very helpful. This flexibility would help keep the Tibetan-English alignment file separate. It could also help with higher-order annotations such as sentence and word annotations. If a user needs sentence annotations, we could load only the sentence annotation file. If they need word annotations, then we could load both files. We had this kind of plan for future, so STAM adapting to these needs would be very helpful.

If possible, we want to integrate all our data in the STAM format and benefit from its potential.

proycon · 2024-07-22T21:57:26Z

Yes, I agree that extending STAM here would probably be the best way to go forward. I do have to carefully consider all the ramifications and then do the implementation, so that will take some time.

I'm already thinking out loud here: The main challenge is the fact that different annotations in a single store (because in memory it is always a single one you work with at any time, in this case you'd load 1+2+3 otherwise you can't make the alignments) would need to be serialized to different STAM JSON files. This relates to the split functionality I already implemented this month, but does go further than it currently does and places some extra demand on things, like keeping track of what annotations come from what store files. It may be the most sensible solution and provide a lot of extra flexibility.

I'll open a new separate issue for it and link it from here.

tenzin3 · 2024-07-23T05:02:57Z

@proycon , Thank you for considering this solution and for your willingness to extend STAM to meet our needs.
We understand that implementing these changes will take time and careful thought, and we are grateful for your dedication to finding the best way forward. We eagerly anticipate your update in STAM.

ngawangtrinley · 2024-07-25T17:51:05Z

@proycon thanks a lot for considering this change to the model. Annotation selectors pointing to other stores is a must for applications dealing with literature. It looks like most religious corporas more or less for the Bible's verse/chapter/book model in which most annotations target verses.

Allowing stores to link to other stores will allow to work at different levels of abstraction. Use cases only dealing with verse and chapters shouldn't need to deal with the characters offsets. And most importantly we don't want to have 100 copies of the verse offsets when we have 100 stores dealing with verses and higher levels of abstraction.

The reason we split the 3 GitHub repos is to facilitate the offset update process either text resource is updates. If we keep a copy of the text resource in repo 3, we end up with two versions of the text resource and it becomes very complex to handle versioning.

What we have in mind is to require that annotations featuring text selectors must always be located with the text resource. (annotations with higher levels of abstraction such as repo 3 can be in different locations)

This is just to ensure that each time we change the text resource and trigger an offset update it is done for all relevant stores at the same time

tenzin3 · 2024-10-10T05:13:19Z

You can have multiple annotation stores reference the same annotationdataset(s) and textresource(s) as long as you're careful to never work with multiple stores in memory or on disk at the same time.

@proycon , can you elaborate the reason you mentioned that we cant load multiple annotation stores in memory at the same time.

english_words = AnnotationStore(...)
french_words = AnnotationStore(...)

## added dataset , resources , setfilename (for annotationstore)

for word in words:
   if is_english_word(word):
        english_words.annotate(.....)
   elif is_french_word(word):
        french_words.annotate(.....)

english_words.save()
french_words.save()

Writing code as above, will be any problem? For our project, ideally we aim to store all the annotations first in memory(there will be more than one annotation store) and then after performing validation, then we wish to create at once in final.

proycon · 2024-10-10T10:37:44Z

@proycon , can you elaborate the reason you mentioned that we cant load multiple annotation stores in memory at the same time.

As long as you keep them strictly independent of one another, it's okay. But as soon as you pass for instance a data key that belongs to one store to another, you will get errors or unexpected results. Another risk is if any of the multiple loaded stores shares the same substore, that would then be duplicated in memory and upon serialisation you might overwrite the other (based on whoever happens to serialise first).

```python english_words = AnnotationStore(...) french_words = AnnotationStore(...) ## added dataset , resources , setfilename (for annotationstore) for word in words: if is_english_word(word): english_words.annotate(.....) elif is_french_word(word): french_words.annotate(.....) english_words.save() french_words.save() ``` Writing code as above, will be any problem? For our project, ideally we aim to store all the annotations first in memory(there will be more than one annotation store) and then after performing validation, then we wish to create at once in final.

I assume `words` here is not a STAM instance yet, so then it's independent and okay yes.

proycon self-assigned this Feb 20, 2024

proycon added the question Further information is requested label Feb 20, 2024

proycon added a commit to annotation/stam-python that referenced this issue Feb 21, 2024

added AnnotationStore.from_file(), allowing store merges

23ae16b

This was not exposed yet in the Python API, but only in the Rust API, slightly related to annotation/stam#21

proycon added a commit to annotation/stam-python that referenced this issue Jul 10, 2024

added set_filename(), filename() and get_filename() for stand-off files

e069b86

This wasn't clearly propagated to the Python binding yet. Ref: annotation/stam#21

proycon mentioned this issue Jul 23, 2024

Allow annotation stores to include/depend on other annotation stores (stand-off STAM JSON files) #29

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support external annotations files to allow selective loading and avoid memory issues #21

Support external annotations files to allow selective loading and avoid memory issues #21

ngawangtrinley commented Feb 20, 2024 •

edited

Loading

proycon commented Feb 20, 2024

ngawangtrinley commented Feb 22, 2024

tenzin3 commented Jul 9, 2024

proycon commented Jul 10, 2024 •

edited

Loading

tenzin3 commented Jul 11, 2024

ngawangtrinley commented Jul 11, 2024 •

edited

Loading

proycon commented Jul 11, 2024 via email

tenzin3 commented Jul 12, 2024 •

edited

Loading

proycon commented Jul 12, 2024

tenzin3 commented Jul 12, 2024

proycon commented Jul 12, 2024

tenzin3 commented Jul 12, 2024

proycon commented Jul 12, 2024 via email

ngawangtrinley commented Jul 15, 2024 •

edited

Loading

proycon commented Jul 15, 2024 via email •

edited

Loading

tenzin3 commented Jul 19, 2024 •

edited

Loading

tenzin3 commented Jul 22, 2024

proycon commented Jul 22, 2024

tenzin3 commented Jul 22, 2024

proycon commented Jul 22, 2024

tenzin3 commented Jul 23, 2024

ngawangtrinley commented Jul 25, 2024

tenzin3 commented Oct 10, 2024 •

edited

Loading

proycon commented Oct 10, 2024 via email

Support external annotations files to allow selective loading and avoid memory issues #21

Support external annotations files to allow selective loading and avoid memory issues #21

Comments

ngawangtrinley commented Feb 20, 2024 • edited Loading

proycon commented Feb 20, 2024

ngawangtrinley commented Feb 22, 2024

tenzin3 commented Jul 9, 2024

proycon commented Jul 10, 2024 • edited Loading

tenzin3 commented Jul 11, 2024

ngawangtrinley commented Jul 11, 2024 • edited Loading

proycon commented Jul 11, 2024 via email

tenzin3 commented Jul 12, 2024 • edited Loading

proycon commented Jul 12, 2024

tenzin3 commented Jul 12, 2024

proycon commented Jul 12, 2024

tenzin3 commented Jul 12, 2024

proycon commented Jul 12, 2024 via email

ngawangtrinley commented Jul 15, 2024 • edited Loading

proycon commented Jul 15, 2024 via email • edited Loading

tenzin3 commented Jul 19, 2024 • edited Loading

tenzin3 commented Jul 22, 2024

proycon commented Jul 22, 2024

tenzin3 commented Jul 22, 2024

proycon commented Jul 22, 2024

tenzin3 commented Jul 23, 2024

ngawangtrinley commented Jul 25, 2024

tenzin3 commented Oct 10, 2024 • edited Loading

proycon commented Oct 10, 2024 via email

ngawangtrinley commented Feb 20, 2024 •

edited

Loading

proycon commented Jul 10, 2024 •

edited

Loading

ngawangtrinley commented Jul 11, 2024 •

edited

Loading

tenzin3 commented Jul 12, 2024 •

edited

Loading

ngawangtrinley commented Jul 15, 2024 •

edited

Loading

proycon commented Jul 15, 2024 via email •

edited

Loading

tenzin3 commented Jul 19, 2024 •

edited

Loading

tenzin3 commented Oct 10, 2024 •

edited

Loading