-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support external annotations files to allow selective loading and avoid memory issues #21
Comments
Hi @ngawangtrinley ! Thanks for your interest in STAM! I'm glad you find the model interesting and useful for your project, so I'd love to support you in this. It's now precisely up to real use cases like yours that have to drive STAM development forward (we're a young software project). When using STAM JSON there is already a way to externalize things over multiple stand-off files, as the resource texts themselves as well as the annotation datasets can be kept in separate files. So you can have a plain text file for each of your large texts, and an independent annotation data set file for each of the 15 annotation sets. In the STAM JSON serialisation of the annotation store, these are then referenced via the The current option to have the actual annotations split over multiple files is to write to multiple annotation store files (which may or may not reference the same resources and datasets via the I have, however, not yet implemented a mechanism to conveniently split an existing store into multiple, but I can certainly implement this fairly easily. The reverse is already in place, you can load (merge) multiple annotation store files into a single annotation store at run time (note that you should never work with multiple annotation stores in memory at run-time, but you can load from multiple files into one, effectively holding only the subset you need into memory), the caveat is that reserialisation to these split files is something that needs to be implemented. I hope this answers your question, let me know if you encounter any problems. |
This was not exposed yet in the Python API, but only in the Rust API, slightly related to annotation/stam#21
The @include mechanism is perfect. I was going to ask about other cases but it seems that this mechanism will solve them too. We'll test it next week and get back to you. :) I'm pretty sure we will need the easy way to split stores as part of our annotation update setup but we're not there yet. We will let you know if it's not yet implemented when we reach that point. More next week... |
@proycon , I would like to ask how to store AnnotationDataset separatly in a json file.And what we would like to not to include base file when storing the different AnnotationDatasets separately. I refered the following documentation here and we fully code in python. |
This wasn't clearly propagated to the Python binding yet. Ref: annotation/stam#21
@tenzin3 Here's an example of how to store data set and text resource in separate files when constructing data from scratch with Python: from stam import *
store = AnnotationStore(id="test", config={ "use_include": True })
resource = store.add_resource(id="testres", filename="test.txt")
dataset = store.add_dataset(id="testdataset", filename="testdataset.dataset.stam.json")
dataset.add_key("pos")
data = dataset.add_data("pos","noun","D1")
store.annotate(id="A1",
target=Selector.textselector(resource, Offset.simple(6,11)),
data=data)
store.set_filename("test.store.stam.json")
store.save() The Note that you need stam 0.8.3 for this (just released), so you may need to do a I hope this answers your question. |
@proycon Thank you for your response, that is what we needed in our project. |
@proycon some of our text resources are quite big so we've been thinking about splitting them in multiple chunks. Is it possible to do cross-file annotations? For instance a chapter annotation for a chapter spanning 3 text resources. |
Yes, that is no problem. You can accomplish that by using the
CompositeSelector, with under it three TextSelectors, each pointing to
a text selection in a different resource.
Of course, all three chunks need to be loaded into an AnnotationStore
for such an annotation to work. But it would allow you have different
annotation stores for different kinds of annotations, associating the
appropriate chunks with each.
A disadvantage may only be that you lose the full absolute coordinate space
over the text as a whole. The chunks would become the base unit, each
starting at 0.
|
@proycon , i wanted to ask how did you stored the AnnotationDataset ("testdataset.dataset.stam.json")separely, as i have seen you load in the below line
Current i facing error loading the the file 'annotation_store.json'.
annotation_store.json->
Root_Segment-e8a.json:>
|
@tenzin3 Can you show me the code where you created this |
@proycon, yes the file
After this, i save the annotation store as I have not been succesfull in saving AnnotationDataset separatelty in json file till now. |
Try this, in line with the example I gave earlier: new_ann_store = AnnotationStore(id="IC17A18B7", config={ "use_include": True })
ann_dataset = new_ann_store.add_dataset(id="root_commentary_6a0", filename="Root_Segment-e8a.json")
new_ann_store.set_filename("annotation_store.json")
new_ann_store.save() That should give you an |
@proycon, Oo yes i tried it and it is saving AnnotationDataset separately in json However, I would like to save the AnnotationDataset to a JSON file only after adding the annotations to the dataset. Our plan is to store each separate AnnotationDataset with its annotations in individual JSON files. We will then have a main file, Would it be possible to save the AnnotationDataset object itself to a JSON file? Any guidance you can provide would be greatly appreciated. |
On Fri Jul 12, 2024 at 1:19 PM CEST, Tenzin Tsundue wrote:
@proycon, Oo yes i tried it and it is saving AnnotationDataset
separately in json However, I would like to save the AnnotationDataset
to a JSON file only after adding the annotations to the dataset.
Our plan is to store each separate AnnotationDataset with its
annotations in individual JSON files. We will then have a main file,
```annotation_store.json```, which will load only the
AnnotationDataset that we need at the moment.
Would it be possible to save the AnnotationDataset object itself to a
JSON file? Any guidance you can provide would be greatly appreciated.
I feel there may be a bit of misunderstanding regarding the STAM model
here? Because that's what you have now, the AnnotationDataSet object in
a JSON file by itself. An `AnnotationDataSet` contains `AnnotationData`
and `DataKey`s, it never contains the actual annotations (only the data
of those annotations). The annotations themselves are always in an
`AnnotationStore`.
So I think what you may want here is multiple annotation stores that
reference multiple annotation data sets; say AnnotationStore1 uses
AnnotationDataSet1, AnnotationDataSet2, TextResource1,TextResource2 and
AnnotationStore2 uses AnnotationDataSet1 and AnnotationDataSet3, and
TextResource2 and TextResource3.
You can have multiple annotation stores reference the same
annotationdataset(s) and textresource(s) as long as you're careful to
never work with multiple stores in memory or on disk at the same time.
STAM would also allow you to easily merge AnnotationStore1 and
AnnotationStore2 in such cases, in case you want to jointly query
annotations in both, yet keep them separated on disk.
|
Thank you for the clarification. Here are a couple of follow-ups:
|
Thank you for the clarification. Here are a couple of follow-ups:
- how can we merge two stores? The python API doesn't seem to have this feature yet.
Once you have an `AnnotationStore` instance, you can just keep calling `from_file()`
on it to merge in additional stores: https://stam-python.readthedocs.io/en/latest/autoapi/stam/index.html#stam.AnnotationStore.from_file
Do be careful when you reserialize a merged store with `save()`, because that
would output all merged annotations to the first loaded store only! That
is; the merge is not reversible.
I did also implement some initial splitting functionality now that
does do the reverse: Given an annotation store, and a selection of either
what you want to keep or delete, it deletes things from the store. It's
still a bit experimental (needs further testing and optimisation).
It's available in stam-python v0.8.4. I recommend an upgrade either way
because there is also an important fix in
`AnnotationStore.add_dataset()`
for merging.
- when merging AnnotationStore1, AnnotationStore2 and AnnotationStore3 with each TextResource1 in the @include statement, does the TextResource1 get loaded 3 times with each store?
No, the resource will only be loaded once. This also goes for the
annotation datasets and any keys and data in there.
|
@proycon, thank you for the explanation. One of our main goals is to store translation pairs, such as Tibetan and English translation pairs. We want to store Tibetan and English annotations separately and also store the mapping annotations that link these languages in a separate file. Although we have successfully created mapping annotations using a composite selector, the stam module allows us to store Tibetan and English text annotations together in the store. Is there a way to store only the mapping annotations in an annotation store? We want to keep the annotation files separate for the following reasons:
Looking at the picture below, when we need annotations for Tibetan files, we want to load only those. The same goes for English files. When we want to load the alignment of Tibetan and English translations, we want the translation mapping, then we wish to load the alignment annotation files with the Tibetan and English sentence annotation files. |
In a team discussion with @eroux, an idea was proposed to keep the translation alignment file separate from STAM. Instead, we could use our own abstraction for the alignment file and integrate other annotations into STAM. What are your thoughts on this approach? |
Thanks for including the nice picture, that makes it a lot easier for me to
One little sidenote about this first one though; JSON in general isn't a very Now onto the actual issue: Keeping the alignments separated from the rest in A possible solution is to include If you then load everything from 1, 2 and 3 into one store you still have easy
Though of course possible, I'd be more inclined (and a bit biased probably) to Another possible solution, more on my side of things, is to reconsider if the What do you think? I hope this gives some options to resolve your question. |
@proycon , First, I would like to thank you for your deep thoughts on all our doubts and ideas. Yes, the STAM JSON is indeed very verbose, but we are keen on using STAM due to its diverse features regarding annotations and its speed optimization, being built on Rust.
We did think of this solution before, and as you mentioned, it would lead to duplication of resource files and annotations, which we are very much trying to avoid.
That would be very helpful, especially for us since we are working with many Tibetan religious texts and their translations. If the STAM model allows dependencies on other stores, it would be very helpful. This flexibility would help keep the Tibetan-English alignment file separate. It could also help with higher-order annotations such as sentence and word annotations. If a user needs sentence annotations, we could load only the sentence annotation file. If they need word annotations, then we could load both files. We had this kind of plan for future, so STAM adapting to these needs would be very helpful. If possible, we want to integrate all our data in the STAM format and benefit from its potential. |
Yes, I agree that extending STAM here would probably be the best way to go forward. I do have to carefully consider all the ramifications and then do the implementation, so that will take some time. I'm already thinking out loud here: The main challenge is the fact that different annotations in a single store (because in memory it is always a single one you work with at any time, in this case you'd load 1+2+3 otherwise you can't make the alignments) would need to be serialized to different STAM JSON files. This relates to the split functionality I already implemented this month, but does go further than it currently does and places some extra demand on things, like keeping track of what annotations come from what store files. It may be the most sensible solution and provide a lot of extra flexibility. I'll open a new separate issue for it and link it from here. |
@proycon , Thank you for considering this solution and for your willingness to extend STAM to meet our needs. |
@proycon thanks a lot for considering this change to the model. Annotation selectors pointing to other stores is a must for applications dealing with literature. It looks like most religious corporas more or less for the Bible's verse/chapter/book model in which most annotations target verses. Allowing stores to link to other stores will allow to work at different levels of abstraction. Use cases only dealing with verse and chapters shouldn't need to deal with the characters offsets. And most importantly we don't want to have 100 copies of the verse offsets when we have 100 stores dealing with verses and higher levels of abstraction. The reason we split the 3 GitHub repos is to facilitate the offset update process either text resource is updates. If we keep a copy of the text resource in repo 3, we end up with two versions of the text resource and it becomes very complex to handle versioning. What we have in mind is to require that annotations featuring text selectors must always be located with the text resource. (annotations with higher levels of abstraction such as repo 3 can be in different locations) This is just to ensure that each time we change the text resource and trigger an offset update it is done for all relevant stores at the same time |
@proycon , can you elaborate the reason you mentioned that we cant load multiple annotation stores in memory at the same time. english_words = AnnotationStore(...)
french_words = AnnotationStore(...)
## added dataset , resources , setfilename (for annotationstore)
for word in words:
if is_english_word(word):
english_words.annotate(.....)
elif is_french_word(word):
french_words.annotate(.....)
english_words.save()
french_words.save() Writing code as above, will be any problem? For our project, ideally we aim to store all the annotations first in memory(there will be more than one annotation store) and then after performing validation, then we wish to create at once in final. |
@proycon , can you elaborate the reason you mentioned that we cant
load multiple annotation stores in memory at the same time.
As long as you keep them strictly independent of one another, it's okay.
But as soon as you pass for instance a data key that belongs to one
store to another, you will get errors or unexpected results.
Another risk is if any of the multiple loaded stores shares the same substore,
that would then be duplicated in memory and upon serialisation you might
overwrite the other (based on whoever happens to serialise first).
```python
english_words = AnnotationStore(...)
french_words = AnnotationStore(...)
## added dataset , resources , setfilename (for annotationstore)
for word in words:
if is_english_word(word):
english_words.annotate(.....)
elif is_french_word(word):
french_words.annotate(.....)
english_words.save()
french_words.save()
```
Writing code as above, will be any problem? For our project, ideally
we aim to store all the annotations first in memory(there will be more
than one annotation store) and then after performing validation, then
we wish to create at once in final.
I assume `words` here is not a STAM instance yet, so then it's independent and
okay yes.
|
We're working on PechaData, a multilingual Buddhist corpus project in collaboration with bdrc.io and pecha.org. As a format, Stam is a dream for our project, and we're starting to build our project on top of it with a mechanism to update annotation coordinates when the base text is updated.
However, our dataset includes many large texts (>10mb .txt) featuring multiple annotation layers often larger than the initial text file and we are concerned about performance issues when we have to load all the annotations in memory even when we only need a couple of sets of annotations. (i.e. we have a file with 15 annotation sets including POS tags and dependencies but we only need the text and the annotations for the table of content.)
Have you considered externalizing annotations in separate files like the .ann files of BrAT or do you have another solution to load annotations selectively? We thought about patching Stam to find a solution but we would much prefer a solution coming from the creators.
Thanks a lot for your work!
The text was updated successfully, but these errors were encountered: