Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical data structure in .eln or not #98

Open
SteffenBrinckmann opened this issue Nov 19, 2024 · 7 comments
Open

Hierarchical data structure in .eln or not #98

SteffenBrinckmann opened this issue Nov 19, 2024 · 7 comments

Comments

@SteffenBrinckmann
Copy link
Collaborator

(based on discussion around PR #95)
Should the .eln file allow for a hierarchical structure of folders and files or keep it flat? It seems most ELNs run a rather flat structure that does not allow arbitrarily nested structures which can have directories or files on the top.

Hence there seems to be two alternatives:
A) the .eln has a rather flat structure of two levels (./ and the items); Those ELNs, that have a deep hierarchy, flatten on export to .eln and deepen/raising on import from .eln (possible algorithm for deepening below).

B) the .eln has a arbitrary deep hierarchy. During import the ELNs, that do not like deep hierarchies, flatten the information of the .eln.

Deepening algorithm can be based on boolean operations. A=set of ids at './'. B=set of all ids of all hasParts. At top level are those items: A - B. From there use the given hasPart information to construct the individual trees. If items of B are not in A, they are supplementary.

@SteffenBrinckmann
Copy link
Collaborator Author

@NicolasCARPi @nicobrandt @FlorianRhiem Any opinion / preference?
[I would prefer a "decision" such that I know whether to change the export / import algorithm]

@FlorianRhiem
Copy link
Contributor

So far, SampleDB doesn't follow either alternative, though it aligns roughly with A. The structure for a SampleDB export is roughly like this:

  • ./ the root data entity, which has as parts
    • ./users/1, users are always part of the root data entity and do not have parts themselves
    • ./objects/1, importable objects are always part of the root data entity, and have as parts
      • ./objects/1/files.json, a JSON file containing more SampleDB-internal information on the files associated with ./objects/1
      • ./objects/1/files/0/example.txt, a file for ./objects/1
      • ./objects/1/comments.json, a JSON file containing more SampleDB-internal information on the comments associated with ./objects/1
      • ./objects/1/versions/0, a version of ./objects/1, with 0 being the first version, etc., which has as parts:
        • ./objects/1/versions/0/data.json, a JSON file containing the SampleDB-internal data for ./objects/1/versions/0
        • ./objects/1/versions/0/schema.json, a JSON file containing the SampleDB-internal schema for ./objects/1/versions/0

The key distinction to a deeply nested approach is that only the Dataset nodes which are parts of ./ should be imported, in this case ./objects/1. It then has parts for files, and if an import can handle the additional information it can, but those are not considered importable objects and should not be shown as standalone datasets.

I think it's fine to have nodes which represent directories and aren't really importable objects/Datasets, and to have nodes that are deeply nested but are really importable objects/Datasets, as long as there's a well-defined way to find all the Dataset nodes that should be considered as "importable". Sure, they may use mentions to reference other Datasets, but they should not be parts of a whole that would be incomplete on their own.
So far, I worked under the assumption that being part of ./ meant that a Dataset had that status. If that changes, we need an alternative. If that doesn't change, PASTA should include the nested Datasets that should be imported in ./.

One way would be to add a custom attribute, or to create a DataCatalog node listing the importable Datasets nodes. Though I think a way that would be consistent with the behavior of more general tools dealing with RO-Crates would be ideal, rather than a custom solution.

@salexan2001 Are you aware of how this is handled by RO-Crates more generally?

@nicobrandt
Copy link
Contributor

The exported data structure in Kadi4Mat currently looks like the following:

<RO-Crate root>/
  |   ro-crate-metadata.json
  |   folder1/
  |    | file1
  |    | file2
  |    | ...
  |   folder2/
  |    | file1
  |    | file2
  |    | ...
  |   ...

Whether there are one or multiple "folders" depends on whether a single record (the basic data/metadata containers in Kadi4Mat) or a collection of multiple records is exported. This is also why we have two examples in this repo, one for each resource type. The general structure is the same though.

That being said, we also support collection hierarchies in Kadi4Mat. However, the export currently only goes one level deep. We haven't decided yet on how to deal with this in the future. Basically, we could either keep the hierarchy flat (maybe with some additional metadata if someone really wants to recreate the hierarchy), or actually make use of "sub-folders". In the latter case, intermediate folders (collections) would not contain any files though.

Regarding the import, we currently focused on the flat structure shown above. Importing nested structures would in principle be possible for us, but probably not without limitations/some information loss, independent of whether we flatten everything.


All in all, I don't have strong opinions about this, as long as we can agree on something. In general, I suggest keeping our spec a bit more strict than the RO-Crate spec though to make our life a bit easier. I also suggest discussing this in the next meeting, rather than only a couple of people deciding right now :P

@simontaurus
Copy link
Contributor

simontaurus commented Nov 21, 2024

I don't think we have to / should make a decision here.
RO-Create dataset structure is per definition a least a directed acylic graph or even a tree (would make sense to restrict it to a tree) and it's up to the target system whether to build internally a single nested entity, multiple linked flat entities or anything in between. A conventional file system has to nest while e.g. in OpenSemanticLab we can just replicate the graph.
A target application may choose the skip nesting path elements if they are just folders without additional metadata

@SteffenBrinckmann
Copy link
Collaborator Author

I agree that we might want to discuss this during a meeting, along with #69.
I also agree to make the .eln-spec "more strict to make life easier". Also, the spec should be easy to understand, without quirky rules.
What I do not yet agree to is the concept of "importable". If content is important, it should be in .eln otherwise not. If we understand .eln as the intersection of all the functionality/features of all the ELNs, it means that not all functionality of the ELN can be saved in .eln. Along those lines, it makes sense to strip content when exporting into .eln; like PASTA would do if we decide on this issue.

@NicolasCARPi
Copy link
Contributor

PASTA should include the nested Datasets that should be imported in ./.

For me that's the important bit: things that must be imported must be mentionned in ./. In PASTA .eln there are Datasets with hasPart, which means we need to recursively process them. Keeping it flat, at root level, seems a better choice.

The Ro-crate spec allows for nested Datasets, but as @nicobrandt said, having stricter rules will make our life easier. And having all the importable bits in the hasPart of ./ sounds like a good plan.

@salexan2001
Copy link
Contributor

So, to summarize, what are the issues that need to be discussed now (for me preferably in one of the next meetings)?

  • Do we want to restrict the allowed folder-hierarchy inside the ELN file format? (If yes, we should have a look at existing approaches for standardization, like: https://www.nature.com/articles/s41467-024-52446-8)
  • How do we document the separation between mandatory (e.g. must be imported) and optional parts/files of the ELN file?

In my opinion an important goal of the ELN file format should be to maximize compatibility/interoperability between the different ELNs, so I think having a strict and simple specification (while maintaining full compatibility to ROCrate) is desirable.

Btw.: Are there already efforts to create something like standardized ELN file format libraries for different languages used by the ELNs? Or is there basically a new import/export implementation in each ELN?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants