Hierarchical data structure in .eln or not #98

SteffenBrinckmann · 2024-11-19T12:06:48Z

(based on discussion around PR #95)
Should the .eln file allow for a hierarchical structure of folders and files or keep it flat? It seems most ELNs run a rather flat structure that does not allow arbitrarily nested structures which can have directories or files on the top.

Hence there seems to be two alternatives:
A) the .eln has a rather flat structure of two levels (./ and the items); Those ELNs, that have a deep hierarchy, flatten on export to .eln and deepen/raising on import from .eln (possible algorithm for deepening below).

B) the .eln has a arbitrary deep hierarchy. During import the ELNs, that do not like deep hierarchies, flatten the information of the .eln.

Deepening algorithm can be based on boolean operations. A=set of ids at './'. B=set of all ids of all hasParts. At top level are those items: A - B. From there use the given hasPart information to construct the individual trees. If items of B are not in A, they are supplementary.

SteffenBrinckmann · 2024-11-20T20:01:38Z

@NicolasCARPi @nicobrandt @FlorianRhiem Any opinion / preference?
[I would prefer a "decision" such that I know whether to change the export / import algorithm]

FlorianRhiem · 2024-11-21T07:54:22Z

So far, SampleDB doesn't follow either alternative, though it aligns roughly with A. The structure for a SampleDB export is roughly like this:

./ the root data entity, which has as parts
- ./users/1, users are always part of the root data entity and do not have parts themselves
- ./objects/1, importable objects are always part of the root data entity, and have as parts
  - ./objects/1/files.json, a JSON file containing more SampleDB-internal information on the files associated with ./objects/1
  - ./objects/1/files/0/example.txt, a file for ./objects/1
  - ./objects/1/comments.json, a JSON file containing more SampleDB-internal information on the comments associated with ./objects/1
  - ./objects/1/versions/0, a version of ./objects/1, with 0 being the first version, etc., which has as parts:
    - ./objects/1/versions/0/data.json, a JSON file containing the SampleDB-internal data for ./objects/1/versions/0
    - ./objects/1/versions/0/schema.json, a JSON file containing the SampleDB-internal schema for ./objects/1/versions/0

The key distinction to a deeply nested approach is that only the Dataset nodes which are parts of ./ should be imported, in this case ./objects/1. It then has parts for files, and if an import can handle the additional information it can, but those are not considered importable objects and should not be shown as standalone datasets.

I think it's fine to have nodes which represent directories and aren't really importable objects/Datasets, and to have nodes that are deeply nested but are really importable objects/Datasets, as long as there's a well-defined way to find all the Dataset nodes that should be considered as "importable". Sure, they may use mentions to reference other Datasets, but they should not be parts of a whole that would be incomplete on their own.
So far, I worked under the assumption that being part of ./ meant that a Dataset had that status. If that changes, we need an alternative. If that doesn't change, PASTA should include the nested Datasets that should be imported in ./.

One way would be to add a custom attribute, or to create a DataCatalog node listing the importable Datasets nodes. Though I think a way that would be consistent with the behavior of more general tools dealing with RO-Crates would be ideal, rather than a custom solution.

@salexan2001 Are you aware of how this is handled by RO-Crates more generally?

nicobrandt · 2024-11-21T08:38:07Z

The exported data structure in Kadi4Mat currently looks like the following:

<RO-Crate root>/
  |   ro-crate-metadata.json
  |   folder1/
  |    | file1
  |    | file2
  |    | ...
  |   folder2/
  |    | file1
  |    | file2
  |    | ...
  |   ...

Whether there are one or multiple "folders" depends on whether a single record (the basic data/metadata containers in Kadi4Mat) or a collection of multiple records is exported. This is also why we have two examples in this repo, one for each resource type. The general structure is the same though.

That being said, we also support collection hierarchies in Kadi4Mat. However, the export currently only goes one level deep. We haven't decided yet on how to deal with this in the future. Basically, we could either keep the hierarchy flat (maybe with some additional metadata if someone really wants to recreate the hierarchy), or actually make use of "sub-folders". In the latter case, intermediate folders (collections) would not contain any files though.

Regarding the import, we currently focused on the flat structure shown above. Importing nested structures would in principle be possible for us, but probably not without limitations/some information loss, independent of whether we flatten everything.

All in all, I don't have strong opinions about this, as long as we can agree on something. In general, I suggest keeping our spec a bit more strict than the RO-Crate spec though to make our life a bit easier. I also suggest discussing this in the next meeting, rather than only a couple of people deciding right now :P

simontaurus · 2024-11-21T11:00:35Z

I don't think we have to / should make a decision here.
RO-Create dataset structure is per definition a least a directed acylic graph or even a tree (would make sense to restrict it to a tree) and it's up to the target system whether to build internally a single nested entity, multiple linked flat entities or anything in between. A conventional file system has to nest while e.g. in OpenSemanticLab we can just replicate the graph.
A target application may choose the skip nesting path elements if they are just folders without additional metadata

SteffenBrinckmann · 2024-11-22T07:40:04Z

I agree that we might want to discuss this during a meeting, along with #69.
I also agree to make the .eln-spec "more strict to make life easier". Also, the spec should be easy to understand, without quirky rules.
What I do not yet agree to is the concept of "importable". If content is important, it should be in .eln otherwise not. If we understand .eln as the intersection of all the functionality/features of all the ELNs, it means that not all functionality of the ELN can be saved in .eln. Along those lines, it makes sense to strip content when exporting into .eln; like PASTA would do if we decide on this issue.

NicolasCARPi · 2024-11-26T10:43:56Z

PASTA should include the nested Datasets that should be imported in ./.

For me that's the important bit: things that must be imported must be mentionned in ./. In PASTA .eln there are Datasets with hasPart, which means we need to recursively process them. Keeping it flat, at root level, seems a better choice.

The Ro-crate spec allows for nested Datasets, but as @nicobrandt said, having stricter rules will make our life easier. And having all the importable bits in the hasPart of ./ sounds like a good plan.

salexan2001 · 2024-12-02T09:36:44Z

So, to summarize, what are the issues that need to be discussed now (for me preferably in one of the next meetings)?

Do we want to restrict the allowed folder-hierarchy inside the ELN file format? (If yes, we should have a look at existing approaches for standardization, like: https://www.nature.com/articles/s41467-024-52446-8)
How do we document the separation between mandatory (e.g. must be imported) and optional parts/files of the ELN file?

In my opinion an important goal of the ELN file format should be to maximize compatibility/interoperability between the different ELNs, so I think having a strict and simple specification (while maintaining full compatibility to ROCrate) is desirable.

Btw.: Are there already efforts to create something like standardized ELN file format libraries for different languages used by the ELNs? Or is there basically a new import/export implementation in each ELN?

This was referenced Nov 19, 2024

Flattend graph and added other changes of consortium meeting nov. 2024 #95

Open

Update tests #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical data structure in .eln or not #98

Hierarchical data structure in .eln or not #98

SteffenBrinckmann commented Nov 19, 2024

SteffenBrinckmann commented Nov 20, 2024

FlorianRhiem commented Nov 21, 2024

nicobrandt commented Nov 21, 2024

simontaurus commented Nov 21, 2024 •

edited

Loading

SteffenBrinckmann commented Nov 22, 2024

NicolasCARPi commented Nov 26, 2024

salexan2001 commented Dec 2, 2024

Hierarchical data structure in .eln or not #98

Hierarchical data structure in .eln or not #98

Comments

SteffenBrinckmann commented Nov 19, 2024

SteffenBrinckmann commented Nov 20, 2024

FlorianRhiem commented Nov 21, 2024

nicobrandt commented Nov 21, 2024

simontaurus commented Nov 21, 2024 • edited Loading

SteffenBrinckmann commented Nov 22, 2024

NicolasCARPi commented Nov 26, 2024

salexan2001 commented Dec 2, 2024

simontaurus commented Nov 21, 2024 •

edited

Loading