-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Rewrite inheritance principle #946
[FIX] Rewrite inheritance principle #946
Conversation
pinging @yarikoptic and @VisLab as they both have expressed interest in this issue in the past |
src/02-common-principles.md
Outdated
Any metadata file (such as `.json`, `.bvec` or `.tsv`) MAY be defined at any | ||
directory level. For any given data file, any metadata file at that directory | ||
level or higher that does not include any entities absent from the name of the | ||
data file and possesses the same suffix are applicable to that data file. Such | ||
files are loaded from the top of the directory hierarchy downwards, such that | ||
values from the top level are inherited by all data files at lower levels to | ||
which it is applicable unless overridden by a value for the same key present | ||
in another metadata file at a lower level (though it is RECOMMENDED to minimise | ||
the extent of such overrides). There is no notion of "unsetting" a | ||
key/value pair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any metadata file (such as `.json`, `.bvec` or `.tsv`) MAY be defined at any | |
directory level. For any given data file, any metadata file at that directory | |
level or higher that does not include any entities absent from the name of the | |
data file and possesses the same suffix are applicable to that data file. Such | |
files are loaded from the top of the directory hierarchy downwards, such that | |
values from the top level are inherited by all data files at lower levels to | |
which it is applicable unless overridden by a value for the same key present | |
in another metadata file at a lower level (though it is RECOMMENDED to minimise | |
the extent of such overrides). There is no notion of "unsetting" a | |
key/value pair. | |
- Any metadata file (such as `.json`, `.bvec` or `.tsv`) MAY be defined at any directory level. | |
- For a given data file, any metadata file at that directory level or higher | |
is applicable to that data file if: | |
- the metadata and the data filenames possess the same suffix, | |
- the metadata filename does not include any entity absent from the data filename. | |
- Such files are loaded from the top of the directory hierarchy downwards, | |
such that values from the top level are inherited by all data files | |
at lower levels to which it is applicable unless overridden | |
by a value for the same key present in another metadata file at a lower level | |
(though it is RECOMMENDED to minimise the extent of such overrides). | |
- There is no notion of "unsetting" a key/value pair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an improvement. However, there is still some confusion because the notion of a 'key' makes sense for json but I don't think it makes sense for tsv files. I have no idea how to apply this to bvec files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd contemplated bullet-pointing this myself; or potentially even enumerating so that datasets violating specific requirements / software code can be cross-referenced more precisely?
@VisLab: Good catch. There's going to need to be a hard split between JSON and other metadata files. For .bvec
/ .bval
/ .tsv
, it I think needs to be simply selection of whichever applicable file is lowest in the directory tree; anything above it gets ignored, there's no reason to describe the contents of such as being "overwritten" by the content of files lower down. Only edge case that comes to mind would be .tsv
files that happen to contain exactly the same columns (including their titles), but even then it's hard to justify choosing between a concatenation and a replacement. Are there any other metadata types that are worth being familiar with for the sake of getting the language / logic of the inheritance principle right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea about using an ordered list and referring to those points in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So am I understanding that for JSON the resolution is top-down, with top (closest to the root) having precedence.
For the others it is bottom up, with the bottom (farthest from the root) having precedence?
The notions of "lowest" and "highest" level in the tree isn't necessarily clear. Does lowest mean farthest from the root?
I honestly thought only JSON files got inherited in the sense that there could be multiple applicable ones for the same suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only case of "official" of TSV inheritance we have around is having a single task-foo_events.tsv
in the root of the directory in case all subjects / runs had the same design:
see from the BIDS examples
╰─⠠⠵ ls -l */*task*tsv
-rw-rw-r-- 1 remi remi 143 Nov 10 11:33 ds114/task-covertverbgeneration_events.tsv
-rw-rw-r-- 1 remi remi 280 Nov 10 11:33 ds114/task-fingerfootlips_events.tsv
-rw-rw-r-- 1 remi remi 143 Nov 10 11:33 ds114/task-overtverbgeneration_events.tsv
-rw-rw-r-- 1 remi remi 127 Nov 10 11:33 ds114/task-overtwordrepetition_events.tsv
-rw-rw-r-- 1 remi remi 1054 Nov 10 11:33 eeg_ds000117/task-facerecognition_channels.tsv
-rw-rw-r-- 1 remi remi 1807 Nov 10 08:42 synthetic/task-nback_events.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to expand the discussion of events inheritance. Inherited events.tsv files
are needed at several levels.
For example, because BIDS forces multi-modality data to be separated into separate directories, simultaneously recorded MEG/EEG or fMRI/EEG will have the same event files for each run.
In the following example from hed-examples, EEG and MEG were recorded simultaneously, but separated into separate modality directories.
ds_003654s:
task-FacePerception_events.json
sub-02:
sub-002_task-FacePerception_events.json
sub-002_task-FacePerception_run-1_events.tsv
eeg:
sub-002_task-FacePerception_run-1_eeg.set
meg:
sub-002_task-FacePerception_run-1_meg.fif
. . .
Right now we are assuming that multiple events.json
files can be applicable to a given events.tsv
files and that the keys are resolved from directory root down in the JSON files. I think this is the right behavior for JSON files, but it needs to be clarified in the specification.
We have only made a single events.tsv
applicable to any given recording, but that events.tsv
file could appear anywhere in the directory hierarchy.
Events are a special case, and I think it might be feasible/useful to allow multiple events files at different levels to apply to a given run. The events are the union of the rows/join of columns. This might be too complicated.
The issue also comes up with derivatives. Here the derivatives might have their own set of events. If the events from the raw data also apply do they need to always be copied into the events files? There was some discussion of this in the past, but there didn't seem to be a clear resolution or consensus on this when the discussion occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have a general question about inheritance. Is the matching of entity components and their names case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notions of "lowest" and "highest" level in the tree isn't necessarily clear. Does lowest mean farthest from the root?
I always have to cross-reference elsewhere in the spec to remind myself which is which. Does anyone know if there's a robust reference-able definition somewhere? Or does this need to be defined upfront in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the matching of entity components and their names case sensitive?
I think the case sensitivity issues was solved in #858
Co-authored-by: Remi Gau <[email protected]>
- Change primary rules from dot points to enumerations, and reference rules by number where applicable. - Move from subsequent text into this list rules relating to improper placement of metadata files within the directory structure, and not permitting multiple applicable files at one level of the hierarchy. - Add "corollaries" section incorporating text relating to interpretation and consequences of the rules, separating them from subsequent examples.
As the inheritance principle makes reference to tabular files and JSON key-value dictionaries, but there are no such references in the opposite direction, these files should be defined prior to introduction of the Inheritance Principle.
Lots of changes (not expecting this to be a quick PR). I think that it is by necessity transitioning from the original text being along the lines of "here's how you should do this thing" to what is really required here for the sake of stringent enforcement and validation being "here's the rules", with the downstream consequences of such expressed in a more user-friendly way separate to the rules themselves (what's currently called "corollaries" here could conceivably be expanded quite a lot). Note that while @Remi-Gau's change proposal above was merged, I've explicitly unresolved the conversation as there is a lot of content there unrelated to that specific proposal that have not been resolved. Would suggest branching separate chains of thought into separate comment threads tied to the relevant lines; this one is invariably going to be messy, best to mitigate it where we can. |
Having a metadata file applicable to multiple data files can occur not only at one level of the directory hierarchy, but additionally to many data files at lower levels of the hierarchy.
I have reviewed the commits and the bullet points make things much clearer. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like it, well done. IMHO should be taken out of the draft! ;)
Left some minor comments
1. There MUST NOT be multiple metadata files applicable to a data file at one level | ||
of the directory hierarchy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: it is such a nice concise rule which helps to avoid ambiguity and the workaround in Example 3 is quite cute ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH -- I remembered what bothered me in my original #102 and outlined in Edit 1 there. Citing an example from there:
I placed myself into a corner with an example of having e.g.
sub-1_task-task1_run-1_bold.json and sub-1_task-task1_acq-X_run-1_bold.json
per subject (should be ok), and then trying to aggregate over them while retaining also _acq- if defined.
and there are many other entities/use-cases which would fall into similar situation. What I am thinking is to add a clarification as subitem here or a separate rule:
- A metadata file at the data files level of hierarchy MUST not be considered for inheritance if there is a matching in entities data file.
This would make sub-1_task-task1_run-1_bold.json
not a to be considered for providing metadata to enrich sub-1_task-task1_acq-X_run-1_bold.json
, and examples below would stay valid (we might want to add an example for this rule though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is such a nice concise rule which helps to avoid ambiguity and the workaround in Example 3 is quite cute ;)
Well, actually (from first post):
Full disclosure: as per #259, I would like to pursue the prospect of modifying the inheritance principle itself, not just the description of such; specifically removal of the preclusion of having multiple applicable JSONs at one level of the hierarchy.
😬
The proposed changes here do at least I think demonstrate what would need to change in order to facilitate such. Any text I propose for such will need to be very clear for both users and software creators exactly what is permitted vs. not permitted and in what order JSONs should be loaded. But as I said in the original post, this is not a change in descriptive language but a change in prescriptive permissible structure, and so IMO necessitates either a minor or major version change rather than just a patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Given the currently loose formulation of the principle, depending on the changes to it, I think it might be feasible to get it done within minor revision. So it would be great to see this PR be finalized/merged soon.
Resolves bids-standard#946 with bids-standard#962. Co-authored-by: Yaroslav Halchenko <[email protected]>
8549e00
to
e91cb98
Compare
e91cb98
to
24e6489
Compare
24e6489
to
01aa790
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wonderful work, thanks a lot! I left a few comments
src/02-common-principles.md
Outdated
When reading image `sub-01/func/sub-01_task-rest_acq-default_bold.nii.gz`, only | ||
metadata file `task-rest_bold.json` is read; file | ||
`sub-01/func/sub-01_task-rest_acq-longtr_bold.json` is inapplicable as it contains | ||
entity "`acq-longtr`" that is absent from the image path (rule 2.3). When reading image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rule 2.c, see my comment above.
I see that below this would be "5.b.ii" instead of 5.2.2, so maybe we need to discuss my suggestion - as 5.b.ii is kind of ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised by 7bc7ad5, but not yet resolving in case the enumeration formatting needs to be discussed further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it again, I don't find it ugly anymore :-) Thanks for making the change.
Co-authored-by: Stefan Appelhoff <[email protected]>
8e808a1
to
7bc7ad5
Compare
The rule regarding the fact that it is not possible to "unset" a key-value pair from a JSON file from higher in the filesystem hierarchy is here moved to the "corollaries" section. This is because this behaviour is a natural consequence of loading consecutive JSON files using a simple merge operation, and the absence of an equivalent to Python' "None" in the JSON specification.
Invested parties please also see Lestropie#3, which proposes an alteration to the contents of this PR. |
@Lestropie is there anything you wanted to discuss or get into this PR? Or should we do a final review? |
@sappelhoff I'm done tinkering, content as it currently stands. Will have a go at #259 in a separate PR as discussed. " |
That issue was solved in
awesome, let's try to get a final approval by a bunch of people and merge this! |
There is a problem with the bullet point numbering when I do the view file on GitHub. The first set of bullet points uses 2 (with subcategories i, ii, and iii), but the text refers to 2 (a, b, c). Similarly in the corollaries, rule three refers to "per rule 5.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Two minor issues.
|
||
1. It is permissible for a single metadata file to be applicable to multiple data | ||
files at that level of the hierarchy or below. Where such metadata content is consistent | ||
across multiple data files, it is RECOMMENDED to store metadata in this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RECOMMENDED
This is a best practice written in the language of a validatable rule. Some prefer this approach, others prefer to have their sidecars per-data-file. I think we should not make this recommendation, as validation will be tricky and annoying to users who would prefer the alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with taking out the recommended here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a best practice written in the language of a validatable rule.
-
This seems to extend beyond the RFC2119 definition. Is this an established BIDS-specific interpretation that is documented somewhere?
-
Would the same criticism not also apply to 5.2?
For JSON files, key-values are loaded from files from the top of the directory hierarchy downwards, such that key-values from the top level are inherited by all data files at lower levels to which it is applicable unless overridden by a value for the same key present in another metadata file at a lower level (though it is RECOMMENDED to minimise the extent of such overrides).
Issuing a validator warning regarding the presence of individual key-value overrides may be unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an established BIDS-specific interpretation that is documented somewhere?
AFAIK this is not documented anywhere, but among maintainers we have often discussed that RECOMMENDED corresponds to a "warning" level in the validator, and MUST corresponds to an "error" level in the validator. The main point of this consideration is to not design a specification that we cannot reasonably validate.
Having that said, there are many cases in the spec where we RECOMMEND or require (MUST), where validation is currently not happening and might be difficult to implement --- as you correctly point out in your second point.
Personally I am fine with having some recommendations that we cannot "warn" about (see especially 5.b here, but maybe also the point raised by Chris.).
I think the most important correspondence here is going to be not with a GitHub rendering, but with the website / PDF renderings. I couldn't find an existing example; perhaps if someone could duplicate this branch on the main repo, then set up ReadTheDocs to render that branch, we could check what the current behaviour is there? |
Co-authored-by: Stefan Appelhoff <[email protected]>
Co-authored-by: Stefan Appelhoff <[email protected]>
Thanks for leading this effort @Lestropie! I think it's a huge improvement in terms of clarity. |
Relates to discussion in bids-standard#259, and was also raised in bids-standard#946. Instead of enforcing a unique order of metadata file inheritance via filesystem hierarchy only, the logic behind the inheritance principle is generalised to permit multiple files per directory whilst still ensuring that the order of metadata file loading is still unique.
Draft PR in search of feedback.
This change relates to #102, #259, #482; am escalating due to re-invigoration of BEP016, in particular bids-standard/bids-bep016#32.
IMO, there are multiple issues with the current description of the inheritance principle:
Given the breadth of these issues I think a re-write is warranted. I am completely open to proposals for modifications or proposal of outright alternatives, but the ball needs to start rolling somewhere.
Full disclosure: as per #259, I would like to pursue the prospect of modifying the inheritance principle itself, not just the description of such; specifically removal of the preclusion of having multiple applicable JSONs at one level of the hierarchy. However here I am focusing solely on modifying the text describing the principle as it currently stands, which would be applicable to a patch update. Changing the principle itself should IMO be deferred to a minor or potentially major update; in addition, the isolation of that specific aspect and modification thereof should be somewhat facilitated by the clarification of the current state of the principle following these changes.