Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample entity and samples.tsv file #779

Closed
mariehbourget opened this issue Apr 16, 2021 · 29 comments
Closed

Add sample entity and samples.tsv file #779

mariehbourget opened this issue Apr 16, 2021 · 29 comments

Comments

@mariehbourget
Copy link
Collaborator

Context and motivation

Hi BIDS community!

As part of the development of the Microscopy BEP (BEP031), we want to add a new sample entity to BIDS. This sample entity was introduced in order to distinguish different tissue samples from the same subject.

The sample entity may also be used by the Animal Ephys BEP (BEP032 @SylvainTakerkart) and could benefit other modalities as well.

This issue aims to start a discussion about the details of the sample entity between the 2 BEP groups and with the BIDS community. It will also facilitate the breaking down of BEPs in smaller modules by adding the sample entity as a separate PR.

Definition of the sample entity

To ensure compatibility with BIDS other modalities, the subject entity should correspond to the participant (e.g. a human, a mouse, etc). To identify multiple tissue samples from the same subject, we define the sample entity in BEP031 as:

A tissue sample, volume or slice pertaining to a subject.

It is positioned after the optional session entity in the filename:

sub-<label>[_ses-<label>]_sample-<label>_<modality_suffix>.<ext>

samples.tsv file

In BEP031, a samples.tsv file was added at the root of the dataset along with participants.tsv.

The samples.tsv file would have 2 required columns:

  • sample_id: corresponding to sample-<label> of the filename
  • participant_id: corresponding to sub-<label> of the filename

Another column sample_type was also suggested as required:

We should also discuss if (and how) we want to encode an additional identifier when a sample is derived from another sample (e.g., a slice is derived from a block of tissue).

participants.tsv file

As part of the subject vs. sample definitions, we would also like to add 2 columns to the participants.tsv file:

  1. species: string corresponding to the Binomial species name from NCBI Taxonomy, required when different from “Homo sapiens”
    We think species should be in participants.tsv and not samples.tsv as it is an attribute of the subject and not the sample.

  2. pathology: required when different from “Healthy”
    In that case, pathology could be in either participants.tsv or samples.tsv as appropriate (e.g. healthy and non-healthy biopsy samples from the same subject).

Examples

File hierarchy and naming:

├── dataset_description.json
├── participants.json
├── participants.tsv
├── samples.json
├── samples.tsv
├── sub-rat1
│   └── microscopy
│       ├── sub-rat1_sample-data1_SEM.json
│       └── sub-rat1_sample-data1_SEM.png
├── sub-rat2
│   └── microscopy
│       ├── sub-rat2_sample-data5_SEM.json
│       └── sub-rat2_sample-data5_SEM.png
├── sub-rat3
│   └── microscopy
│       ├── sub-rat3_sample-data10_SEM.json
│       ├── sub-rat3_sample-data10_SEM.png
│       ├── sub-rat3_sample-data11_SEM.json
│       ├── sub-rat3_sample-data11_SEM.png
│       ├── sub-rat3_sample-data9_SEM.json
│       └── sub-rat3_sample-data9_SEM.png

participants.tsv:

participant_id species
sub-rat1 Rattus norvegicus
sub-rat2 Rattus norvegicus
sub-rat3 Rattus norvegicus

samples.tsv:

sample_id participant_id sample_type
sample-data1 sub-rat1 tissue
sample-data5 sub-rat2 tissue
sample-data9 sub-rat3 tissue
sample-data10 sub-rat3 tissue
sample-data11 sub-rat3 tissue
@SylvainTakerkart
Copy link

thanks @mariehbourget for this ! we clearly need something of this type for BEP032 but havn't had a look at it precisely yet... we will take care of this in the next few weeks with @JuliaSprenger and others...

@effigies
Copy link
Collaborator

  • Species: 👍
  • Pathology: This might not translate well to human subject populations, where I think "diagnosis" or similar would be considered more neutral. Is there a common term we can settle on that wouldn't be too awkward for either group, or will there need to be separate terms for animal tissues and human subjects?
  • Sample entity: As this is written, it seems to require that sample is unique throughout the dataset, so that if sample-1 exists in sub-01, there cannot be a sub-02_sample-1. This is an unusual requirement compared to existing entities. Could you elaborate on the motivation?

@dorahermes
Copy link
Member

How does pathology/diagnosis overlap with phenotype?

@satra
Copy link
Collaborator

satra commented Apr 19, 2021

regarding pathology, this should be annotated at the sample level in that one may have a tumor sample in one case versus on non-tumor location in the same participant. thus pathology goes with sample rather than participant. indeed for human and potentially other species there should be a Dx column for diagnosis. note that this could also vary by date and could therefore be in the sessions.tsv rather than participants.tsv. more generally there should be a conversation about inheritance of properties from participants to sample, when all samples share those properties.

also the same samples could be used in multiple sessions. hence having a mechanism to consolidate that would be necessary, and hence samples are similar to participants in that sense. in many cases, samples, rather than participants are often the primary entity in studies. and in keeping with bids it was discussed in the subgroups that a dataset should connect a sample to a participant even if the participant details are unknown.

btw, samples could also be used in human MR scans (e.g., left/right hemi ex vivo, brainstem, etc.,.), and hence samples should be considered at a generic concept in bids, rather than specialized just for microscopy/ephys.

@satra
Copy link
Collaborator

satra commented Apr 19, 2021

How does pathology/diagnosis overlap with phenotype?

@dorahermes - one example use case could be something like a diagnosis column that says Major Depressive Disorder (or ICD10 code), but the diagnosis itself could have been attached to a phenotype file(s) (e..g, KSADS, HAMD, etc.,.) or simply a clinical evaluation which may not have a phenotypic assessment in many cases.

@mariehbourget
Copy link
Collaborator Author

  • Sample entity: As this is written, it seems to require that sample is unique throughout the dataset, so that if sample-1 exists in sub-01, there cannot be a sub-02_sample-1. This is an unusual requirement compared to existing entities. Could you elaborate on the motivation?

The intention is not to require "unique" sample_id across a dataset. We think people should be able to have the same sample_id for two different subjects as you are suggesting. In samples.tsv, that would give something like this:

sample_id participant_id sample_type
sample-1 sub-1 tissue
sample-1 sub-2 tissue
sample-1 sub-3 tissue
sample-2 sub-3 tissue
sample-3 sub-3 tissue

So the "unique" identifier is the combination of sample_id and participant_id, and not sample_id alone.

@SylvainTakerkart
Copy link

So the "unique" identifier is the combination of sample_id and participant_id, and not sample_id alone.

this sounds ok to me! we should just give an example that is a bit more telling that just "sample-1", "sample-2" to be immediately understandable just by looking at it... question: what do experimenters use as user-friendly ids for their samples?

@SylvainTakerkart
Copy link

We should also discuss if (and how) we want to encode an additional identifier when a sample is derived from another sample (e.g., a slice is derived from a block of tissue).

we had discussion in our last BEP32 meeting about the possibility of adding several entities ('sample', but also 'slice' and 'tissue')... I don't want to deviate the goal of this thread, but maybe we should have this discussion globally here? I mean, asking ourselves how many entities should be added and which ones? or whether adding just the 'sample' entity and dealing with everything else through the 'sample_type' can cover all the targeted usecases? with this latter solution, indeed, the quoted question (i.e "how do we encode the fact that a slice is derived from a block of tissue") should be addressed!

@Remi-Gau
Copy link
Collaborator

small detail: although it is just an example / suggestion, the current specification mentions "group" as one of the column in participants.tsv, so if "pathology/diagnosis" finds its way in participants, this example might need to be amended or clarified otherwise this could lead to some confusion.

@effigies
Copy link
Collaborator

If sample labels can be reused across subjects, I think we can do the following:

  1. Drop the participant_id column.
  2. Follow the inheritance principle.

If the sample labels are the same across subjects, a global samples.tsv would provide the information needed. If they vary across subjects, then a set of sub-<label>/sub-<label>_samples.tsv files can be created.

regarding pathology, this should be annotated at the sample level in that one may have a tumor sample in one case versus on non-tumor location in the same participant.

I think @satra's suggestion here is good, and that making pathology a column in samples.tsv would make resolve the concerns I had above.

Indeed for human and potentially other species there should be a Dx column for diagnosis. note that this could also vary by date and could therefore be in the sessions.tsv rather than participants.tsv.

Yes, I think diagnosis as a session-level variable makes sense. As an aside, I don't think we have a principle that says how to do session-level variables for single-session studies that omit the ses-<label>/ directory, but that would be worth clarifying if we add variables that are useful in single-session contexts.

@mariehbourget
Copy link
Collaborator Author

@SylvainTakerkart

we had discussion in our last BEP32 meeting about the possibility of adding several entities ('sample', but also 'slice' and 'tissue')...

We had similar discussions in BEP031 for other additional entities. The way we handled this so far is based on what entities are needed to distinguish between 2 different files of a same subject. For example, metadata like “sample_type” (primary cell, tissue, etc) is a unique attribute of the sample itself and would not change for a same subject_sample. In those cases, we think the information would be best encode in metadata and not in the filename.

how do we encode the fact that a slice is derived from a block of tissue

I would suggest adding a derived_from column in samples.tsv to cover this. Ex: sample-X from sub-1 is a block of tissue imaged. Then sample-X is sliced in slices named sample-x1, sample-x2, sample-x3 by the experimenter and imaged. The link between the samples could be in samples.tsv as:

sample_id participant_id sample_type derived_from
sample-X sub-1 tissue n/a
sample-x1 sub-1 tissue sample-X
sample-x2 sub-1 tissue sample-X
sample-x3 sub-1 tissue sample-X

@effigies

If sample labels can be reused across subjects, I think we can do the following:

  1. Drop the participant_id column.
  2. Follow the inheritance principle.

If the sample labels are the same across subjects, a global samples.tsv would provide the information needed.

I’m not sure to understand you on this.
Ex: 2 subjects (sub-1 and sub-2) have a sample named sample-1. However, the metadata of the sample-1 from sub-1 is not necessarily the same as for the sample-1 from sub-2. I don’t understand the utility of a global file without the participant_id column, as it would not make the distinction between the two.

@jgrethe
Copy link
Contributor

jgrethe commented Apr 21, 2021

@mariehbourget in the SPARC Dataset Structure we also include a "derived_from" (i.e. wasDerivedFromSample) in the samples metadata file: https://docs.google.com/presentation/d/1EQPn1FmANpPsFt3CguU-JOQVMMlJsNXluQAK_gb2qVg/edit#slide=id.p9

@effigies
Copy link
Collaborator

@mariehbourget

Ex: 2 subjects (sub-1 and sub-2) have a sample named sample-1. However, the metadata of the sample-1 from sub-1 is not necessarily the same as for the sample-1 from sub-2. I don’t understand the utility of a global file without the participant_id column, as it would not make the distinction between the two.

If the metadata for sample-1 is the same across subjects, it can be placed in a global file. If it differs, it can be placed in sub-1/sub-1_samples.tsv and sub-2/sub-02_samples.tsv.

@SylvainTakerkart
Copy link

I think it'd be great to hear from @tgbugs here... if we manage to handle all this consistently across BEP31, BEP32 and SPARC, that'd be fantastic to facilitate future inter-operability... (as was just said in the BEP31 meeting ;) )

@effigies
Copy link
Collaborator

From discussions at the meeting, I think the global samples.tsv might be compelling for this use case. My concerns are primarily aesthetic, preferring the file location to match the objects being described, but if doing it that way would require everybody to reconstruct the global table in software, it's not worth forcing.

@tgbugs
Copy link

tgbugs commented Apr 23, 2021

Here is my write-up with an overview of the problem space, a potential model, and a review of the trade-offs that I see for BIDS based on my experience implementing and maintaining the SDS and its validation pipelines. I'm also dropping this in INCF/neuroscience-data-structure#9.

https://github.com/SciCrunch/sparc-curation/blob/master/docs/participants.org

If you have targeted questions or comments you can leave them on this commit.
SciCrunch/sparc-curation@c5968b9

@tgbugs
Copy link

tgbugs commented Apr 23, 2021

@effigies your concerns about forcing the reconstruction of the global table are well founded and I discuss the trade-offs in detail.

@jcohenadad
Copy link
Contributor

Thank you everyone for your comments, suggestions and feedback!

@tgbugs thank you very much for your insightful comment in https://github.com/SciCrunch/sparc-curation/blob/master/docs/participants.org. I am responding here so that the discussion stays centralized within a single issue thread (otherwise it is difficult to keep a clear track history of the discussion).

A few considerations following previous discussions and comments:

  1. Regarding the subject entity: We are strongly in favor of keeping the current BIDS definition of subject i.e. a person or animal participating in the study. This is important to ensure compatibility between BIDS modalities (ex: a study with both microscopy and MRI where the subject must refer to the same organism)

  2. Regarding “collective participants” such as “populations” or “pool”: This may go beyond the scope of this discussion. Moreover, based on our multiple meetings with the different groups who have given us feedback and exposed typical use case scenarios, this scenario currently falls in the 20% of the 80/20 BIDS principle. It seems reasonable to think that the label of the sub-<label> or sample-<label> could be used to describe this particular case (e.g. sub-pool01)

  3. Regarding different “experimental group” (e.g. control and treatment): group is already mentioned as an example field for this purpose in participants.tsv in the current stable version of the BIDS standard (1.6.0).

  4. Regarding the sample entity: Our proposition adds a single new entity sample to the file name to describe any sample_type (tissue, primary cell, etc). The sample_type is described in samples.tsv as it is an attribute of the sample and not necessary to distinguish between two files of the same sample. The advantage is to have a simple structure where each file has a unique identifier (e.g.: sub-1_sample-1). Because this prefix is in the file name, there would be no ambiguity between files if 2 subjects have the same sample numbers (e.g. sub-1_sample-1 and sub-2_sample-1).

  5. Regarding file system: All files from the same subject remain in the sub-XX folder and not into additional nested folders. This structure avoids the added complexity and pitfalls of a nested folder structure mentioned by @tgbugs.

  6. Regarding metadata files: The proposed solution with the sample entity requires only one additional file (samples.tsv) at the root of the dataset. The column participant_id is common to both participants.tsv and samples.tsv to join the tables. It seems reasonable to use two different files to distinguish between attributes of the subject and attributes of the sample.

In short, we suggest that:

  1. Specific entity in filename should only be used when there is a need to distinguish two files from a same subject and sample (e.g. for microscopy: session, stain, chunk, run)
  2. participants.tsv should be reserved for subjects’ attributes (e.g. age, sex, species, diagnosis)
  3. samples.tsv should be reserved for samples attributes (e.g. sample_type, derived_from, pathology)

@tgbugs
Copy link

tgbugs commented May 3, 2021

Hi @jcohenadad, thanks for taking a look. Here are my thoughts.

  1. With respect to compatibility between modalities, the only thing
    that matters in that case is the distinct identity of the participant,
    not distinct identifier type.

    The underlying conceptual types for the entities referred to by the
    identifiers remains distinct (organismal subjects are not biological
    samples). The type of the identifier in how it defines a namespace
    for uniquely identifying distinct individuals is different from that
    conceptual type. I am suggesting to extend the sub- identifier type
    to be used to name anything in a BIDS dataset that has data about it.
    This is consistent with how sub- is used in BIDS.

    The underlying conceptual type of course must be retained, it couldn't
    be otherwise, the key is to deconflate the conceptual type from the
    identifier type. This means making the conceptual participant type
    explicit in the schema rather than implicit in the identifier type.

    As such, my suggestion does not prevent the ability to distinguish
    metadata for participants subjected to different modalities, because
    it is about the type of the identifier, not the identity of the
    identifier. If I have sub-1 that was subjected to both microscopy and
    MRI, then both have the same identifier because there was only one
    participant that had measurements made on it [cue the Far Side "It's a
    mammoth" cartoon]. If there was a sample derived from sub-1 that was
    subjected to microscopy then I would simply call it sub-2. The
    identity of the organismal subject and the sample subject are thus
    differentiated, without adding complexity to the model by forcing them
    to have different identifier types.

    I can understand the desire force the conceptual types (e.g. of
    subject vs sample) onto the identifiers, but this doesn't actually buy
    anything and only increases complexity, because sub- can already
    distinguish sub-1 from sub-2 in the modality use case. Yes, we can
    also distinguish sub-1 from sam-1, but why add that complexity?

  2. The suggestion to use sub-pool01 will not work unless the meaning
    of sub- is extended in the way I propose, because the metadata
    requirements for pools cannot be enforced correctly unless there is a
    way to distinguish between collective and atomic participants
    independent of their identifier type prefix.

    Even if this is a 20% use case, it is one that must be considered when
    designing the 80% case because of the fundamental differences in what
    can be required in the metadata of atomic vs collective entities.

    If BIDS cannot distinguish between those cases, then it will become an
    issue down the line, because there will be datasets e.g. in SPARC that
    we will not be able to convert to BIDS in a way where we can correctly
    enforce the structure of the metadata. For consortia looking to adopt
    a standard that they can use across all datasets, this will cause BIDS
    to continue to be unattractive.

  3. Yep. That section was included for completeness and for other
    audiences as BIDS already does this.

  4. My suggestion is that rather than adding this complexity to only
    samples, that it be added to all participants. Eventually it will be
    needed for everything, and if it is only applied to samples then there
    will be duplicated effort and duplicate schemas down the line.

    Furthermore and more importantly, I strongly warn against allowing
    sample identifiers to be reused across subjects. We just made a change
    to prevent this in the SPARC data structure because of the numerous
    implementation, usability, and ambiguity issues that it causes.

    Sample IDs should at the very least be unique per dataset. Speaking
    from experience, allowing non-unique sample ids that must be composed
    with a subject identifier to for a primary key is a bad idea. I'm
    happy to elaborate on the years of headaches that it has caused.

  5. I should note that nesting files with sam-x in their name
    encounters the exact same issue as nesting folders and thus the
    problem remains.

    In theory this could be mitigated to some extent by forcing users to
    always include the subject id from which a sample was derived but as I
    mention above, creating composite primary keys from subject sample
    pairs is not a good idea. The problems that it causes down the line
    are simply not worth the trade-off of being able to detect that a
    sample file has been put in the wrong subject folder. Further, what
    happens in cases with multiply derived samples? sub-1-sam-1-sam-1?

    There are many other lurking issues here and I suggest avoiding them
    entirely.

  6. It is reasonable to use two different files to represent two
    different schema. However, the problem is that there are going to be
    more than two files in the future if the principle is "a new file for
    every new schema and/or every new participant type."

    Having a sparse tabular schema is also a reasonable way to solve this
    problem, which is significantly more attractive because BIDS accepts
    JSON as a metadata format, where the sparseness is not an issue. It
    also has the benefit of only requiring the modification of the schema
    to add a new set of fields for a new conceptual participant type,
    rather than the addition of a whole new file. Some back of the napkin
    math suggests that creating a new table per participant type will wind
    up with BIDS eventually having well over a dozen different files one
    for each of the participant types that I enumerate.

    While adding additional files to avoid sparseness may seem reasonable
    if there is only a single new table, it does not seem reasonable if
    there will eventually be multiple new files and tables.

    In a sense splitting samples.tsv into its own file is trying to solve
    a user interface problem in the data model. I suggest that BIDS not
    try to solve the user interface problem here and avoid multiplying
    specialized metadata files.

With regard to the suggestions.

  1. This is reasonable, with the note that I strongly discourage
    allowing sample identifiers to be reused across subjects.
  2. 2 and 3 combined will lead to multiplication of specialized top
    level metadata files over time. It would be better to have a sparse
    schema that is only in participants.tsv.

Also re: INCF/neuroscience-data-structure#9

@jcohenadad
Copy link
Contributor

Hi @tgbugs, thank you for the clarifications.

This discussion is touching on some of the core decisions made by the BIDS community. It would be great if some of the BIDS maintainers/steering could chip in as well @effigies @robertoostenveld.

I am suggesting to extend the sub- identifier type to be used to name anything in a BIDS dataset that has data about it. This is consistent with how sub- is used in BIDS.

If there was a sample derived from sub-1 that was subjected to microscopy then I would simply call it sub-2. The identity of the organismal subject and the sample subject are thus differentiated, without adding complexity to the model

We were advised by the BIDS steering group (@robertoostenveld) to not extend the definition of the subject entity. In that regard, the subject definition of "a person or animal participating in the study" seems important to the BIDS community to preserve consistency across modalities.

We’ve tried different configurations in the early development of the microscopy BEP and we agree that adding many different entities to describe different use cases adds undue complexity to the model. Therefore, we proposed the sample entity which would be used for the different specimen types you mentioned. Our idea is not to add an entity to every possible case but to use sample as the entity for different sample_type such as “whole organ”, “tissue”, “cells”, etc.

The advantages are that it retains the definition of subject without adding multiple layers of complexity to the scheme, and covers all atomic participant types.

The suggestion to use sub-pool01 will not work unless the meaning of sub- is extended in the way I propose, because the metadata requirements for pools cannot be enforced correctly unless there is a way to distinguish between collective and atomic participants independent of their identifier type prefix. Even if this is a 20% use case, it is one that must be considered when designing the 80% case because of the fundamental differences in what can be required in the metadata of atomic vs collective entities.

As far as we know, the current BIDS specification does not cover explicitly “collective” participants, hence the suggestion to name the subject with the pool name in the absence of standardization. Again, it may be out of scope for the current issue. With that being said, I will let the BIDS community chip in if there are plans for that in the future.

I strongly warn against allowing sample identifiers to be reused across subjects. We just made a change to prevent this in the SPARC data structure because of the numerous implementation, usability, and ambiguity issues that it causes. Sample IDs should at the very least be unique per dataset.

I should note that nesting files with sam-x in their name encounters the exact same issue as nesting folders and thus the problem remains. In theory this could be mitigated to some extent by forcing users to always include the subject id from which a sample was derived but as I mention above, creating composite primary keys from subject sample pairs is not a good idea. The problems that it causes down the line are simply not worth the trade-off of being able to detect that a sample file has been put in the wrong subject folder.

We understand your concerns in cases where the subject from whom the sample is derived from would not be explicit. In this BEP, the derivation of a sample from a subject is enforced in the filename itself. An individual file will always have both the subject_id and the sample_id within its name. So the composite key of sub-sample is not only present in metadata but it corresponds directly to a unique filename (nested or not, misplaced or not).

From an experimental point of view, it also makes sense for people to name their samples the way they want for the same subject without having to take into account sample_id from previous acquisitions on other subjects. In addition, it is usual in BIDS to deal with the same key-value pair across a dataset with other entities such as session. Enforcing a unique sample_id would be an unusual requirement. Again, I would appreciate it if someone from the core BIDS could chip in (@effigies), as I am uncomfortable speaking as a porte parole for BIDS strategic decisions.

Further, what happens in cases with multiply derived samples? sub-1-sam-1-sam-1?

This was addressed earlier in the thread where we suggested to add a derived_from column in the samples.tsv file. There is always only one instance of sample-<label> in the filename.

It is reasonable to use two different files to represent two different schema. However, the problem is that there are going to be more than two files in the future if the principle is "a new file for every new schema and/or every new participant type." [...] Some back of the napkin math suggests that creating a new table per participant type will wind up with BIDS eventually having well over a dozen different files one for each of the participant types that I enumerate.

As mentioned earlier, the addition of the sample entity and samples.tsv file would already cover new participants type (at least “atomic”), so we are not worried about dozens of files being added in the future.

@tgbugs
Copy link

tgbugs commented May 7, 2021

Hi @jcohenadad, I'll leave some thoughts while awaiting for responses from others. Since the discussion has strayed into cross BEP and core BIDS territory, this is understandable.

to not extend the definition of the subject entity

Absolutely. However, I wonder if that suggestion was made in a context where identifier type and conceptual type were conflated. Retaining the definition of subject while extending the scope of sub- should be possible by adding (initially) a separate definition for sample and a subject type or participant type column to participants.tsv. This is a unifying and regularizing generalization of the rather awkward sample type (I say this also having the same awkward sample type in the SDS schema that I maintain).

From an experimental point of view, it also makes sense for people to name their samples the way they want for the same subject without having to take into account sample_id from previous acquisitions on other subjects

However, from a data sharing point of view, they probably should be taken that into account. There are countless sample-1s in the world, and having to carry around a composite primary key of dataset-id, subject-id, and sample-id without any way to reduce them to a single unique identifier for the individual participant seems like it will induce complexity on any implementations of BIDS in the future. Furthermore it complications communication about samples because unique sample ids cannot be generated without the subject id to qualify them unless all communicating parties agree on the convention for converting composite primary keys into unique ids.

Relevant to a later point, the generalization of this reasoning is that participant-1-participant-1-participant-1-participant-1 should be allowed as an identifier because each prior participant is distinct and it should be up to the experimenter how to identify their participants. Part of the current BEP tries to deal with this by forcing sample ids to be unique if they were derived from another sample, however the isDerivedFrom relationship applies with domain and range subject and sample in addition to sample and sample, so the lack of enforcement of unique identifiers for samples derived from subjects is inconsistent with respect to the isDerivedFrom relationship.

Enforcing a unique sample_id would be an unusual requirement.

But according to the proposal this in fact already required for samples derived from other samples.

This was addressed earlier in the thread where we suggested to add a derived_from column in the samples.tsv file. There is always only one instance of sample-<label> in the filename.

There are many cases where samples and not subjects are shipped from one lab to another and then from the shipped samples further samples are derived. That is to say, there are labs for which someone else's sample is their subject.

If we were to apply the logic articulated above for subjects, the experimentalists should likewise not have to care about the fact they derived one sample from another, so long as they keep track of which sample they derived it from and thus that sub-1-sam-1-sam-1 should be allowed (re: infinitely nested participant-1).

Requiring different practices for identifier generation due to an arbitrary distinction between subject and sample (is a cadaver a sample?) seems like a design flaw. The restriction that only sample ids must be unique and enforcing that only on derived samples but not on samples derived directly from subjects* would significantly complicate the underlying data model and ontology.

* This isn't actually the requirement, it is more that all transitively derived samples from the same subject have to have unique identifiers. This gets extremely messy if you start deriving samples from populations of subjects because now the samples probably have to be uniquely identified up to the population not the subject, so the generalization of the uniqueness would require further specification (and thus complexity) in the future to correctly deal with such cases.

@robertoostenveld
Copy link
Collaborator

there are presently 4 explicit generic levels over which the acquisition of "data" can be iterated. I won't summarize the definitions here, but they can be found on https://bids-specification.readthedocs.io/en/stable/02-common-principles.html

  • there can be multiple subjects
  • there can be multiple sessions
  • there can be multiple scans
  • there can be multiple runs
    The first three are represented in the directory hierarchy. For all of these 4 there is metadata that is also part of the filename to distinguish them.

There are also multiple domain specific levels over which the acquisition of "data" can be iterated. For example over multiple voxels in fMRI, or multiple channels in EEG, or multiple timepoints (in either type of data). For MRI there can also be multiple echo's, or multiple contrast enhancing agents, or tracers.

The idea from @tgbugs to "extend the sub- identifier type to be used to name anything in a BIDS dataset that has data about it" leads to the question: why would you not extend the meaning of session, or scan, or run instead?

Or should one be allowed to do sub-ses1scan1run1 and use only sub- for whatever thing that repeats? Changing the entities that represent iterations of data acquisition would be technically possible, but breaks the meaning of those entities and hence would better fit BIDS 2.x (considering semantic versioning, and hence 1.x and 2.x being incompatible).

@yarikoptic
Copy link
Collaborator

it might be worth splitting this issue into two (or three)

  • addition of _sample-<label> entity (IMHO quite straightforward)
  • ability to re-order levels of iteration (much larger issue, but could be made backward compatible so IMHO no need to wait for 2.x)
  • (providing the entity(ies) summary file(s) such as participants.tsv (aka subjects), scans.tsv, samples.tsv etc -- also could be backward compatible)

The last two relate also to "stimuli BEP" wannabe issue (see e.g. #751 (comment) I also generalize "similarly") and IMHO orthogonal issues to the first one ("samples" entity) and interrelated within since with reordering you would get top level ".tsv/.json".

As for the last one -- we could gain ".tsv/.json" even without any reordering: at large we already have it someone implied by inheritance principle and hence could have scans.json (#789) and even sessions.tsv/.json at top level could be useful (e.g. to provide characteristics for e.g. "preoperative", "postoperative" sessions etc) independent of top "iteration" level (currently fixed to sub).

@tgbugs
Copy link

tgbugs commented May 13, 2021

@robertoostenveld thank you very much.

I think that BIDS 2.X is probably the right venue for my suggestions. Given the constraints on 1.X. In that context I only have one suggestion for this thread, which is to require that sample identifiers be unique per dataset not per subject.

why would you not extend the meaning of session, or scan, or run instead?

The only reason would be if there was a required metadata structure that was associated with some experimental process that could not be capture at one of those levels, or if there were more levels that were required. Otherwise the only reason would be because someone doesn't like the naming of the three levels.

In SPARC we have called the abstraction of those three into a single term performance or protocol execution variously. It corresponds to the performance of a protocol aka the carrying out of some experimental process. The distinction between session, scan, and run have to do with the particular nesting of repeated structure that is common to many MRI experiments, and which is shared with a variety of other modalities beyond MRI.

For the most part these don't need to be extended because they are distinct only in how they are named and in that they support 3 levels of repeated structure. There might be some experimental designs that need slightly more expressivity, or that might need/want to associate slightly different metadata with a particular repeated process, in which case the abstracted solution might help.

@yarikoptic I think the 3 can be broken up as you suggest, with a note that there is an interaction between _sample-<label> and participants.tsv depending on what uniqueness constraints are required.

@SylvainTakerkart
Copy link

I'm also in favor of addressing these issues step by step, this would fit the needs of development of BEP32 (which are strongly overlapping with the ones of BEP31, if not strictly identical)! and the first step (addition of the sample entity) will already allow us to move forward!

what's the next step?

@yarikoptic
Copy link
Collaborator

I would think a PR for "addition of _sample- entity" with the opening of this PR. Point to this issue for further info on discussion etc.

I would also file a separate issue (or better even a PR) suggesting additional (RECOMMEND) columns to participants.tsv/.json .

@mariehbourget
Copy link
Collaborator Author

Thank you everyone for your feedback!
As suggested by @yarikoptic and as discussed in today’s BEP031 meeting, we will move forward with separate PRs, starting with the “Addition of sample entity”.

@mariehbourget
Copy link
Collaborator Author

Hi everyone! The first PR (#812) for the addition of the sample entity is now open.

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Feb 8, 2022

Closing this since #816 is now merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests