-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include isa-xlsx for ARC-specification 1.2 #76
Conversation
Merge ISA-Tab specification with ISA-XLSX changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add information akin to ~"Any other column MAY be used in ISA-XLSX, but is not converted to ISA-JSON"
ISA-XLSX.md
Outdated
|
||
## Inputs and Outputs | ||
|
||
Each annotation table sheet MUST contain an `Input` and an `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input [<InputNodeType>]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They MUST be formatted in the pattern
Input [<InputNodeType>]
.
"They MUST be formatted in the pattern Input [<InputNodeType>]
and Output [<OutputNodeType>]
." ?
Are InputNodeType
s defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are listed opaquely in the next sentences:
A Source MUST be indicated with the node type Source Name
for example indicates that Source Name
is a input node type
I agree that this should be more obvious. There should be a sentence right after that lists allowed input and output node types-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kMutagene @Brilator
Thanks alot for your input. Can you check if my last commit clarifies it sufficiently?
ISA-XLSX.md
Outdated
Each annotation table sheet MUST contain an `Input` and an `Output` column, which denote the Input and Output node of the `Process` node respectively. They MUST be formatted in the pattern `Input [<InputNodeType>]`. | ||
|
||
|
||
A `Source` MUST be indicated with the node type `Source Name`. `Sources` MUST not be used as `Output` nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also imply that there MUST be a Source Name
(somewhere in the ARC), i.e. a Sample Name
MUST not exist without a Source towards that Sample.
Or did I misunderstand that isa tables are changed towards allowing Input [Sample Name]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. With this change, we want to make the annotation more explicit.
Instead of
-----Sheet1-----|------Sheet2------|
Source -> Sample = Source -> Sample
we now have
---Sheet1---|---Sheet2---|
Source -> Sample -> Sample
The second entity in the example above was in actuality a Sample
, but ambiguously annotated as a Source
in the second sheet. Now instead, we annotate it as what it is, a Sample
in both Sheets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, but this only answers half the question.
Can there be a sheet in any isa.study or isa.assay starting with a Sample, that does not have a source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say yes, and therefore not add any constraint.
To my understanding, Source
is only a further specification of a Sample
.
|
||
An `Labeled Extract Material` MUST be indicated with the node type `Labeled Extract Name`. | ||
|
||
`Source Names`, `Sample Names`, `Extract Names` and `Labeled Extract Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about those categories of inputs and outputs (afaik they come from ISA). They don't really represent what's happening in the lab. a) Basically all of those would typically be called "sample". b) "extract" or "labeled extract" are just two types of possible outputs of a laboratory workflow, while other outputs types (a sample prepared for microscopy, ground powder, seeds, etc.) are not categorized.
So I don't really get, why these exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree 100%, but that's just what is given by ISA. I guess Sample
and Data
would probably be sufficient tbh.
But with the ISA compatability (isa-json as interface) as a top-priority, IMO sticking to the given terminology is our best bet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Sample and Data would probably be sufficient tbh.
If that proves to be the case based on what people actually use, there is no problem here i think. There is nothing wrong with the spec providing more nodes than practically useful - you would not expect users to read the spec anyways. However, if the spec focuses on full compatibility, this must still be contained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, good point
ISA-XLSX.md
Outdated
|
||
`Source Names`, `Sample Names`, `Extract Names` and `Labeled Extract Names` MUST be unique across an ARC. If two of these entities with the same name exist in the same ARC, they are considered the same entity. | ||
|
||
`Image File`, `Raw Data File` or `Derived Data File` node types MUST correspond to a relevant `Data` node to provide names or URIs of file locations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the "extract" types, I do not understand how an Image File
is different from other Raw Data File
s.
ISA-XLSX.md
Outdated
|
||
For detail on ISA framework terminology, please read the [ISA Abstract Model specification](https://isa-specs.readthedocs.io/en/latest/isamodel.html). | ||
|
||
This document describes the ISA Abstract Model reference implementation specified in the ISA-XLSX format. The XLSX format uses the SpreadsheetML markup language and schema to represent a spreadsheet document. Conceptually, using the terminology of the Spreadsheet ML specification [ISO/IEC 29500-1](https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml#:~:text=The%20XLSX%20format%20uses%20the,a%20rectangular%20grid%20of%20cells.), the document comprises one or more worksheets in a workbook. Every worksheet MUST contain one table object storing the metadata. Comments or auxiliary information MAY be stored alongside with table objects in a worksheet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every worksheet MUST contain one table object storing the metadata
- Means there can be no worksheet without metadata (e.g. random notes)
- metadata = ISA metadata ≈ experimental metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will cut this file.
I have this below:
Sheets described in this specification MUST follow one of the two given formats:
Top-level metadata sheets
for listing top-level metadataAnnotation Table sheets
for describing experimental workflowsSheets which do not follow any of these two formats are considered additional payload and are ignored in this specification.
ISA-XLSX.md
Outdated
ISA-XLSX uses three types of files to capture the experimental metadata: | ||
- Investigation file | ||
- Study file | ||
- Assay file (with associated data files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Assay file (with associated data files)
- dataset files?
- Protocol files are also "associated" and can be referenced in Protocol REF
- Analog for "Study file" above: associated resources and protocols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will cut the part after Assay file
ISA-XLSX.md
Outdated
|
||
The Investigation file contains all the information needed to understand the overall goals and means used in an experiment; experimental steps (or sequences of events) are described in the Study and in the Assay file(s). For each Investigation file there may be one or more Studies defined with a corresponding Study file; for each Study there may be one or more Assays defined with corresponding Assay files; one assay file may be registered in different studies. | ||
|
||
In order to facilitate identification of ISA-XLSX component files, specific naming patterns MUST follow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MUST be followed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
ISA-XLSX.md
Outdated
For maximal portability file names SHOULD contain only ASCII characters not excluded | ||
already (that is `A-Za-z0-9._!#$%&+,;=@^(){}'[]` - we exclude space as many utilities | ||
do not accept spaces in file paths): non-English alphabetic characters cannot be guaranteed | ||
to be supported in all locales. It would be good practice to avoid the shell metacharacters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good practice
is recommended? is good practice? should?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll pick is recommended
ISA-XLSX.md
Outdated
The `Investigation file` fulfils four needs: | ||
|
||
1. to declare key entities, such as factors, protocols, which may be referenced in the other files | ||
2. to track provenance of the terminologies (controlled vocabularies or ontologies) there are used, where applicable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are used,
of the used terminologies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
ISA-XLSX.md
Outdated
## INVESTIGATION section | ||
|
||
This section is organized in several subsections, described in detail below. The Investigation section provides a | ||
flexible mechanism for grouping two or more Study files where required. When only one Study is created, the values in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When only one Study is created, the values in
this section SHOULD be left empty and the relevant metadata values recorded in the Study section only.
Why?
ISA-XLSX.md
Outdated
|
||
`Protocol Description` columns MAY be used to specify the description of the `Protocol` node implemented by the `Process` node. Per Annotation Table sheet there MUST be at most one `Protocol Description` column. The value MUST be free text. | ||
|
||
`Protocol Uri` columns MAY be used to specify the uri of the `Protocol` node implemented by the `Process` node. Per Annotation Table sheet there MUST be at most one `Protocol Uri` column. The value MUST be free text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means I can reference an (external) protocol from e.g. a protocol database?
The value MUST be free text.
Would assume, that it MUST be a URI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More important question here is: do we want such columns Protocol URI
, Protocol Description
, etc. inside an annotation table sheet?
ISA-XLSX.md
Outdated
## Factors | ||
|
||
A `Factor` is an independent variable manipulated by an experimentalist with the intention to affect biological systems in a way that can be measured by an assay. This field holds the actual data for the `Factor` named between the | ||
square brackets (as declared in the `Study Factors` section of a top-level metadata sheet) so MUST match; for example, `Factor [compound]`. The value MUST be free text, numeric, or an [`Ontology Annotation`](#ontology-annotations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so MUST match; for example,
so MUST match, for example,
ISA-XLSX.md
Outdated
|
||
## Parameters | ||
|
||
`Parameters` are all additional information about the experimental setup, that do not fall under the aforementioned 3 categories. It is formatted in the pattern `Parameter [<category term>]`. The value MUST be free text, numeric, or an [`Ontology Annotation`](#ontology-annotations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unify first sentences.
Characteristics
- A
Factor
- A
Component
Parameters
ISA-XLSX.md
Outdated
|
||
## Others | ||
|
||
Columns whose headers do not follow any of the formats described above are considered additional payload and are ignored in this specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are ignored
are out of the scope / not affected in this spec
| | | | ||
|---------------------|--------------------------------------| | ||
| ASSAY | | ||
| Assay File Name | assays/Proteomics/isa.assay.xlsx | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether this should be relative in the study and assay files. Or basically relative in all.
So in isa.investigation.xlsx it would be assays/Proteomics/isa.assay.xlsx
.
And in the top-level metadataassays/Proteomics/isa.assay.xlsx
:isa_assay
it should be isa.assay.xlsx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relative to top-level root is my current modus-operandi for tooling.
I will try to incorporate specifications about this in the next patch (1.2.1)/minor (1.3.0) release.
I think this is something for the arc specification
though. In ISA-XLSX it could be done in any other way, if used outside of the ARC context.
This PR introduces the
ISA-XLSX
specification into the ARC specification.Until now, the ARC specification mentioned ISA-Tab as a reference for the implementation of the experimental metadata files.
Contrary to this, many differences between the ISA-Tab specification and our tool implementations accumulated. Therefore I propose here the ISA-XLSX specification.
closes #73
closes #71