-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify nomenclature that unambiguously describes the components of matrix-api objects #11
Comments
I find the current nomenclature ( I'd thought the names were switched, since |
I also stumbled across the naming when first reading it in the Google Doc proposal and made a comment. Months later, I personally got used to it and can't think of a better alternative. I do think it'd be helpful to mention the naming convention translation ( Here comes a suggestion along these lines: #17 |
I still feel like using group for singular and dataset for plural can be confusing. Especially when HDF5 and zarr, which one might refer to in the same breath, define |
I'd be very much open for a different naming choice, but I also think that would require a more in-depth decision doc. For instance, I'd feel "super-group" bears more potential for confusion in the future. |
@ivirshup do you think this would be mitigated in practice when schema specifications + group naming is applied in implementations of the schema? I feel they'll make things more concrete. I'm open to changing the name, but I think Alex's changes in #17 mitigate most of the confusion and wonder if we're in an area of diminishing returns. As an example, the TileDB implementation we're working on has the following structure. This example models a 10x multiome.
|
I think part of the issue is me working with very concrete examples (hdf5 and zarr APIs) where groups contain datasets. The context switch here keeps tripping me up. I'm not sure I've had a day where I've thought about this and not mixed them up at least once. I don't think this is a huge sticking point for me, but I think we're going to have some confused conversations if the current |
I understand Isaac's discomfort with the naming choice and have given it more thought to come up with a potentially better choice. To that end, I'll try to offer an additional perspective on the current specification. What we are discussing here is the naming choice of two "data structures" giving rise to two "data container objects" and their serialized counterparts. Here comes a table with a few naming suggestions. Please note that the new random abbreviation SLAM is just meant to ditch all connotations with existing terms and could be any acronym for that sake.
To jump to my conclusion for selecting names: My strongest inclination is that I'd suggest replacing it with another term. My favorite would be a new term like
I'd also suggest replacing I'll leave it with this first take on it. I'd love to hear more in-depth perspectives on naming rationales for the two data structures! And I'd encourage us to still consider the current names For more background, for anyone who still wants to read on: Here comes an effort to define a SLAM1:
And here comes an effort to define "A collection of related SLAMs."4:
Footnotes
|
This is something I struggled with in the FOM schema as well. Maybe the term "matrix_group" could refer to a set of matrices that are related and "dataset" could be a collection of matrix groups. If the matrices within a sc_group can have different numbers of features, then it might be better to call it an "obs_group". The only other generic possibility that I can think of is "subset" or "modaility_subset" since each sc_group is a unique combination of modalities, observations, and features. |
Thanks very much for the clarity about the challenge @ivirshup in-depth decomposition of this question @falexwolf We've discussed and favor the replacement of Our discussions lead us to propose two additional requirements on the suffix, since we're going to say it a million times: It should be pronounceable, and spellable. Our concern with I'm interested to know what level of confidence @falexwolf and @ivirshup have that "matrix" is not overly limiting. If it's not, then We're also interested in ideas for alternative acronyms, and agree we should let this issue germinate a bit before making the change. |
Sorry, I didn't refresh to see @falexwolf's last post before adding my own. I also like I don't really have a strong opinion, but I do wonder whether it is worth aligning the name with FOM working group. The FOM (feature observation matrix) acronym was thought up quickly for a grant last summer to get the working group going. I asked during our first call if anyone had other thoughts but didn't get much of a response. |
I strongly agree with @joshua-d-campbell's point on aligning FOM vs. To @joshua-d-campbell's point about "simple": I think people might (and I know did) construct matrices by joining, for instance, RNA & ATAC measurements into one contingent table. To me that's then no longer a simple matrix in this sense:
I'd discourage storing "such a complicated object" over a "simple one" as users would have a hard time understanding it. I agree with @joshua-d-campbell that we don't need to have "simple" in the name. A specification in the docs that discourages storage of intermediate-stage-processed RNA & ATAC could address the above concern!1 I think that having or having not "simple" in the name is more an aesthetic consideration. To @ambrosejcarr's point of whether the term "matrix" is limiting. My thinking was as follows: Yes, a "data matrix" is an array of dimension 2 ("tensor of degree 2"), with index-identifications observations ⨉ features. There may be measurements that generate data in meaningful higher-dimensional structures that could be worth absorbing in a higher-dimensional array. The only candidate I can presently think of is spatial data. But to take away the conclusion: I think also spatial data should be stored in annotated-matrix form within the scope of the present data structure specification. Below comes why I think so. I'm not an expert, but as far as I know, there are two common ways of representing spatial information:
We see things got a little complicated in 2! I generally wonder whether one shouldn't favor 1 as a first "go-to" stored representation, in particular as this also seems what OME now considers (ome/ngff#64). There are many ways of putting measurement data on a grid through reshaping, interpolating, adjusting resolutions, etc. These are highly non-trivial ML-related comp-workflows. Most of them won't fit into the metadata model that we're discussing here. If people do use imaging data, I'd encourage them to store point measurements that are annotated with spatial information as an intermediate representation after a "folder of files". This would fit the present scope of the present repository and be compatible with the "annotated matrix" layout. Reshaped and further-processed tensor/array like representations of the same data could then be dealt with ML data infrastructure. I'm pretty sure there should be solutions for this but would have to investigate. This whole discussion is related to how much the present format should be seen as a "canonical intermediate format" for omics data as opposed to a format that can also absorb all potential downstream representations. Given the reasons above (with really the dominant reason being the ability to build powerful metadata-schema specification around a matrix), I think it should be considered the former: Upstream (fastqs, folders of images, etc.) and downstream (highly-processed tensor-like data) formats that are non-annotated-matrix like should be handled elsewhere. One should maybe stress this: Restricting ourselves to "annotated matrices" does not mean any information loss. It means that in some cases some nd-array structure that could be absorbed within the substructure of a flat "primary molecular/measurement dimension" would have to be flattened. But in my mind,
Sorry, I hope this didn't end up being too convoluted. Please let me know if it is and I'll try to streamline the writeup. 😅 Footnotes
|
One case here is genomics data. 3 dimensional arrays used by For spatial data having pixel and point level representations, I agree these have fairly different use cases. If these representations are in scope, a BED file like representation for ATAC data would be in scope as well. I do think it's important that data where point and read level information is associated with an annotated matrix is compatible here. But my current thinking is that the solution there is essentially to keep their obs x var annotated matrices compatible with |
@falexwolf this is a great restatement of CZI and TileDB's goals for this project. I think we should port the bolded sections to the readme.
@ivirshup The sgkit example is interesting, and a good test case. ... I think we could support these data. You could imagine flattening "allele" and "variant" dimensions to "variant" and annotate each variant with the allele it associates with to ensure no information loss. This is a very useful alignment discussion. I also think we have our answer: "Matrix" is not overly limiting. Based on this, I think the current proposal is as follows (But please let me know if I've interpreted your comments incorrectly!)
I also agree with @joshua-d-campbell and @falexwolf 's proposal to align FOM vs. matrix-api vs. the data structure name. Once we've aligned on a name, I favor changing both the working group and the specification proposal here. |
I'm happy we seem so aligned, and good with proceeding like this! Re @ivirshup's example: Thanks for pointing it out! I agree with Ambrose that one can flatten it without information loss, and also no loss of computational efficiency. Per-variant queries & aggregations are less convenient if there is "no per-variant dimension". Adding that convenience back could be achieved through a "genomics"-accessor that talks to the I also think that that should be possible without efficiency losses as the 3d |
My impression was they like the 3d representation. Someone from sgkit could probably give more context on this. Maybe @hammer or @tomwhite could answer: Would referring to your data structure as a "matrix" be fine? For context on this thread, we're trying to figure a shared name for both a single AnnData/ SummarizedExperiment/ MatrixTable and a collection of them. IIRC From what I know:
I'm not sure this is the case. It seems that most of the annotation elements are aligned to either the IIRC Hail also does not do this (btw, Hail calls their AnnData-like object a
This would depend on the cardinality of the array, right? I generally don't think it would be that bad to allow X a third dimension. For the right cardinality, you could just think of it as a different dtype that happens to be composed of multiple values. To me, the constraint on We would just be limiting how much you can annotate those extra dimensions. |
One of the nice things about xarray and zarr is that the number of dimensions is flexible.
That's right, so we tend to use xarray nomenclature. So the whole data structure is an xarray dataset, which xarray defines as "a dict-like container of labeled arrays (DataArray objects) with aligned dimensions."
As mentioned above the individual items in an xarray dataset are called (data) arrays, but sometimes they might be referred to as matrices. |
Thanks, @tomwhite! I also agree with @ivirshup on these three paragraphs:
This would mean we would continue to build metadata standards that assume the second dimension corresponds to variables. This will imply that in order to make full profit from these standards, data will need to be reshaped into that form. However, we can also allow further dimensions and postpone a solution for how to treat them in the metadata schema. Probably, this means delaying it for a few years. I think coming up with it will be a substantial challenge. Regarding the consequences for naming: if the term "matrix" feels too constraining, one could change I'd personally feel more comfortable with the narrow "matrix use case" in which users will always exactly get what they expect, and not some surprising 3rd dimension that they'll not know how to query and generally use. I appreciate that a converter for the sgkit needs to be written, then, but I also think these converters will need to be written anyway. If we achieve an intuitive well-designed canonical matrix-api for many types of biological data, this is both achievable and will bring much value. I think that the downside that this wouldn't be readily applicable to all types of biological data shouldn't let us make the mistake of broadening up the scope so much that we can't precisely formulate it anymore. Hence, I'd suggest sticking with |
The working proposal for this is @joshmoore I'm interested if you have a perspective as an upstream format maintainer. |
Starting off with the general group/dataset discussion, I'd just like to 👍 @ivirshup's "it's confusing":
re: re: 2D -- if you decide it is a MUST, "tables" and "dataframes" express it better to me, though this probably is as contentionous as "dataset" and "group". re: >2D (outside of my wheel house but...) "To me, the constraint on X is that the first dimension is aligned to the observations, and the second is aligned to the variables. Not sure we need to constrain further dimensions. #11 (comment)" is intriguing. I assume xarray-ness would suffice to make these additional, higher-dimension annotations discoverable. re: OME's table representation (#11 (comment)) -- here we are very much looking to define how to bridge with your work here, so if you decide to do something else, we would likely follow suit. cc @kevinyamauchi |
Our poll of the FOM group and other stakeholders completed today. The finalist names were FOLD (Feature Observation Layered Data), SLAM (Simple Layered Annotated Matrices), and SOMA (Stack of Matrices, Annotated). Voting indicated that SOMA was the preferred choice and for this decision we decided that voting would be our decision making approach. The following changes will result from this decision:
Follow-up work: #27 |
Circling back -- this seminal issue was in fact the foundational material behind all of last year's SOMA spec design: https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md |
This issue identifies confusion with the
sc-group
andsc-dataset
nomenclature and contains a discussion about potential replacement names and how those names should be propagated to the broader feature-object-matrix standardization efforts.The text was updated successfully, but these errors were encountered: