Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing metadata (/configuration?) through DataFrame.attrs #151

Open
ivirshup opened this issue May 7, 2023 · 0 comments
Open

Passing metadata (/configuration?) through DataFrame.attrs #151

ivirshup opened this issue May 7, 2023 · 0 comments

Comments

@ivirshup
Copy link
Contributor

ivirshup commented May 7, 2023

The feature

It would be really nice to have a way of associating chromosome info with the dataframe containing the ranges. I would propose using pd.DataFrame.attrs for storing metadata like chromosome info, column names.

Why

GRanges objects from bioconductor have a @seqinfo attribute that contains sequence info about the assembly being used. For example:

library(EnsDb.Hsapiens.v86)
ensdb = EnsDb.Hsapiens.v86
g = genes(ensdb)
head(g, 3)
# GRanges object with 3 ranges and 6 metadata columns:
#                   seqnames      ranges strand |         gene_id   gene_name           gene_biotype seq_coord_system      symbol                       entrezid
#                      <Rle>   <IRanges>  <Rle> |     <character> <character>            <character>      <character> <character>                         <list>
#   ENSG00000223972        1 11869-14409      + | ENSG00000223972     DDX11L1 transcribed_unproces..       chromosome     DDX11L1 100287596,100287102,727856,...
#   ENSG00000227232        1 14404-29570      - | ENSG00000227232      WASH7P unprocessed_pseudogene       chromosome      WASH7P                           <NA>
#   ENSG00000278267        1 17369-17436      - | ENSG00000278267   MIR6859-1                  miRNA       chromosome   MIR6859-1                      102466751
#   -------
#   seqinfo: 357 sequences (1 circular) from GRCh38 genome
g@seqinfo
# Seqinfo object with 357 sequences (1 circular) from GRCh38 genome:
#   seqnames seqlengths isCircular genome
#   1         248956422      FALSE GRCh38
#   10        133797422      FALSE GRCh38
#   11        135086622      FALSE GRCh38
#   12        133275309      FALSE GRCh38
#   13        114364328      FALSE GRCh38
#   ...             ...        ...    ...
#   LRG_741      231167      FALSE GRCh38
#   LRG_93        22459      FALSE GRCh38
#   MT            16569       TRUE GRCh38
#   X         156040895      FALSE GRCh38
#   Y          57227415      FALSE GRCh38

It would be nice if we could also attach this kind of information to our range dataframe for use with bioframe. This could be done by putting something equivalent to @seqinfo into the pd.DataFrame.attrs attribute. Something similar could also be done for different range column names.

Current use of global configuration

With cols, this library already provides ways of setting different values without needing to pass them all the time (docs). These are using a global config or temporarily modifying that config with a context manager.

I think both of these are less ergonomic

  • They require explicit code for something which could be explicit in the data, but implicit in the code.
  • They're global, and don't allow working with different configurations at the same time

Downsides

pd.DataFrame.attrs

The main downside is pd.DataFrame.attrs.

  • It's still marked as experimental, and can change
  • It doesn't show up in the repr, so it's not obvious if anything has been added

I would hope that usage here could influence further development of the features.

May not work with other backends

It's not immediately obvious whether alternative backends would also support this kind of feature

Alternatives

  • Do nothing, keep passing this metadata as is.
  • Custom class of some sort (like bioconductor)
    • Instead of a custom dataframe class, this could be a pandas extension array, which would be a lighter touch.
    • But this doesn't fit with the current bioframe design
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants