check dtypes in h5ad #15

mccalluc · 2020-02-27T01:44:41Z

Trevor notes:

@mccalluc: it would be good to check the dtypes in umap = ann_data.obsm['X_umap']. I would be surprised if the hdf5 data used unnecessarily large dtypes, but pandas defaults to float64 for csv numerics. This was a headache I ran into with arrow early on.

The text was updated successfully, but these errors were encountered:

manzt · 2020-02-27T01:49:32Z

I would be surprised if numeric dtypes were huge (but good to check!). However, in my experience people forget that casting a column in pandas as categorical for many repeated entries (ie. cell type, etc) can lead to a much lower memory footprint. For saving arrow, I found converting categorical columns had some nice benefits in the resulting arrow size. I found it easiest to convert these types on the pandas.DataFrame and then let pyarrow take care of mapping these to arrow-specific dtypes.

manzt · 2020-02-27T01:52:34Z

potentially useful: https://github.com/manzt/arrow-loader-demo/blob/c682ea7132830e45d7867e7d4a928fa063db6867/data/json2arrow.py#L20-L30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check dtypes in h5ad #15

check dtypes in h5ad #15

mccalluc commented Feb 27, 2020

manzt commented Feb 27, 2020 •

edited

Loading

manzt commented Feb 27, 2020

check dtypes in h5ad #15

check dtypes in h5ad #15

Comments

mccalluc commented Feb 27, 2020

manzt commented Feb 27, 2020 • edited Loading

manzt commented Feb 27, 2020

manzt commented Feb 27, 2020 •

edited

Loading