Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode strings unexpectedly transformed to byte strings upon open_dataset #1638

Closed
olgabot opened this issue Oct 18, 2017 · 7 comments · Fixed by #1648
Closed

Unicode strings unexpectedly transformed to byte strings upon open_dataset #1638

olgabot opened this issue Oct 18, 2017 · 7 comments · Fixed by #1648

Comments

@olgabot
Copy link

olgabot commented Oct 18, 2017

When I first create the dataset, all the metadata is stored as unicode strings (yay!):

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int64 1229254 730274 1075370 ...
    EXP_ID                        (cell) <U29 '170925_A00111_0066_AH3TKNDMXX' ...
    TAXON                         (cell) <U3 'mus' 'mus' 'mus' 'mus' 'mus' ...
    WELL_MAPPING                  (cell) <U9 'B000126' 'B000126' 'B000126' ...
    Lysis Plate Batch             (cell) <U32 '20' '20' '20' '20' '20' '20' ...
    dNTP.batch                    (cell) <U38 '457912' '457912' '457912' ...
    oligodT.order.no              (cell) <U32 '6/23/17 12757296' ...
    plate.type                    (cell) <U32 'Biorad HSP3901' ...
    preparation.site              (cell) <U32 'Biohub' 'Biohub' 'Biohub' ...
    date.prepared                 (cell) <U32 '07-06-17' '07-06-17' ...
    date.sorted                   (cell) <U6 '170707' '170707' '170707' ...
    tissue                        (cell) <U13 'Skin' 'Skin' 'Skin' 'Skin' ...
    subtissue                     (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.id                      (cell) <U13 '3_39_F' '3_39_F' '3_39_F' ...
    FACS.selection                (cell) <U52 'Multiple' 'Multiple' ...
    nozzle.size                   (cell) <U32 '100' '100' '100' '100' '100' ...
    FACS.instument                (cell) <U32 'Sony SIM1' 'Sony SIM1' ...
    Experiment ID                 (cell) <U32 'exp22' 'exp22' 'exp22' ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    Plate                         (cell) <U32 '1' '1' '1' '1' '1' '1' '1' ...
    Location                      (cell) <U32 'MACA20_3' 'MACA20_3' ...
    Comments                      (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.age                     (cell) <U1 '3' '3' '3' '3' '3' '3' '3' '3' ...
    mouse.number                  (cell) <U32 '39' '39' '39' '39' '39' '39' ...
    mouse.sex                     (cell) <U1 'F' 'F' 'F' 'F' 'F' 'F' 'F' 'F' ...
  * cell                          (cell) object 'A17-B000126-3_39_F-1-1' ...
Data variables:
    counts                        (cell, gene) int64 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...

but then when I save using to_netcdf using the default arguments, then xr.open_dataset on the same dataset using default arguments, all of them get converted to byte strings:

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * cell                          (cell) |S24 b'A17-B000126-3_39_F-1-1' ...
  * gene                          (gene) |S22 b'0610005C13Rik' ...
Data variables:
    counts                        (cell, gene) int32 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...
    FACS.selection                (cell) |S52 b'Multiple' b'Multiple' ...
    dNTP.batch                    (cell) |S38 b'457912' b'457912' b'457912' ...
    EXP_ID                        (cell) |S29 b'170925_A00111_0066_AH3TKNDMXX' ...
    subtissue                     (cell) |S19 b'nan' b'nan' b'nan' b'nan' ...
    oligodT.order.no              (cell) |S17 b'6/23/17 12757296' ...
    plate.type                    (cell) |S14 b'Biorad HSP3901' ...
    tissue                        (cell) |S13 b'Skin' b'Skin' b'Skin' ...
    mouse.id                      (cell) |S13 b'3_39_F' b'3_39_F' b'3_39_F' ...
    FACS.instument                (cell) |S13 b'Sony SIM1' b'Sony SIM1' ...
    Comments                      (cell) |S11 b'nan' b'nan' b'nan' b'nan' ...
    WELL_MAPPING                  (cell) |S9 b'B000126' b'B000126' ...
    date.prepared                 (cell) |S9 b'07-06-17' b'07-06-17' ...
    Location                      (cell) |S9 b'MACA20_3' b'MACA20_3' ...
    preparation.site              (cell) |S8 b'Biohub' b'Biohub' b'Biohub' ...
    date.sorted                   (cell) |S6 b'170707' b'170707' b'170707' ...
    Experiment ID                 (cell) |S6 b'exp22' b'exp22' b'exp22' ...
    TAXON                         (cell) |S3 b'mus' b'mus' b'mus' b'mus' ...
    Lysis Plate Batch             (cell) |S3 b'20' b'20' b'20' b'20' b'20' ...
    nozzle.size                   (cell) |S3 b'100' b'100' b'100' b'100' ...
    Plate                         (cell) |S3 b'1' b'1' b'1' b'1' b'1' b'1' ...
    mouse.number                  (cell) |S3 b'39' b'39' b'39' b'39' b'39' ...
    Uniquely mapped reads number  (cell) int32 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int32 1229254 730274 1075370 ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    mouse.age                     (cell) |S1 b'3' b'3' b'3' b'3' b'3' b'3' ...
    mouse.sex                     (cell) |S1 b'F' b'F' b'F' b'F' b'F' b'F' ...

So then things I expect like selecting on gene, e.g. ds.sel(gene="Ins1") don't work unless they're byte strings, i.e. ds.sel(gene=b"Ins1") works just fine.

Do you know why this may be happening?

@olgabot
Copy link
Author

olgabot commented Oct 18, 2017

Also, how did all the Coordinates somehow get moved into Data variables ?

@shoyer
Copy link
Member

shoyer commented Oct 18, 2017

Which backend are you using to save the data? Try explicitly setting engine to either netcdf4, h5netcdf or scipy. I think h5netcdf may be your best bet but it's probably worth trying all of them. Sadly unicode strings in Python 3 / NumPy is still quite painful.

Also, how did all the Coordinates somehow get moved into Data variables ?

This looks like a bug of some sort -- not sure how that happened!

@shoyer
Copy link
Member

shoyer commented Oct 21, 2017

Reading over Unidata/netcdf-c#402, it seems like we should probably copy the handling of _Encoding from netcdf4-python (Unidata/netcdf4-python#665) to our scipy interface. That would solve our problem of faithfully round-tripping data.

@olgabot
Copy link
Author

olgabot commented Oct 24, 2017

Thank you for looking into this! I used the default engine to save, which looks like it was netcdf4. I did pip install h5netcdf and saved again. It took longer, ~2min instead of seconds. Loading was still 110ms and all the features are objects again! Though the coordinates --> variables thing is still happening.

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * cell                          (cell) object 'A17-B000126-3_39_F-1-1' ...
  * gene                          (gene) object '0610005C13Rik' ...
Data variables:
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Comments                      (cell) object 'nan' 'nan' 'nan' 'nan' ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    EXP_ID                        (cell) object '170925_A00111_0066_AH3TKNDMXX' ...
    Experiment ID                 (cell) object 'exp22' 'exp22' 'exp22' ...
    FACS.instument                (cell) object 'Sony SIM1' 'Sony SIM1' ...
    FACS.selection                (cell) object 'Multiple' 'Multiple' ...
    Location                      (cell) object 'MACA20_3' 'MACA20_3' ...
    Lysis Plate Batch             (cell) object '20' '20' '20' '20' '20' ...
    Number of input reads         (cell) int64 1229254 730274 1075370 ...
    Plate                         (cell) object '1' '1' '1' '1' '1' '1' '1' ...
    TAXON                         (cell) object 'mus' 'mus' 'mus' 'mus' ...
    Uniquely mapped reads number  (cell) int64 1017682 634557 941828 1392029 ...
    WELL_MAPPING                  (cell) object 'B000126' 'B000126' ...
    counts                        (cell, gene) int64 0 0 0 0 442 0 0 0 0 0 0 ...
    dNTP.batch                    (cell) object '457912' '457912' '457912' ...
    date.prepared                 (cell) object '07-06-17' '07-06-17' ...
    date.sorted                   (cell) object '170707' '170707' '170707' ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    mouse.age                     (cell) object '3' '3' '3' '3' '3' '3' '3' ...
    mouse.id                      (cell) object '3_39_F' '3_39_F' '3_39_F' ...
    mouse.number                  (cell) object '39' '39' '39' '39' '39' ...
    mouse.sex                     (cell) object 'F' 'F' 'F' 'F' 'F' 'F' 'F' ...
    nozzle.size                   (cell) object '100' '100' '100' '100' ...
    oligodT.order.no              (cell) object '6/23/17 12757296' ...
    plate.type                    (cell) object 'Biorad HSP3901' ...
    preparation.site              (cell) object 'Biohub' 'Biohub' 'Biohub' ...
    subtissue                     (cell) object 'nan' 'nan' 'nan' 'nan' ...
    tissue                        (cell) object 'Skin' 'Skin' 'Skin' 'Skin' ...

Not sure if it matters, but one detail is that I created ~250 individual datasets (each sized at ~300 samples x 20,000 features) and then used xr.concat(datasets, dim='cell') to concatenate them because I couldn't read them all into memory at once.

@shoyer
Copy link
Member

shoyer commented Oct 25, 2017

Hmm. I'm not sure why h5netcdf was so much slower. I suspect the default engine you used might have been scipy, which we've noticed can be significantly faster in some cases. If you have time, I would be curious how well my branch in 1648 works using engine='scipy'.

Please file a separate issue if you can put together example code that reproduces the lost coordinate issue. I would like to dig into this.

@olgabot
Copy link
Author

olgabot commented Nov 1, 2017

Using v0.9.6 with engine='h5netcdf'

CPU times: user 1min, sys: 47.7 s, total: 1min 48s
Wall time: 2min 19s

Using #1648:

CPU times: user 1min 5s, sys: 54.9 s, total: 2min
Wall time: 2min 1s

@olgabot
Copy link
Author

olgabot commented Nov 1, 2017

Posted the lost coordinate issue here: #1680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants