Unicode strings unexpectedly transformed to byte strings upon `open_dataset` #1638

olgabot · 2017-10-18T00:16:38Z

When I first create the dataset, all the metadata is stored as unicode strings (yay!):

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int64 1229254 730274 1075370 ...
    EXP_ID                        (cell) <U29 '170925_A00111_0066_AH3TKNDMXX' ...
    TAXON                         (cell) <U3 'mus' 'mus' 'mus' 'mus' 'mus' ...
    WELL_MAPPING                  (cell) <U9 'B000126' 'B000126' 'B000126' ...
    Lysis Plate Batch             (cell) <U32 '20' '20' '20' '20' '20' '20' ...
    dNTP.batch                    (cell) <U38 '457912' '457912' '457912' ...
    oligodT.order.no              (cell) <U32 '6/23/17 12757296' ...
    plate.type                    (cell) <U32 'Biorad HSP3901' ...
    preparation.site              (cell) <U32 'Biohub' 'Biohub' 'Biohub' ...
    date.prepared                 (cell) <U32 '07-06-17' '07-06-17' ...
    date.sorted                   (cell) <U6 '170707' '170707' '170707' ...
    tissue                        (cell) <U13 'Skin' 'Skin' 'Skin' 'Skin' ...
    subtissue                     (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.id                      (cell) <U13 '3_39_F' '3_39_F' '3_39_F' ...
    FACS.selection                (cell) <U52 'Multiple' 'Multiple' ...
    nozzle.size                   (cell) <U32 '100' '100' '100' '100' '100' ...
    FACS.instument                (cell) <U32 'Sony SIM1' 'Sony SIM1' ...
    Experiment ID                 (cell) <U32 'exp22' 'exp22' 'exp22' ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    Plate                         (cell) <U32 '1' '1' '1' '1' '1' '1' '1' ...
    Location                      (cell) <U32 'MACA20_3' 'MACA20_3' ...
    Comments                      (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.age                     (cell) <U1 '3' '3' '3' '3' '3' '3' '3' '3' ...
    mouse.number                  (cell) <U32 '39' '39' '39' '39' '39' '39' ...
    mouse.sex                     (cell) <U1 'F' 'F' 'F' 'F' 'F' 'F' 'F' 'F' ...
  * cell                          (cell) object 'A17-B000126-3_39_F-1-1' ...
Data variables:
    counts                        (cell, gene) int64 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...

but then when I save using to_netcdf using the default arguments, then xr.open_dataset on the same dataset using default arguments, all of them get converted to byte strings:

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * cell                          (cell) |S24 b'A17-B000126-3_39_F-1-1' ...
  * gene                          (gene) |S22 b'0610005C13Rik' ...
Data variables:
    counts                        (cell, gene) int32 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...
    FACS.selection                (cell) |S52 b'Multiple' b'Multiple' ...
    dNTP.batch                    (cell) |S38 b'457912' b'457912' b'457912' ...
    EXP_ID                        (cell) |S29 b'170925_A00111_0066_AH3TKNDMXX' ...
    subtissue                     (cell) |S19 b'nan' b'nan' b'nan' b'nan' ...
    oligodT.order.no              (cell) |S17 b'6/23/17 12757296' ...
    plate.type                    (cell) |S14 b'Biorad HSP3901' ...
    tissue                        (cell) |S13 b'Skin' b'Skin' b'Skin' ...
    mouse.id                      (cell) |S13 b'3_39_F' b'3_39_F' b'3_39_F' ...
    FACS.instument                (cell) |S13 b'Sony SIM1' b'Sony SIM1' ...
    Comments                      (cell) |S11 b'nan' b'nan' b'nan' b'nan' ...
    WELL_MAPPING                  (cell) |S9 b'B000126' b'B000126' ...
    date.prepared                 (cell) |S9 b'07-06-17' b'07-06-17' ...
    Location                      (cell) |S9 b'MACA20_3' b'MACA20_3' ...
    preparation.site              (cell) |S8 b'Biohub' b'Biohub' b'Biohub' ...
    date.sorted                   (cell) |S6 b'170707' b'170707' b'170707' ...
    Experiment ID                 (cell) |S6 b'exp22' b'exp22' b'exp22' ...
    TAXON                         (cell) |S3 b'mus' b'mus' b'mus' b'mus' ...
    Lysis Plate Batch             (cell) |S3 b'20' b'20' b'20' b'20' b'20' ...
    nozzle.size                   (cell) |S3 b'100' b'100' b'100' b'100' ...
    Plate                         (cell) |S3 b'1' b'1' b'1' b'1' b'1' b'1' ...
    mouse.number                  (cell) |S3 b'39' b'39' b'39' b'39' b'39' ...
    Uniquely mapped reads number  (cell) int32 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int32 1229254 730274 1075370 ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    mouse.age                     (cell) |S1 b'3' b'3' b'3' b'3' b'3' b'3' ...
    mouse.sex                     (cell) |S1 b'F' b'F' b'F' b'F' b'F' b'F' ...

So then things I expect like selecting on gene, e.g. ds.sel(gene="Ins1") don't work unless they're byte strings, i.e. ds.sel(gene=b"Ins1") works just fine.

Do you know why this may be happening?

The text was updated successfully, but these errors were encountered:

olgabot · 2017-10-18T00:17:25Z

Also, how did all the Coordinates somehow get moved into Data variables ?

shoyer · 2017-10-18T00:54:04Z

Which backend are you using to save the data? Try explicitly setting engine to either netcdf4, h5netcdf or scipy. I think h5netcdf may be your best bet but it's probably worth trying all of them. Sadly unicode strings in Python 3 / NumPy is still quite painful.

Also, how did all the Coordinates somehow get moved into Data variables ?

This looks like a bug of some sort -- not sure how that happened!

shoyer · 2017-10-21T06:27:57Z

Reading over Unidata/netcdf-c#402, it seems like we should probably copy the handling of _Encoding from netcdf4-python (Unidata/netcdf4-python#665) to our scipy interface. That would solve our problem of faithfully round-tripping data.

olgabot · 2017-10-24T23:02:34Z

Thank you for looking into this! I used the default engine to save, which looks like it was netcdf4. I did pip install h5netcdf and saved again. It took longer, ~2min instead of seconds. Loading was still 110ms and all the features are objects again! Though the coordinates --> variables thing is still happening.

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * cell                          (cell) object 'A17-B000126-3_39_F-1-1' ...
  * gene                          (gene) object '0610005C13Rik' ...
Data variables:
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Comments                      (cell) object 'nan' 'nan' 'nan' 'nan' ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    EXP_ID                        (cell) object '170925_A00111_0066_AH3TKNDMXX' ...
    Experiment ID                 (cell) object 'exp22' 'exp22' 'exp22' ...
    FACS.instument                (cell) object 'Sony SIM1' 'Sony SIM1' ...
    FACS.selection                (cell) object 'Multiple' 'Multiple' ...
    Location                      (cell) object 'MACA20_3' 'MACA20_3' ...
    Lysis Plate Batch             (cell) object '20' '20' '20' '20' '20' ...
    Number of input reads         (cell) int64 1229254 730274 1075370 ...
    Plate                         (cell) object '1' '1' '1' '1' '1' '1' '1' ...
    TAXON                         (cell) object 'mus' 'mus' 'mus' 'mus' ...
    Uniquely mapped reads number  (cell) int64 1017682 634557 941828 1392029 ...
    WELL_MAPPING                  (cell) object 'B000126' 'B000126' ...
    counts                        (cell, gene) int64 0 0 0 0 442 0 0 0 0 0 0 ...
    dNTP.batch                    (cell) object '457912' '457912' '457912' ...
    date.prepared                 (cell) object '07-06-17' '07-06-17' ...
    date.sorted                   (cell) object '170707' '170707' '170707' ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    mouse.age                     (cell) object '3' '3' '3' '3' '3' '3' '3' ...
    mouse.id                      (cell) object '3_39_F' '3_39_F' '3_39_F' ...
    mouse.number                  (cell) object '39' '39' '39' '39' '39' ...
    mouse.sex                     (cell) object 'F' 'F' 'F' 'F' 'F' 'F' 'F' ...
    nozzle.size                   (cell) object '100' '100' '100' '100' ...
    oligodT.order.no              (cell) object '6/23/17 12757296' ...
    plate.type                    (cell) object 'Biorad HSP3901' ...
    preparation.site              (cell) object 'Biohub' 'Biohub' 'Biohub' ...
    subtissue                     (cell) object 'nan' 'nan' 'nan' 'nan' ...
    tissue                        (cell) object 'Skin' 'Skin' 'Skin' 'Skin' ...

Not sure if it matters, but one detail is that I created ~250 individual datasets (each sized at ~300 samples x 20,000 features) and then used xr.concat(datasets, dim='cell') to concatenate them because I couldn't read them all into memory at once.

shoyer · 2017-10-25T06:32:35Z

Hmm. I'm not sure why h5netcdf was so much slower. I suspect the default engine you used might have been scipy, which we've noticed can be significantly faster in some cases. If you have time, I would be curious how well my branch in 1648 works using engine='scipy'.

Please file a separate issue if you can put together example code that reproduces the lost coordinate issue. I would like to dig into this.

olgabot · 2017-11-01T18:33:24Z

Using v0.9.6 with engine='h5netcdf'

CPU times: user 1min, sys: 47.7 s, total: 1min 48s
Wall time: 2min 19s

Using #1648:

CPU times: user 1min 5s, sys: 54.9 s, total: 2min
Wall time: 2min 1s

olgabot · 2017-11-01T19:51:55Z

Posted the lost coordinate issue here: #1680

shoyer mentioned this issue Oct 21, 2017

fix to_netcdf append bug (GH1215) #1609

Merged

4 tasks

shoyer mentioned this issue Oct 23, 2017

Roundtrip unicode strings even when written as character arrays #1648

Merged

4 tasks

shoyer closed this as completed in #1648 Oct 27, 2017

kripnerl mentioned this issue Feb 3, 2021

Unicode strings unexpectedly transformed to byte strings upon open_dataset #4859

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode strings unexpectedly transformed to byte strings upon `open_dataset` #1638

Unicode strings unexpectedly transformed to byte strings upon `open_dataset` #1638

olgabot commented Oct 18, 2017 •

edited

Loading

olgabot commented Oct 18, 2017

shoyer commented Oct 18, 2017

shoyer commented Oct 21, 2017

olgabot commented Oct 24, 2017 •

edited

Loading

shoyer commented Oct 25, 2017

olgabot commented Nov 1, 2017

olgabot commented Nov 1, 2017

Unicode strings unexpectedly transformed to byte strings upon open_dataset #1638

Unicode strings unexpectedly transformed to byte strings upon open_dataset #1638

Comments

olgabot commented Oct 18, 2017 • edited Loading

olgabot commented Oct 18, 2017

shoyer commented Oct 18, 2017

shoyer commented Oct 21, 2017

olgabot commented Oct 24, 2017 • edited Loading

shoyer commented Oct 25, 2017

olgabot commented Nov 1, 2017

olgabot commented Nov 1, 2017

Unicode strings unexpectedly transformed to byte strings upon `open_dataset` #1638

Unicode strings unexpectedly transformed to byte strings upon `open_dataset` #1638

olgabot commented Oct 18, 2017 •

edited

Loading

olgabot commented Oct 24, 2017 •

edited

Loading