decide the future of `source` #15

mikapfl · 2021-02-24T09:52:48Z

At the moment, we have source as dim, but only allow a single source in a dataset. That actually gives us the worst of all worlds: Only a single source, incompatibilities when doing arithmetic with different sources, and more dimensions, which always hurt somewhat.

We have three options what to do instead:

A single source, in `attrs`.

Advantages:

One less dimension.
Direct arithmetic with different sources.
Disadvantages:
Multiple sources have to be held in multiple datasets always. If working with many sources, for example when plotting differences between sources, for loops have to be used.

One or more sources, in `dim`

Advantages:

Select for source like for area.
Explicit arithmetic with different sources (e.g. da1.loc[{'source': 'FAO'}] + da2.loc[{'source': 'Andrew'}])
Multiple sources can be held in a single dataset. Working with many sources becomes fluent.
Disadvantages:
When working with a single source, the additional dimension makes representations larger, makes tabular display in pycharm more difficult, etc.
Explicit arithmetic with different sources (more to type in the "working with two different sources, I know what I'm doing" use case)

Hybrid, both allowed

Advantages:

When working with one dimension, all advantages of the attrs solution.
When working with multiple dimensions, all advantages of the dim solution.
Disadvantages:
Shared functions need to consider both cases.
Mixing Datasets using one style with Datasets using the other style leads to surprising results.
Explicit conversions necessary.

My gut reaction at the moment (especially now that we have da.pr.set() and therefore don't have to deal with da.loc[sel] = array[..., np.newaxis] anymore, the np.newaxis really pissed me off) would be to standardize on "one or more sources, in dim".

The text was updated successfully, but these errors were encountered:

JGuetschow · 2021-02-24T10:10:13Z

I think one ore more sources in dim is best. I'm a bit worried about memory use though. But we'll see.
And for displaying of data a to_dataframe() or to_interchangeformat() function would be great so you get the table as in a csv file without all the nans

mikapfl · 2021-02-24T10:13:04Z

And for displaying of data a to_dataframe() or to_interchangeformat() function would be great so you get the table as in a csv file without all the nans

We have this already: ds.to_dataframe() 🙂

mikapfl · 2021-02-24T10:54:49Z

More than one source now possible

mikapfl closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decide the future of `source` #15

decide the future of `source` #15

mikapfl commented Feb 24, 2021

JGuetschow commented Feb 24, 2021

mikapfl commented Feb 24, 2021

mikapfl commented Feb 24, 2021

decide the future of source #15

decide the future of source #15

Comments

mikapfl commented Feb 24, 2021

A single source, in attrs.

One or more sources, in dim

Hybrid, both allowed

JGuetschow commented Feb 24, 2021

mikapfl commented Feb 24, 2021

mikapfl commented Feb 24, 2021

decide the future of `source` #15

decide the future of `source` #15

A single source, in `attrs`.

One or more sources, in `dim`