Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decide the future of source #15

Closed
mikapfl opened this issue Feb 24, 2021 · 3 comments
Closed

decide the future of source #15

mikapfl opened this issue Feb 24, 2021 · 3 comments

Comments

@mikapfl
Copy link
Member

mikapfl commented Feb 24, 2021

At the moment, we have source as dim, but only allow a single source in a dataset. That actually gives us the worst of all worlds: Only a single source, incompatibilities when doing arithmetic with different sources, and more dimensions, which always hurt somewhat.

We have three options what to do instead:

A single source, in attrs.

Advantages:

  • One less dimension.
  • Direct arithmetic with different sources.
    Disadvantages:
  • Multiple sources have to be held in multiple datasets always. If working with many sources, for example when plotting differences between sources, for loops have to be used.

One or more sources, in dim

Advantages:

  • Select for source like for area.
  • Explicit arithmetic with different sources (e.g. da1.loc[{'source': 'FAO'}] + da2.loc[{'source': 'Andrew'}])
  • Multiple sources can be held in a single dataset. Working with many sources becomes fluent.
    Disadvantages:
  • When working with a single source, the additional dimension makes representations larger, makes tabular display in pycharm more difficult, etc.
  • Explicit arithmetic with different sources (more to type in the "working with two different sources, I know what I'm doing" use case)

Hybrid, both allowed

Advantages:

  • When working with one dimension, all advantages of the attrs solution.
  • When working with multiple dimensions, all advantages of the dim solution.
    Disadvantages:
  • Shared functions need to consider both cases.
  • Mixing Datasets using one style with Datasets using the other style leads to surprising results.
  • Explicit conversions necessary.

My gut reaction at the moment (especially now that we have da.pr.set() and therefore don't have to deal with da.loc[sel] = array[..., np.newaxis] anymore, the np.newaxis really pissed me off) would be to standardize on "one or more sources, in dim".

@JGuetschow
Copy link
Contributor

I think one ore more sources in dim is best. I'm a bit worried about memory use though. But we'll see.
And for displaying of data a to_dataframe() or to_interchangeformat() function would be great so you get the table as in a csv file without all the nans

@mikapfl
Copy link
Member Author

mikapfl commented Feb 24, 2021

And for displaying of data a to_dataframe() or to_interchangeformat() function would be great so you get the table as in a csv file without all the nans

We have this already: ds.to_dataframe() 🙂

@mikapfl
Copy link
Member Author

mikapfl commented Feb 24, 2021

More than one source now possible

@mikapfl mikapfl closed this as completed Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants