Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a few concerns about metadata methods #52

Closed
ExpandingMan opened this issue Sep 27, 2022 · 8 comments · Fixed by #53
Closed

a few concerns about metadata methods #52

ExpandingMan opened this issue Sep 27, 2022 · 8 comments · Fixed by #53

Comments

@ExpandingMan
Copy link

ExpandingMan commented Sep 27, 2022

I've started implementing metadata and colmetadata for Parquet2.jl. I have a few thoughts, sorry for not bringing this up when this was being discussed but there was a lot of conversation and I tuned out at some point.

  • There is currently no way as part of the API to fetch with a default such as in Base.get. In many situations this means there is no way of fetching data without at least 2 lookups.
  • There isn't a clean way in the API of fetching all metadata, one would have to do something like Dict(k=>metadata(x, k) for k \in metadatakeys(x)) which seems a bit awkward, especially considering that in many cases the object is probably just sitting there in the first place and shouldn't have to be reconstructed.
  • I'm not sure if this is a problem, but I thought I'd point out that Tables.jl supports cases where the relationship between an object and its colmetadata is more complicated than this API suggests. For example, in Parquet2 a Dataset is a table that has columns which are concatenations of sub-columns which belong ti sub-tables (which are also Tables.jl tables) called RowGroup. It's therefore not possible to define colmetadata on Dataset because it would be ambiguous which column metadata should be used (or whether it would be appropriate to merge them). This is surely not a typical case, but it seems worth pointing out that Tables.jl isn't enough to specify what colmetadata should do.
  • Defining ArgumentError fallbacks seems a bit dubious. These clearly should be MethodError if there is no reasonable fallback. The most obvious consequence of this is that error handling routines might catch a wrong error here. Nothing else immediately comes to mind, though I do vaguely remember somebody writing a blog post at some point describing why this pattern leads to trouble... I'd also be a little worried about it making method ambiguity cases worse.

I realize that opening this issue might seem like more of an annoyance than anything else since the ship has sailed and now we'd have to deal with breakage. However there might still be room to add a few methods such as, perhaps

metadata(x)
metadata(x, k, default)
metadata!(x, k, default)
metadata!(f, x, k)

Perhaps it's already fine for packages to include these but in that case perhaps it should be documented.

@ExpandingMan
Copy link
Author

I know this is abusing the interface, but very little since Int is unnecessarily restrictive. Case in point about defining the methods

  Expression: Parquet2.colmetadata(ds[1], :ints, "mac") ≡ nothing
  MethodError: colmetadata(::Parquet2.RowGroup{Parquet2.FileManager{FilePathsBase.PosixPath}}, ::Symbol, ::String) is ambiguous. Candidates:
    colmetadata(::T, ::Symbol, ::AbstractString; style) where T in DataAPI at /home/expandingman/.julia/packages/DataAPI/pEtqn/src/DataAPI.jl:378
    colmetadata(rg::Parquet2.RowGroup, col::Union{Integer, Symbol}, k::AbstractString; style) in Parquet2 at /home/expandingman/.julia/dev/Parquet2/src/schema.jl:1308
  Possible fix, define
    colmetadata(::T, ::Symbol, ::AbstractString) where T<:Parquet2.RowGroup

@ExpandingMan
Copy link
Author

This is implemented in Parquet2.jl. I wanted to point it out before I tag to give you guys a chance to object to my extra methods.

See here, here and here.

@bkamins
Copy link
Member

bkamins commented Sep 28, 2022

currently no way as part of the API to fetch with a default such as in Base.get.

We considered this. As you have commented it is currently doable by double lookup. Therefore we decided that this functionality can be added later in a non-breaking way if users report that this is needed.

If we add it I would propose that this is handled by the default keyword argument (not positional argument), as it will be more explicit in the code. I agree with your implementation that if style is required then :default should be returned.

We can add it now if we want. @nalimilan - what do you think?

There isn't a clean way in the API of fetching all metadata

This is the same reason as above. It can be added later in a non-breaking way if it is needed. We can define metadata(obj; style::Bool) and colmetadata(obj, col; style::Bool) as functions which are required to return readable iterators of key => value pairs or key => (value, style) pairs (depending on the style kwarg). In particular it can be just returning the object from the parent, but does not have to be.
Also colmetadata(obj; style::Bool) can be added that returns a readable iterator of col => colmetadata(obj, col; style=style) vaules.

We can add it now if we want. @nalimilan - what do you think?

This is surely not a typical case, but it seems worth pointing out that Tables.jl isn't enough to specify what colmetadata should do.

Tables.jl defines:

columns = Tables.columns(x)
column = Tables.getcolumn(columns, col)

the metadata returned by colmetadata should be metadata that corresponds to the returned column since this is what Tables.jl user sees. Now, indeed it is a hard choice what to do if internally package creates the object returned in Tables.getcolumn(columns, col) by merging different columns that can have different metadata (as this is what happens in Parquet if I understand correctly - right?). However, I think that the decision what to do in this case should be made on package level. User should see what metadata is reasonably assigned to Tables.getcolumn(columns, col). I think, as you commented, that doing metadata merging should be done in this case on package level (a natural way would be to return only the metdata that is defined in the same way in all chunks)

Defining ArgumentError fallbacks seems a bit dubious. These clearly should be MethodError if there is no reasonable fallback.

Maybe you are right. Defining the methods in DataAPI.jl indeed leads to method ambiguities if methods in a package implementing the interface are not implemented carefully enough (as you have noticed, I had the same problem in DataFrames.jl).

Fortunately, following semver rules changing exception type is not considered breaking. Therefore, if we decide to remove the default implementations in the next release this will not be a breaking change. However, I would say that this is the most urgent decision to be made, because while this is non breaking still it is better to change the exception type sooner than later if we want to go this way.

Again, @nalimilan - what do you think?

metadata!(x, k, default)
metadata!(f, x, k)

I am not clear what you propose here (these are methods for setting metadata as they have ! suffix) - was this a typo?

I know this is abusing the interface, but very little since Int is unnecessarily restrictive.

Int is required, because Tables.jl requires Int only and does not allow any Integer (e.g. it does not allow Bool).
However, I guess, when we remove the default implementations, as you propose above, then these ambiguities will be resolved without having to add extra methods.


In summary:

  • I think adding default is OK, but I would add it as a keyword argument; the decision to be made is if we add it to DataAPI.jl now, or we add it later
  • I think adding methods for retrieving all metadata using metadata(obj; style), colmetadata(obj, col; style), and colmetadata(obj; style) are OK and they should be required to return readable iterators (the point is that it should not return mutable internal data structures as returning them can corrupt the contents if user by mistake mutates them). We just need to decide if in this case we want to allow passing style kwarg or not (we can add this option later); again - the question is if we want to add it to DataAPI.jl now or later.
  • I think changing ArgumentError to MethodError by un-defining default methods is acceptable. It will make defining methods in packages easier by avoiding dispatch ambiguities.

But let us wait a bit for others to comment.

@nalimilan
Copy link
Member

I'm fine with dropping fallback methods if they create ambiguities. (I think there's now a mechanism to print a custom message for certain MethodErrors, though that's probably overkill here and could be added later anyway.)

Other additions seem fine, we can add it as soon as somebody needs them (not sure whether @ExpandingMan is in that position or not).

metadata!(x, k, default)
metadata!(f, x, k)

I am not clear what you propose here (these are methods for setting metadata as they have ! suffix) - was this a typo?

IIUC these would be equivalent to get!. And indeed if we want to support the do syntax f has to be the first positional argument, not a keyword argument.

(I'm still kind of afraid that we are reinventing the AbstractDict API and that it would have been simpler to require implementations to expose an AbstractDict type to users instead, but well...)

@bkamins
Copy link
Member

bkamins commented Sep 28, 2022

The problem with AbstractDict API is that it is not well documented and thus difficult to implement correctly. I would not add these get!-equivalent methods for now (I do not think they are very useful in metadata context). I am OK with convenient get-equivalent. Also I would not support the do syntax.

@bkamins
Copy link
Member

bkamins commented Sep 28, 2022

@ExpandingMan - the idea was to limit the complexity of methods that packages are required to implement to make it easier to opt-in for their maintainers. The non-breaking proposals that you listed were considered, but we thought they can be added later.

@ExpandingMan
Copy link
Author

Alright, so it sounds like we are mostly in agreement concerning the points I made in my OP. I suppose there's nothing to be done in the general case about my 3rd point about the relationship between tables and columns not always being so starightforward. Indeed, it makes things pretty ugly in Parquet2.jl but I don't see what the alternative would be.

@ExpandingMan - the idea was to limit the complexity of methods that packages are required to implement to make it easier to opt-in for their maintainers. The non-breaking proposals that you listed were considered, but we thought they can be added later.

I completely agree with this approach and it is arguably the standard approach to creating interfaces in Julia, but there weren't any implemented fallback methods whatsoever so I was a bit puzzled about what the intentions were. Perhaps they simply were not added yet.

If we add it I would propose that this is handled by the default keyword argument (not positional argument), as it will be more explicit in the code. I agree with your implementation that if style is required then :default should be returned.

That seems a bit tricky due to the fact that keyword arguments must already have a default (at least in this case). So, if you were to e.g. choose nothing, should nothing always be the default? Then how do you get back the error-throwing behavior?

I am more convinced than ever that the methods should be removed so that it gives MethodError rather than ArgumentError. Furthermore, as I understand it, the reason that there are e.g. Base.getindex methods that take fully-specified Int arguments is because Base also provides a bunch of generic methods so you want it to pick the Int methods when possible. In this case, we don't have a bunch of generic methods, so choosing Int for the methods in DataAPI is inappropriate. I suppose this is a moot point if they are removed.

@bkamins
Copy link
Member

bkamins commented Sep 29, 2022

That seems a bit tricky due to the fact that keyword arguments must already have a default (at least in this case).

That is why I have written in #53 that I have thought how to do it as I had exactly the same first reaction 😄. The solution would be to define a special type struct DataAPI.NoDefaultMetadata end that would be used as a default and then would throw an error. I.e. the default would be default=DataAPI.NoDefaultMetadata().

@nalimilan - do you think positional or keyword argument should be preferred for default value.
Pros of default as positional: it is consistent with dictionary API and a bit shorter to type
Pros of default as keyword: it is more explicit in the code what the passed argument means (especially for readers that are not using metadata API often - which is likely the case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants