-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor package to remove sharp edges for schema authors/users (particularly @row
/ Row
) and improve API
#54
Conversation
Co-authored-by: Alex Arslan <[email protected]>
…/beacon-biosignals/Legolas.jl into sc/column-name-utility-functions
<3 |
I'm currently working on a branch to upgrade Onda to the intended Legolas v0.5 and running into an area of Legolas' design that is really tempting to try to improve before merging this. I don't have time to write up a full explanation of the problem space now (will do so later) but suffice to say I'm tempted to reintroduce a struct-generation mechanism that is kind of like still unclear on how/whether the field types for such a mechanism should be allowed to be parameterized, or whether or not this notion is actually 1:1 w/ schema declaration will think on it + follow up with a better breakdown of the problem space / motivations later |
Alrighty then! I jotted down a bunch of thoughts in an attempt to break down the problem/motivations as promised. It's a bit too much content for a single comment so I've dumped the write up in a gist here: https://gist.github.com/jrevels/fdfe939109bee23566d425440b7c759e I'm going to forge ahead and refocus this PR towards implementing the proposal put forward in that gist. Will flip this into draft mode in the meantime. My hope is to get everything updated and mergeable by EOW. |
Took a little bit longer than expected, but I've now merged #55 into this PR, which implements the aforementioned proposal. I'm going to try to get CI green, then I'll flip this PR back out of draft mode. My hope is to get this merged/tagged on Thursday, modulo any major concerns brought up in review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great! I think this is a better system than the already-quite-useful @row
one. I do have fears about a long tail of upgrading old tables, and more specific fears about arrow serialization still not being tested thoroughly (e.g. w/ multiple type parameters, w/ union type constraints, etc), and some more specific comments.
Dev ergonomicity 1: Revise
Redefining a schema was actually super nice when developing. It made Legolas Row's like Revisable-structs, where you could do
@row("x@1", x::Int)
and then later realize you actually need a y
field too, so you just put in the REPL
@row("x@1", x::Int, y::Float64)
and everything "just works" (since the type is the same and all the info is in the method's which get invalidated).
I get that we don't want to be redefining schemas like this in package code, and don't want to be overwriting existing definitions etc, but... if there's some way to preserve the development ergonomicity, it would be super nice. However I'm not sure how that could actually be implemented now that we are really defining individual structs...
Dev ergonomicity 2: what was created by this macro
I think one issue with @version
and @schema
is that it's not really visible what happened when you run one of those, like what did it generate. @row
was similar but it returned something, so you had some feedback that OK now I have this object and I can use it. But these return nothing
, so you don't get a nudge as to what was created and what you can use. (Obvs it's in the docs/tour, but still it's kind of nice to have that automatic feedback).
In particular, the connection between @version("example.foo@1", ...)
and FooV1
being created feels too loose (in terms of code clarity, not actual semantics).
One option is @version
could return say FooV1
, the type. But I worry this would encourage folks to do const MyFoo = @version(...)
like we do with @row
, and that would be kinda an anti-pattern here (since probably it's better to use the FooV1 name rather than introduce another one, unless there is a good reason to).
Another semi weird option would be to try to introspect if the macro is running in the REPL (vs a package/precompiling/etc) and if so, print something like ("FooV1 type has been created"). I think that could go along way to making the package more approachable, even if it's a bit odd.
Or we could somehow try to get FooV1
into the written code itself, like writing the macro as @version(FooV1, ...)
and using @schema
to reverse the connection, meaning that @version(FooV1, ...)
errors unless an @schema
has been declared so that FooV1 can be associated to the example.foo@1
schema. In other words, you wouldn't have any more freedom in choosing how to spell the type (which is good), but you would have to write the type name there, so then you know it exists and when it shows up later in the code you know where it came from.
So that would look like:
@schema "example.foo" Foo
@version FooV1 begin
a::Real
b::String
end
Yeah, this does look prettier and clearer from a certain angle The cons AFAICT:
i think the huge pro, though, is the greppability improvement - even if you don't know what's going on with i think as part of this change we should also maybe just change let the repaint of the bikeshed begin 😁 |
Yeah, makes sense. I'm not sure how to cleanly support this, but would be down if somebody wants to make a suggestion on how to do so. Also worth noting that AFAICT it likely wouldn't be a breaking change if we figured out how to solve this in a subsequent release |
Co-authored-by: Eric Hanson <[email protected]>
Co-authored-by: Eric Hanson <[email protected]>
Co-authored-by: Eric Hanson <[email protected]>
...could always poke the relevant upstream issue (Revise+struct) folks for this one? JuliaLang/julia#40399 via timholy/Revise.jl#18 (comment) |
Okay, bikeshed repainted. Retagging @ericphanson for one last review pass 🙏 If there are no major concerns my plan is to merge by EOD tomorrow and tag the v0.5 release as stated in the OP. |
9910493
to
9fe8092
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I would suggest we also document some update guidance & best practices, maybe in other PRs (I know you said everyone will get help upgrading but I think it still helps to have some written guidance). E.g., questions I have would be:
- when should we be making a schema extension vs using custom columns (given constructors drop custom cols)?
- Should custom cols only be used when it's entirely dynamic what the column is, so it's not really possible to write a schema unless we generate it on-the-fly (which sounds like a very bad idea), or are custom columns more broadly applicable?
- since we can only check type-constraints (not value constraints) without the constructors, whose responsibility is it to do that construction? the writer or the reader or both?
- I am guessing the reader, or maybe more specifically, the caller of a function requiring that vetting, since you shouldn't really trust the writer did it, since who knows what they did
- if it is the reader/caller, then this also means it's possible to serialize out custom columns, otherwise they would be dropped by constructing-before-writing, and you can read them in as long as you don't do that constructing
- is it more OK to constrain function signatures by type now than before? E.g. if you want to express that you need a constructed-therefore-validated type as input
- is it more OK to dispatch on record types now than before? Since your package "owns" the type rather than Legolas now, and they seem to have greater semantic meaning now
- this is kinda the same as the previous question, but imo relying on dispatch for functionality is different semantically than e.g. having a function w/ a single method w/ a type constraint
also, unrelatedly, if we want table-level validation, and we are in the validate-by-construction framework, does that mean we need tabular types, not just row types?
Co-authored-by: Eric Hanson <[email protected]>
i'm not sure there's a hard rule, so i'd just say "when it's useful to do so" (which i know is annoying 😁 but maybe we can figure out a more thoughtful rule at some point). one's use of schema-aware tools/systems may incentivize this in different directions
For the reader, it's all about trust; for the writer, it's all about goals. A big part of this is whether you're reading from / writing to a schema-on-read system or a (trusted) schema-on-write system. For example, if I'm a reader that is reading from a trusted schema-on-write system, i'd probably want to avoid revalidating unless i'm forced to by e.g. using a tool that is unaware/agnostics to the provenance of its inputs, and thus forces its own validation i have some design notes written down on this topic that i'd like to put into a design docs section of the docs
Yup, it is; but with the caveat that doing so also means biting the bullet that your code is now non-generic, and thus could "poison" otherwise generic code that callers might want to compose it with. which, in the worst case, could result in costly/burdensome conversion gymnastics (but in some cases, that might be okay)
Yeah, but maybe less so than the former one lol. Basically it's okay to do these things if you understand/accept the consequences. i personally think a more duck-typed style is more useful/less frustrating in a lot of scenarios, given the plethora of types that a valid table/row value can take on. IMO the holy grail would be true structural typing support, but that might be a bit of a lofty non-goal in julia land
either that, or have tabular validation be treated separately. i actually explored introducing a table type in this PR a little bit before going with the record type approach
agreed, i'll file an issue for this |
I pulled on the thread of
Legolas.@row
to resolve #49, and accidentally unravelled the knitted sweater that was Legolas' API.To elaborate: the overwhelming number of changes included in this single PR result from a development process that kind of resembles "
x
should be added/removed; hmm, adding/removingx
immediately renders it obvious thaty
needs to be added/removed, or else the API will be inconsistent or have an obvious hole". Now that I know what the final puzzle looks like all put together, it'd be technically possible to break this up into more incremental PRs, but many of them would still each be breaking and I think it'd actually worsen the churn...but even if it didn't, we don't really have the bandwidth to draw out landing these changes 😅A lot of this PR's changes stem from the design mini-investigation undertaken here.
Note that this PR supersedes #52, but is based atop @OTDE's commits from that branch to ensure they also receive contribution assignment for this PR once it lands on
main
.changelog
I think the best way to onboard into this PR is probably to go through its new tour.jl, but here are a summary of the changes:
rm'd
Legolas.Row
in favor of the record types described in this documentLegolas.Schema
->Legolas.SchemaVersion
, and associated name changes to more properly refer to "schema versions" where such changes are warranted. Also, theschema_
prefix has been dropped from many of the methods that were defined against this type.rm'd methods against
::Type{<:Schema}
in favor of methods against::SchemaVersion
(for similar reasons to why Base methods generally prefer::Val{n}
over::Type{Val{n}}
).More disciplined/uniform API for handling schema identifiers, including replacing
Legolas.Schema(id::AbstractString)
withfirst(parse_identifier(id))
Legolas.@row
has been replaced withLegolas.@schema
(which simply declares a schema) andLegolas.@version
(which declares a new version of a schema, and generates all associated record type definitions)General overhaul of
tour.jl
and docs to both reflect the new changes and to hopefully provide a smoother / more thorough on-ramp for new users.More comprehensive introspection/validation utilities:
required_fields
declaration
declared
find_violation
complies_with
accepted_field_type
validate(table, schema)
-->validate(Tables.schema(table), schema)
. This works better/more generically withfind_violation
/complies_with
, and just yields more control to the caller in general IMO. Furthermore, this makes it harder for callers to accidentally incorrectly assume thatLegolas.validate
validates row values.Better-tailored error messages for
Legolas.read
/Legolas.write
now that those methods don't need to punt to the removedvalidate
methodThe new Legolas-generated record types are a bit "dumber" than the former
Legolas.Row
, and its more likely that row tables constructed via these record types will have directly inferrableTables.Schema
s. Thus, we've removed a small bit of Legolas' schema-guessing code in a manner that removes a bit of responsibility fromLegolas.read
/Legolas.write
and makes them more efficient. The tradeoff is that these functions have a higher expectation that callers will provide them with tables whoseTables.Schema
s are inferrable viaTables.schema
.expected properties of schema version declarations are now better validated/enforced:
UnknownSchemaError
#51foo@1
with an additional required field compared to its original declaration, which would de facto causeLegolas.validate
to require the field forfoo@1
's child schemas. Since there was little enforcement of actual parent/child field compatibility in the first place, it was possible to redefine schema versions in such a manner that rendered them incompatible with their preexisting children. Now that we seek to better enforce parent/child schema version compatibility, we should strive to disallow this kind of accidental invalidation. We achieve this by bluntly disallowing the alteration of preexisting schema version declarations via redeclaration entirely, because a) there's not currently a mechanism for validating parent declarations against preexisting child declarations and b) the ability to alter preexisting schema version declarations is of questionable utility in the first place.resolves
Legolas.Row
type aliases impede ergonomic deprecation cycles #49obviates throw better error messages for defensive implicit
AbstractRecord
convert
failures #50resolves Add function that returns a Schema's cols #41
resolve Columns of extension schemas should go after columns of parent schemas #12
more comprehensive unit tests (though I'm sure there are more test cases we should add more over time)
bumps minimum supported Julia version to 1.6, which is the latest LTS minor version, just to avoid maintenance overhead of dealing with Julia 1.3 compatibility. If any user actually requires Julia 1.3 support, please speak up and I'll reconsider.
Breaking Change Summary
Legolas.Row
in general. Note that, unless we add a deprecation path, the new Legolas version will not be able to automatically deserialize values that were previously serialized via the formerLegolas.Row
ArrowTypes overloads. Note though that this probably doesn't affect most Legolas-serialized data, unless e.g. you had a nestedLegolas.Row
value inside of a column of a table.Legolas.@row
->Legolas.@schema
+Legolas.@version
Legolas.Schema
->Legolas.SchemaVersion
, and associated method name changesType{<:Schema}
-> methods against::SchemaVersion
Schema(x::AbstractString)
->first(parse_identifier(x))
(the previous method allowed, but did not return, parent identifiers - and was buggy to boot, allowing invalid parent identifiers)schema_
prefix fromSchemaVersion
accessor methodsschema_qualified_string
->identifier
validate(table, schema)
->validate(Tables.schema(table), schema)
Where to go from here
Note this PR bumps the version to v0.5, not v1.0, because there are still some obvious breaking changes to be made later:
Legolas.write_full_path
and related behavior that uses it #31, support pluggable/arbitrary (de)serialization formats/targets (CSV, JSON, YAML, etc.) #34)gather
)My plan is to merge this PR without implementing automated deprecation paths. I think the breakage in this release (esp. w.r.t.
Legolas.Row
removal) is sufficiently nontrivial that any user who'd like to upgrade will probably need to actually manually upgrade, rather than fully rely on automated deprecation pathways. If it's worthwhile, we can add in depwarns for more trivial cases (e.g. simple method renames) in follow-up patch releases, but I'm reluctant to attempt more complex deprecation pathways that would have to handle a lot of edge cases in order to be correct (and thus have a higher risk of being subtly buggy).Note that Beaconeers aren't expected to upgrade to v0.5 internally without dedicated support from the relevant team members (if you have a question about this, feel free to ping me in Beacon's internal Slack). So feel free to upgrade "early" if you want to for your use case, but keep in mind that it's not a requirement.