Replies: 4 comments 5 replies
-
@e-lo, don't overestimate my influence here; I am only one of several folks all participating in a working group to move the data package standard forward. Decisions about what is added to the standard are based on consensus of the group. Also, note that relationships between fields were added as a recipe, but are not part of the official standard (yet). Your example above is an excellent one, and one that especially resonates with me as an avid user of public transportation and apps that rely on these feeds. Many of the use cases you list (but perhaps not all) can be handled with existing functionality, such as row constraints (of which I know you are aware). I use row constraints all the time in my work, for which they provide constraint checking and documentation that is relatively human readable. Other types of constraints can be specified and checked with a custom validation check. Neither of these is language-agnostic, however. IIUC, you are advocating for something that would be language-agnostic, more standardized, and easier to use (i.e., if it were part of the standard). IMO those are all important considerations. My only concerns would be:
In sum, it's the members of the working group that you should focus on "convincing," and the way to do that would be to propose something specific. FWIW, I totally agree with you that "We should leverage existing language in these existing implementations if possible while also making sure that validation error code outcomes are legible." I would be supportive of something that does this while addressing the two concerns listed above. I would also strongly suggest reaching out to the working group to see if someone wants to collaborate with you on developing a specific proposal, as you might get some excellent feedback and would then be well positioned to advocate for it. |
Beta Was this translation helpful? Give feedback.
-
From @pschumm's suggestion to suggest something specific, here goes a straw person! Field ScanImplement through language-specific code: Implement through a "formula":
I do think that json-schema's namespaces are good - though its implementation of conditional logic is not. Note that there are a zillion other data definition languages for defining constraints, but are mainly focused on type and shape relationships rather than logical constraints. Implementation Strategy
Initial ExamplesNOTE: this is a WIP but I'm saving incrementally. Thoughts welcome! Field value depends on value of another field in record.fields:
- name: arrival_time
type: string
format: time
constraints:
required: true
pattern: "^([0-9]{1,2}:[0-9]{2}:[0-9]{2})$"
logical:
- description: "stop_times.arrival_time must be provided for records where timepoint == 1"
groupBy: record
condition: timepoint == 1
constraints:
value != ""
description: "Arrival time at a specific stop for a specific trip on a specific day."
- name: timepoint
type: integer
constraints:
required: false
enum: [0, 1]
description: "Indicates whether the specified arrival and departure times for a stop are strictly adhered to by the vehicle."
primaryKey: ["trip_id", "stop_sequence"]
missingValues: [""]
constraints: # table-level constraints which must evaluate to true to pass validation
logical:
- groupBy: record
condition: arrival_time =< departure_time
constraints:
- true
description: stop_times.arrival_time =< stop_times.departure_time Field value depends on a group-by attribute in the table.fields:
- name: trip_id
type: string
constraints:
required: true
description: "Identifies a trip."
- name: arrival_time
type: string
format: time
constraints:
required: true
pattern: "^([0-9]{1,2}:[0-9]{2}:[0-9]{2})$"
description: "Arrival time at a specific stop for a specific trip on a specific day."
- name: stop_sequence
type: integer
constraints:
required: true
minimum: 0
description: "Order of the stops for a particular trip."
primaryKey: ["trip_id", "stop_sequence"]
missingValues: [""]
constraints
logical:
- description: "stop_times.arrival_time must be provided for the min and max values of stop_times.stop_sequence within a group defined by trip_id"
groupBy: record
condition: stop_sequence =
max(groupby("trip_id").where("group.trip_id"=="trip_id")). # this is getting messy. suggestions welcome
OR
min(groupby("trip_id").where("group.trip_id"=="trip_id"))
constraints:
value != "" |
Beta Was this translation helpful? Give feedback.
-
Colleagues of mine have worked on this. They have written the Furthermore, there is the VTL (https://cros.ec.europa.eu/book-page/vtl) language (Validation and Transformation Language) by Eurostat. The nice thing is that is a generic language for specifying these kinds of rules. The not so nice thing is that it is apparently a bit of a mess (according to colleagues) and there are various implementations but no complete implementation. |
Beta Was this translation helpful? Give feedback.
-
I think there are good reasons why schema languages such as XML Schema and SHACL do not support arbitrary conditions across data elements. The best to compare with and to build on is SQL CHECK constraints. Spreadsheets also allow for validation. In any case this issue requires to define a syntax of expressions such as
@e-lo wrote:
Yes, it's getting messy so better not invent our own syntax but a syntax that someone has already been cleaned up and implemented. |
Beta Was this translation helpful? Give feedback.
-
As somebody who routinely uses data standards that have conditionality between fields, records, and tables, I'd really appreciate if they were able to be defined in a language-agnostic manner that can be consumed by and consistently implemented by code.
Example Use Cases
I am using GTFS as my primary use case here because it is important to many people across the world in their daily lives - there are many other examples. These use cases vary widely in implementation complexity.
Across Fields
Field value depends on value of another field in record.
Examples:
stop_times.arrival_time
must be provided for records wheretimepoint == 1
.stop_times.arrival_time
=<stop_times.departure_time
Across Fields and Records
Field value depends on a group-by attribute in the table.
Example:
stop_times.arrival_time
must be provided for the min and max values ofstop_times.stop_sequence
within a group defined bytrip_id
.Field value should increase as another one does.
Example:
stop_times.arrival_time
andstop_times.departure_time
, if provided, should not decrease within a giventrip_id
when ordered bystop_times.stop_sequence
Between Tables
Table required if another table is not provided (or has no records).
Example
calendar_dates
is required ifcalendars
is not.Between a Table and a Field in another Table
Table forbidden if field exists in another table (that is non-null).
Example
networks.txt
is conditionally forbidden ifroutes.network_id
exists.Between Fields in different Tables
Requirement for a record identified in another table by a foreign key.
Example
pathways.from_stop_id
andpathways.to_stop_id
must reference a record instops
wherestops.location_type != 1
NOTE: In many of these use cases, we can easily observe that the data model itself is far from ideal and that the conditional requirements could be more legibly and efficiently-met by redefining the data model. Its nice in theory, but in practice you can't ask 10,000 transit agencies, all of their technology vendors and the myriad of consumer applications to change - thus standards like GTFS have a pretty firm (though not completely solid) backwards compatibility requirement which creates some (very) awkward data modeling.
Community Need
This functionality seems to have fairly widespread demand as indicated by:
json-schema
row_constraint
in Frictionless Frameworkpandera
(which can import frictionless schemas) andpydantic
gtfs
which is the data specification which 10,000+ transit agencies use to share their schedules with trip planning applications like google and apple maps.What Currently Happens
In the absence of a clear, code-legible but language-agnostic method to define requirements across multiple fields (indeed - actually across multiple resources/tables), these requirements are often:
The outcomes of either of these is that people end up passing around datasets which have varying levels of validation done - but there is no consistent full implementation of the (markdown-defined) specification...which results in an inordinate amount of money spent on engineers time fixing edge cases and pointing fingers between various transit software vendors.
Considerations
relationships
has been implemented into the table schema but merely indicates if a field is derived from or coupled with another field.relationships
, @pschumm told @roll in Add pattern - Table Schema: Relationship between Fields #859 that they did not agree that specifying these constraints anywhere other than in the validation context was a priority ("at the moment").. I'd love (for us all) to convince him (or whomever it is that we need to) that it is.Implementation Thoughts
Similar check-types have been implemented in other data definition languages and in software-specific code. We should leverage existing language in these existing implementations if possible while also making sure that validation error code outcomes are legible (e.g. the json-schema
oneOf
option is NOT a good example of that).Beta Was this translation helpful? Give feedback.
All reactions