Conditional requirements across records, fields, and tables #1058

e-lo · 2024-11-07T18:14:18Z

e-lo
Nov 7, 2024

As somebody who routinely uses data standards that have conditionality between fields, records, and tables, I'd really appreciate if they were able to be defined in a language-agnostic manner that can be consumed by and consistently implemented by code.

Example Use Cases

I am using GTFS as my primary use case here because it is important to many people across the world in their daily lives - there are many other examples. These use cases vary widely in implementation complexity.

Across Fields

Field value depends on value of another field in record.

Examples:

stop_times.arrival_time must be provided for records where timepoint == 1.
stop_times.arrival_time =< stop_times.departure_time

Across Fields and Records

Field value depends on a group-by attribute in the table.

Example: stop_times.arrival_time must be provided for the min and max values of stop_times.stop_sequence within a group defined by trip_id.

Field value should increase as another one does.

Example: stop_times.arrival_time and stop_times.departure_time, if provided, should not decrease within a given trip_id when ordered by stop_times.stop_sequence

Between Tables

Table required if another table is not provided (or has no records).

Example calendar_dates is required if calendars is not.

Between a Table and a Field in another Table

Table forbidden if field exists in another table (that is non-null).

Example networks.txt is conditionally forbidden if routes.network_id exists.

Between Fields in different Tables

Requirement for a record identified in another table by a foreign key.

Example pathways.from_stop_id and pathways.to_stop_id must reference a record in stops where stops.location_type != 1

NOTE: In many of these use cases, we can easily observe that the data model itself is far from ideal and that the conditional requirements could be more legibly and efficiently-met by redefining the data model. Its nice in theory, but in practice you can't ask 10,000 transit agencies, all of their technology vendors and the myriad of consumer applications to change - thus standards like GTFS have a pretty firm (though not completely solid) backwards compatibility requirement which creates some (very) awkward data modeling.

Community Need

This functionality seems to have fairly widespread demand as indicated by:

Issues that have been submitted (and closed without implementation AFAIK) including required based on field condition #169 , Create Frictionless pattern to support conditional constraints tdwg/camtrap-dp#32
Implementation of conditionality in popular data definition languages like json-schema
Implementation of row_constraint in Frictionless Framework
Implementation of row-/model-level checks in popular parsing and validation packages such as pandera (which can import frictionless schemas) and pydantic
Numerous data schemas that are documented in things like markdown (and then in code for validation) instead of frictionless b/c frictionless is not currently capable of describing the schema. One of the most notable examples includes gtfs which is the data specification which 10,000+ transit agencies use to share their schedules with trip planning applications like google and apple maps.

What Currently Happens

In the absence of a clear, code-legible but language-agnostic method to define requirements across multiple fields (indeed - actually across multiple resources/tables), these requirements are often:

Ignored, creating invalid datasets that are inconsistent with the defined schema
Implemented in language-specific code with possibly different interpretations by language and their available packages.
- Usually it is implemented in code that is specific to one organization's use-case.
- Sometimes this is made available as open source code as a general resource.
- Even if it is open-source, the validation component is not individually packaged in an easily-consumable api for other purposes.
- In the case of GTFS, many organizations spent on the order of high 6-7 figures (and counting) to implement a canonical validator which is ~ 94,000 lines of code. Many organizations do not even use this canonical validator because it is gnarly to use as an internal API AND it doesn't even implement the full standard (yet) because it is gnarly to develop onto.

The outcomes of either of these is that people end up passing around datasets which have varying levels of validation done - but there is no consistent full implementation of the (markdown-defined) specification...which results in an inordinate amount of money spent on engineers time fixing edge cases and pointing fingers between various transit software vendors.

Considerations

The concept of relationships has been implemented into the table schema but merely indicates if a field is derived from or coupled with another field.

{ "fields": [ ],
  "relationships": [
    { "fields" : [ "country", "code"],
      "description" : "is the country code alpha-2 of",
      "link" : "coupled"
    }
    { "fields" : [ "region", "population"],
      "description" : "is the population of",
      "link" : "derived"}
  ]
}

In initially considering the implementation of relationships, @pschumm told @roll in Add pattern - Table Schema: Relationship between Fields #859 that they did not agree that specifying these constraints anywhere other than in the validation context was a priority ("at the moment").. I'd love (for us all) to convince him (or whomever it is that we need to) that it is.

Implementation Thoughts

Similar check-types have been implemented in other data definition languages and in software-specific code. We should leverage existing language in these existing implementations if possible while also making sure that validation error code outcomes are legible (e.g. the json-schema oneOf option is NOT a good example of that).

pschumm · 2024-11-08T09:09:28Z

pschumm
Nov 8, 2024
Collaborator

I'd love (for us all) to convince him (or whomever it is that we need to) that it is.

@e-lo, don't overestimate my influence here; I am only one of several folks all participating in a working group to move the data package standard forward. Decisions about what is added to the standard are based on consensus of the group.

Also, note that relationships between fields were added as a recipe, but are not part of the official standard (yet).

Your example above is an excellent one, and one that especially resonates with me as an avid user of public transportation and apps that rely on these feeds. Many of the use cases you list (but perhaps not all) can be handled with existing functionality, such as row constraints (of which I know you are aware). I use row constraints all the time in my work, for which they provide constraint checking and documentation that is relatively human readable. Other types of constraints can be specified and checked with a custom validation check. Neither of these is language-agnostic, however.

IIUC, you are advocating for something that would be language-agnostic, more standardized, and easier to use (i.e., if it were part of the standard). IMO those are all important considerations. My only concerns would be:

Making sure that anything added along these lines is not discipline or use case specific, so that it is equally intuitive and usable in the widest possible range of contexts; and
That the added value outweighs the need to keep the standard as simple and easy to use as possible, which is a big part of its appeal. Frictionless may not be the best tool in all cases, and IMO attempting to achieve that would be a mistake.

In sum, it's the members of the working group that you should focus on "convincing," and the way to do that would be to propose something specific. FWIW, I totally agree with you that "We should leverage existing language in these existing implementations if possible while also making sure that validation error code outcomes are legible." I would be supportive of something that does this while addressing the two concerns listed above. I would also strongly suggest reaching out to the working group to see if someone wants to collaborate with you on developing a specific proposal, as you might get some excellent feedback and would then be well positioned to advocate for it.

1 reply

e-lo Nov 8, 2024
Author

Awesome feedback, thank you!

e-lo · 2024-11-08T21:58:23Z

e-lo
Nov 8, 2024
Author

From @pschumm's suggestion to suggest something specific, here goes a straw person!

Field Scan

Implement through language-specific code:

pandera
pydantic
postgreSQL
a zillion other things...

Implement through a "formula":

frictionless row constraints which leverages simpleeval

I do think that json-schema's namespaces are good - though its implementation of conditional logic is not.

Note that there are a zillion other data definition languages for defining constraints, but are mainly focused on type and shape relationships rather than logical constraints.

Implementation Strategy

Add constraints namespaces for datapackages, and tables which can refer to 1 or more conditions
Allow constraints to be based on condition

Initial Examples

NOTE: this is a WIP but I'm saving incrementally. Thoughts welcome!

Field value depends on value of another field in record.

fields:
  - name: arrival_time
    type: string
    format: time
    constraints:
      required: true
      pattern: "^([0-9]{1,2}:[0-9]{2}:[0-9]{2})$"
      logical:
          - description: "stop_times.arrival_time must be provided for records where timepoint == 1"
             groupBy: record
             condition: timepoint == 1
             constraints: 
                  value != ""
    description: "Arrival time at a specific stop for a specific trip on a specific day."

  - name: timepoint
    type: integer
    constraints:
      required: false
      enum: [0, 1]
    description: "Indicates whether the specified arrival and departure times for a stop are strictly adhered to by the vehicle."

primaryKey: ["trip_id", "stop_sequence"]
missingValues: [""]
constraints: # table-level constraints which must evaluate to true to pass validation
    logical: 
     - groupBy: record
        condition: arrival_time =< departure_time
        constraints: 
             - true
        description: stop_times.arrival_time =< stop_times.departure_time

Field value depends on a group-by attribute in the table.

fields:
  - name: trip_id
    type: string
    constraints:
      required: true
    description: "Identifies a trip."
  - name: arrival_time
    type: string
    format: time
    constraints:
      required: true
      pattern: "^([0-9]{1,2}:[0-9]{2}:[0-9]{2})$"
    description: "Arrival time at a specific stop for a specific trip on a specific day."
  - name: stop_sequence
    type: integer
    constraints:
      required: true
      minimum: 0
    description: "Order of the stops for a particular trip."

primaryKey: ["trip_id", "stop_sequence"]
missingValues: [""]

constraints
    logical: 
     - description: "stop_times.arrival_time must be provided for the min and max values of stop_times.stop_sequence within a group defined by trip_id"
        groupBy: record
        condition: stop_sequence = 
                max(groupby("trip_id").where("group.trip_id"=="trip_id")). # this is getting messy. suggestions welcome
                OR
                 min(groupby("trip_id").where("group.trip_id"=="trip_id"))
        constraints: 
                  value != ""

0 replies

djvanderlaan · 2024-11-13T07:58:11Z

djvanderlaan
Nov 13, 2024

Colleagues of mine have worked on this. They have written the validate R-package that can import sets of rules and check if data is valid according to those rules. See: https://data-cleaning.github.io/validate/ .

Furthermore, there is the VTL (https://cros.ec.europa.eu/book-page/vtl) language (Validation and Transformation Language) by Eurostat. The nice thing is that is a generic language for specifying these kinds of rules. The not so nice thing is that it is apparently a bit of a mess (according to colleagues) and there are various implementations but no complete implementation.

2 replies

nichtich Nov 20, 2024
Collaborator

VLT looks good but lack of implementations is a high barrier indeed. I only found https://github.com/Meaningful-Data/vtlengine

nichtich Nov 20, 2024
Collaborator

Speaking about implementations I suppose XPath would be the easiest to do. All it needs is a mapping of table structure into XML, e.g. <row field1="value" field2="value2" .../> and an existing XPath processor.

nichtich · 2024-11-14T06:06:50Z

nichtich
Nov 14, 2024
Collaborator

I think there are good reasons why schema languages such as XML Schema and SHACL do not support arbitrary conditions across data elements. The best to compare with and to build on is SQL CHECK constraints. Spreadsheets also allow for validation. In any case this issue requires to define a syntax of expressions such as arrival_time =< departure_time. This goes far beyond the core of data package specification. I strongly recommend to

either allow arbitrary strings as check rules and add a field to indicate the syntax (e.g. SQL formula, Excel formula, JavaScript, XPath, SHACL...)
or stick to one existing syntax (e.g. SQL formula)
or define a strict subset of an existing syntax

@e-lo wrote:

stop_sequence = 
    max(groupby("trip_id").where("group.trip_id"=="trip_id")). # this is getting messy. suggestions welcome

Yes, it's getting messy so better not invent our own syntax but a syntax that someone has already been cleaned up and implemented.

2 replies

e-lo Nov 20, 2024
Author

I wholeheartedly agree will the three suggested pathways to specifying constraints. Thoughts appreciated from others as to which they might prefer/detest!

nichtich Nov 20, 2024
Collaborator

I suppose it depends on availability of standards and implementations. Pathway 1 is easiest but has little advantage over using plain description. Pathway 3 requires most work. I had a look at a SQL parser: it's sure doable but depends on how much you need to express. Maybe start with a review of existing languages that may be used (SQL Check, Open Formula, XPath, VLT...)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditional requirements across records, fields, and tables #1058

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Conditional requirements across records, fields, and tables #1058

e-lo Nov 7, 2024

Example Use Cases

Across Fields

Across Fields and Records

Between Tables

Between a Table and a Field in another Table

Between Fields in different Tables

Community Need

What Currently Happens

Considerations

Implementation Thoughts

Replies: 4 comments · 5 replies

pschumm Nov 8, 2024 Collaborator

e-lo Nov 8, 2024 Author

e-lo Nov 8, 2024 Author

Field Scan

Implementation Strategy

Initial Examples

Field value depends on value of another field in record.

Field value depends on a group-by attribute in the table.

djvanderlaan Nov 13, 2024

nichtich Nov 20, 2024 Collaborator

nichtich Nov 20, 2024 Collaborator

nichtich Nov 14, 2024 Collaborator

e-lo Nov 20, 2024 Author

nichtich Nov 20, 2024 Collaborator

e-lo
Nov 7, 2024

Replies: 4 comments 5 replies

pschumm
Nov 8, 2024
Collaborator

e-lo Nov 8, 2024
Author

e-lo
Nov 8, 2024
Author

djvanderlaan
Nov 13, 2024

nichtich Nov 20, 2024
Collaborator

nichtich Nov 20, 2024
Collaborator

nichtich
Nov 14, 2024
Collaborator

e-lo Nov 20, 2024
Author

nichtich Nov 20, 2024
Collaborator