Option in SQL Targets to coerce types based on observed record shape #1561

aaronsteers · 2023-03-29T16:55:48Z

This would be a simple-to-use option for end users who are trying to deal with "rogue" or incorrect type declarations in the upstream tap.

For instance, if the tap incorrectly defines one of its fields as an integer, but it receives a string, then we could give the user an option of auto-expanding the data type to be inclusive of the declared type and also the observed type. Since a string column can hold integers as well as strings, expanding the data type to a string type will allow the load to complete successfully.

Implementation wise, if built within the tap or mapper layer, this normally would result in a new SCHEMA message being emitted upon observance of a record that does not fit the declared schema. However, if built in the target, there's no need to emit a SCHEMA message. Instead, the Sink class per batch would expand data type negotiation to be inclusive of (1) declared type, (2) target column's already existing type, and (2) observed data type in the records. Currently this negotiation exists but it only considers the first two factors.

Note:

This probably would not be 'on' by default, since there are performance and stability reasons to not do this. However, this feature would exist an option to make sure data is landing properly.

cc @radbrt

The text was updated successfully, but these errors were encountered:

radbrt · 2023-03-29T17:29:28Z

From what I can tell, the SDK does type validation by default. An input stream where integer columns contain a string, and corresponding error message:

{"type": "SCHEMA", "stream": "badges2", "schema": {"type": "object", "properties": {"id": {"type": ["integer"]}, "user_id": {"type": ["null", "string"], "maxLength": 64}, "name": {"type": ["null", "string"], "maxLength": 64}, "class": {"type": ["null", "integer"]}, "tag_based": {"type": ["null", "boolean"]}}}, "key_properties": ["id"], "bookmark_properties": ["id"]}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387}}, "currently_syncing": "badges"}}
{"type": "ACTIVATE_VERSION", "stream": "badges2", "version": 1667747583387}
{"type": "RECORD", "stream": "badges2", "record": {"class": 2, "id": 1, "name": "e9dd95f26c8b24ab973d3f8f", "tag_based": null, "user_id": "9C"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "2", "id": 2, "name": "984fa1edcd96cc55224124c0", "tag_based": null, "user_id": "40"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "D", "id": 3, "name": "cc3453e38b2371bffcae9aa0", "tag_based": null, "user_id": "33"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387, "replication_key_value": 3}}, "currently_syncing": null}}

The error message:

2023-03-29 19:23:39,982 Target 'target-oracle' is listening for input from tap.
2023-03-29 19:23:39,982 Initializing 'target-oracle' target sink...
2023-03-29 19:23:39,983 Initializing target sink for stream 'badges2'...
Traceback (most recent call last):
  File "/Users/radbrt/miniconda3/envs/target-oracle/bin/target-oracle", line 8, in <module>
    sys.exit(TargetOracle.cli())
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 564, in cli
    target.listen(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 34, in listen
    self._process_lines(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 265, in _process_lines
    counter = super()._process_lines(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 81, in _process_lines
    self._process_record_message(line_dict)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 308, in _process_record_message
    sink._validate_and_parse(transformed_record)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/sinks/core.py", line 302, in _validate_and_parse
    self._validator.validate(record)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/jsonschema/validators.py", line 310, in validate
    raise error
jsonschema.exceptions.ValidationError: '2' is not of type 'null', 'integer'

Failed validating 'type' in schema['properties']['class']:
    {'type': ['null', 'integer']}

On instance['class']:
    '2'

I hope to add functionality, and a setting tentatively named types_dont_fail_me_now: true, that allows the target to alter the column type of the offending column.

The pragmatic approach would be to alter the column to varchar (possibly via add column/drop column), and coerce the input, but we could conceivably make a more complex ruleset of int -> bigint -> number -> varchar -> clob.

The other half of the solution must be that during the validation of the target schema (typically during subsequent runs), so that when it finds varchar in the target column it will accept the target type and keep coercing the integer (or whatever) to varchar.

Type checking brings a performance hit, and many users might prefer if the target fails when something is wrong (so it must be a configurable setting). And this would not be feasible to do during batch loads.

BuzzCutNorman · 2023-03-29T18:02:03Z

I have run into the JSON Validation errors with a couple of data type. @edgarrmondragon tracked down the root cause of the issue and created a draft PR that contains a possible way for developers to control the JSON Validation settings. Here is the link to the PR just in case it is helpful for this conversation.

feat: Support custom JSON schema validation and string format checkers in targets
#1471

stale · 2023-07-27T19:01:55Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

tayloramurphy · 2023-07-28T15:34:51Z

Still relevant right @edgarrmondragon ?

edgarrmondragon · 2023-07-28T15:44:29Z

@tayloramurphy yeah this'd be nice to have

aaronsteers added this to Office Hours Mar 29, 2023

aaronsteers converted this from a draft issue Mar 29, 2023

aaronsteers moved this from Up Next to Discussed in Office Hours Mar 29, 2023

stale bot added the stale label Jul 27, 2023

stale bot removed the stale label Jul 28, 2023

edgarrmondragon added SQL Support for SQL taps and targets kind/Feature New feature or request labels Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option in SQL Targets to coerce types based on observed record shape #1561

Option in SQL Targets to coerce types based on observed record shape #1561

aaronsteers commented Mar 29, 2023 •

edited

Loading

radbrt commented Mar 29, 2023 •

edited

Loading

BuzzCutNorman commented Mar 29, 2023

stale bot commented Jul 27, 2023

tayloramurphy commented Jul 28, 2023

edgarrmondragon commented Jul 28, 2023

Option in SQL Targets to coerce types based on observed record shape #1561

Option in SQL Targets to coerce types based on observed record shape #1561

Comments

aaronsteers commented Mar 29, 2023 • edited Loading

radbrt commented Mar 29, 2023 • edited Loading

BuzzCutNorman commented Mar 29, 2023

stale bot commented Jul 27, 2023

tayloramurphy commented Jul 28, 2023

edgarrmondragon commented Jul 28, 2023

aaronsteers commented Mar 29, 2023 •

edited

Loading

radbrt commented Mar 29, 2023 •

edited

Loading