Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option in SQL Targets to coerce types based on observed record shape #1561

Open
aaronsteers opened this issue Mar 29, 2023 · 5 comments
Open
Labels
kind/Feature New feature or request SQL Support for SQL taps and targets

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 29, 2023

This would be a simple-to-use option for end users who are trying to deal with "rogue" or incorrect type declarations in the upstream tap.

For instance, if the tap incorrectly defines one of its fields as an integer, but it receives a string, then we could give the user an option of auto-expanding the data type to be inclusive of the declared type and also the observed type. Since a string column can hold integers as well as strings, expanding the data type to a string type will allow the load to complete successfully.

Implementation wise, if built within the tap or mapper layer, this normally would result in a new SCHEMA message being emitted upon observance of a record that does not fit the declared schema. However, if built in the target, there's no need to emit a SCHEMA message. Instead, the Sink class per batch would expand data type negotiation to be inclusive of (1) declared type, (2) target column's already existing type, and (2) observed data type in the records. Currently this negotiation exists but it only considers the first two factors.

Note:

  • This probably would not be 'on' by default, since there are performance and stability reasons to not do this. However, this feature would exist an option to make sure data is landing properly.

cc @radbrt

@aaronsteers aaronsteers converted this from a draft issue Mar 29, 2023
@aaronsteers aaronsteers moved this from Up Next to Discussed in Office Hours Mar 29, 2023
@radbrt
Copy link
Contributor

radbrt commented Mar 29, 2023

From what I can tell, the SDK does type validation by default. An input stream where integer columns contain a string, and corresponding error message:

{"type": "SCHEMA", "stream": "badges2", "schema": {"type": "object", "properties": {"id": {"type": ["integer"]}, "user_id": {"type": ["null", "string"], "maxLength": 64}, "name": {"type": ["null", "string"], "maxLength": 64}, "class": {"type": ["null", "integer"]}, "tag_based": {"type": ["null", "boolean"]}}}, "key_properties": ["id"], "bookmark_properties": ["id"]}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387}}, "currently_syncing": "badges"}}
{"type": "ACTIVATE_VERSION", "stream": "badges2", "version": 1667747583387}
{"type": "RECORD", "stream": "badges2", "record": {"class": 2, "id": 1, "name": "e9dd95f26c8b24ab973d3f8f", "tag_based": null, "user_id": "9C"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "2", "id": 2, "name": "984fa1edcd96cc55224124c0", "tag_based": null, "user_id": "40"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "D", "id": 3, "name": "cc3453e38b2371bffcae9aa0", "tag_based": null, "user_id": "33"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387, "replication_key_value": 3}}, "currently_syncing": null}}

The error message:

2023-03-29 19:23:39,982 Target 'target-oracle' is listening for input from tap.
2023-03-29 19:23:39,982 Initializing 'target-oracle' target sink...
2023-03-29 19:23:39,983 Initializing target sink for stream 'badges2'...
Traceback (most recent call last):
  File "/Users/radbrt/miniconda3/envs/target-oracle/bin/target-oracle", line 8, in <module>
    sys.exit(TargetOracle.cli())
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 564, in cli
    target.listen(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 34, in listen
    self._process_lines(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 265, in _process_lines
    counter = super()._process_lines(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 81, in _process_lines
    self._process_record_message(line_dict)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 308, in _process_record_message
    sink._validate_and_parse(transformed_record)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/sinks/core.py", line 302, in _validate_and_parse
    self._validator.validate(record)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/jsonschema/validators.py", line 310, in validate
    raise error
jsonschema.exceptions.ValidationError: '2' is not of type 'null', 'integer'

Failed validating 'type' in schema['properties']['class']:
    {'type': ['null', 'integer']}

On instance['class']:
    '2'

I hope to add functionality, and a setting tentatively named types_dont_fail_me_now: true, that allows the target to alter the column type of the offending column.

The pragmatic approach would be to alter the column to varchar (possibly via add column/drop column), and coerce the input, but we could conceivably make a more complex ruleset of int -> bigint -> number -> varchar -> clob.

The other half of the solution must be that during the validation of the target schema (typically during subsequent runs), so that when it finds varchar in the target column it will accept the target type and keep coercing the integer (or whatever) to varchar.

Type checking brings a performance hit, and many users might prefer if the target fails when something is wrong (so it must be a configurable setting). And this would not be feasible to do during batch loads.

@BuzzCutNorman
Copy link
Contributor

I have run into the JSON Validation errors with a couple of data type. @edgarrmondragon tracked down the root cause of the issue and created a draft PR that contains a possible way for developers to control the JSON Validation settings. Here is the link to the PR just in case it is helpful for this conversation.

feat: Support custom JSON schema validation and string format checkers in targets
#1471

@stale
Copy link

stale bot commented Jul 27, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 27, 2023
@tayloramurphy
Copy link
Collaborator

Still relevant right @edgarrmondragon ?

@stale stale bot removed the stale label Jul 28, 2023
@edgarrmondragon
Copy link
Collaborator

@tayloramurphy yeah this'd be nice to have

@edgarrmondragon edgarrmondragon added SQL Support for SQL taps and targets kind/Feature New feature or request labels Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/Feature New feature or request SQL Support for SQL taps and targets
Projects
Archived in project
Development

No branches or pull requests

5 participants