-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: standardise per-stream config approach #1350
Comments
@aaronsteers @pnadolny13 FYI as you opened those two issues 🙂 |
@kgpayne I like this idea! One thing to consider though is how to cleanly override these settings. For me I like to set the streams in the base plugin configuration so theyre shared by all environments, then override a config settings like So the challenge I see with the approach you suggested is that if I have complex stream selection at the base level then I need to copy/paste it into my environment config to keep the same stream config but override the I'm not sure what the solution is though, maybe having a way to treat environment select criteria as additional criteria. In a lot of implementations we had ways to manipulate selection in the catalog.json and config.json, so the config could be used in meltano environments to achieve this goal. Here's an example of what I'm trying to describe. To keep the same selection criteria and add
On the other hand something like this would end up ignoring all the selection rules defined and overriding them with *.
|
@pnadolny13 thanks for your feedback 🙌 Just to clarify - this would be a new standard config format/feature in the plugins:
extractors:
- name: tap-gitlab
variant: meltanolabs
pip_url: git+https://github.com/meltanolabs/tap-gitlab.git
select:
# rules as normal
config:
streams:
"*":
ignore: true
environments:
- name: userdev
config:
plugins:
extractors:
- name: tap-gitlab
config:
groups: meltano
start_date: '2020-01-01T00:00:00Z'
streams:
"*":
max_records_limit: 100
"my_selected_stream_prefix_*":
ignore: false So I would expect Meltano to do a merge in the example above, with the resulting config for config:
groups: meltano
start_date: '2020-01-01T00:00:00Z'
streams:
"*":
ignore: true
max_records_limit: 100
"my_selected_stream_prefix_*":
ignore: false This is maybe still somewhat confusing in that It is also worth saying that implementing |
@kgpayne ahh ok I misunderstood how this proposal would work, I was thinking this was a change to select itself but should have known because this is the SDK repo 🙄 , thanks for explaining more! I do think this solves the use cases that I have then. |
@kgpayne - A few things here... First of all, I do think it's potentially valuable to allow users to override specific aspects of the tap's config at an individual stream level. But I think the preferred implementation would be (1) that the stream-level config be a subset of the tap-level config, and (2) that we do not support wildcards in this implementation. Wildcards create a difficult-to-debug situation at best, and an impossible-to-debug situation at worse - due to the issues of precedence order and a domain of stream names that may not be known to the user at config-time. Whereas the Explicit stream names also allow the tap to print a hard error message if the stream name does not apply to any known streams in the catalog - and print a simple warning message if rules apply to streams that are simply deselected. Config applied to a deselected stream may be valid for the tap overall but not in this invocation, whereas config that applies to streams that aren't in the catalog is almost certainly a typo. Possible implementationI think one reasonably simple implementation path here is to simply add a new Examples of properties which could work at the stream level:
Examples of properties which likely would not work at the stream level:
Caveats / Implementation ChallengesOther points:
|
@aaronsteers thanks for writing this up 🙏 Your examples in particular include ones I wasn't aware of! I am on board with not supporting patterns/wild cards, ignore patters or
Re: |
I have been running with an idea that you tossed out in Slack for a per-stream approach. These are the components I have incorporated into a few taps to get this working:
#tap.py
class TapName(Tap):
config_jsonschema = th.PropertiesList(
th.Property(
"stream_config",
th.ArrayType(
th.PropertiesList(
th.Property(
"stream",
th.StringType,
required=True,
description="Name of stream to apply a custom configuration.",
),
th.Property(
"parameters",
th.StringType,
description="URL formatted parameters string to be used for stream.",
),
)
),
description="Custom configuration for streams.",
)
)
# client.py
class TapNameStream(RESTStream):
def get_stream_config(self) -> dict:
"""Get parameters set in config."""
config = {}
stream_configs = self.config.get("stream_config", [])
if not stream_configs:
return config
config_list = [
conf for conf in stream_configs if conf.get("stream", "") == self.name
] or [None]
config_dict = config_list[-1] or {}
stream_config = {k: v for k, v in config_dict.items() if k != "stream"}
return stream_config Now I can use the config on a per-stream basis like in this example: # client.py
from urllib.parse import parse_qsl
class TapNameStream(RESTStream):
def get_stream_params(self) -> dict:
stream_params = self.get_stream_config().get("parameters", "")
return {qry[0]: qry[1] for qry in parse_qsl(stream_params.lstrip("?"))}
def get_url_params(
self, context: Optional[dict], next_page_token: Optional[Any]
) -> Dict[str, Any]:
return self.get_stream_params() This can most likely be cleaned up but otherwise works great, especially for testing during development. |
@robby-rob-slalom - Fantastic news. Thanks for sharing this! 🚀 Any reason you chose stream configs as a list/array, rather than as a map with the stream name as map key? |
If we were going to make this part of the default built-in SDK handling, we could add an overridable Instead of what we have currently in self._config: dict = dict(tap.config) That might be replaced with something like this:
And your If the exact form of the stream-level overrides were still up in the air, or to be left up to the developer, we could keep status quo behavior by simply returning What's nice about making this built-in, is that everywhere that the cc @edgarrmondragon, @kgpayne for their thoughts as well. |
IIRC I made it a list to get around the schema validation and having to define each stream as a key with a "parameters" property. I'm open to a better way to go about it. |
@robby-rob-slalom maybe h.PropertiesList(
th.Property(
"stream_config",
th.ObjectType(
additional_properties=th.ObjectType(
th.Property(
"parameters",
th.StringType,
description="URL formatted parameters string to be used for stream.",
),
),
),
description="Custom configuration for streams.",
),
)
I'm still not sure how developers would handle custom stream-level settings that extend built-in ones (e.g. flattening, etc.) |
Thanks @edgarrmondragon! That reduced the # client.py
# class TapNameStream(RESTStream):
def get_stream_config(self) -> dict:
"""Get config for stream."""
stream_configs = self.config.get("stream_config", {})
return stream_configs.get(self.name, {}) |
A good partial reference implementation is in MeltanoLabs/tap-github#300 |
Feature scope
Configuration (settings parsing, validation, etc.)
Description
We have at least two open issues for features requiring per-stream configuration:
ignore
glob patterns as standard tap config #1240max_records
config for testing #1333It strikes me that we may want to standardise our approach for specifying per-stream config, and leverage the same pattern-matching functions for all per-stream config. Something like:
The above would have the following effect:
max_records_limit
to 1000The text was updated successfully, but these errors were encountered: