[Feature]: Add stream context to request authenticator #1028

mchudoba · 2022-10-03T15:35:59Z

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

I'm proposing adding the stream/partition context when creating an authenticator in REST streams. We have a use case where we would like to use partitions within the same stream run to extract data for multiple clients, which each require different credentials. We would like to put those credentials in the stream partition and access them when creating the authenticator object.

The SDK already supports passing the context object when building the request url, parameters, and body, so this would add it to another part of the request.

Proposal

Current authenticator property:

@property
def authenticator(self) -> APIAuthenticatorBase | None:
    ...

Add/replace with:

def get_authenticator(self, context: dict | None) -> APIAuthenticatorBase | None:
    ...

Call self.get_authenticator(context) instead of self.authenticator inside of RESTStream.build_prepared_request

Questions

If accepted, my only question would be whether or not to deprecate the authenticator property in favor of get_authenticator or to support both. By default get_authenticator can just return self.authenticator.

The text was updated successfully, but these errors were encountered:

edgarrmondragon · 2022-10-07T21:53:33Z

Hi @mchudoba, thanks for logging!

We would like to put those credentials in the stream partition and access them when creating the authenticator object.

It's generally not safe to put credentials in the partition context because it ends up in a number of potentially insecure places, like the tap state dictionary and as a tag in metric logs.

We have a use case where we would like to use partitions within the same stream run to extract data for multiple clients, which each require different credentials.

Is within the same stream run a hard requirement? Because I can think of a solution by instantiating multiple instances of the stream class with different configurations:

class MyStream(RESTStream):
  def __init__(self, client_creds, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.client_creds = client_creds

  @property
  def authenticator(self):
    return MyAuthenticator(self.client_creds)
    

class MyTap(Tap):
  def discover_streams(self):
    return [
      MyStream(client_creds, self)
      for client_creds in self.config.get("client_creds", [])
    ]

Still, the proposal might be interesting for other folks too so I've added it to our Office Hours board 🙂

mchudoba · 2022-10-10T18:27:40Z

@edgarrmondragon Thank you for the suggestion! Great point about storing secrets in the context, I didn't fully think that through. An alternate approach that could work would be to instead store a client_id in the partition context and look up the credentials in the config instead.

Something like this:

def get_authenticator(context):
    client_id = context["client_id"]
    creds = self.config["client_creds_map"].get(client_id)
    
    return MyAuthenticator(creds)

However, your suggestion about using multiple stream instances is really interesting and might do exactly what I need. Running within the same stream run is definitely not a hard requirement, so I'll experiment with that and see if it works for me.

How does Meltano send data from multiple instances of the same stream to its target in one stream run? Does it batch data across instances? Essentially I'm trying to send larger, less frequent batches of data to our Postgres target.

edgarrmondragon · 2022-10-10T23:41:49Z

Running within the same stream run is definitely not a hard requirement, so I'll experiment with that and see if it works for me.

👍

How does Meltano send data from multiple instances of the same stream to its target in one stream run? Does it batch data across instances? Essentially I'm trying to send larger, less frequent batches of data to our Postgres target.

@mchudoba Each instance will emit a schema message at the start of the sync, which for most targets will trigger a drain (which can be more or less costly, depending on the target). So, if you have a lot of partitions, with few records in each, the loading process could be inefficient.

In my opinion, targets should be able to identify if the schema had no changes so it does not need to drain its current record batch. No targets are implemented like that afaict, though.

cc @kgpayne @aaronsteers this 👆 is something that could be interesting/useful to have for targets.

aaronsteers · 2022-10-11T00:20:01Z

In my opinion, targets should be able to identify if the schema had no changes so it does not need to drain its current record batch. No targets are implemented like that afaict, though.

We actually did intend to implement exactly this check (and also a check on key-properties) within the SDK for targets:

A new sink will be created if schema is provided and if either schema or
key_properties has changed. If so, the old sink becomes archived and held
until the next drain_all() operation.

https://github.com/meltano/sdk/blob/main/singer_sdk/target_base.py#L131-L133

If this is not working as expected for SDK targets (meaning, if an identical schema message is sent and it causes the Sink to drain), I think we would want to log as a defect. The conditional is here:

https://github.com/meltano/sdk/blob/main/singer_sdk/target_base.py#L160-L171

aaronsteers · 2022-10-11T00:24:51Z

Logged related, because I think SDK-based taps may be noisy/ineficient for non-SDK targets:

change(taps): Don't resend SCHEMA messages if they match the most recent schema sent #1061

And also, I don't think this is related, but I should call out (just in case) that we currently have an overloaded meaning of 'batch' which we will resolve and/or merge in future. BatchSink is a subclass of Sink that wants to load its items in bulk. "Batch messages" are a means of taps and targets communicating via sending of files. #963

stale · 2023-07-18T05:09:54Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

mchudoba added kind/Feature New feature or request valuestream/SDK labels Oct 3, 2022

mchudoba mentioned this issue Oct 3, 2022

feat: Add stream context to request authenticator #1029

Closed

edgarrmondragon added this to Office Hours Oct 7, 2022

edgarrmondragon moved this to To Discuss in Office Hours Oct 7, 2022

edgarrmondragon mentioned this issue Oct 11, 2022

docs: Add explanation and recommendations for context usage #1060

Merged

aaronsteers moved this from To Discuss to Up Next in Office Hours Oct 19, 2022

aaronsteers moved this from Up Next to To Discuss in Office Hours Oct 19, 2022

stale bot added the stale label Jul 18, 2023

edgarrmondragon closed this as not planned Won't fix, can't repro, duplicate, stale Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add stream context to request authenticator #1028

[Feature]: Add stream context to request authenticator #1028

mchudoba commented Oct 3, 2022 •

edited

Loading

edgarrmondragon commented Oct 7, 2022 •

edited

Loading

mchudoba commented Oct 10, 2022

edgarrmondragon commented Oct 10, 2022

aaronsteers commented Oct 11, 2022 •

edited

Loading

aaronsteers commented Oct 11, 2022

stale bot commented Jul 18, 2023

[Feature]: Add stream context to request authenticator #1028

[Feature]: Add stream context to request authenticator #1028

Comments

mchudoba commented Oct 3, 2022 • edited Loading

Feature scope

Description

Proposal

Questions

edgarrmondragon commented Oct 7, 2022 • edited Loading

mchudoba commented Oct 10, 2022

edgarrmondragon commented Oct 10, 2022

aaronsteers commented Oct 11, 2022 • edited Loading

aaronsteers commented Oct 11, 2022

stale bot commented Jul 18, 2023

mchudoba commented Oct 3, 2022 •

edited

Loading

edgarrmondragon commented Oct 7, 2022 •

edited

Loading

aaronsteers commented Oct 11, 2022 •

edited

Loading