-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Cache SQL columns and schemas #1779
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1779 +/- ##
==========================================
+ Coverage 87.38% 87.39% +0.01%
==========================================
Files 59 59
Lines 5121 5126 +5
Branches 828 830 +2
==========================================
+ Hits 4475 4480 +5
Misses 451 451
Partials 195 195
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pnadolny13 I think caching at this level makes sense. We use functools.lru_cache
in other places in the SDK for a similar purpose, which may be a bit cleaner that implementing our own cache. Will let @edgarrmondragon take a look too 🙂
This reverts commit 668832b.
@kgpayne I picked this back up and explored if lru_cache would work. I dont think it will for this case since we're calling these methods to check if the column/schema exists and if not we create them. If we use lru_cache then in the case where a column doesnt yet exist we would get a return value of false, then we'd add that column elsewhere, and a second call to this method would also return the cached value false even though the column has since been created. My implementation tries to avoid needless calls to the DB if we already know the column/schema exists but if it doesnt exist in our cache then we fall back to calling the DB. The first time the method is called the cache is hydrated then later if a column/schema is requested that exists in the cache its returned without hitting the DB again, but if a column/schema is requested and that isnt in the cache then we go back to the DB to refresh our cache to double check in which case newly added columns would be accounted for. Does that make sense? Can you think of a better way to do it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this implementation makes sense without lru_cache. What this is doing is calling inspection if the object hasn't been processed.
for more information, see https://pre-commit.ci
…ltano/sdk into cache_sql_columns_and_schemas
for more information, see https://pre-commit.ci
Hello, As mentioned in the Target Postgres issue, we are trying to add the overwrite functionality to that loader. I am trying to use the built-in You say "... a new connector is created if the schema changes" but I am not sure if this is the case. I put loggers all over the place in the Target Postgres and the sink is initialized if the schema changes indeed, but NOT the connector, which is where the cache is stored. Indeed, if you look at the code here: class SQLTarget(Target):
def get_sink():
# This part of the code calls `add_sqlsink()`
...
def add_sqlsink():
sink = sink_class(
target=self,
stream_name=stream_name,
schema=schema,
key_properties=key_properties,
# HERE: the Sink is being initialized with the EXISTING connector
connector=self.target_connector,
) As you can see, the Sink class is initialized with the existing target connector, so the sink does not recreate it again. Am I misunderstanding the code? |
@raulbonet it looks like the methods you mentioned were added #1864 after this PR merged. Can you create a separate bug issue for this? |
These two methods get called a lot and behind the scenes sqlalchemy is running a query so it gets pretty expensive if they arent being cached. If the schema or the table isnt found in the cache then it runs the normal workflow of querying the database but if its cached it skips it. I originally implemented a version of this in MeltanoLabs/target-snowflake#57.
I know for targets these methods get called at the initialization step of the sink and a new sink and connector is created if the schema changes so theres no worry about having to invalidate the cache, especially since targets commonly alter the table columns. For taps I'm less familiar with the workflow but it seems rare and out of scope for the source tables to be updated mid sync and require us to invalidate the cache.
@edgarrmondragon @kgpayne any thoughts?
📚 Documentation preview 📚: https://meltano-sdk--1779.org.readthedocs.build/en/1779/