Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend metadata cache flush procedure to flush specific caches #10385

Merged
merged 3 commits into from
Jan 16, 2022

Conversation

aczajkowski
Copy link
Member

@aczajkowski aczajkowski commented Dec 22, 2021

This PR extends existing Hive connector procedure system.flush_metadata_cache() with optional parameters which will allow to make invocation more specific to concrete schema, table or even partition.
E.g.

USE example_catalog.some_schema;

# Clear all caches in catalog 'example_catalog' for all schema
system.flush_metadata_cache(); 

# Clear all caches for specific partition 'partition_column=partition_value'
# in table 'example_catalog.example_schema.example_table'
system.flush_metadata_cache(
  schema_name => 'example_schema',
  table_name => 'example_table',
  partition_columns => ARRAY['partition_column'],
  partition_values => ARRAY['partition_value']
);

@cla-bot cla-bot bot added the cla-signed label Dec 22, 2021
@aczajkowski aczajkowski force-pushed the acz/partitions_cache branch 5 times, most recently from db3b253 to 42dc5d7 Compare December 23, 2021 13:52
@aczajkowski aczajkowski marked this pull request as ready for review January 3, 2022 09:13
@sopel39
Copy link
Member

sopel39 commented Jan 3, 2022

@aczajkowski Could you provide a rationale for this feature? Caches are usually not something that one should micromanage. Why flushing entire cache is not sufficient?

@aczajkowski
Copy link
Member Author

@aczajkowski Could you provide a rationale for this feature? Caches are usually not something that one should micromanage. Why flushing entire cache is not sufficient?

@sopel39 This is mostly for large data sets where deltas from recent period are being updated (overwritten) since next period starts. Eg. Load events from current day (0:00 - 6:00] after six hours export and load events from (0:00 - 12:00]
to table partitioned by date (day precision) . Once day ends we no longer need to touch older partitions and there is no need to flush older cache entries. In use case I'm addressing flushing whole cache and loading it from 0 means +/- 45s lag in user query which is not acceptable.

Currently each time delta is being written we:

  • create new partition location
  • write new delta data there
  • switch partition location in Hive.

@sopel39
Copy link
Member

sopel39 commented Jan 3, 2022

I would focus on that particular partition cache usecase then and avoid other complexity for now. We could have flush_partition_cache procedure in addition to flush_metadata_cache procedure

@aczajkowski
Copy link
Member Author

I would focus on that particular partition cache usecase then and avoid other complexity for now. We could have flush_partition_cache procedure in addition to flush_metadata_cache procedure

Ok sound resonable. @findepi @wendigo @losipiuk WDYT ??

@findepi
Copy link
Member

findepi commented Jan 4, 2022

I would focus on that particular partition cache usecase then and avoid other complexity for now.

I agree we should expose as little functionality as sufficient. Every functionality adds up to maintenance cost

We could have flush_partition_cache procedure in addition to flush_metadata_cache procedure

There is no "partition cache". There is "metadata cache containing partition information", and there also is "file listing cache containing partitions' data files".

I like the idea of "overloading" flush_metadata_cache procedure, as this continues to use "flush_metadata_cache" name which is informative, and doesn't introduce controversy "flush_metadata_cache" vs "flush_partition_cache".

Since we may want to be future-proof here and keep ability to add more options to the procedure (hopefully this never happens, but butter be prepared), we could allow syntaxes

flush_metadata_cache()
flush_metadata_cache(schema_name => ..,  table_name => .., partition => ...)

while disallowing syntax

flush_metadata_cache('...', '....', ...); -- positional parameters

to do this, we could add a fake parameter as the first one:

        return new Procedure(
                "system",
                "flush_metadata_cache",
                ImmutableList.of(
                        new Procedure.Argument(
                                "$fake_first_parameter",
                                VARCHAR,
                                false,
                                "procedure should only be invoked with name parameters"),
                        new Procedure.Argument(
                                "schema_nname",
                                VARCHAR,
                                false,
                                ...
                ),

@sopel39
Copy link
Member

sopel39 commented Jan 4, 2022

There is no "partition cache". There is "metadata cache containing partition information", and there also is "file listing cache containing partitions' data files".

Do the details matter for end user? I don't think user cares what is internal cache layout. All he wants is to evict partition information from metadata cache. Thus flush_partition_cache seems more friendly name as it's more explicit (it's easier to grasp than overloaded method with some if-then documentation).

@losipiuk
Copy link
Member

As for the user interface I am on @findepi's side with it. Having a single flush_metadata_cache method with keyword arguments is nice to use and allows for easy extension of capabilities (while it is not always easy to come up with new function name which makes sense and is consistent with already defined ones).

@sopel39
Copy link
Member

sopel39 commented Jan 10, 2022

As for the user interface I am on @findepi's side with it.

So 2 vs 1. I don't have that strong opinion. Up to you. You might ask @martint too

@aczajkowski
Copy link
Member Author

@findepi @sopel39 @losipiuk I've implemented overloaded version where arguments check is done explicit in procedure implementation. Thrown exception list legal usages to avoid misleading.

docs/src/main/sphinx/connector/hive.rst Outdated Show resolved Hide resolved
@@ -136,6 +136,26 @@ public void testFlushHiveMetastoreCacheProcedureCallable()
queryRunner.execute(renamedColumnQuery);
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestCachingHiveMetastoreWithQueryRunner -- this is an odd name for a test class.

It's named like this to differentiate from a unit test TestCachingHiveMetastore, and so it should be called TestCachingHiveMetastoreQueries (alas, this already exists! let's merge the classes -- as a follow-up)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could merge them there are some differences in QueryRunner setup, but i think we can manage to adjust.

@aczajkowski aczajkowski requested a review from findepi January 12, 2022 10:57
@aczajkowski aczajkowski force-pushed the acz/partitions_cache branch 2 times, most recently from 4bac055 to 6ffcd24 Compare January 13, 2022 10:42
@aczajkowski
Copy link
Member Author

@findepi Thx for review and approval. Tests got green. Do you want some additional reviewers to approve or we could merge ?

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some comments to test code.

@aczajkowski
Copy link
Member Author

@losipiuk applied your comments. Please let me know if updated integration test is ok now.

@losipiuk losipiuk merged commit 6529d7b into trinodb:master Jan 16, 2022
@github-actions github-actions bot added this to the 369 milestone Jan 16, 2022
@losipiuk losipiuk mentioned this pull request Jan 17, 2022
@aczajkowski aczajkowski deleted the acz/partitions_cache branch February 7, 2022 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants