Proposal for time series deletion with block storage #4274

ilangofman · 2021-06-10T04:07:35Z

What this PR does:
A proposal for adding a time series deletion API for block storage. Currently still in [Draft] and not fully complete, but would really appreciate some initial feedback.

Which issue(s) this PR fixes:
A proposal for #4267

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Gofman <[email protected]> Signed-off-by: ilangofman <[email protected]>

Signed-off-by: ilangofman <[email protected]>

pracucci

Thanks for working on this proposal! I left few comments. Can you take a look, please?

docs/proposals/block-storage-time-series-deletion.md

Signed-off-by: ilangofman <[email protected]>

pracucci

Thanks for addressing our feedback! I left few more comments. Unless I'm missing something, I think the query results cache invalidation is the most critical to address.

docs/proposals/block-storage-time-series-deletion.md

pracucci · 2021-06-17T14:48:12Z

docs/proposals/block-storage-time-series-deletion.md


-Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. The cache generation numbers are currently being stored in an Index table  (e.g. DynamoDB or Bigtable). One option for block store is to store a per tenant key using the KV-store with the ring backend and propogate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. 
+Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table  (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. 


I think there's a timing issue here. When a new tombstone is created or deleted (cancel delete request) the querier will take some time before applying it (it's not instantaneous). However, the cache generation number is increased immediately so we're going to cache query results with the new gen number but results upon which tombstones haven't been enforced yet.

Am I missing anything?

@pracucci - I was thinking of an alternative approach.

The purger would write new tombstones to 3 ingesters assigned to a user and wait until at-least 2 succeeds. While executing a query, a querier would fetch the tombstones for a user from all ingesters in the cluster. If the tombstoneTimestamp > currentCacheGenNumber, the querier would update the currentCacheGenNumber to currentTimestamp. I believe this would solve the timing issue. WDYT?

The purger would write new tombstones to 3 ingesters assigned to a user and wait until at-least 2 succeeds. While executing a query, a querier would fetch the tombstones for a user from all ingesters in the cluster. If the tombstoneTimestamp > currentCacheGenNumber, the querier would update the currentCacheGenNumber to currentTimestamp. I believe this would solve the timing issue. WDYT?

This requires that queriers have access to "place where current cache gen number" is stored, and also doesn't handle cancellation as currently proposed.

I am wondering if we can modify design:

let's assume that frontend and purger have a way to share "current cache gen number/timestamp"

when purger receives new delete or cancellation request, it will update "current cache gen number/timestamp" for tenant. This timestamp will be also timestamp of in the tombstone. Cancellation requests to be discussed later.

queriers know which tombstones are present from bucket index, and can include largest timestamp in the response.

query-frontend will know "current cache gen number/timestamp". If reponse is not found in cache, it asks queriers. Queriers will include "largest tombstone timestamp" in the responses. Frontend will take smallest timestamp from all responses (if request was split into sub-requests), and use that to check if it can put final result in the cache again.

We need to consider cancelled tombstones -- in order to make this work, we need to keep them around until at least one other tombstone request or cancellation request with newer timestamp exists. In other words, cancellation wouldn't just delete a tombstone, but create another "tombstone cancelled" file, so that queriers can include its timestamp in the response.

Thank you for the suggestion. I like this idea and think it could work. I've updated the proposal to include this. One small modification I made is that instead of using the largest tombstone timestamp, I propose to use the timestamp of when the compactor writes the tombstones to the bucket index. I think this should work the same but make the delete request cancellations a bit simpler to implement.

docs/proposals/block-storage-time-series-deletion.md

Signed-off-by: ilangofman <[email protected]>

pstibrany

Thank you for your work on this proposal. I have read the latest version, and I think it's a solid design. I don't see any showstopper, and I'm looking forward to see this implemented! Great job! 🎉

Signed-off-by: ilangofman <[email protected]>

docs/proposals/block-storage-time-series-deletion.md

pracucci · 2021-07-01T08:42:19Z

docs/proposals/block-storage-time-series-deletion.md

+
+The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file.  The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, deleting, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created.
+
+Updating the states will be done from the compactor. Inside the compactor, the new _DeletedSeriesCleaner_ service will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state.


the new DeletedSeriesCleaner service will periodically check all the tombstones to see if their current state is ready to be upgraded

This could make things complicated. I think we should add tombstones discovery as part of the blocks BlocksCleaner so that we only have 1 simple component responsibile of keeping the bucket index updated.

It will use the tombstone with the most recently updated state and try to delete the file with the older state.

I think it should the logic should be implemented as a sort of tiny state machine. States can move only in a specific direction: pending -> deleting -> processed. Given this assumption, then the later state wins over the previous ones (eg. if you have both pending and deleting for the same state, then deleting wins because we can't go backward from deleting back to pending).

I've updated it so the BlocksCleaner is responsible for discovering and updating the bucket index. As for processing the deletions and permanently removing data , I've left it to be done in a separate service DeletedSeriesCleaner as this shouldn't be updating the bucket index. Please let me know if it would be preferable to have it separated like this or also include the series deletions processing inside the BlocksCleaner.

docs/proposals/block-storage-time-series-deletion.md

pracucci · 2021-07-01T08:54:28Z

docs/proposals/block-storage-time-series-deletion.md

+- When a deletion request is made or cancelled, the cache generation number is incremented for the given tenant in the KV-store to the current timestamp.
+- Inside the query frontend, the cache generation number will be outdated, so the cache is invalidated. The query frontend will now store the most recent cache generation number.
+- When the compactor writes the tombstones to the bucket index, it will include the timestamp of when the write occurred. When the querier reads from the bucket index, it will store this timestamp.
+- Inside the query frontend, if a response is not found in cache, it will ask at least one querier. Inside the response of each querier, it will include the timestamp of when the compactor wrote to the bucket index. The query frontend will compare the minimum timestamp that was returned by the queriers to the cache generation number (timestamp). If the minimum timestamp is larger than the cache generation number, that means that all the queriers loaded the bucket index after the compactor has updated it with the most up-to-date tombstones.  If this is the case, then the response from each of the queriers should have the correct data and the results can be cached inside the query frontend. Otherwise, the response of the queriers might contain data that was not processed using the most up-to-date tombstones, and the results should not be cached. Using the minimum timestamp returned by the queriers, we can guarantee that the queriers have the most recent tombstones.


I have concerns about this whole mechanism. Looks quite complicate and fragile to me.

Given we don't want any guarantee that delete series are immediately filtered out from query results (as soon as the tombstone is created) but some delay is accepted, then we could leverage on the bucket index staleness as the delay to wait before the cache generation number is increased.

we could leverage on the bucket index staleness

As of today, a query fails, in the querier, if the bucket index last update is older than "X" (configurable). Given tombstones are written to the bucket index by compactor and given a querier filters out series matching all tombstones reference in the bucket index, we could leverage on this "X" period before the cache generation number is updated.

If we update the cache generation number of a tombstone after X time it's created, we have the guarantee that after X time all queriers will apply that tombstone. If, for any reason, the bucket index is not updated within X time, then the query will fail (it's already failing today).

Looks quite complicate and fragile to me.

To me it seems quite simple: query-frontend compares current generation number (timestamp of last operation received by purger for given tenant) with timestamps from queriers, which report back timestamp of last bucket index update. If cache gen number is less than minimum (across queriers) of bucket index update timestamps, query-frontend can write response to the cache.

If we update the cache generation number of a tombstone after X time it's created, we have the guarantee that after X time all queriers will apply that tombstone. If, for any reason, the bucket index is not updated within X time, then the query will fail (it's already failing today).

What component would update cache generation number? How would this work with tombstones that were deleted?

To me it seems quite simple

I think I initially misunderstood the proposal. I understand it better now. I agree that looks simple but, correct me if I'm wrong, the query-frontend cache will be entirely skipped after a tombstone is created until queriers have updated the bucket index containing that tombstone (let's say 30m with the default config). Basically, each time you create a tombstone, for the next 30m we run without query-frontend caching for that tenant.

Basically, each time you create a tombstone, for the next 30m we run without query-frontend caching for that tenant.

Yes, that's my understanding. That doesn't sound very good. :(

After our discussion today, I've changed the cache invalidation to use the mechanism leveraging the bucket index staleness period. This should get rid of the 30 min no caching period. Please take a look and let me know any concerns. Thanks for the suggestion!

docs/proposals/block-storage-time-series-deletion.md

Signed-off-by: ilangofman <[email protected]>

pracucci · 2021-07-08T10:12:05Z

docs/proposals/block-storage-time-series-deletion.md

+- The cache generation number will be a timestamp. It will also serve as the time of when it becomes valid and the query front-end can use it.
+- The bucket index will store the cache generation number.  The query front-end will periodically fetch the bucket index.
+- Inside the compactor, it will load the tombstones from object store and update the bucket index accordingly. If a deletion request is made or cancelled, the compactor will discover this and increment the cache generation number in the bucket index. The cache generation number will be the current timestamp + the max stale period of the bucket index.  The compactor can discover if there have been any changes to the tombstones by comparing the newly loaded tombstones to the one's currently in the bucket index.
+- The query front-end will fetch the cache generation number from the bucket index. If the current timestamp is less than the cache generation number, it will simply not do anything as the generation number is not yet valid. If the query front-end discovers that the current time has passed the cache generation timestamp from the bucket index, then it is valid and can be used. The query front end will compare it to the current cache generation number stored in the front-end. If the cache generation number from the front-end is less than the one from bucket index, then the cache is invalidated.


I'm not sure this works. Think about the case the query-frontend is restarted. At startup it loads the bucket index for a tenant to get the cache generation number and that number is in the future. What to do? It can't use the cache generation number in the bucket index, but it doesn't even know which was the previous one.

The proposed approach also has another problem: if the max stale period is 1h and you create a tombstone every < 1h, then the cache is never invalidated.

I have a different proposal:

The cache generation number in the bucket index is the current one (the one the query-frontend has to use)

If the bucket index has no cache generation number (think about an old version of the bucket index, when you rollout this feature) than the generation number is a default 0

The compactor, during the blocks cleanup process, looks all tombstones and compute the current generation number being the max timestamp across all tombstones created until "now - max stale period"

What's about cancellations? We need to invalidate the cache also after cancellations. We could change the process of how a tombstone is deleted. We don't immediately delete the tombstone from the object storage, but we switch its state to .deleted (adding a deletionTime too) . This allows the compactor to also be aware of deleted tombstones and when it was done. Finally, the compactor's blocks cleaner can also take care of cleaning up .deleted tombstones after X period of time (eg. 10 times max stale period).

That makes a lot of sense, thanks for the suggestions! Proposal is updated now. Only difference is instead of using a deletionTime, I added another field called stateCreationTime which can serve in the same way. This way it can also be used for seeing how much time elapsed since moving into the pending state.

pracucci · 2021-07-08T10:14:33Z

docs/proposals/block-storage-time-series-deletion.md

+    - In Prometheus, there is no delay.
+    - One way to filter out immediately is to load the tombstones during query time but this will cause a negative performance impact.
+- Adding limits to the API such as:
+    - The number of deletion requests per day,


This may be non trivial to implement. What were your thoughts on this?

The tombstones have a field with a timestamp of when the the request was created. Could be possible to do something like iterate over the tombstones and count the ones that were created in the last 24 hours.

docs/proposals/block-storage-time-series-deletion.md

Signed-off-by: ilangofman <[email protected]>

pracucci

Good job on this proposal! LGTM 👍 👏

pracucci · 2021-07-09T11:52:08Z

Could you run make clean-white-noise to make the linter happy, please?

Signed-off-by: ilangofman <[email protected]>

ilangofman · 2021-07-09T12:14:32Z

Just ran the command. Thank you and to everyone else for all the help with this PR!

) * Add proposal document Signed-off-by: Gofman <[email protected]> Signed-off-by: ilangofman <[email protected]> * Minor text modifications Signed-off-by: ilangofman <[email protected]> * Implement requested changes to the proposal Signed-off-by: ilangofman <[email protected]> * Fix mention of Compactor instead of purger in proposal Signed-off-by: ilangofman <[email protected]> * Fixed wording and spelling in proposal Signed-off-by: ilangofman <[email protected]> * Update the cache invalidation method Signed-off-by: ilangofman <[email protected]> * Fix wording on cache invalidation section Signed-off-by: ilangofman <[email protected]> * Minor wording additions Signed-off-by: ilangofman <[email protected]> * Remove white-noise from text Signed-off-by: ilangofman <[email protected]> * Remove the deleting state and change cache invalidation Signed-off-by: ilangofman <[email protected]> * Add deleted state and update cache invalidation Signed-off-by: ilangofman <[email protected]> * Add one word to clear things up Signed-off-by: ilangofman <[email protected]> * update api limits section Signed-off-by: ilangofman <[email protected]> * ran clean white noise Signed-off-by: ilangofman <[email protected]> Signed-off-by: Alvin Lin <[email protected]>

yeya24 · 2022-06-20T00:17:00Z

docs/proposals/block-storage-time-series-deletion.md

+
+While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the _DeletedSeriesCleaner_ know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small.
+
+The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones that are currently being processing for deletions and apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Pending` state is complete and the request progresses to the `Processed` state.


If the tombstone file has a maxTime which is in the future, when do you consider all blocks are processed? Do we need to wait until all blocks are processed and the maxTime is passed?

After taking a break, I'm getting back to working on the implementation of this proposal.

I was following the chunks series deletion API implementation, which I don't think supports deletes with end times in the future. However, the way you mention it here could work as well. Need to check that the end time has passed before moving the deletion request to the processed state.

If we allow deletion requests with the end times in the future, one downside is that if the deletion period end time is really far into the future, then the tombstone query time filtering will result in slower performance for a long time until the tombstone end time has passed. For example, let's say we would like to delete a certain metric 1 month into the future, then all queries will have to be filtered for that metric the entire month.

Yeah I agree. But maybe a future time window (now+24h) can be considered due to the pending period.

It is not clear that how to process multiple tombstones on the same block. As two tombstones may enter processed state at different time, so one block might be rewritten first by one tombstone then another tombstone can only apply to the new block.

If there are multiple tombstones that are past the pending state, then we can rewrite a single block once using multiple tombstones at the same time. If there is some tombstones in a pending state and some in processing, then it is possible the block will get overridden once by one tombstone and then the new block gets overridden again once the pending tombstone begins processing.

My plan was that the compactor will process the permanent deletions at a configurable interval, SeriesDeletionCleanupInterval. So the deletions will not necessarily begin as soon as the pending period passes, there could be some waiting, which depends on how much time you will set it to. So if lets say you set SeriesDeletionCleanupInterval = 12 hours, and in the last 12 hours X tombstones went from pending to processing state, it will only rewrite each block once using the X tombstones at the same time (assuming the X tombstones would apply to the block). So making multiple delete requests close to each other will require less processing than doing them far apart.

yeya24 · 2022-06-20T02:16:09Z

docs/proposals/block-storage-time-series-deletion.md

+
+Cons:
+- Might have to wait for a longer period of time for the compaction to be finished.
+    - This would mean the earliest time to be able to run the deletion would be once the last time from the block_ranges in the [compactor_config](https://cortexmetrics.io/docs/blocks-storage/compactor/#compactor-configuration) has passed. By default this value is 24 hours, so only once 24 hours have passed and the new compacted blocks have been created, then the rewrite can be safely run.


This means that if we configure a larger compaction block duration for example 7d, then we must wait for 7 days for the tombstones to be cleaned up.
From Thanos project perspective this will be a little bit long as Thanos's max block is 14d time range.

Yeah, if the block duration is longer, then with this deletion design, it will require a longer wait.

Maybe for the Thanos project the design could be expanded to perform deletions during compaction. If you have other alternatives in mind, I would be happy to discuss with you.

pull-request-size bot added the size/L label Jun 10, 2021

ilangofman marked this pull request as draft June 10, 2021 04:09

ilangofman force-pushed the blocks-time-series-deletion branch 3 times, most recently from b84bed7 to 6d4d95d Compare June 10, 2021 14:13

ilangofman added 2 commits June 10, 2021 10:17

Add proposal document

e08af0f

Signed-off-by: Gofman <[email protected]> Signed-off-by: ilangofman <[email protected]>

Minor text modifications

8027c4a

Signed-off-by: ilangofman <[email protected]>

ilangofman force-pushed the blocks-time-series-deletion branch from 6d4d95d to 8027c4a Compare June 10, 2021 14:18

pracucci reviewed Jun 10, 2021

View reviewed changes

jdbaldry reviewed Jun 14, 2021

View reviewed changes

docs/proposals/block-storage-time-series-deletion.md Outdated Show resolved Hide resolved

ilangofman added 3 commits June 14, 2021 18:36

Implement requested changes to the proposal

9081eb2

Signed-off-by: ilangofman <[email protected]>

Fix mention of Compactor instead of purger in proposal

30ca52d

Signed-off-by: ilangofman <[email protected]>

Fixed wording and spelling in proposal

47f06d4

Signed-off-by: ilangofman <[email protected]>

pracucci reviewed Jun 17, 2021

View reviewed changes

pstibrany reviewed Jun 23, 2021

View reviewed changes

docs/proposals/block-storage-time-series-deletion.md Show resolved Hide resolved

pstibrany reviewed Jun 23, 2021

View reviewed changes

docs/proposals/block-storage-time-series-deletion.md Outdated Show resolved Hide resolved

pstibrany reviewed Jun 23, 2021

View reviewed changes

docs/proposals/block-storage-time-series-deletion.md Outdated Show resolved Hide resolved

pstibrany reviewed Jun 23, 2021

View reviewed changes

docs/proposals/block-storage-time-series-deletion.md Outdated Show resolved Hide resolved

ilangofman added 3 commits June 24, 2021 00:29

Update the cache invalidation method

0224010

Signed-off-by: ilangofman <[email protected]>

Fix wording on cache invalidation section

73db542

Signed-off-by: ilangofman <[email protected]>

Minor wording additions

ffd079e

Signed-off-by: ilangofman <[email protected]>

pstibrany mentioned this pull request Jun 24, 2021

Deprecate chunks storage #4295

Closed

3 tasks

ilangofman marked this pull request as ready for review June 24, 2021 20:12

pstibrany approved these changes Jun 30, 2021

View reviewed changes

Remove white-noise from text

2f30ab5

Signed-off-by: ilangofman <[email protected]>

pracucci reviewed Jul 1, 2021

View reviewed changes

docs/proposals/block-storage-time-series-deletion.md Show resolved Hide resolved

pracucci reviewed Jul 1, 2021

View reviewed changes

Remove the deleting state and change cache invalidation

f95b002

Signed-off-by: ilangofman <[email protected]>

pracucci reviewed Jul 8, 2021

View reviewed changes

Add deleted state and update cache invalidation

add4da2

Signed-off-by: ilangofman <[email protected]>

ilangofman added 2 commits July 8, 2021 10:17

Add one word to clear things up

434df48

Signed-off-by: ilangofman <[email protected]>

update api limits section

98abab0

Signed-off-by: ilangofman <[email protected]>

pracucci approved these changes Jul 9, 2021

View reviewed changes

ran clean white noise

bc19925

Signed-off-by: ilangofman <[email protected]>

tomwilkie approved these changes Jul 9, 2021

View reviewed changes

tomwilkie merged commit 82fa333 into cortexproject:master Jul 9, 2021

yeya24 reviewed Jun 20, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for time series deletion with block storage #4274

Proposal for time series deletion with block storage #4274

ilangofman commented Jun 10, 2021

pracucci left a comment

pracucci left a comment

pracucci Jun 17, 2021

harry671003 Jun 18, 2021 •

edited

Loading

pstibrany Jun 23, 2021

ilangofman Jun 24, 2021

pstibrany left a comment

pracucci Jul 1, 2021

pracucci Jul 1, 2021

ilangofman Jul 2, 2021

pracucci Jul 1, 2021

pstibrany Jul 1, 2021

pracucci Jul 1, 2021

pstibrany Jul 1, 2021

ilangofman Jul 2, 2021

pracucci Jul 8, 2021

ilangofman Jul 8, 2021

pracucci Jul 8, 2021

ilangofman Jul 8, 2021

pracucci left a comment

pracucci commented Jul 9, 2021

ilangofman commented Jul 9, 2021

yeya24 Jun 20, 2022

ilangofman Jun 20, 2022

yeya24 Jun 20, 2022

yeya24 Jun 20, 2022

ilangofman Jun 20, 2022

yeya24 Jun 20, 2022

ilangofman Jun 20, 2022


		Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. The cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to store a per tenant key using the KV-store with the ring backend and propogate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number.
		Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number.


		The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, deleting, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created.

		Updating the states will be done from the compactor. Inside the compactor, the new _DeletedSeriesCleaner_ service will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state.


		While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the _DeletedSeriesCleaner_ know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small.

		The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones that are currently being processing for deletions and apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Pending` state is complete and the request progresses to the `Processed` state.

Proposal for time series deletion with block storage #4274

Proposal for time series deletion with block storage #4274

Conversation

ilangofman commented Jun 10, 2021

pracucci left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harry671003 Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pracucci commented Jul 9, 2021

ilangofman commented Jul 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harry671003 Jun 18, 2021 •

edited

Loading