Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compaction &| deletion. #10

Open
thedodd opened this issue May 30, 2019 · 3 comments
Open

Support compaction &| deletion. #10

thedodd opened this issue May 30, 2019 · 3 comments

Comments

@thedodd
Copy link

thedodd commented May 30, 2019

Looks like there is already an issue open for compression. Awesome. In this case, I'm looking for a way to remove old entries. Perhaps from a given offset and back. I see the truncation method, that is definitely useful for certain cases, especially when dealing with Raft and such.

In the case of compaction/deletion, the use case is where a log is only intended to be kept around for some specific amount of time, or where log entries are to be deleted after some specific amount of time. EG, keep messages around for 1 week, after that, remove them.

I'm happy to implement this, as I am strongly considering using this for a project of mine, just wanted to pop a ticket for some general discussion. Thoughts?

@zowens
Copy link
Owner

zowens commented Jun 1, 2019

Thanks for filing this.

Some sort of compaction/rewrite functionality would be really useful. I think there are a couple cases here:

  1. Time-based retention (e.g. Kafka does this by keeping a time index in addition to a offset -> address index)
  2. Rewrite the log by removing certain entries (in Kafka parlance, this is a compacted topic)

When I was thinking about this a few months ago, I think a generalized indexing scheme is what I was envisioning for the time support, where you could have some sort of custom index for a field like "timestamp" but we wouldn't have to introduce time concepts throughout the code base.

For key-based compaction, one could use a key index along with the custom code to actually do the comparison for compaction.

The other thing worth thinking through is the complexity of removing the segments themselves vs. the complexity in doing a full rewrite of the segment with some of the log truncated. Both pieces of functionality would be interesting, but worth considering.

What exact requirements are you needing sooner rather than later?

@norcalli
Copy link

norcalli commented Feb 28, 2021

Ignoring the index mapping from time based indices to offsets, a first step would be to just add the ability to do approximate and exact deletion.

Where approximate deletion would quickly just delete all segments which are less than the segment which contains the given lower_bound offset.

And exact deletion would actually create a new segment which truncates the segment which contains the offset to not contain any extra records.

These two methods would likely exist regardless of additional features like time based indices. I have no idea how Kafka implements its retention, but these approaches seem obvious to me when I was imagining how to implement it, since it's an important feature for me. I likely will add it in a fork. If they seem like reasonable methods then I can put them into a PR.

@zowens
Copy link
Owner

zowens commented Mar 6, 2021

@norcalli Agreed, those seem like reasonable approaches. Feel free to PR it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants