Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chunk Data Pack Pruner] Add Block Iterator #6858

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

zhangchiqing
Copy link
Member

This PR adds a height based block iterator that iterates blocks by height, without iterating siblings of finalized blocks, which will be done later by implementing view based block iterator.

}

// BlockIterator is an interface for iterating over blocks
type BlockIterator interface {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BlockIterator interface can be implemented into height based iterator and view based iterator.

The block iterator not long can be used by chunk data pack pruner, but alsoin future to implement protocol state pruner.

The height based iterator is easy to implement, however, it can't guarantee to prune all data, since it doesn't iterate unfinalized blocks. The view based iterator can guarantee all blocks are pruned, but it's more complicated to implement.

In this PR, I first implement the height based iterator, for chunk data pack, it's OK that we only prune by height, however, for protocol state, it's better that we can prune by view and ensure a more throughout pruning.

jobCreator IteratorJobCreator
}

func NewIteratorFactory(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the interfaces in the arguments are implemented, then the logic to create the BlockIterator can be reused. That's why, I put this function here along with the interface definitions, so that it's clear to see how the interfaces will be used for creating the block iterator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call this function NewBlockIterator and return a BlockIterator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be two NewBlockIterator implementations: NewHeightBasedBlockIterator and NewViewBasedBlockIterator, both of them will need to implement progress initialization and creating job with range of height / view. And these logic are the same for both, so extracting the iteration factory is to reuse them.

In other words, there will be one IterationFactory, many different BlockIterator creators.

@zhangchiqing zhangchiqing changed the base branch from leo/db-ops-dbstore to master January 9, 2025 17:25
@zhangchiqing zhangchiqing changed the base branch from master to leo/db-ops-dbstore January 10, 2025 17:18
@zhangchiqing zhangchiqing marked this pull request as draft January 10, 2025 18:39
Base automatically changed from leo/db-ops-dbstore to master January 13, 2025 19:55
@zhangchiqing zhangchiqing force-pushed the leo/cdp-prune-block-iterator branch from b688ac2 to 0af09d6 Compare January 15, 2025 16:21
@zhangchiqing zhangchiqing marked this pull request as ready for review January 15, 2025 16:23
@zhangchiqing zhangchiqing requested a review from a team as a code owner January 15, 2025 16:23
@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 40.00000% with 27 lines in your changes missing coverage. Please review.

Project coverage is 41.08%. Comparing base (b740fc0) to head (a57ac00).
Report is 65 commits behind head on master.

Files with missing lines Patch % Lines
module/block_iterator.go 0.00% 21 Missing ⚠️
module/block_iterator/height_based/iterator.go 75.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6858      +/-   ##
==========================================
- Coverage   41.11%   41.08%   -0.04%     
==========================================
  Files        2116     2120       +4     
  Lines      185749   185895     +146     
==========================================
- Hits        76378    76373       -5     
- Misses     102954   103116     +162     
+ Partials     6417     6406      -11     
Flag Coverage Δ
unittests 41.08% <40.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a couple small comments, but otherwise this looks good.

func (b *HeightIterator) Next() (flow.Identifier, bool, error) {
// exit when the context is done
select {
case <-b.ctx.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need context here? since it's only used for this check and only at the beginning of the function call, it seems like we should make the check the caller's responsibility

"github.com/onflow/flow-go/storage"
)

type HeightIterator struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not concurrency safe. should it be? if not, can you add a warning to the godoc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, suppose we should have only one iterator for a task.

require.NoError(t, err)

// iterate through all blocks
visited := make(map[flow.Identifier]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a slice instead of a map so the verification step can check that they were also visited in the correct order?

Comment on lines 52 to 56
// verify we don't iterate two many blocks
count++
if count > len(bs) {
t.Fatal("visited too many blocks")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use a slice, you can omit this and just compare the final length to len(bs)

// if the iteration is interrupted (e.g. by a restart), the iterator can be
// resumed from the last checkpoint, which might result in the same block being
// iterated again.
Next() (blockID flow.Identifier, hasNext bool, exception error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about making this a go iterator? Instead of Next() it could be

func (b *HeightIterator) Range() iter.Seq2[flow.Identifier, error] {
	return func(yield func(flow.Identifier, error) bool) {
		for b.nextHeight <= b.endHeight {
			next, err := b.headers.BlockIDByHeight(b.nextHeight)
			if err != nil {
				yield(flow.ZeroID, fmt.Errorf("failed to fetch block at height %v: %w", b.nextHeight, err))
				return
			}
		
			b.nextHeight++
		
			if !yield(next, nil) {
				return
			}
		}
	}
}

then

for blockID, err := range heightIterator.Range() {
	...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a go 1.23 feature right? 1.22 seems doesn't have it yet.

Thanks, I will add a TODO once we upgrade to 1.23


type HeightIterator struct {
// dependencies
headers storage.Headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the BlockIDByHeight function is needed from headers. Consider just using a func (height) flow.Identifier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Actually if I change it into a GetBlockIDByHeight function, the whole HeightIterator will be almost identical to ViewIterator, meaning we could just use HeightIterator as ViewIterator by passing a GetBlockIDByView function as GetBlockIDByHeight.

// if the iteration is interrupted (e.g. by a restart), the iterator can be
// resumed from the last checkpoint, which might result in the same block being
// iterated again.
Next() (blockID flow.Identifier, hasNext bool, exception error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea!

jobCreator IteratorJobCreator
}

func NewIteratorFactory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call this function NewBlockIterator and return a BlockIterator?

// the range could be either view based range or height based range.
// when specifying the range, the start and end are inclusive, and the end must be greater than or
// equal to the start
type IterateJob struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
type IterateJob struct {
type IterationRange struct {

// ReadNext reads the next block to iterate
// caller must ensure the reader is created by the IterateProgressInitializer,
// otherwise ReadNext would return exception.
ReadNext() (uint64, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ReadNext() (uint64, error)
LoadState() (uint64, error)

// IterateProgressWriter saves the progress of the iterator
type IterateProgressWriter interface {
// SaveNext persists the next block to be iterated
SaveNext(uint64) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SaveNext(uint64) error
SaveState(uint64) error

}

// IterateProgressWriter saves the progress of the iterator
type IterateProgressWriter interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a bit overkill to have separate interfaces for read and save in this case, since you always need both for iterating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see this PR , how the writer and reader are separated.

The reader is used by the Job creator to read the start height and creating a height range. And it doesn't need the writer to update progress.

The writer is used by the iterator for saving the iterated height. Since the iteration range is decided by the input (IteratorJob), the iterator doesn't need the reader to read progress from storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants