Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Searchable snapshot dependency on repository chunk_size #9676

Closed
andrross opened this issue Aug 31, 2023 · 2 comments · Fixed by #12277
Closed

[BUG] Searchable snapshot dependency on repository chunk_size #9676

andrross opened this issue Aug 31, 2023 · 2 comments · Fixed by #12277
Assignees
Labels
bug Something isn't working Search:Searchable Snapshots

Comments

@andrross
Copy link
Member

andrross commented Aug 31, 2023

Background

Every repository implementation accepts an optional chunk_size parameter at repository creation time. This property defines the maximum file size that will be uploaded to the repository. Any files larger than that will be broken into smaller files of chunk_size size (with the last chunk potentially smaller).

Searchable snapshots work by fetching partial index files on-demand at search time, and storing these parts as 8MiB (current hard-coded default) files on disk (unless the entire file is smaller than 8MiB). A virtual IndexInput wraps this logic, given the appearance to the upper layers of a single file while the implementation fetches and reads what is needed from these 8MiB partial files. Some clever logic in this code relies on the partial file size being a power of 2 to leverage bit shifting techniques to convert "block number" into an actual byte offset.

Bug

The searchable snapshot code does not handle the case that fetching an 8MiB section of a file might cross one of the snapshot chunk boundaries. This means that if the repository chunk_size parameter is not a multiple of 8MiB, then this code will fail. In practice, all default chunk sizes are in fact a multiple of 8MiB (fs: no chunking, s3: 1GiB, gcs: 5TiB, hdfs: no chunking). However a user can configure any value. This was discovered while attempting to implement #9514 and some repositories choose a random value between 100 and 1000 bytes for the test case.

Possible solution

  • Improve OnDemandBlockSnapshotIndexInput to always download the configured block size (i.e. 8MiB) even if that means downloading multiple file parts from the repository.
@andrross andrross added bug Something isn't working untriaged distributed framework and removed untriaged labels Aug 31, 2023
@kkmr
Copy link
Contributor

kkmr commented Sep 6, 2023

I'm looking into this

@kotwanikunal
Copy link
Member

@kkmr Are you still looking at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Searchable Snapshots
Projects
None yet
5 participants