You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Every repository implementation accepts an optional chunk_size parameter at repository creation time. This property defines the maximum file size that will be uploaded to the repository. Any files larger than that will be broken into smaller files of chunk_size size (with the last chunk potentially smaller).
Searchable snapshots work by fetching partial index files on-demand at search time, and storing these parts as 8MiB (current hard-coded default) files on disk (unless the entire file is smaller than 8MiB). A virtual IndexInput wraps this logic, given the appearance to the upper layers of a single file while the implementation fetches and reads what is needed from these 8MiB partial files. Some clever logic in this code relies on the partial file size being a power of 2 to leverage bit shifting techniques to convert "block number" into an actual byte offset.
Bug
The searchable snapshot code does not handle the case that fetching an 8MiB section of a file might cross one of the snapshot chunk boundaries. This means that if the repository chunk_size parameter is not a multiple of 8MiB, then this code will fail. In practice, all default chunk sizes are in fact a multiple of 8MiB (fs: no chunking, s3: 1GiB, gcs: 5TiB, hdfs: no chunking). However a user can configure any value. This was discovered while attempting to implement #9514 and some repositories choose a random value between 100 and 1000 bytes for the test case.
Possible solution
Improve OnDemandBlockSnapshotIndexInput to always download the configured block size (i.e. 8MiB) even if that means downloading multiple file parts from the repository.
The text was updated successfully, but these errors were encountered:
Background
Every repository implementation accepts an optional
chunk_size
parameter at repository creation time. This property defines the maximum file size that will be uploaded to the repository. Any files larger than that will be broken into smaller files ofchunk_size
size (with the last chunk potentially smaller).Searchable snapshots work by fetching partial index files on-demand at search time, and storing these parts as 8MiB (current hard-coded default) files on disk (unless the entire file is smaller than 8MiB). A virtual
IndexInput
wraps this logic, given the appearance to the upper layers of a single file while the implementation fetches and reads what is needed from these 8MiB partial files. Some clever logic in this code relies on the partial file size being a power of 2 to leverage bit shifting techniques to convert "block number" into an actual byte offset.Bug
The searchable snapshot code does not handle the case that fetching an 8MiB section of a file might cross one of the snapshot chunk boundaries. This means that if the repository
chunk_size
parameter is not a multiple of 8MiB, then this code will fail. In practice, all default chunk sizes are in fact a multiple of 8MiB (fs: no chunking, s3: 1GiB, gcs: 5TiB, hdfs: no chunking). However a user can configure any value. This was discovered while attempting to implement #9514 and some repositories choose a random value between 100 and 1000 bytes for the test case.Possible solution
The text was updated successfully, but these errors were encountered: