-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single definition of batch size in indexing vs transformers #233
Comments
Can you please clarify a bit more where to look for the code so I can help fix the issue |
@mdabir1203 Thanks for the interest and wanting to pick this up! 🚀 Currently there are two batch sizes defined, one in the indexing pipeline itself. The indexing pipeline needs to know how many chunks it should batch up, and will send those as a list to the batch transformer. However, some batch transformers also have a batch size (regardless of the pipeline's). FastEmbed for instance takes a batch size for embeddings for internally in its models as well. This leads to the following weird api: let batch_size = 100;
Pipeline::from_loader(loaders::FileLoader::new(tempdir.path()).with_extensions(&["rs"]))
.then_in_batch(batch_size, transformers::Embed::new(FastEmbed::builder().batch_size(batch_size).build().unwrap()))
It's double and it needs to be aligned. Specifically for FastEmbed, the default is None, so then the library will use it's own default batch size. Ideally I'd have the batch size specified once for transformers (or on the client if that determines it), and only on the transformer, with sensible defaults (256 is great). This would need to be set in pipeline defaults. Transformers with children that determine the batch size would need to propagate it. See the Persist trait for how I think it should work. Then in batch would be like this instead: .then_in_batch(transformers::Embed::new(FastEmbed::builder().batch_size(batch_size).build().unwrap() For example, a possible implementation could look like this:
|
Can you show which file and which line of the code is this. I have a probable implementation in my mind like this. But I need to see a bit more :
|
@mdabir1203 Almost everything is in swiftide-indexing, traits and core shared types are in swiftide-core. // Usage
fn main() {
let defaults = PipelineDefaults { default_batch_size: 100 };
let pipeline = Pipeline { defaults };
let embed_transformer = Embed { batch_size: None }; // No specific batch size set
pipeline.then_in_batch(embed_transformer); // Uses default batch size of 100
} What's this? Have you looked at our examples or tried to run a pipeline? For searching code ripgrep and rust-analyzer are your friends. Best of luck! |
…tch size value and Embeed/SparseEmbed are able to modify it. Fixes bosun-ai#233
… default ba… (#336) Fixes #233 BREAKING CHANGE: The batch size of batch transformers when indexing is now configured on the batch transformer. If no batch size or default is configured, a configurable default is used from the pipeline. The default batch size is 256. --------- Co-authored-by: Timon Vonk <[email protected]> Co-authored-by: Timon Vonk <[email protected]>
Batch size in indexing should either be on the pipeline only and inherited by the transformer, or set on the transformer only and used by the pipeline.
EmbeddingModels currently have a double batch size, it's confusing and not needed. I think the latter with a sane default could have the cleanest api.
The text was updated successfully, but these errors were encountered: