Single definition of batch size in indexing vs transformers #233

timonv · 2024-08-13T12:49:17Z

Batch size in indexing should either be on the pipeline only and inherited by the transformer, or set on the transformer only and used by the pipeline.

EmbeddingModels currently have a double batch size, it's confusing and not needed. I think the latter with a sane default could have the cleanest api.

mdabir1203 · 2024-09-03T21:30:03Z

Can you please clarify a bit more where to look for the code so I can help fix the issue

timonv · 2024-09-04T15:55:35Z

@mdabir1203 Thanks for the interest and wanting to pick this up! 🚀

Currently there are two batch sizes defined, one in the indexing pipeline itself. The indexing pipeline needs to know how many chunks it should batch up, and will send those as a list to the batch transformer. However, some batch transformers also have a batch size (regardless of the pipeline's). FastEmbed for instance takes a batch size for embeddings for internally in its models as well.

This leads to the following weird api:

let batch_size = 100;

       Pipeline::from_loader(loaders::FileLoader::new(tempdir.path()).with_extensions(&["rs"]))
            .then_in_batch(batch_size, transformers::Embed::new(FastEmbed::builder().batch_size(batch_size).build().unwrap()))

It's double and it needs to be aligned. Specifically for FastEmbed, the default is None, so then the library will use it's own default batch size.

Ideally I'd have the batch size specified once for transformers (or on the client if that determines it), and only on the transformer, with sensible defaults (256 is great). This would need to be set in pipeline defaults. Transformers with children that determine the batch size would need to propagate it.

See the Persist trait for how I think it should work.

Then in batch would be like this instead:

.then_in_batch(transformers::Embed::new(FastEmbed::builder().batch_size(batch_size).build().unwrap()

For example, a possible implementation could look like this:

Remove batch size as argument from then_in_batch
Add a batch_size -> Option to the BatchTransformerTrait
I think Embed and SparseEmbed might need a batch size then given the generics, maybe there's a nice way to avoid it.
Ensure any integrations with batch sizes have sensible defaults if needed, otherwise they could use the default from the pipeline, afaik mostly fastembed.
Add a default_batch_size to PipelineDefaults
In then_in_batch, it should try to get the batch size from the transformer or fall back to the pipeline default.

mdabir1203 · 2024-09-05T17:40:03Z

Can you show which file and which line of the code is this. I have a probable implementation in my mind like this. But I need to see a bit more :

// Define a struct for pipeline defaults
struct PipelineDefaults {
    default_batch_size: usize,
}

// Define the BatchTransformerTrait
trait BatchTransformerTrait {
    fn batch_size(&self) -> Option<usize>;
}

// Example implementation for Embed
struct Embed {
    batch_size: Option<usize>,
}

impl BatchTransformerTrait for Embed {
    fn batch_size(&self) -> Option<usize> {
        self.batch_size
    }
}

// Modify the Pipeline struct
struct Pipeline {
    defaults: PipelineDefaults,
}

impl Pipeline {
    fn then_in_batch(self, transformer: impl BatchTransformerTrait) {
        let batch_size = transformer.batch_size().unwrap_or(self.defaults.default_batch_size);
        // Proceed with processing using the determined batch_size
    }
}

// Usage
fn main() {
    let defaults = PipelineDefaults { default_batch_size: 100 };
    let pipeline = Pipeline { defaults };

    let embed_transformer = Embed { batch_size: None }; // No specific batch size set

    pipeline.then_in_batch(embed_transformer); // Uses default batch size of 100
}

timonv · 2024-09-05T19:17:07Z

@mdabir1203 Almost everything is in swiftide-indexing, traits and core shared types are in swiftide-core.

// Usage
fn main() {
    let defaults = PipelineDefaults { default_batch_size: 100 };
    let pipeline = Pipeline { defaults };

    let embed_transformer = Embed { batch_size: None }; // No specific batch size set

    pipeline.then_in_batch(embed_transformer); // Uses default batch size of 100
}

What's this? Have you looked at our examples or tried to run a pipeline? For searching code ripgrep and rust-analyzer are your friends. Best of luck!

…tch size value and Embeed/SparseEmbed are able to modify it. Fixes bosun-ai#233

… default ba… (#336) Fixes #233 BREAKING CHANGE: The batch size of batch transformers when indexing is now configured on the batch transformer. If no batch size or default is configured, a configurable default is used from the pipeline. The default batch size is 256. --------- Co-authored-by: Timon Vonk <[email protected]> Co-authored-by: Timon Vonk <[email protected]>

timonv added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed breaking change labels Aug 16, 2024

devsprint added a commit to devsprint/swiftide that referenced this issue Sep 25, 2024

feat: Removed duplication of batch_size. Pipeline owns the default ba…

68b3c26

…tch size value and Embeed/SparseEmbed are able to modify it. Fixes bosun-ai#233

devsprint mentioned this issue Sep 25, 2024

feat(indexing)!: Removed duplication of batch_size. Pipeline owns the default ba… #336

Merged

timonv closed this as completed in #336 Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single definition of batch size in indexing vs transformers #233

Single definition of batch size in indexing vs transformers #233

timonv commented Aug 13, 2024

mdabir1203 commented Sep 3, 2024

timonv commented Sep 4, 2024 •

edited

Loading

mdabir1203 commented Sep 5, 2024

timonv commented Sep 5, 2024 •

edited

Loading

Single definition of batch size in indexing vs transformers #233

Single definition of batch size in indexing vs transformers #233

Comments

timonv commented Aug 13, 2024

mdabir1203 commented Sep 3, 2024

timonv commented Sep 4, 2024 • edited Loading

mdabir1203 commented Sep 5, 2024

timonv commented Sep 5, 2024 • edited Loading

timonv commented Sep 4, 2024 •

edited

Loading

timonv commented Sep 5, 2024 •

edited

Loading