Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single definition of batch size in indexing vs transformers #233

Closed
timonv opened this issue Aug 13, 2024 · 4 comments · Fixed by #336
Closed

Single definition of batch size in indexing vs transformers #233

timonv opened this issue Aug 13, 2024 · 4 comments · Fixed by #336
Labels
breaking change enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@timonv
Copy link
Member

timonv commented Aug 13, 2024

Batch size in indexing should either be on the pipeline only and inherited by the transformer, or set on the transformer only and used by the pipeline.

EmbeddingModels currently have a double batch size, it's confusing and not needed. I think the latter with a sane default could have the cleanest api.

@timonv timonv added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed breaking change labels Aug 16, 2024
@mdabir1203
Copy link

Can you please clarify a bit more where to look for the code so I can help fix the issue

@timonv
Copy link
Member Author

timonv commented Sep 4, 2024

@mdabir1203 Thanks for the interest and wanting to pick this up! 🚀

Currently there are two batch sizes defined, one in the indexing pipeline itself. The indexing pipeline needs to know how many chunks it should batch up, and will send those as a list to the batch transformer. However, some batch transformers also have a batch size (regardless of the pipeline's). FastEmbed for instance takes a batch size for embeddings for internally in its models as well.

This leads to the following weird api:

let batch_size = 100;

       Pipeline::from_loader(loaders::FileLoader::new(tempdir.path()).with_extensions(&["rs"]))
            .then_in_batch(batch_size, transformers::Embed::new(FastEmbed::builder().batch_size(batch_size).build().unwrap()))
 

It's double and it needs to be aligned. Specifically for FastEmbed, the default is None, so then the library will use it's own default batch size.

Ideally I'd have the batch size specified once for transformers (or on the client if that determines it), and only on the transformer, with sensible defaults (256 is great). This would need to be set in pipeline defaults. Transformers with children that determine the batch size would need to propagate it.

See the Persist trait for how I think it should work.

Then in batch would be like this instead:

.then_in_batch(transformers::Embed::new(FastEmbed::builder().batch_size(batch_size).build().unwrap()

For example, a possible implementation could look like this:

  • Remove batch size as argument from then_in_batch
  • Add a batch_size -> Option to the BatchTransformerTrait
  • I think Embed and SparseEmbed might need a batch size then given the generics, maybe there's a nice way to avoid it.
  • Ensure any integrations with batch sizes have sensible defaults if needed, otherwise they could use the default from the pipeline, afaik mostly fastembed.
  • Add a default_batch_size to PipelineDefaults
  • In then_in_batch, it should try to get the batch size from the transformer or fall back to the pipeline default.

@mdabir1203
Copy link

Can you show which file and which line of the code is this. I have a probable implementation in my mind like this. But I need to see a bit more :

// Define a struct for pipeline defaults
struct PipelineDefaults {
    default_batch_size: usize,
}

// Define the BatchTransformerTrait
trait BatchTransformerTrait {
    fn batch_size(&self) -> Option<usize>;
}

// Example implementation for Embed
struct Embed {
    batch_size: Option<usize>,
}

impl BatchTransformerTrait for Embed {
    fn batch_size(&self) -> Option<usize> {
        self.batch_size
    }
}

// Modify the Pipeline struct
struct Pipeline {
    defaults: PipelineDefaults,
}

impl Pipeline {
    fn then_in_batch(self, transformer: impl BatchTransformerTrait) {
        let batch_size = transformer.batch_size().unwrap_or(self.defaults.default_batch_size);
        // Proceed with processing using the determined batch_size
    }
}

// Usage
fn main() {
    let defaults = PipelineDefaults { default_batch_size: 100 };
    let pipeline = Pipeline { defaults };

    let embed_transformer = Embed { batch_size: None }; // No specific batch size set

    pipeline.then_in_batch(embed_transformer); // Uses default batch size of 100
}

@timonv
Copy link
Member Author

timonv commented Sep 5, 2024

@mdabir1203 Almost everything is in swiftide-indexing, traits and core shared types are in swiftide-core.

// Usage
fn main() {
    let defaults = PipelineDefaults { default_batch_size: 100 };
    let pipeline = Pipeline { defaults };

    let embed_transformer = Embed { batch_size: None }; // No specific batch size set

    pipeline.then_in_batch(embed_transformer); // Uses default batch size of 100
}

What's this? Have you looked at our examples or tried to run a pipeline? For searching code ripgrep and rust-analyzer are your friends. Best of luck!

devsprint added a commit to devsprint/swiftide that referenced this issue Sep 25, 2024
…tch size value and Embeed/SparseEmbed are able to modify it. Fixes bosun-ai#233
timonv added a commit that referenced this issue Sep 26, 2024
… default ba… (#336)

Fixes #233

BREAKING CHANGE: The batch size of batch transformers when indexing is
now configured on the batch transformer. If no batch size or default is
configured, a configurable default is used from the pipeline. The
default batch size is 256.

---------

Co-authored-by: Timon Vonk <[email protected]>
Co-authored-by: Timon Vonk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
2 participants