Skip to content

Production-grade embedding generation, for any length of text, for transformer models.

License

Notifications You must be signed in to change notification settings

HeadspaceMeditation/transformer-embeddings

Repository files navigation

Transformer Embeddings

PyPI Status Python Version License

Tests

pre-commit Black

This library simplifies and streamlines the usage of encoder transformer models supported by HuggingFace's transformers library (model hub or local) to generate embeddings for string inputs, similar to the way sentence-transformers does.

Please note that starting with v4, we have dropped support for Python 3.7. If you need to use this library with Python 3.7, the latest compatible release is version 3.1.0.

Why use this over HuggingFace's transformers or sentence-transformers?

Under the hood, we take care of:

  1. Can be used with any model on the HF model hub, with sensible defaults for inference.
  2. Setting the PyTorch model to eval mode.
  3. Using no_grad() when doing the forward pass.
  4. Batching, and returning back output in the format produced by HF transformers.
  5. Padding / truncating to model defaults.
  6. Moving to and from GPUs if available.

Installation

You can install Transformer Embeddings via pip from PyPI:

$ pip install transformer-embeddings

Usage

from transformer_embeddings import TransformerEmbeddings

transformer = TransformerEmbeddings("model_name")

If you have a previously instantiated model and / or tokenizer, you can pass that in.

transformer = TransformerEmbeddings(model=model, tokenizer=tokenizer)
transformer = TransformerEmbeddings(model_name="model_name", model=model)

or

transformer = TransformerEmbeddings(model_name="model_name", tokenizer=tokenizer)

Note: The model_name should be included if only 1 of model or tokenizer are passed in.

Embeddings

To get output embeddings:

embeddings = transformer.encode(["Lorem ipsum dolor sit amet",
                                 "consectetur adipiscing elit",
                                 "sed do eiusmod tempor incididunt",
                                 "ut labore et dolore magna aliqua."])
embeddings.output

Pooled Output

To get pooled outputs:

from transformer_embeddings import TransformerEmbeddings, mean_pooling

transformer = TransformerEmbeddings("model_name", return_output=False, pooling_fn=mean_pooling)

embeddings = transformer.encode(["Lorem ipsum dolor sit amet",
                                "consectetur adipiscing elit",
                                "sed do eiusmod tempor incididunt",
                                "ut labore et dolore magna aliqua."])

embeddings.pooled

Exporting the Model

Once you are done testing and training the model, it can be exported into a single tarball:

from transformer_embeddings import TransformerEmbeddings

transformer = TransformerEmbeddings("model_name")
transformer.export(additional_files=["/path/to/other/files/to/include/in/tarball.pickle"])

This tarball can also be uploaded to S3, but requires installing the S3 extras (pip install transformer-embeddings[s3]). And then using:

from transformer_embeddings import TransformerEmbeddings

transformer = TransformerEmbeddings("model_name")
transformer.export(
    additional_files=["/path/to/other/files/to/include/in/tarball.pickle"],
    s3_path="s3://bucket/models/model-name/date-version/",
)

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the Apache 2.0 license, Transformer Embeddings is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

This project was partly generated from @cjolowicz's Hypermodern Python Cookiecutter template.

About

Production-grade embedding generation, for any length of text, for transformer models.

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages