CrossEncoderModule with rerank API #389

markstur · 2024-09-11T06:10:11Z

This module is closely related to EmbeddingModule.

Cross-encoder models use Q and A pairs and are trained return a relevance score for rank(). The existing rerank APIs in EmbeddingModule had to encode Q and A separately and use cosine similarity as a score. So the API is the same, but the results are supposed to be better (and slower).

Cross-encoder models do not support returning embedding vectors or sentence-similarity.

Support for the existing tokenization and model_info endpoints was also added.

This module is closely related to EmbeddingModule. Cross-encoder models use Q and A pairs and are trained return a relevance score for rank(). The existing rerank APIs in EmbeddingModule had to encode Q and A separately and use cosine similarity as a score. So the API is the same, but the results are supposed to be better (and slower). Cross-encoder models do not support returning embedding vectors or sentence-similarity. Support for the existing tokenization and model_info endpoints was also added. Signed-off-by: Mark Sturdevant <[email protected]>

evaline-ju

Some q's mainly around how much ipex will apply here, configurable parameters, and truncation result testing!

caikit_nlp/modules/text_embedding/crossencoder.py

evaline-ju · 2024-09-11T15:34:15Z

tests/modules/text_embedding/test_crossencoder.py

+
+
+@pytest.mark.parametrize("truncate_input_tokens", [-1, 99, 510, 511, 512])
+def test_too_many_tokens_with_truncation_working(truncate_input_tokens, loaded_model):


while no errors are raised, maybe there should be at least one test to make sure the truncation leads to the expected final result (mainly to make sure the logic of the _truncation_needed function like positioning and all is tested/working as expected)?

Yep. Needed to push the PR before I could get to that. Will add some confirming test.

caikit_nlp/modules/text_embedding/crossencoder.py

gkumbhat · 2024-09-11T16:47:45Z

caikit_nlp/modules/text_embedding/crossencoder.py

+        error.value_check(
+            "<NLP20896115E>",
+            artifacts_path,
+            ValueError(f"Model config missing '{cls._ARTIFACTS_PATH_KEY}'"),


value_check with automatically through ValueError, you do not need to pass ValueError here. You can simple pass f"Model config missing '{cls._ARTIFACTS_PATH_KEY}'" instead

Thanks. Fixed

gkumbhat · 2024-09-11T16:49:56Z

caikit_nlp/modules/text_embedding/crossencoder.py

+        if ipex:
+            if autocast:  # IPEX performs best with autocast using bfloat16
+                model = ipex.optimize(
+                    model, dtype=torch.bfloat16, weights_prepack=False


is bfloat16 supported on all devices for ipex ? or should we make the dtype configurable somehow?

Took this out because we won't really properly test the ipex options for short-term cross-encoder needs. But FYI, various config names related to dtype/bfloat16 were rejected as confusing with other uses so for embeddings it ended up ipex + autocast is how you take advantage ipex with bloat16 speed.

caikit_nlp/modules/text_embedding/crossencoder.py

gkumbhat · 2024-09-11T16:51:58Z

caikit_nlp/modules/text_embedding/crossencoder.py

+        self,
+        queries: List[str],
+        documents: List[JsonDict],
+        top_n: Optional[int] = None,


nit: why not call this as top_k ?

Long story, but short version is some folks thought that had other implications and preferred to avoid it and go with top_n. This is now in our rerank API that is shared with text-embedding models and cross-encoder models so changing it would not be great.

So today I have top_n for the API we expose but it becomes top_k as the familiar parameter name in CrossEncoder functions. Sorry. Is there a better thing to do here?

gkumbhat · 2024-09-11T17:01:03Z

caikit_nlp/modules/text_embedding/crossencoder.py

+    def smart_batching_collate_text_only(
+        self, batch, truncate_input_tokens: Optional[int] = 0
+    ):
+        texts = [[] for _ in range(len(batch[0]))]


nit: this can be:

texts = [[]] * len(batch[0])

?

your way look better and "looks" equivalent but it breaks the data fetcher. Using range works.

interesting. Whats "data fetcher" here ?

this is using torch DataLoader so callables and iterators or being used. Overkill if I wrote it from scratch, but I'm using what sentence-transformers/CrossEncoder has been using as much as possible (with our extensions as needed for truncation and token counting).

gkumbhat · 2024-09-11T17:03:41Z

caikit_nlp/modules/text_embedding/crossencoder.py

+        activation_fct=None,
+        apply_softmax=False,


Suggested change

activation_fct=None,

apply_softmax=False,

activation_fct = None,

apply_softmax = False,

Nope. Our linter/formatter insists on no spaces around keyword parameter equals which is a good thing.

The odd thing is that lint/fmt rules are the opposite when there is a type.

Fortunately tox takes care of this.

oh wow 🤔 thats weird

it is weird, but I find trying to make python a typed language is generally a little weird (usually not this odd)

caikit_nlp/modules/text_embedding/crossencoder.py

gkumbhat · 2024-09-11T17:09:08Z

caikit_nlp/modules/text_embedding/crossencoder.py

+            pred_scores = torch.stack(pred_scores)
+        elif convert_to_numpy:
+            pred_scores = np.asarray(
+                [score.cpu().detach().numpy() for score in pred_scores]


wondering if

score.cpu().detach().numpy() should be score.cpu().detach().float().item(), since numpy() can be an array but we want a float here?

yes, your suggestion looks better. Done

* mostly removing unnecessary code * some better clarity Signed-off-by: Mark Sturdevant <[email protected]>

* The already borrowed errors are fixed with tokenizers per thread, so there were some misleading comments about not changing params for truncation (which we do for cross-encoder truncation). Signed-off-by: Mark Sturdevant <[email protected]>

markstur · 2024-09-12T06:20:42Z

Thanks for the reviews!

Forgot to mention regarding the removal of ipex, etc code... Part of that I was keeping to get MPS support as well, but I've found out that the default with CrossEncoder handles MPS and CUDA device already.

Default is 32. Can override with embedding batch_size in config or EMBEDDING_BATCH_SIZE env var. Signed-off-by: Mark Sturdevant <[email protected]>

* Moved the truncation check to a place that can determine the proper index for the error message (with batching). * Added test to validate some results after truncation. This is with a tiny model, but works for sanity. Signed-off-by: Mark Sturdevant <[email protected]>

caikit_nlp/modules/text_embedding/crossencoder.py

tests/modules/text_embedding/test_crossencoder.py

The part that really tests that a token is truncated was wrong. * It was backwards and passing because the scores are sorted by rank * Using the index to get scores in the order of the inputs * Now correctly xx != xy but xy == xyz (truncated z) Signed-off-by: Mark Sturdevant <[email protected]>

Signed-off-by: Mark Sturdevant <[email protected]>

evaline-ju

LGTM - thanks for the updates!

markstur requested review from alex-jw-brooks, gkumbhat, evaline-ju, gabe-l-hart, tharapalanivel and Ssukriti as code owners September 11, 2024 06:10

markstur force-pushed the crossencoder branch from c118db3 to 5b0989f Compare September 11, 2024 06:22

evaline-ju reviewed Sep 11, 2024

View reviewed changes

gkumbhat reviewed Sep 11, 2024

View reviewed changes

markstur added 2 commits September 11, 2024 22:26

Cross-encoder improvements from code review

7146ffe

* mostly removing unnecessary code * some better clarity Signed-off-by: Mark Sturdevant <[email protected]>

Cross-encoder docstring fix

ac46993

* The already borrowed errors are fixed with tokenizers per thread, so there were some misleading comments about not changing params for truncation (which we do for cross-encoder truncation). Signed-off-by: Mark Sturdevant <[email protected]>

markstur added 2 commits September 12, 2024 01:09

Cross-Encoder use configurable batch size.

4e9c5aa

Default is 32. Can override with embedding batch_size in config or EMBEDDING_BATCH_SIZE env var. Signed-off-by: Mark Sturdevant <[email protected]>

evaline-ju reviewed Sep 12, 2024

View reviewed changes

caikit_nlp/modules/text_embedding/crossencoder.py Outdated Show resolved Hide resolved

caikit_nlp/modules/text_embedding/crossencoder.py Outdated Show resolved Hide resolved

tests/modules/text_embedding/test_crossencoder.py Outdated Show resolved Hide resolved

markstur added 2 commits September 12, 2024 13:38

Cross-encoder: remove some unused and tidy up some comments

8fa67cc

Signed-off-by: Mark Sturdevant <[email protected]>

markstur force-pushed the crossencoder branch from 45acabb to 8fa67cc Compare September 12, 2024 21:08

evaline-ju approved these changes Sep 12, 2024

View reviewed changes

evaline-ju merged commit 1695c3b into caikit:main Sep 12, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrossEncoderModule with rerank API #389

CrossEncoderModule with rerank API #389

markstur commented Sep 11, 2024

evaline-ju left a comment

evaline-ju Sep 11, 2024

markstur Sep 12, 2024

markstur Sep 12, 2024

gkumbhat Sep 11, 2024

markstur Sep 12, 2024

gkumbhat Sep 11, 2024

markstur Sep 12, 2024

gkumbhat Sep 11, 2024

markstur Sep 11, 2024

gkumbhat Sep 11, 2024

markstur Sep 12, 2024

gkumbhat Sep 12, 2024

markstur Sep 12, 2024

gkumbhat Sep 11, 2024

markstur Sep 11, 2024

gkumbhat Sep 12, 2024

markstur Sep 12, 2024

gkumbhat Sep 11, 2024

markstur Sep 12, 2024

markstur commented Sep 12, 2024

evaline-ju left a comment



		@pytest.mark.parametrize("truncate_input_tokens", [-1, 99, 510, 511, 512])
		def test_too_many_tokens_with_truncation_working(truncate_input_tokens, loaded_model):

CrossEncoderModule with rerank API #389

CrossEncoderModule with rerank API #389

Conversation

markstur commented Sep 11, 2024

evaline-ju left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markstur commented Sep 12, 2024

evaline-ju left a comment

Choose a reason for hiding this comment