Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

de.shadow_ops.embedding_lookup() is non thread-safe #278

Closed
alionkun opened this issue Sep 26, 2022 · 1 comment · Fixed by #280
Closed

de.shadow_ops.embedding_lookup() is non thread-safe #278

alionkun opened this issue Sep 26, 2022 · 1 comment · Fixed by #280
Assignees

Comments

@alionkun
Copy link
Contributor

alionkun commented Sep 26, 2022

Describe the Problem

I built a model based on de.keras.layers.BasicEmbedding() and everything worked fine in training phase.
But when serving the trained SavedModel with TFServing, I got two issues:

  1. TFServing reported tensor shape mismatch frequently.

    error msg: Input to reshape is a tensor with 60 values, but the requested shape has 228 ....

  2. TFServing process cored dump occasionally

Resolving

After double checking my code and data without any progresses, I dived into the TFRA source code and found following code could bring a race condition:

def embedding_lookup(
shadow,
ids,
partition_strategy=None, # pylint: disable=unused-argument
name=None,
validate_indices=None, # pylint: disable=unused-argument
):
"""
Shadow version of dynamic_embedding.embedding_lookup. It use existed shadow
variable to to embedding lookup, and store the result. No by-product will
be introduced in this call. So it can be decorated by `tf.function`.
Args:
shadow: A ShadowVariable object.
ids: A tensor with any shape as same dtype of params.key_dtype.
partition_strategy: No used, for API compatiblity with `nn.emedding_lookup`.
name: A name for the operation.
validate_indices: No used, just for compatible with nn.embedding_lookup .
Returns:
A tensor with shape [shape of ids] + [dim],
dim is equal to the value dim of params.
containing the values from the params tensor(s) for keys in ids.
"""
ids = ops.convert_to_tensor(ids)
if shadow.ids.dtype != ids.dtype:
raise ValueError('{} ids is not matched with ShadowVariable with ids'
' {},'.format(ids.dtype, shadow.ids.dtype))
with ops.name_scope(name, "shadow_embedding_lookup"):
with ops.control_dependencies([shadow._reset_ids(ids)]):
return shadow.read_value(do_prefetch=True)

That is , L250 updates ShadowVariable.ids (typed ResourceVariable) every time before the real lookup operation, which is not thread-safe in multi-thread scenario, so are all APIs depending on de.shadow_ops.embedding_lookup().

A similar issue of DynamiceEmbedding was fixed at #24 , and a quick fixing can be borrowed.

I have solved and verified this issue in my environment and am willing to contribute the fixing.

@Lifann @rhdong Can you please take a look at this.

@Lifann
Copy link
Member

Lifann commented Sep 27, 2022

Thanks for feedback, @alionkun . A way to solve this is using sparse_variable.lookup(keys) instead of embedding_lookup(shadow, ids) in inference phase or exporting the inference model. Maybe it's possible to make it internal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants