`de.shadow_ops.embedding_lookup()` is non thread-safe #278

alionkun · 2022-09-26T07:26:40Z

Describe the Problem

I built a model based on de.keras.layers.BasicEmbedding() and everything worked fine in training phase.
But when serving the trained SavedModel with TFServing, I got two issues:

TFServing reported tensor shape mismatch frequently.

error msg: Input to reshape is a tensor with 60 values, but the requested shape has 228 ....
TFServing process cored dump occasionally

Resolving

After double checking my code and data without any progresses, I dived into the TFRA source code and found following code could bring a race condition:

recommenders-addons/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py

Lines 219 to 251 in 634b96d

    
           def embedding_lookup( 
        
               shadow, 
        
               ids, 
        
               partition_strategy=None,  # pylint: disable=unused-argument 
        
               name=None, 
        
               validate_indices=None,  # pylint: disable=unused-argument 
        
           ): 
        
             """ 
        
             Shadow version of dynamic_embedding.embedding_lookup. It use existed shadow 
        
             variable to to embedding lookup, and store the result. No by-product will 
        
             be introduced in this call. So it can be decorated by `tf.function`. 
        
             Args: 
        
               shadow: A ShadowVariable object. 
        
               ids: A tensor with any shape as same dtype of params.key_dtype. 
        
               partition_strategy: No used, for API compatiblity with `nn.emedding_lookup`. 
        
               name: A name for the operation. 
        
               validate_indices: No used, just for compatible with nn.embedding_lookup . 
        
             Returns: 
        
               A tensor with shape [shape of ids] + [dim], 
        
                 dim is equal to the value dim of params. 
        
                 containing the values from the params tensor(s) for keys in ids. 
        
             """ 
        
             ids = ops.convert_to_tensor(ids) 
        
             if shadow.ids.dtype != ids.dtype: 
        
               raise ValueError('{} ids is not matched with ShadowVariable with ids' 
        
                                ' {},'.format(ids.dtype, shadow.ids.dtype)) 
        
             with ops.name_scope(name, "shadow_embedding_lookup"): 
        
               with ops.control_dependencies([shadow._reset_ids(ids)]): 
        
                 return shadow.read_value(do_prefetch=True)

That is , L250 updates ShadowVariable.ids (typed ResourceVariable) every time before the real lookup operation, which is not thread-safe in multi-thread scenario, so are all APIs depending on de.shadow_ops.embedding_lookup().

A similar issue of DynamiceEmbedding was fixed at #24 , and a quick fixing can be borrowed.

I have solved and verified this issue in my environment and am willing to contribute the fixing.

@Lifann @rhdong Can you please take a look at this.

The text was updated successfully, but these errors were encountered:

Lifann · 2022-09-27T08:08:10Z

Thanks for feedback, @alionkun . A way to solve this is using sparse_variable.lookup(keys) instead of embedding_lookup(shadow, ids) in inference phase or exporting the inference model. Maybe it's possible to make it internal.

rhdong assigned Lifann Sep 26, 2022

alionkun mentioned this issue Sep 28, 2022

[fix] fix shadow variable lookup race condition #280

Merged

5 tasks

rhdong closed this as completed in #280 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`de.shadow_ops.embedding_lookup()` is non thread-safe #278

`de.shadow_ops.embedding_lookup()` is non thread-safe #278

alionkun commented Sep 26, 2022 •

edited

Loading

Lifann commented Sep 27, 2022

de.shadow_ops.embedding_lookup() is non thread-safe #278

de.shadow_ops.embedding_lookup() is non thread-safe #278

Comments

alionkun commented Sep 26, 2022 • edited Loading

Describe the Problem

Resolving

Lifann commented Sep 27, 2022

`de.shadow_ops.embedding_lookup()` is non thread-safe #278

`de.shadow_ops.embedding_lookup()` is non thread-safe #278

alionkun commented Sep 26, 2022 •

edited

Loading