Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Running custom Encoder Decoder model #2491

Closed
AvivSham opened this issue Nov 24, 2024 · 5 comments
Closed

[Question] Running custom Encoder Decoder model #2491

AvivSham opened this issue Nov 24, 2024 · 5 comments
Labels
question Further information is requested triaged Issue has been triaged by maintainers

Comments

@AvivSham
Copy link

Hi All,
Thank you for your amazing work.
We have an encoder decoder model we want to run using TensorRT-LLM. We made an architectural modification by pooling the encoder's output dim using stacked MLP layers.
What is the recommended way of modifying the code to support the new architecture? We assume that we need to change the code to convert the model (to a static computation graph) and run it.

Please advice,

@hello-11 hello-11 added question Further information is requested triaged Issue has been triaged by maintainers labels Nov 25, 2024
@hello-11
Copy link
Collaborator

@AvivSham you can follow this guide.

@AvivSham
Copy link
Author

Thank you for your response @hello-11.
We followed the guide and created Custom encoder model by adding a single linear layer:

class CustomEncoder(WhisperEncoder):
    def __init__(self, config: PretrainedConfig):
        super().__init__(config)
        self.lin = Linear(in_features=1280, out_features=1280)

    def forward(self,
                input_features: Tensor,
                input_lengths=None,
                position_ids=None):
        if default_net().plugin_config.remove_input_padding:
            # BXT,D -> 1,BxT,D -> 1,D,BxT
            input_features = unsqueeze(input_features, 0)
            input_features = transpose(input_features, 1, 2)
        # Encoder conv needs to run in fp32 on Volta/Turing
        x_type = input_features.dtype
        input_features = cast(input_features, self._conv_dtype)
        x = self.conv1(input_features)
        x = gelu(x)
        x = self.conv2(x)
        x = cast(x, x_type)
        x = gelu(x)
        x = transpose(x, 2, 1)
        x = x + cast(self.position_embedding(position_ids), x.dtype)

        if default_net().plugin_config.remove_input_padding:
            #B,T,D -> BxT,D
            x = x.view([-1, self.config.hidden_size])
        hidden_states = x
        input_lengths = input_lengths // self.downsample_factor
        for encoder_layer in self.encoder_layers:
            hidden_states = encoder_layer(hidden_states,
                                          input_lengths=input_lengths)

        x = hidden_states
        x = self.lin(x)
        x = self.ln_post(x)
        x.mark_output('encoder_output', self._dtype)
        return x

we also wrote a new convert_checkpoint.py, just for sanity we added these lines to the convert_checkpoint.py file in whipser example:
In lines 246-247 we added the following lines since the added linear layer is not included in whisper-v3.pt file from the example

weights['lin.weight'] = torch.rand(1280, 1280).contiguous()
weights['lin.bias'] = torch.rand(1280).contiguous()

when running:

trtllm-build  --checkpoint_dir ${checkpoint_dir}/encoder \
              --output_dir ${output_dir}/encoder \
              --moe_plugin disable \
              --enable_xqa disable \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --gemm_plugin disable \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --max_input_len 3000 --max_seq_len=3000

we receive the following error:

  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 662, in from_checkpoint
    model.load(weights, from_pruned=is_checkpoint_pruned)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 675, in load
    raise RuntimeError(
RuntimeError: Required but not provided tensors:{'lin.per_channel_scale'}

After deep dive is seems like the lin.per_channel_scale which is related to quantization is added to the model's named parameter when loading the model's config:

I assume it relates to this:

def __post_init__(self):

Can you please advice how to solve this issue?

@hello-11
Copy link
Collaborator

@AvivSham, did you convert the checkpoint first?

@yuekaizhang
Copy link

@AvivSham Please use python3 convert_checkpoint.py --output_dir $checkpoint_dir rather than python3 convert_checkpoint.py --use_weight_only --weight_only_precision $WEIGHT_ONLY_PRECISION --output_dir $checkpoint_dir.

@AvivSham
Copy link
Author

@yuekaizhang Thanks
we were able to workaround this issue by doing the steps mentioned here #2535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants