Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Important ablations and findings missing in your work? #12

Closed
randomcodelookup opened this issue Apr 30, 2024 · 4 comments
Closed

Important ablations and findings missing in your work? #12

randomcodelookup opened this issue Apr 30, 2024 · 4 comments

Comments

@randomcodelookup
Copy link

randomcodelookup commented Apr 30, 2024

For figure 4, are you comparing 4 frames input for the n-frames method to the 16-frames input for pllava? This seems like not an accurate comparison with 4X more data, so one can expect pllava to be more stable and achieve better scores. If not, could the authors clarify this? If there is no pooling, this is just n-frame approach right? There doesn't seem to be any quantitative results using 16 frames, this seems to be a big missing ablation (fig 7)

For section 3.4, what is the difference between merging the weights with the original LLM, when we already have lora weights detached? Why not just tweak the lora weights directly before merging it in?

Not a big issue but I think it is also confusing to propose "Adaptive Structure Pooling" like a new module when there is nothing explained in the paper about the algorithm apart from input/output shape. According to the implementation this is just adaptive 3d pooling? I think it is clearer to just indicate that in the paper if this is the case

Thanks for responding

@zhoudaquan
Copy link
Collaborator

For figure 4, are you comparing 4 frames input for the n-frames method to the 16-frames input for pllava? This seems like not an accurate comparison with 4X more data, so one can expect pllava to be more stable and achieve better scores. If not, could the authors clarify this? If there is no pooling, this is just n-frame approach right? There doesn't seem to be any quantitative results using 16 frames, this seems to be a big missing ablation (fig 7)

For section 3.4, what is the difference between merging the weights with the original LLM, when we already have lora weights detached? Why not just tweak the lora weights directly before merging it in?

Not a big issue but I think it is also confusing to propose "Adaptive Structure Pooling" like a new module when there is nothing explained in the paper about the algorithm apart from input/output shape. According to the implementation this is just adaptive 3d pooling? I think it is clearer to just indicate that in the paper if this is the case

Thanks for responding

Hi,

Thanks for your interest in our work and thanks for your discussions also.

For your first question, I do not fully understand the meaning of "4x data" you mentioned. If you are referring to the computation costs, they are the same as the 4-frame method has a larger spatial resolution. If you are referring to the number of data that the model saw during training, it is observed that longer training does not improve the 4-frame model's performance. This is similar to using more data for the 4-frame setting and proved to have worse performance.

For the second question, we cannot run 16-frame experiments due to the memory issue. Without pooing, we can easily get OOM issue. So, it is not feasible to run this experiment.

For the weights fusion part, in the actual implementation in the current version, we are adjusting the alpha value. In the paper, we use a more formal & clean presentation to express our core insights and ideas and it is able to extend to different training settings under joint training (image & video) which is also our next TODOs.

For the last question about the naming of 3D pooling, we use structure pooling to differentiate from the pooling strategy where global pooling is applied (spatial dimension set to 1). We will make this clearer in the revision.

We welcome more discussions and thanks for your suggestions.

Best regards,
DQ

@randomcodelookup
Copy link
Author

Hi authors, thanks for replying. If OOM is an issue, is there a comparison with 4-frame pllava with 4-frame n-frames method? I initially thought this is figure 4, but it seems like fig 4 is 16-frame pllava vs 4-frame n-frames, correct? The main thing I want to find out is how much of an improvement or degradation if we use the same amount of data. Even if the scores are lower, which makes sense since we lose some information, I think it is acceptable if the inference and training is faster. I also think it is useful to add some inference time benchmark if the speed is faster. Is the choice of 4 frames used in fig 4 because it matches the same computational requirements as pllava 16 frames?

The main goal here is to only understand the effect of the pooling on performance. When reading the paper, the conclusion I came to is this pooling method fundamentally improves video understanding compared to n-frames, but it is not intuitive because n-frames has more information, and there is also no further ablation/study of why the performance should be better. Then when reading the sections again I discover maybe the number of input frames is different in the first place. This seems to be an important detail that should be mentioned if true.

I am also curious, what is the effect if you pool the vision tokens before the projection?

Thank you

@zhoudaquan
Copy link
Collaborator

Hi authors, thanks for replying. If OOM is an issue, is there a comparison with 4-frame pllava with 4-frame n-frames method? I initially thought this is figure 4, but it seems like fig 4 is 16-frame pllava vs 4-frame n-frames, correct? The main thing I want to find out is how much of an improvement or degradation if we use the same amount of data. Even if the scores are lower, which makes sense since we lose some information, I think it is acceptable if the inference and training is faster. I also think it is useful to add some inference time benchmark if the speed is faster. Is the choice of 4 frames used in fig 4 because it matches the same computational requirements as pllava 16 frames?

The main goal here is to only understand the effect of the pooling on performance. When reading the paper, the conclusion I came to is this pooling method fundamentally improves video understanding compared to n-frames, but it is not intuitive because n-frames has more information, and there is also no further ablation/study of why the performance should be better. Then when reading the sections again I discover maybe the number of input frames is different in the first place. This seems to be an important detail that should be mentioned if true.

I am also curious, what is the effect if you pool the vision tokens before the projection?

Thank you

Hi,

Thanks for your interest. However, our main goal is to justify that using more frames could mitigate the dominant token phenomenon and thus improve the model. As shown in the paper, the 4-frame setting suffers a performance degradation as the training goes on, which is equivalent to the so-called "more data" in your question, as we use random frame sampling during training. Thus, I do not think the 4-frame setting + pooling is closely related to our report.

We are glad that you find increasing the frame number leads to performance improvements. That is exactly what we are claiming in this report, instead of the pooling operation itself.

Thank you again for your interest.

Best regards,
DQ

@randomcodelookup
Copy link
Author

Got it. Thanks for responding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants