Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] MVTamperBench #739

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

amitbcp
Copy link
Contributor

@amitbcp amitbcp commented Jan 21, 2025

Thanks for open-sourcing the benchmark tool to enable development of the evaluations of different Multimodal LLMs

We release MVTamperBench - https://arxiv.org/abs/2412.19794v4 | https://amitbcp.github.io/MVTamperBench/

Details -

Multimodal Large Language Models (MLLMs), also known as Large Multi-modal Models (LMMs), are recent advancement of Vision-Language Models (VLMs), that have driven major advances in video understanding, yet their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce \textbf{MVTamperBench}, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping. Built from 3.4K original videos—expanded to over 17K tampered clips spanning 19 video tasks.

MVTamperBench challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families, revealing substantial variability in resilience across tampering types and showing that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code and data to foster open research in trustworthy video understanding.

@amitbcp
Copy link
Contributor Author

amitbcp commented Jan 23, 2025

Hey @FangXinyu-0913 @kennymckormick - can you please have a look at the PR. We would want to release the benchmark. Also would want to check on how we can share the results with you so that it can be hosted on your leader board as we have done extensive benchmark with 45 MLLMs using VLMEvalKit

@FangXinyu-0913 FangXinyu-0913 self-assigned this Jan 23, 2025
@FangXinyu-0913
Copy link
Collaborator

Hi @amitbcp, I am trying to reproduce the results using InternVL2_5-8B, but there is a significant deviation from the results in the paper
image
image

This is the version of the relevant libraries I am using. I would like to ask what version you are using so that I can reproduce the results after making further modifications
image

@amitbcp
Copy link
Contributor Author

amitbcp commented Jan 29, 2025

Hey @FangXinyu-0913
Sorry for the late reply. here is my environment versions -
image

I am also re-running to ensure its constant

@FangXinyu-0913
Copy link
Collaborator

Thank you @amitbcp, and how many frames were used in the configuration when you tested the InternVL2.5 series?
I am using the default settings in video_dataset_config(i.e., 8 frames), because the differences in results are too large, and I think different frame rate settings may cause it

@amitbcp
Copy link
Contributor Author

amitbcp commented Jan 29, 2025

Hey @FangXinyu-0913

We did our benchmark before the Video Inference was refactored at. - aa9f50e

As I remember before the refactor also, the the default was 8 frames , please correct me if I'm wrong. Also can you share the command that you ran for the benchmarking ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants