-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Benchmark] MVTamperBench #739
base: main
Are you sure you want to change the base?
Conversation
…cstrings for methods
Hey @FangXinyu-0913 @kennymckormick - can you please have a look at the PR. We would want to release the benchmark. Also would want to check on how we can share the results with you so that it can be hosted on your leader board as we have done extensive benchmark with 45 MLLMs using VLMEvalKit |
Hi @amitbcp, I am trying to reproduce the results using InternVL2_5-8B, but there is a significant deviation from the results in the paper This is the version of the relevant libraries I am using. I would like to ask what version you are using so that I can reproduce the results after making further modifications |
Hey @FangXinyu-0913 I am also re-running to ensure its constant |
Thank you @amitbcp, and how many frames were used in the configuration when you tested the InternVL2.5 series? |
Hey @FangXinyu-0913 We did our benchmark before the Video Inference was refactored at. - aa9f50e As I remember before the refactor also, the the default was 8 frames , please correct me if I'm wrong. Also can you share the command that you ran for the benchmarking ? |
Thanks for open-sourcing the benchmark tool to enable development of the evaluations of different Multimodal LLMs
We release MVTamperBench - https://arxiv.org/abs/2412.19794v4 | https://amitbcp.github.io/MVTamperBench/
Details -
Multimodal Large Language Models (MLLMs), also known as Large Multi-modal Models (LMMs), are recent advancement of Vision-Language Models (VLMs), that have driven major advances in video understanding, yet their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce \textbf{MVTamperBench}, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping. Built from 3.4K original videos—expanded to over 17K tampered clips spanning 19 video tasks.
MVTamperBench challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families, revealing substantial variability in resilience across tampering types and showing that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code and data to foster open research in trustworthy video understanding.