Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying the model to new video data #2

Open
plnguyen2908 opened this issue Jun 21, 2024 · 1 comment
Open

Applying the model to new video data #2

plnguyen2908 opened this issue Jun 21, 2024 · 1 comment

Comments

@plnguyen2908
Copy link

Hi,

I am really interested in your model. However, I do not know how to apply your model to a new fresh video data. Can you help me answer several answers?

  1. I am currently having a mp4 file. I have extracted into frames and mel-spectrogram data. However, the time stamp for video and audio does not match, do I need to match those?
  2. For the face crop, which model can I use and do they need to have the same dimension after cropping because from your paper, they need to be H x W?
  3. After having the face crop and audio data, Do I need to do any preprocessing? If yes, is there any note how to preprocess the 2 data.
  4. In the inference step, If I am not wrong, I need to load the loconet model from loconet.py with weight and run similar to evaluate_network function right?

Hope you can answer my question or even better if you can create new new pyinb file with an example on how to run the model on new video.

@SJTUwxz
Copy link
Owner

SJTUwxz commented Aug 13, 2024

Hi! Thank you for your interest in our work!

  1. Yes, the audio and the video need to be aligned temporally to be used as input to the model.
  2. I am using the face crops annotation released by the datasets (e.g. AVA-ActiveSpeaker). For a random video, you can choose any face detection methods like RetinaFace. The face crops need to be resized to have the same dimension during preprocessing in order to be concatenated into face tracks.
  3. I think you need to generate face tracks given the face crops. The audio only needs to be processed into mel-spectrogram feature, which will be used as input. To generate the face track of a single speaker, you can refer to TalkNet's pipeline of inference on random video.
  4. During inference, you can load our released checkpoint on AVA-ActiveSpeaker dataset. After preprocessing your video and getting the face tracks and audio mel-spectrogram, give them as input to the model and get the prediction scores of speaking activities.
  5. We don't have plans to release code to inference on random videos for now, but I will keep you posted when we release a demo version later.

Thank you and please let me know if you have further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants