You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am really interested in your model. However, I do not know how to apply your model to a new fresh video data. Can you help me answer several answers?
I am currently having a mp4 file. I have extracted into frames and mel-spectrogram data. However, the time stamp for video and audio does not match, do I need to match those?
For the face crop, which model can I use and do they need to have the same dimension after cropping because from your paper, they need to be H x W?
After having the face crop and audio data, Do I need to do any preprocessing? If yes, is there any note how to preprocess the 2 data.
In the inference step, If I am not wrong, I need to load the loconet model from loconet.py with weight and run similar to evaluate_network function right?
Hope you can answer my question or even better if you can create new new pyinb file with an example on how to run the model on new video.
The text was updated successfully, but these errors were encountered:
Yes, the audio and the video need to be aligned temporally to be used as input to the model.
I am using the face crops annotation released by the datasets (e.g. AVA-ActiveSpeaker). For a random video, you can choose any face detection methods like RetinaFace. The face crops need to be resized to have the same dimension during preprocessing in order to be concatenated into face tracks.
I think you need to generate face tracks given the face crops. The audio only needs to be processed into mel-spectrogram feature, which will be used as input. To generate the face track of a single speaker, you can refer to TalkNet's pipeline of inference on random video.
During inference, you can load our released checkpoint on AVA-ActiveSpeaker dataset. After preprocessing your video and getting the face tracks and audio mel-spectrogram, give them as input to the model and get the prediction scores of speaking activities.
We don't have plans to release code to inference on random videos for now, but I will keep you posted when we release a demo version later.
Thank you and please let me know if you have further questions!
Hi,
I am really interested in your model. However, I do not know how to apply your model to a new fresh video data. Can you help me answer several answers?
Hope you can answer my question or even better if you can create new new pyinb file with an example on how to run the model on new video.
The text was updated successfully, but these errors were encountered: