Backchanneling during a conversation occurs when one participant is speaking and another participant gives a response to the speaker. It can be verbal cues (“uh-huh”, “hmm”), visual cues(“nodding”, “facial expressions”) or both, these cues are input modalities used for backchannel detection. The backchannel has the important function of encour- aging current speaker to hold their turn and con- tinue to speak, which enables smooth conversa- tion. Predicting a backchannel can be beneficial for building human like conversational agents or robots
Refer to the base report for full understanding of our experiment : https://drive.google.com/file/d/1XLZRns5FUpb33731iPfz5u1UrFK_jzQ2/view?usp=sharing
The above python notebook and report serves as the base for further experiments. We did further experiments on different fusion techniques in transformers for the same task.
We perform the below fusion techniques:
- one-stream
- one-to-one stream
- One-to-Two stream
- Two-to-One stream
- Cross Attention
- Cross-to-one Stream
After extensive evaluation we find out that simple one-stream serves the purpose for backchannel detection
CheckOut leaderboard for this comeptiton. Leaderboard
This project backchannel detection along with other project backchannel estimation is submitted to IJCNN'23 and is accepted. Official Paper
Refer to our official code where we experiment with different fusion tchniques menioned in the published paper. CodeRepo