Loading…

Listen Then See: Video Alignment with Speaker Attention

Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires process...

Full description

Saved in:

Bibliographic Details
Main Authors:	Agrawal, Aviral, Samudio Lezcano, Carlos Mateo, Balam Heredia-Marin, Iqui, Sethi, Prabhdeep Singh
Format:	Conference Proceeding
Language:	English
Subjects:	Accuracy Alignment Audio modality Bridges Codes Computer vision Conferences Fusion LLM Multimodal learning Question answering (information retrieval) Social Interation Video QA Visualization
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	2027
container_issue
container_start_page	2018
container_title
container_volume
creator	Agrawal, Aviral Samudio Lezcano, Carlos Mateo Balam Heredia-Marin, Iqui Sethi, Prabhdeep Singh
description	Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at [1].
doi_str_mv	10.1109/CVPRW63382.2024.00207
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10678502</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10678502</ieee_id><sourcerecordid>10678502</sourcerecordid><originalsourceid>FETCH-LOGICAL-i687-e4b707d4da249f57fb615367255ee5f40245de259b7fda2041ffca9285f0d2443</originalsourceid><addsrcrecordid>eNotjt1Kw0AQRldBsNS8gcK-QOLsz-xkvQtBrRBQbKiXJTGzdrVNSxIQ396A3pzv5vBxhLhRkCkF_rbcvLy-OWNynWnQNgPQQGci8eRzg2AcWrLnYqGVg5RQuUuRjOMnACjIEb1ZCKriOHEv692MNfOd3MSOj7LYx4_-wP0kv-O0k-sTN188yGKa5Ske-ytxEZr9yMn_LkX9cF-Xq7R6fnwqiyqNLqeUbUtAne0abX1ACq1TaBxpRGYMdo7GjjX6lsLsgFUhvDde5xig09aapbj-u43MvD0N8dAMP1sFjnIEbX4B7ypGqA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Listen Then See: Video Alignment with Speaker Attention</title><source>IEEE Xplore All Conference Series</source><creator>Agrawal, Aviral ; Samudio Lezcano, Carlos Mateo ; Balam Heredia-Marin, Iqui ; Sethi, Prabhdeep Singh</creator><creatorcontrib>Agrawal, Aviral ; Samudio Lezcano, Carlos Mateo ; Balam Heredia-Marin, Iqui ; Sethi, Prabhdeep Singh</creatorcontrib><description>Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at [1].</description><identifier>EISSN: 2160-7516</identifier><identifier>EISBN: 9798350365474</identifier><identifier>DOI: 10.1109/CVPRW63382.2024.00207</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Alignment ; Audio modality ; Bridges ; Codes ; Computer vision ; Conferences ; Fusion ; LLM ; Multimodal learning ; Question answering (information retrieval) ; Social Interation ; Video QA ; Visualization</subject><ispartof>2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, p.2018-2027</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10678502$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27904,54534,54911</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10678502$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Agrawal, Aviral</creatorcontrib><creatorcontrib>Samudio Lezcano, Carlos Mateo</creatorcontrib><creatorcontrib>Balam Heredia-Marin, Iqui</creatorcontrib><creatorcontrib>Sethi, Prabhdeep Singh</creatorcontrib><title>Listen Then See: Video Alignment with Speaker Attention</title><title>2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</title><addtitle>CVPRW</addtitle><description>Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at [1].</description><subject>Accuracy</subject><subject>Alignment</subject><subject>Audio modality</subject><subject>Bridges</subject><subject>Codes</subject><subject>Computer vision</subject><subject>Conferences</subject><subject>Fusion</subject><subject>LLM</subject><subject>Multimodal learning</subject><subject>Question answering (information retrieval)</subject><subject>Social Interation</subject><subject>Video QA</subject><subject>Visualization</subject><issn>2160-7516</issn><isbn>9798350365474</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotjt1Kw0AQRldBsNS8gcK-QOLsz-xkvQtBrRBQbKiXJTGzdrVNSxIQ396A3pzv5vBxhLhRkCkF_rbcvLy-OWNynWnQNgPQQGci8eRzg2AcWrLnYqGVg5RQuUuRjOMnACjIEb1ZCKriOHEv692MNfOd3MSOj7LYx4_-wP0kv-O0k-sTN188yGKa5Ske-ytxEZr9yMn_LkX9cF-Xq7R6fnwqiyqNLqeUbUtAne0abX1ACq1TaBxpRGYMdo7GjjX6lsLsgFUhvDde5xig09aapbj-u43MvD0N8dAMP1sFjnIEbX4B7ypGqA</recordid><startdate>20240617</startdate><enddate>20240617</enddate><creator>Agrawal, Aviral</creator><creator>Samudio Lezcano, Carlos Mateo</creator><creator>Balam Heredia-Marin, Iqui</creator><creator>Sethi, Prabhdeep Singh</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240617</creationdate><title>Listen Then See: Video Alignment with Speaker Attention</title><author>Agrawal, Aviral ; Samudio Lezcano, Carlos Mateo ; Balam Heredia-Marin, Iqui ; Sethi, Prabhdeep Singh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i687-e4b707d4da249f57fb615367255ee5f40245de259b7fda2041ffca9285f0d2443</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Alignment</topic><topic>Audio modality</topic><topic>Bridges</topic><topic>Codes</topic><topic>Computer vision</topic><topic>Conferences</topic><topic>Fusion</topic><topic>LLM</topic><topic>Multimodal learning</topic><topic>Question answering (information retrieval)</topic><topic>Social Interation</topic><topic>Video QA</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Agrawal, Aviral</creatorcontrib><creatorcontrib>Samudio Lezcano, Carlos Mateo</creatorcontrib><creatorcontrib>Balam Heredia-Marin, Iqui</creatorcontrib><creatorcontrib>Sethi, Prabhdeep Singh</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Agrawal, Aviral</au><au>Samudio Lezcano, Carlos Mateo</au><au>Balam Heredia-Marin, Iqui</au><au>Sethi, Prabhdeep Singh</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Listen Then See: Video Alignment with Speaker Attention</atitle><btitle>2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</btitle><stitle>CVPRW</stitle><date>2024-06-17</date><risdate>2024</risdate><spage>2018</spage><epage>2027</epage><pages>2018-2027</pages><eissn>2160-7516</eissn><eisbn>9798350365474</eisbn><coden>IEEPAD</coden><abstract>Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at [1].</abstract><pub>IEEE</pub><doi>10.1109/CVPRW63382.2024.00207</doi><tpages>10</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2160-7516
ispartof	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, p.2018-2027
issn	2160-7516
language	eng
recordid	cdi_ieee_primary_10678502
source	IEEE Xplore All Conference Series
subjects	Accuracy Alignment Audio modality Bridges Codes Computer vision Conferences Fusion LLM Multimodal learning Question answering (information retrieval) Social Interation Video QA Visualization
title	Listen Then See: Video Alignment with Speaker Attention
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T12%3A32%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Listen%20Then%20See:%20Video%20Alignment%20with%20Speaker%20Attention&rft.btitle=2024%20IEEE/CVF%20Conference%20on%20Computer%20Vision%20and%20Pattern%20Recognition%20Workshops%20(CVPRW)&rft.au=Agrawal,%20Aviral&rft.date=2024-06-17&rft.spage=2018&rft.epage=2027&rft.pages=2018-2027&rft.eissn=2160-7516&rft.coden=IEEPAD&rft_id=info:doi/10.1109/CVPRW63382.2024.00207&rft.eisbn=9798350365474&rft_dat=%3Cieee_CHZPO%3E10678502%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i687-e4b707d4da249f57fb615367255ee5f40245de259b7fda2041ffca9285f0d2443%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10678502&rfr_iscdi=true