COVA: Text-Guided Composed Retrieval for Audio-Visual Content

Gyuwon Han1, Young Kyun Jang2, Chanho Eom1
1Chung-Ang University 2Google DeepMind
Teaser Image

Comparison between (a) existing CoVR, which only accounts for visual modifications, and (b) our proposed COVA benchmark, which jointly considers both visual and auditory modifications for more realistic retrieval scenarios.

Abstract

Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio (COVA), a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark of video pairs with cross-modal changes and textual queries describing the differences, enabling retrieval based on audio as well. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for COVA.

Datasets

EXAMPLES OF GENERATED AV-COMP TRIPLETS

Query: U6ElTfA5lSw_10
Modification Text Object: Replace the transparent plastic container with a black metal cage containing various colorful toys and perches.

Action: Change the parrot's action from perching on a container to climbing out of the cage onto the top bar and observing its surroundings.

Attribute: Change the dark background to an indoor setting near a window with blinds and natural light.

Audio: Replace the sound of a man speaking and a bird whistling with a bird chirping, a person burping, and the bird imitating the burp.
Target: NBR-XVmJNjQ_60
Hard Negative: eeNw3_nsEdM_310
Hard Negative: QM6X_mK_KAE_50
Query: 7JCV2B6mbDo_30
Modification Text Object: Replace the child in a pink shirt, teddy bear, and chair with a baby in a white outfit with red patterns.

Action: Change the action from the child walking with support from a chair to a baby walking to an adult, touching their leg, and being picked up.

Attribute: Update the setting to an indoor home environment with a wooden cabinet and refrigerator, and change the clothing to a light patterned outfit for the baby and a dark shirt with beige pants for the adult.

Audio: Change the audio from a baby's loud cries to a baby crying while a kid sings and laughs.
Target: VcZykKLnTnI_30
Query: cew6UO_TiWI_60
Modification Text Object: The one duck remains the same, but its color changes from white to brown and green.

Action: The action remains unchanged.

Attribute: A wooden fence is added to the background.

Audio: Replace the sound of splashing with a woman speaking, while a duck quacks.
Target: II7uLXgHSD8_30
Hard Negative: q0uIdT4wzRk_20
Query: qkQ7ooIUNd0_60
Modification Text Object: Remove the trees from the background, focusing only on the sheep and the grassy field.

Action: Change the sheep's movement from walking in a line to running across the field in a scattered and playful manner.

Attribute: Shift the mood from calm and serene to lively and energetic, and change the sheep's appearance from all white to a mix of white, brown, and speckled patterns.

Audio: Replace the sound of a bell ringing with birds chirping and a man and woman speaking.
Target: YRg_topnqRI_40
Hard Negative: E4ECgoC8ahg_20
Query: 2QsWqMg_j08_30
Modification Text Object: Change the man's attire from a striped shirt and headscarf to traditional attire, change the horse from white to brown, and add other people to the scene.

Action: Change the action from the man leading the horse calmly down a sidewalk to the horse bending down while the man adjusts its position as others observe.

Attribute: Transform the urban setting into a rural or semi-rural outdoor environment with a paved area.

Audio: Replace the sound of a child speaking with people shouting.
Target: 6O6rqrMirkU_30
Hard Negative: DukP2K1j2Kg_30
Query: lxVT6iqlJ2k_27
Modification Text Object: Replace the brown cushion and various items with a zebra-patterned blanket, a wooden floor.

Action: Change the cat's action from standing on its hind legs to walking away.

Attribute: Change the environment from a naturally lit, casual home setting to a cozy bedroom with warm, dim lighting.

Audio: Remove the sound of lips kissing, leaving only a woman speaking and a cat meowing.
Target: Kx0eryXWMgE_7
Hard Negative: 2JgHbC7yyTU_0

STATISTICS

Dataset Composition Overview
Unique Video and Pair Count
Overall Aspect-Wise Statistics
All Dataset Statistics
Word Cloud of Object in Modification Text (Overall)
Object WordCloud
Word Cloud of Action in Modification Text (Overall)
Action WordCloud
Word Cloud of Attribute in Modification Text (Overall)
Attribute WordCloud
Word Cloud of Audio in Modification Text (Overall)
Audio WordCloud
Video-based Clustering (Video Distribution per Cluster)
Video Clustering Distribution
Audio-based Clustering (Audio Distribution per Cluster)
Audio Clustering Distribution