VC-Agent:
An Interactive Agent for Customized Video Dataset Collection

1SSE, CUHKSZ    2FNii-Shenzhen    3Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHKSZ       4Kuaishou Technology    5HKUST    6Shenzhen University

Corresponding Author
MY ALT TEXT

We propose VC-Agent , the first interactive MLLM-based agent that can effectively scale-up the collection of customized video datasets from Internet.

Abstract

Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users’ queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user's requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent’s usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection.


Method Overview

MY ALT TEXT

An overview of our framework. The illustration of the entire workflow of our VC-Agent. It mainly consists of User Interface (front-end, shown in the left panel) and Agent Functions (back-end, shown in the right panel). Initially, the user interacts with the agent to express their coarse demands 𝑄. Upon receiving 𝑄, our agent controls the Video Proposal module to download and retrieve candidate videos. Next, several candidates 𝑉𝑟 will be sent back to the user, requiring them to provide the accepted/rejected video set 𝑉+/𝑉− , and comment set 𝐶. Subsequently, 𝑉+, 𝑉− and 𝐶 will guide ourn agent to define/update the Filtering Policy. The updated policy is then used to filter candidate videos, after which a new batch of video samples 𝑉 will be returned to the user again for the next interaction iteration. The whole process will be iteratively conducted until the user is satisfied with all returned videos. Finally, our agent will begin to construct the dataset in a fully automatic manner.

MY ALT TEXT

The overview of our user-assisted double-check strategy. Samples with low confidence are retained in a dedicated buffer. Once 100 such samples have been accumulated, a subset is randomly selected for user review during the interaction phase, prompting the user to verify them again.

User Interface Demo

User Interface Demo of VC-Agent .

Qualitative Comparison

The Qualitative Comparison results of 2 different tasks between: 1. the baseline model. 2. the baseline model finetuned with our collected dataset. 3. the baseline model finetuned with the data manually collected and filtered by LLAVA-OneVision.

BibTeX

@inproceedings{zhang2025vcagent,
        title={VC-Agent: An Interactive Agent for Customized Video Dataset Collection}, 
        author={Yidan Zhang and Mutian Xu and Yiming Hao and Kun Zhou and Jiahao Chang and Xiaoqiang Liu and Pengfei Wan and Xiaoguang Han},
        year={2025},
  }