Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

Qifan Yu¹*, Zhebei Shen¹*, Zhongqi Yue²*, Yang Wu³, Bosheng Qin¹, Wenqiao Zhang¹,

liyunfei³, Juncheng Li¹, Siliang Tang¹, Yueting Zhuang¹

¹Zhejiang University, ²Nanyang Technological University, ³Alibaba Group

*Equal Contribution.

🌍 Introduction

Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles—--informativeness, uniqueness, and representativeness--—for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the ``Less is More" philosophy in MLLM development.

💡 Three Core Principles

Informativeness: a valuable sample should be informative of the hard task, e.g., If the task is reasoning, describing the movement differences between skiing and ice skating is more informative and complex than simply describing someone skiing. In Fig.1 where each axis (heuristically) represents an orthogonal dimension of task information, points along the diagonal carry more information about the task.
Uniqueness: a valuable sample should be distinct from others, offering unique insights rather than prevalent commonsense knowledge (c.f. Fig.1 near the blue dashed region in the intra-cluster space demonstrates high uniqueness)
Representativeness: it should be a typical sample in the data distribution. This prevents selecting noisy outliers or mislabeled samples~(c.f. Fig.1 the clusters connected by blue lines in the inter-cluster space exhibit high representativeness for the overall dataset).

⭐ Highlight & Results

We identify three key principles~(i.e., informativeness, uniqueness, and representativeness) from a systematic perspective to master multi-modal data selection.
We propose a unified framework, DataTailor, to adaptively integrate these principles for value evaluation to optimize multi-modal data selection in a collaborative way.
Extensive results show DataTailor's effectiveness in optimizing all three principles during selection and achieving new SOTA performance on various benchmarks.

🛠️ Setup Steps

Conda a new python environment

conda create --name datatailor python=3.10
conda activate datatailor
pip install -r requirements.txt

Prepare candidate datasets (1. MiniGPT4-Instruction, 2. LLaVA-665K, 3. Mplug-OWL-264K, 4. Bunny-695K)
DataTailor framework
1. informativeness
```
bash datatailor.sh
```
1. cross modal domain clustering
```
python make_clustering.py
```
1. uniqueness and representativeness
```
python relationship_value_calculation.py
```
1. collaborative multi-modal data selection
```
python datatailor_selector.py
```
1. Selected data fine-tuning
```
bash finetune_lora_stage2_7b.sh
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

🌍 Introduction

💡 Three Core Principles

⭐ Highlight & Results

🛠️ Setup Steps

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assert		assert
llava		llava
README.md		README.md
datatailor.sh		datatailor.sh
datatailor_selector.py		datatailor_selector.py
finetune_lora_stage2_7b.sh		finetune_lora_stage2_7b.sh
make_clustering.py		make_clustering.py
relationship_value_calculation.py		relationship_value_calculation.py
requirements.txt		requirements.txt

Yuqifan1117/DataTailor

Folders and files

Latest commit

History

Repository files navigation

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

🌍 Introduction

💡 Three Core Principles

⭐ Highlight & Results

🛠️ Setup Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages