PyTorch implementation for TIP2023 paper of “Plug-and-Play Regulators for Image-Text Matching”.
It is built on top of the SGRAF, GPO and Awesome_Matching.
If any problems, please contact me at [email protected]. ([email protected] is deprecated)
The framework of RCAR:
The reported results (One can import GloVe Embedding or BERT for better results)
| Dataset | Module | Sentence retrieval | Image retrieval | ||||
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
| Flick30k | T2I | 79.7 | 95.0 | 97.4 | 60.9 | 84.4 | 90.1 | 
| I2T | 76.9 | 95.5 | 98.0 | 58.8 | 83.9 | 89.3 | |
| ALL | 82.3 | 96.0 | 98.4 | 62.6 | 85.8 | 91.1 | |
| MSCOCO1k | T2I | 79.1 | 96.5 | 98.8 | 63.9 | 90.7 | 95.9 | 
| I2T | 79.3 | 96.5 | 98.8 | 63.8 | 90.4 | 95.8 | |
| ALL | 80.9 | 96.9 | 98.9 | 65.7 | 91.4 | 96.4 | |
| MSCOCO5k | T2I | 59.1 | 84.8 | 91.8 | 42.8 | 71.5 | 81.9 | 
| I2T | 58.4 | 84.6 | 91.9 | 41.7 | 71.4 | 81.7 | |
| ALL | 61.3 | 86.1 | 92.6 | 44.3 | 73.2 | 83.2 | |
Utilize pip install -r requirements.txt for the following dependencies.
- Python 3.7.11
 - PyTorch 1.7.1
 - NumPy 1.21.5
 - Punkt Sentence Tokenizer:
 
import nltk
nltk.download()
> d punktWe follow SCAN to obtain image features and vocabularies, which can be downloaded by using:
https://www.kaggle.com/datasets/kuanghueilee/scan-featuresAnother download link is available below:
https://drive.google.com/drive/u/0/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqCdata
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   └── id_mapping.json  # mapping from coco-id to image's file name
│   
│
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   └── id_mapping.json  # mapping from f30k index to image's file name
│   
│
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)
Modify the model_path, split, fold5 in the eval.py file.
Note that fold5=True is only for evaluation on mscoco1K (5 folders average) while fold5=False for mscoco5K and flickr30K. Pretrained models and Log files can be downloaded from Flickr30K_RCAR and MSCOCO_RCAR.
Then run python eval.py in the terminal.
Uncomment the required parts of BASELINE, RAR, RCR, RCAR in the train_xxxx_xxx.sh file.
Then run ./train_xxx_xxx.sh in the terminal:
If RCAR is useful for your research, please cite the following paper:
  @article{Diao2023RCAR,
     author={Diao, Haiwen and Zhang, Ying and Liu, Wei and Ruan, Xiang and Lu, Huchuan},
     journal={IEEE Transactions on Image Processing}, 
     title={Plug-and-Play Regulators for Image-Text Matching}, 
     year={2023},
     volume={32},
     pages={2322-2334}
  }
