Dongchen Si1,4,5 *, Di Wang2*, Erzhong Gao4,5 , Xiaolei Qin3 , Liu Zhao4,5, Jing Zhang2, Minqiang Xu4,5 β ,Jianbo Zhan4,5 β ,Jianshe Wang4,5,Lin Liu4,5,Bo Du2,Liangpei Zhang3
1 Xinjiang University, China,
2 School of Computer Science, Wuhan University, China,
3 State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, China,
4 iFlytek Co., Ltd, China,
5National Engineering Research Center of Speech and Language Information Processing, China,
2025.08.08
- We uploaded our work on arXiv.
Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness.
The SPIE dataset will be released soon.
The code will be released soon.
If you find SPEX helpful, please consider giving this repo a β and citing:
@article{SPEX,
title={SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images},
author={Dongchen Si and Di Wang and Erzhong Gao and Xiaolei Qin and Liu Zhao and Jing Zhang and Minqiang Xu and Jianbo Zhan and Jianshe Wang and Lin Liu and Bo Du and Liangpei Zhang},
journal={arXiv preprint arXiv:2508.05202},
year={2025}
}