LIU Xuanguang, LI Yujie, ZHANG Zhenchao, DAI Chenguang, ZHANG Hao, MIAO Yuzhe, ZHU Han, LU Jinhao
[Objectives] Existing semantic change detection methods fail to fully utilize local and global features in very high-resolution images and often overlook the spatial-temporal dependencies between bi-temporal remote sensing images, resulting in inaccurate land cover classification results. Additionally, the detected change regions suffer from boundary ambiguity, leading to low consistency between the detected and actual boundaries. [Methods] To address these issues, inspired by the Vision State Space Model (VSSM) with long-sequence modeling capabilities, we propose a semantic change detection network, CVS-Net, which combines Convolutional Neural Networks (CNNs) and VSSM. CVS-Net effectively leverages the local feature extraction capability of CNNs and the long-distance dependency modeling ability of VSSM. Furthermore, we embed a bi-directional spatial-temporal feature modeling module based on VSSM into CVS-Net to guide the network in capturing spatial-temporal change relations. Finally, we introduce a boundary-aware reinforcement branch to enhance the model's performance in boundary localization. [Results] We validate the proposed method on the SECOND and Fuzhou GF2 (FZ-SCD) datasets and compare it with five state-of-the-art methods: HRSCD.str4, Bi-SRNet, ChangeMamba, ScanNet, and TED. Comparative experiments demonstrate that our method outperforms these existing approaches, achieving a Sek of 23.95% and mIoU of 72.89% on the SECOND dataset, and a Sek of 23.02% and mIoU of 72.60% on the FZ-SCD dataset. In ablation experiments, as the proposed modules were progressively added, the SeK improved to 21.26%, 23.04%, and 23.95%, respectively, demonstrating the effectiveness of each module. Notably, compared with CNN-based, Transformer-based, and Mamba-based feature extractors,the proposed CNN-VSS feature extractor achieved the highest Sek, mIoU and Fscd, indicating its robust feature extraction capability and effective balance between local and global feature representation. Additionally, ST-SS2D improved the Sek score by 1.19% on average compared to other spatial-temporal modeling methods, effectively capturing the spatial-temporal dependencies of bi-temporal features and enhancing the model's ability to infer potential feature changes. Furthermore, the proposed edge-enhancement branch improved the consistency between detected and actual boundaries, achieving a consistency degree of 92.97%. [Conclusions] The proposed method significantly improves both the attribute and geometric accuracy of semantic change detection, providing technical references and data support for sustainable urban development and land resource management.