[Objective] Mainstream semantic segmentation methods, primarily designed for small natural images, face significant challenges when applied to large-scale remote sensing imagery, e.g., 5000×5000 pixels. These challenges include spatial feature loss due to fragmented processing, block stitching artifacts from patch-based strategies, and prohibitive computational resource demands. To overcome these limitations, this study proposes large scale segment anything model (LS-SAM), an enhanced fine-tuning framework based on the segment anything model (SAM), specifically optimized for accurate and efficient building extraction from ultra-high-resolution remote sensing images. The primary objectives are to: Enable end-to-end processing of full-scale images while preserving spatial and contextual integrity. Balance computational efficiency with high segmentation accuracy for practical deployment. Address the limitations of existing methods in handling large-scale geospatial data.
[Method] The proposed LS-SAM framework addresses challenges in large-scale remote sensing image processing through four innovations: (1) A dynamic positional encoding generator (PEG) replaces SAM's fixed positional encoding, using depthwise convolutions (kernel size=3) to adaptively partition input images, e.g., H×W, into patches and project spatial coordinates into learnable embeddings. This enables arbitrary-sized input processing, e.g., 5000×5000 pixels, while preserving positional relationships. (2) A hybrid encoder integrates a CNN backbone with Transformer, where the CNN extracts hierarchical local features (edges, textures) and fuses them with SAM's global attention outputs via skip connections. (3) A SMS-AdaptFormer employs parallel convolutional branches with varying kernel sizes 1×1, 3×3, 5×5 and dilation rates, r=8, 14, 20, small kernels refine local details, while dilated convolutions expand receptive fields. Features are aggregated via weighted summation for precise segmentation of diverse buildings. (4) A dynamic training strategy is used: during training, the model takes full-resolution images and applies random crops, e.g., 512×512 pixels, while PEG generates adaptive positional encodings. At inference, PEG handles any input size, and the combined CNN-Transformer encoder processes large images, e.g., 5000×5000 pixels, end-to-end—no chunking or stitching required.
[Result] Experiments on four public datasets, IAILD, MBD, WBDS, WAID, demonstrate LS-SAM's superiority: Achieves 86.7% mIoU on IAILD, outperforming DeepLabV3+ 81.25% and SAM 76.98%. On WBDS and WAID datasets, LS-SAM attains 96.11% and 94.14% mIoU, respectively, demonstrating robust generalization. Reduces GPU memory usage to 12GB, vs. 24GB for vanilla SAM, during training. Attains 10.1 FPS inference speed on 5000×5000 pixels images, NVIDIA RTX 3090Ti. Visual results on Inria and WBDS datasets show LS-SAM effectively mitigates boundary ambiguities and block stitching errors, particularly in dense urban areas and complex terrains. Additionally, ablation experiments reveal that removing PEG reduces mIoU by 2.06%, while disabling SMS-AdaptFormer reduces accuracy by 1.02%, confirming the contribution of each component.
[Conclusion] LS-SAM provides an effective solution for large-scale geospatial analysis by harmonizing global context modeling with local detail preservation. The framework significantly mitigates block stitching errors and computational bottlenecks, achieving state-of-the-art performance in building extraction tasks. This work establishes a foundation for advancing large-scale remote sensing interpretation, with potential applications in urban planning, disaster response, and environmental monitoring. Future work will focus on scaling the architecture for ultra-large imagery, 10000×10000 pixels, and enhancing cross-modal adaptability for multi-sensor data fusion.
丁忆, 李朋龙, 张觅, 张泽烈, 李海峰, 胡艳, 马泽忠, 敖影. 2021.国土资源典型要素变化遥感智能监测关键技术及应用. 地理信息世界, 28(6): 65-71[Ding Y, Li P L, Zhang M, Zhang Z L, Li H F, Hu Y, Ma Z Z, Ao Y.2021. Key technology and application of remote sensing intelligent monitoring for the typical elements change of land and resources. Geomatics World, 28(6): 65-71 (in Chinese)]
李星华, 白学辰, 李正军, 左芝勇. 2022. 面向高分影像建筑物提取的多层次特征融合网络. 武汉大学学报(信息科学版), 47(8): 1236-1244[Li X H, Bai X C, Li Z J, Zuo Z Y.2022. High-resolution image building extraction based on multi-level feature fusion network. Geomatics and Information Science of Wuhan University, 47(8): 1236-1244 (in Chinese)]
刘卓涛, 龚循强, 夏元平, 陈晓勇, 吴晋涛. 2024. KU-Net: 改进U-Net 的高分辨率遥感影像建筑物提取方法. 遥感信息, 39(5): 121-131[Liu Z T, Gong X Q, Xia Y P, Chen X Y, Wu J T.2024. KU-Net: An improved U-Net method for building extraction from high resolution remote sensing imagery. Remote Sensing Information, 39(5): 121-131 (in Chinese)]
任慧群, 蔡国印, 李志强. 2020. 基于深度学习的城市建筑物阴影提取方法. 地理信息世界, 27(2): 81-86[Ren H Q, Cai G Y, Li Z Q.2020. An approach for urban building shadow extraction based on deep learning. Geomatics World, 27(2): 81-86 (in Chinese)]
伍燚垚, 冯德俊, 瑚敏君, 千峰. 2019. 基于航空影像的建筑物阴影提取方法研究. 地理信息世界, 26(6): 44-48[Wu Y Y, Feng D J, Hu M J, Qian F.2019. Extraction of building area shadow based on aerial images. Geomatics World, 26(6): 44-48 (in Chinese)]
张谦, 贾永红, 吴晓良, 胡忠文. 2014. 一种带几何约束的大幅面遥感影像自动快速配准方法. 武汉大学学报(信息科学版), 39(1): 17-21, 31[Zhang Q, Jia Y H, Wu X L, Hu Z W.2014. A rapid image registration method based on restricted geometry constraints for large-size remote sensing image. Geomatics and Information Science of Wuhan University, 39(1): 17-21, 31 (in Chinese)]
赵晨晨, 曹鑫, 石峰, 陈学泓, 崔喜红. 2023. 基于 Geo Eye-1 立体像对影像提取建筑物高度的方法研究. 时空信息学报, 30(1): 25-32[Zhao C C, Cao X, Shi F, Chen X H, Cui X H.2023. Extraction of building height information based on GeoEye-1 stereo image pairs. Journal of Spatio-temporal Information, 30(1): 25-32 (in Chinese)]
朱岩彬, 徐启恒, 杨俊涛, 莫海林. 2020. 基于全卷积神经网络的高分辨率航空影像建筑物提取方法研究. 地理信息世界, 27(2): 101-106[Zhu Y B, Xu Q H, Yang J T, Mo H L.2020. Full convolution neural network based building extraction approach from high resolution aerial image. Geomatics World, 27(2): 101-106 (in Chinese)]
Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L.2018b. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848
Chen L C, Zhu Y K, Papandreou G, Schroff F, Adam H.2018a. Encoder-decoder with atrous separable convolution for semantic image segmentation. Computer Vision -ECCV 2018. Cham: Springer International Publishing.833-851
Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille A L.2016. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062v4.https://doi.org/10.48550/arXiv.1412.7062
Chen L, Papandreou G, Schroff F, Adam H.2017. Rethinking atrous convolution for semantic image segmentation. arXiv:1706. 05587v3.https://doi.org/10.48550/arXiv.1706.05587
Chen S F, Ge C J, Tong Z, Wang J L, Song Y B, Wang J, Luo P.2022. AdaptFormer: Adapting vision Transformers for scalable visual recognition.arXiv:2205.13535v3.https://doi.org/10.48550/arXiv.22 05.13535
Chen W Y, Jiang Z Y, Wang Z Y, Cui K X, Qian X N.2021. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8924-8933
Chu X X, Tian Z, Zhang B, Wang X L, Shen C H.2021. Conditional positional encodings for vision Transformers. https://arxiv.org/abs/2102.10882v3
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N.2021.An image is worth 16×16 words: Transformers for image recognition at scale//International Conference on Learning Representations,1-20
Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A.2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303-338
He K M, Zhang X Y, Ren S Q, Sun J.2016. Deep residual learning for image recognition//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA. 770-778
Hu A N, Wu L, Chen S Q, Xu Y Y, Wang H T, Xie Z.2023. Boundary shape-preserving model for building mapping from high-resolution remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 61: 5610217
Hu Z W, Li Q Q, Zou Q, Zhang Q, Wu G F.2016. A bilevel scale-sets model for hierarchical representation of large remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 54(12): 7366-7377
Huynh C, Tran A T, Luu K, Hoai M.2021. Progressive semantic segmentation//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16750-16759
Ji S P, Wei S Q, Lu M.2019. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing, 57(1): 574-586
Kirillov A, Mintun E, Ravi N, Mao H Z, Rolland C, Gustafson L, Xiao T T, Whitehead S, Berg A C, Lo W Y, Dollár P, Girshick R.2023. Segment anything//2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France. 3992-4003
LeCun Y, Bottou L, Bengio Y, Haffner P.1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278-2324
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S, Guo B N.2021. Swin transformer: Hierarchical vision transformer using shifted windows//2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada. 9992-10002
Long J, Shelhamer E, Darrell T.2015.Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640-651
Maggiori E, Tarabalka Y, Charpiat G, Alliez P.2017. Can semantic labeling methods generalize to any city?the inria aerial image labeling benchmark//2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). Fort Worth, TX, USA. 3226-3229
Mnih V.2013. Machine learning for aerial image labeling. Doctoral Dissertation. Toronto: University of Toronto
Pang S Y, Shi Y P, Hu H C, Ye L Z, Chen J.2024. PTRSegNet: A patch-to-region bottom-up pyramid framework for the semantic segmentation of large-format remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17: 3664-3673
Ronneberger O, Fischer P, Brox T.2015. U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention -MICCAI 2015. Cham: Springer International Publishing. 234-241
Shan L L, Li M L, Li X B, Bai Y, Lv K, Luo B, Chen S B, Wang W Q.2021. UHRSNet: A semantic segmentation network specifically for ultra-high-resolution images//2020 25th International Conference on Pattern Recognition (ICPR), 1460-1466
Song Y R, Zhou Q Y, Li X T, Fan D P, Lu X Q, Ma L Z.2024.BA-SAM: Scalable bias-mode attention mask for segment anything model//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA. 3162-3173
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I.2023. Attention is all you need. arXiv:1706.03762v7.https://doi.org/10.48550/arXiv.1706.03762
Wang L B, Fang S H, Meng X L, Li R.2022. Building extraction with vision transformer. IEEE Transactions on Geoscience and Remote Sensing, 60: 5625711
Wang S S, Zuo Z Q, Yan S H, Zeng W M, Pang S Y.2024. A novel global-local feature aggregation framework for semantic segmentation of large-format high-resolution remote sensing images. Applied Sciences, 14(15): 6616