Spatio-temporal Perception
HE Xiaohui, WU Kaixuan, LI Panle, QIAO Mengjia, CHENG, Xijie
[Objective] Mainstream semantic segmentation methods, primarily designed for small natural images, face significant challenges when applied to large-scale remote sensing imagery, e.g., 5000×5000 pixels. These challenges include spatial feature loss due to fragmented processing, block stitching artifacts from patch-based strategies, and prohibitive computational resource demands. To overcome these limitations, this study proposes large scale segment anything model (LS-SAM), an enhanced fine-tuning framework based on the segment anything model (SAM), specifically optimized for accurate and efficient building extraction from ultra-high-resolution remote sensing images. The primary objectives are to: Enable end-to-end processing of full-scale images while preserving spatial and contextual integrity. Balance computational efficiency with high segmentation accuracy for practical deployment. Address the limitations of existing methods in handling large-scale geospatial data.
[Method] The proposed LS-SAM framework addresses challenges in large-scale remote sensing image processing through four innovations: (1) A dynamic positional encoding generator (PEG) replaces SAM's fixed positional encoding, using depthwise convolutions (kernel size=3) to adaptively partition input images, e.g., H×W, into patches and project spatial coordinates into learnable embeddings. This enables arbitrary-sized input processing, e.g., 5000×5000 pixels, while preserving positional relationships. (2) A hybrid encoder integrates a CNN backbone with Transformer, where the CNN extracts hierarchical local features (edges, textures) and fuses them with SAM's global attention outputs via skip connections. (3) A SMS-AdaptFormer employs parallel convolutional branches with varying kernel sizes 1×1, 3×3, 5×5 and dilation rates, r=8, 14, 20, small kernels refine local details, while dilated convolutions expand receptive fields. Features are aggregated via weighted summation for precise segmentation of diverse buildings. (4) A dynamic training strategy is used: during training, the model takes full-resolution images and applies random crops, e.g., 512×512 pixels, while PEG generates adaptive positional encodings. At inference, PEG handles any input size, and the combined CNN-Transformer encoder processes large images, e.g., 5000×5000 pixels, end-to-end—no chunking or stitching required.
[Result] Experiments on four public datasets, IAILD, MBD, WBDS, WAID, demonstrate LS-SAM's superiority: Achieves 86.7% mIoU on IAILD, outperforming DeepLabV3+ 81.25% and SAM 76.98%. On WBDS and WAID datasets, LS-SAM attains 96.11% and 94.14% mIoU, respectively, demonstrating robust generalization. Reduces GPU memory usage to 12GB, vs. 24GB for vanilla SAM, during training. Attains 10.1 FPS inference speed on 5000×5000 pixels images, NVIDIA RTX 3090Ti. Visual results on Inria and WBDS datasets show LS-SAM effectively mitigates boundary ambiguities and block stitching errors, particularly in dense urban areas and complex terrains. Additionally, ablation experiments reveal that removing PEG reduces mIoU by 2.06%, while disabling SMS-AdaptFormer reduces accuracy by 1.02%, confirming the contribution of each component.
[Conclusion] LS-SAM provides an effective solution for large-scale geospatial analysis by harmonizing global context modeling with local detail preservation. The framework significantly mitigates block stitching errors and computational bottlenecks, achieving state-of-the-art performance in building extraction tasks. This work establishes a foundation for advancing large-scale remote sensing interpretation, with potential applications in urban planning, disaster response, and environmental monitoring. Future work will focus on scaling the architecture for ultra-large imagery, 10000×10000 pixels, and enhancing cross-modal adaptability for multi-sensor data fusion.