Res2Net-ViT:一种多尺度特征融合的无参考图像质量评价模型

张波; 郝彩霞; 胡燕翔; 张雨欣; 马飞翔

doi:10.19638/j.issn1671-1114.20260208

天津师范大学学报（自然科学版） >

2026 , Vol. 46 >Issue 2: 65 - 73

DOI: https://doi.org/10.19638/j.issn1671-1114.20260208

信息与计算机科学

Res2Net-ViT:一种多尺度特征融合的无参考图像质量评价模型

张波 ,
郝彩霞 ,
胡燕翔 ,
张雨欣 ,
马飞翔

展开

天津师范大学计算机与信息工程学院,天津 300387

张波（1981—）,男,实验师,主要从事计算机视觉方面的研究.E-mail:tjnuzhangbo@163.com.

收稿日期: 2024-10-12

网络出版日期: 2026-06-03

基金资助

教育部产学合作协同育人资助项目（220900287135507）

收起

Res2Net-ViT: A no-reference image quality assessment model with multi-scale feature aggregation

ZHANG Bo ,
HAO Caixia ,
HU Yanxiang ,
ZHANG Yuxin ,
MA Feixiang

Expand

School of Computer and Information Engineering,Tianjin Normal University,Tianjin 300387,China

Received date: 2024-10-12

Online published: 2026-06-03

Fold

摘要

在图像质量评价中,针对Transformer模型无法挖掘图像不同尺度和位置特征的问题,本文提出了一种基于Res2Net和ViT（Vision Transformer）的混合模型进行无参考图像质量评价。该模型利用Res2Net的多尺度和跨尺度连接,增强特征提取模型的感受野和特征表示能力,以得到更有用的图像细节信息。用Res2Net生成的特征图取代图像块输入Transformer,利用Transformer捕获全局特征,以使混合模型平衡图像的细节信息与全局信息。在真实失真数据集和合成失真数据集上进行实验,结果表明,本文模型表现出良好的评价性能,具有一定泛化能力,整体性能优于其他CNN类模型。

关键词： 图像质量评价; 全局特征; 局部特征; Res2Net; Vision Transformer（ViT）模型

本文引用格式

张波 , 郝彩霞 , 胡燕翔 , 张雨欣 , 马飞翔 . Res2Net-ViT:一种多尺度特征融合的无参考图像质量评价模型[J]. 天津师范大学学报（自然科学版）, 2026 , 46(2) : 65 -73 . DOI: 10.19638/j.issn1671-1114.20260208

Abstract

In image quality assessment, regarding the issue that the Transformer model is unable to capture features of images across varying scales and positions, a hybrid model based on Res2Net and ViT (Vision Transformer) for no-reference image quality assessment is proposed. By leveraging the multi-scale and cross-scale connections of Res2Net, the model expands the receptive field and enhances feature representation capabilities, thereby obtaining more helpful image details. The feature images generated by Res2Net replace the original image patches as Transformer inputs, and the Transformer are used to capture global feature, so that the hybrid model can balance the detail information and global information of the image. Experiments are conducted on both real-world distortion datasets and synthetic distortion datasets. The results show that the proposed model exhibits excellent assessment performance and possesses certain generalization ability, the overall performance is superior to other CNN-based models.

Key words： image quality assessment; global feature; local feature; Res2Net; Vision Transformer（ViT）model

参考文献

[1] BA Y,ZHANG T Y,BAI Y L,et al.Enhancing reward models for high-quality image generation:Beyond text-image alignment[EB/OL].2025:arXiv:2507.19002.https://arxiv.org/abs/2507.19002.
[2] DING B S,ZHANG R H,XU L X,et al.U2D2Net:Unsupervised uni- fied image dehazing and denoising network for single hazy image en- hancement[J].IEEE Transactions on Multimedia,2024,26:202-217.
[3] MA C Q,SHI Z Y,LU Z Q,et al.A survey on image quality assess- ment:Insights,analysis,and future outlook[EB/OL].2025:arXiv: 2502.08540.https://arxiv.org/abs/2502.08540.
[4] MOORTHY A K,BOVIK A C.Blind image quality assessment:From natural scene statistics to perceptual quality[J].IEEE Transactions on Image Processing,2011,20(12):3350-3364.
[5] SAAD M A,BOVIK A C,CHARRIER C.Blind image quality assess- ment:A natural scene statistics approach in the DCT domain[J].IEEE Transactions on Image Processing,2012,21(8):3339-3352.
[6] YE P,KUMAR J,KANG L,et al.Unsupervised feature learning framework for no-reference image quality assessment[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.June 16-21, 2012,Providence,RI,USA.IEEE, 2012.
[7] YE P,KUMAR J,KANG L,et al.Real-time no-reference image qual- ity assessment based on filter learning[C]//2013 IEEE Conference on Computer Vision and Pattern Recognition.June 23-28,2013,Port- land,OR,USA.IEEE, 2013.
[8] KANG L,YE P,LI Y,et al.Convolutional neural networks for No- reference image quality assessment[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition.June 23-28,2014,Colum- bus,OH,USA.IEEE, 2014.
[9] BOSSE S,MANIRY D,MüLLER K R,et al.Deep neural networks for No-reference and full-reference image quality assessment[J].IEEE Transactions on Image Processing,2018,27(1):206-219.
[10] WU J J,MA J P,LIANG F H,et al.End-to-end blind image quality prediction with cascaded deep neural network[J].IEEE Transactions on Image Processing,2020,29:7414-7426.
[11] SONG T S,LI L D,WU J J,et al.Knowledge-guided blind image quality assessment with few training samples[J].IEEE Transactions on Multimedia,2023,25:8145-8156.
[12] 曹玉东,蔡希彪.基于增强型对抗学习的无参考图像质量评价算法[J].计算机应用,2020,40(11):3166-3171.
CAO Y D,CAI X B.No-reference image quality assessment algorithm with enhanced adversarial learning[J].Journal of Computer Applica- tions,2020,40(11):3166-3171(in Chinese).
[13] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[EB/OL].2020:arXiv:2010.11929.https://arxiv.org/abs/2010.11929.
[14] YOU J Y,KORHONEN J.Transformer for image quality assessment[C]//2021 IEEE International Conference on Image Processing(ICIP). September 19-22,2021,Anchorage,AK,USA.IEEE, 2021.
[15] CHEON M,YOON S J,KANG B,et al.Perceptual image quality assessment with transformers[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW).June 19-25,2021,Nashville,TN,USA.IEEE, 2021.
[16] GAO S H,CHENG M M,ZHAO K,et al.Res2Net:A new multi-scale backbone architecture[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(2):652-662.
[17] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).June 27-30,2016,Las Vegas,NV, USA.IEEE, 2016.
[18] XIAO D,MENG Q Y,LI S P,et al.MUDDFormer:Breaking residual bottlenecks in transformers via multi-way dynamic dense connections [EB/OL].2025:arXiv:2502.12170.https://arxiv.org/abs/2502.12170.
[19] 邢建好,田秀霞,韩奕.结合金字塔 Transformer 与浅层 CNN 的变电站图像篡改检测[J].中国图象图形学报,2024,29(2):444-456. XING J H,TIAN X X,HAN Y.Pyramid Transformer combined with shallow CNN for substation image tampering detection[J].Journal of Image and Graphics,2024,29(2):444-456(in Chinese).
[20] WANG S,XIA C L,LV F,et al.RT-DETRv3:Real-time end-to-end object detection with hierarchical dense positive supervision[C]//2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).February 26-March 6,2025,Tucson,AZ,USA.IEEE, 2025.
[21] XIE S N,GIRSHICK R,DOLLÁR P,et al.Aggregated residual trans- formations for deep neural networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).July 21-26,2017, Honolulu,HI,USA.IEEE, 2017.
[22] LIN H H,HOSU V,SAUPE D.KonIQ-10k:Towards an ecologically valid and large-scale IQA database[EB/OL].2018:arXiv:1803.08489.
https://arxiv.org/abs/1803.08489.
[23] THOMEE B,SHAMMA D A,FRIEDLAND G,et al.YFCC100M: The new data in multimedia research[J].Communications of the ACM, 2016,59(2):64-73.
[24] SHEIKH H R,SABIR M F,BOVIK A C.A statistical evaluation of recent full reference image quality assessment algorithms[J].IEEE Transactions on Image Processing,2006,15(11):3440-3451.
[25] GHADIYARAM D,BOVIK A C.Massive online crowdsourced study of subjective and objective picture quality[J].IEEE Transactions on Image Processing,2016,25(1):372-387.
[26] KANG L,YE P,LI Y,et al.Convolutional neural networks for No- reference image quality assessment[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition.June 23-28,2014,Colum- bus,OH,USA.IEEE, 2014.
[27] SU S L,YAN Q S,ZHU Y,et al.Blindly assess image quality in the wild guided by a self-adaptive hyper network[C]//2020 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition(CVPR).June 13- 19,2020,Seattle,WA,USA.IEEE, 2020.
[28] BOSSE S,MANIRY D,MÜLLER K R,et al.Deep neural networks for no-reference and full-reference image quality assessment[J].IEEE Transactions on Image Processing,2018,27(1):206-219.
[29] ZHU H C,LI L D,WU J J,et al.MetaIQA:Deep meta-learning for no-reference image quality assessment[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).June 13-19, 2020,Seattle,WA,USA.IEEE, 2020.
[30] HOSU V,LIN H H,SZIRANYI T,et al.KonIQ-10k:An ecologically valid database for deep learning of blind image quality assessment[J]. IEEE Transactions on Image Processing,2020,29:4041-4056.
[31] ZHANG W X,MA K D,YAN J,et al.Blind image quality assessment using a deep bilinear convolutional neural network[J].IEEE Transac- tions on Circuits and Systems for Video Technology,2020,30(1):36-47.
[32] QIN G Y,HU R Z,LIU Y T,et al.Data-efficient image quality assessment with attention-panel decoder[J].Proceedings of the AAAI Conference on Artificial Intelligence,2023,37(2):2091-2100.
[33] YANG Y X,LEI Z C,LI C L.No-reference image quality assessment combining swin-transformer and natural scene statistics[J].Sensors, 2024,24(16): 5221.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献