CN-122023751-A - Method, device and system for detecting saliency target and electronic equipment
Abstract
The disclosure provides a method, a device, a system and electronic equipment for detecting a salient object, and belongs to the field of computer vision. The method comprises the steps of obtaining a first visible light image and a first thermal infrared image of an object to be detected, inputting the first visible light image and the first thermal infrared image into a pre-trained first model to obtain a first saliency target image of the object to be detected, wherein the first model is an image fusion neural network model determined based on a second visible light image and a second thermal infrared image of a training object, and comprises a self-adaptive enhancement module, a coding and fusion module and a three-stream difference cooperative decoder. In conclusion, the technical scheme provided by the disclosure can cooperate with three core links from input enhancement, feature fusion and decoding, solves the technical problems of low detection precision, weak anti-interference capability, poor fusion effect and the like in the existing method in a layer-by-layer progressive manner, improves the precision of detecting the salient targets, and can be suitable for various application scenes.
Inventors
- HUANG SHENGUANG
- SHAO FENG
- SUN ZHIJUN
- ZHU ZHENGANG
- WU JUNHENG
- HE WEIGUO
- WANG JIAN
- WU GAODE
- JIN RUINING
Assignees
- 宁波港信息通信有限公司
- 宁波大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260414
Claims (16)
- 1. A method of salient object detection, the method comprising: acquiring a first visible light image and a first thermal infrared image of an object to be detected; The method comprises the steps of inputting a first visible light image and a first thermal infrared image into a pre-trained first model to obtain a first salient target image of an object to be detected, wherein the first model is an image fusion neural network model determined based on a second visible light image and a second thermal infrared image of the training object, the first model comprises a self-adaptive enhancement module, a coding and fusion module and a three-stream difference collaborative decoder, the self-adaptive enhancement module is used for carrying out complementary enhancement based on the second visible light image and the second thermal infrared image to obtain an enhanced third visible light image and a third thermal infrared image, the coding and fusion module is used for respectively extracting multi-scale features of the third visible light image and the third thermal infrared image and carrying out cross-mode feature fusion to obtain fusion features, the three-stream difference collaborative decoder is used for decoding through three parallel decoding paths to obtain a second salient target image, and the three-stream collaborative decoding paths comprise a visible light differential decomposition code path, a thermal heterodyne differential decoding path and a thermal heterodyne decoding path, and a third thermal heterodyne decoding path are used for carrying out complementary enhancement based on the second visible light image and the third thermal heterodyne decoding, and the third thermal heterodyne feature fusion is carried out prediction feature fusion based on the third visible light image and the third thermal heterodyne feature.
- 2. The method of claim 1, wherein the self-adaptation enhancement module comprises a context-aware saliency re-weighting unit and a reverse gating structure unit; The context-aware saliency re-weighting unit is used for receiving the second visible light image and the second thermal infrared image, generating a thermal saliency map based on the second thermal infrared image, generating a guide feature and a cooperative gating signal based on the thermal saliency map, and performing overlapping weighting enhancement on the second visible light image by utilizing the guide feature and the cooperative gating signal to obtain the third visible light image; The inverse gating structure unit is used for receiving and respectively extracting gradient characteristics of the second visible light image and the second thermal infrared image, generating inverse gating weight and structural texture characteristics based on the visible light gradient characteristics and the thermal infrared gradient characteristics, weighting and screening the structural texture characteristics based on the inverse gating weight, and projecting the screened structural texture characteristics into the second thermal infrared image for restoration and enhancement to obtain the third thermal infrared image.
- 3. The method of claim 2, wherein the context-aware saliency re-weighting unit comprises a local contrast enhancer, a pilot signal processor, a co-gating generator, a first element multiplication layer, and a first adder; The local contrast enhancer is used for comparing the second thermal infrared image to generate the thermal saliency map, the guide signal processor is used for carrying out first convolution and first normalization on the thermal saliency map and outputting the guide feature, the collaborative gating generator is used for splicing the second visible light image and the thermal saliency map and generating the collaborative gating signal through second convolution, second normalization and first activation, the first element multiplication layer is used for multiplying the guide feature and the collaborative gating signal to obtain an enhancement item, and the first adder is used for superposing the enhancement item on the second visible light image after scaling to obtain the third visible light image.
- 4. The method of claim 2, wherein the inverse gating architecture unit comprises a gradient extraction module, an inverse gating generator, a structure projector, a second element multiplication layer, and a second adder; The gradient extraction module is used for respectively extracting gradient features of the second visible light image and the second thermal infrared image to obtain the visible light gradient features and the thermal infrared gradient features, the inverse gating generator is used for splicing the visible light gradient features, the thermal infrared gradient features and the difference values of the visible light gradient features and the thermal infrared gradient features, the inverse gating weight is generated through third convolution, third normalization and second activation processing, the structure projector is used for performing fourth convolution and fourth normalization processing on the visible light gradient features to output the structural texture features, the second element multiplication layer is used for multiplying the inverse gating weight and the structural texture features to obtain weighted structural texture features, and the second adder is used for superposing the weighted structural texture features and the second thermal infrared image to obtain the third thermal infrared image.
- 5. The method of claim 1, wherein the encoding and fusion module comprises two lightweight visual transformer networks MobileViT-XS and at least five fusion units, wherein the two MobileViT-XS are structurally identical and each comprises at least five lightweight visual transformer layers; For each lightweight visual converter layer, extracting the multi-scale features of the third visible light image and the third thermal infrared image by the two MobileViT-XS respectively, and outputting a corresponding hierarchical feature map of the image, wherein the multi-scale features comprise image texture features, image contour features, local semantic features, global semantic features and deep semantic features; and the fusion unit is used for carrying out channel splicing processing on the third visible light image features and the third thermal infrared image features of the same level to obtain the fusion features.
- 6. The method according to claim 1, wherein the visible light differential code path and the thermal infrared differential code path in the three-stream differential cooperative decoder are identical in structure, and each of the two paths comprises a bilinear interpolation up-sampling unit, a channel splicing unit and a depth separable convolution normalization activation unit; The dual-linear interpolation upsampling unit is used for respectively upsampling the third visible light image or the third thermal infrared image to sequentially obtain a first upsampling feature or a second upsampling feature, the channel splicing unit is used for carrying out channel splicing on the first upsampling feature, the third visible light image and the fusion feature, or the second upsampling feature, the third thermal infrared image and the fusion feature to respectively obtain a first splicing feature or a second splicing feature, and the depth separable convolution normalization activation unit is used for carrying out fifth convolution, fifth normalization and third activation processing on the first splicing feature or the second splicing feature to correspondingly obtain the visible light difference significance prediction feature or the thermal infrared difference significance prediction feature.
- 7. The method of claim 1, wherein the collaborative fusion decoding path in the three-stream differential collaborative decoder comprises a context-aware decoupling and aggregation module and a modal-aware dynamic aggregation module; the context-aware decoupling and aggregation module is configured to receive and determine a first high-level collaborative feature based on the visible light difference significance prediction feature, the thermal infrared difference significance prediction feature, and the fusion feature; the modal aware dynamic aggregation module is configured to receive and determine the second salient target image based on the first high-level collaborative feature, the visible light difference salient prediction feature, and the thermal infrared difference salient prediction feature.
- 8. The method of claim 7, wherein the context-aware decoupling and aggregation module comprises a boundary attention branch, a region attention branch, a boundary enhancer, a region enhancer, a gated interaction reorganization unit, and a residual connection unit; The boundary attention branch is used for carrying out boundary region sensing on an input feature to generate a boundary attention force diagram, carrying out boundary weighting on the input feature based on the boundary attention force diagram to obtain a first boundary feature, carrying out region semantic sensing on the input feature to generate a region attention force diagram, carrying out region weighting on the input feature based on the region attention force diagram to obtain a first region feature, carrying out directivity enhancement on the first boundary feature to obtain a second boundary feature, carrying out semantic enhancement on the first region feature to obtain a second region feature, carrying out gating weighted interaction on the second boundary feature and the second region feature to obtain a recombined feature, and carrying out superposition on the recombined feature and the input feature by the residual connecting unit to obtain a first high-layer cooperative feature, wherein the input feature is the fusion feature or the splicing feature, and the splicing feature is determined based on the fusion feature recombination.
- 9. The method according to claim 7, wherein the modal aware dynamic aggregation module comprises an initial fusion unit, a spatial weight generation unit, a detail enhancement branch, a structure enhancement branch, and a weighted summation unit; The initial fusion unit is used for adding the first high-level cooperative feature, the visible light difference significance prediction feature and the thermal infrared difference significance prediction feature element by element to obtain a second high-level cooperative feature, the spatial weight generation unit is used for carrying out sixth convolution, sixth normalization and fourth activation processing on the second high-level cooperative feature to generate a spatial weight graph, the detail enhancement branch is used for extracting texture information of the second high-level cooperative processing to obtain texture features, the structure enhancement branch is used for superposing the first high-level cooperative feature and the second high-level cooperative feature to obtain structural features, and the weighted summation unit is used for respectively carrying out weighted summation on the texture features and the structural features by using the spatial weight graph to obtain the second significance target image.
- 10. An apparatus for salient object detection, the apparatus comprising: the acquisition unit is used for acquiring a first visible light image and a first thermal infrared image of the object to be detected; The input unit is used for inputting the first visible light image and the first thermal infrared image into a pre-trained first model to obtain a first saliency target image of the object to be detected, the first model is an image fusion neural network model determined based on a second visible light image and a second thermal infrared image of the training object, the first model comprises a self-adaptive enhancement module, a coding and fusion module and a three-stream difference cooperative decoder, the self-adaptive enhancement module is used for carrying out complementary enhancement based on the second visible light image and the second thermal infrared image to obtain an enhanced third visible light image and a third thermal infrared image, the coding and fusion module is used for respectively extracting multi-scale features of the third visible light image and the third thermal infrared image and carrying out cross-mode feature fusion to obtain fusion features, the three-stream difference cooperative decoder is used for decoding through three parallel decoding paths to obtain a second saliency target image, and the three parallel decoding paths comprise a visible light differential infrared differential decoding path, an infrared differential decoding path and an infrared differential decoding path, and a third heterodyne differential decoding path are used for carrying out cooperative decoding based on the three-dimensional features, and a third thermal differential decoding is used for obtaining a prediction feature based on the third thermal differential decoding, and a difference feature fusion feature, and a third thermal heterodyne feature fusion feature is obtained, and the three-dimensional feature fusion is obtained.
- 11. The apparatus of claim 10, wherein the self-adaptation enhancement module comprises a context-aware saliency re-weighting unit and a reverse gating structure unit; The context-aware saliency re-weighting unit is used for receiving the second visible light image and the second thermal infrared image, generating a thermal saliency map based on the second thermal infrared image, generating a guide feature and a cooperative gating signal based on the thermal saliency map, and performing overlapping weighting enhancement on the second visible light image by utilizing the guide feature and the cooperative gating signal to obtain the third visible light image; The inverse gating structure unit is used for receiving and respectively extracting gradient characteristics of the second visible light image and the second thermal infrared image, generating inverse gating weight and structural texture characteristics based on the visible light gradient characteristics and the thermal infrared gradient characteristics, weighting and screening the structural texture characteristics based on the inverse gating weight, and projecting the screened structural texture characteristics into the second thermal infrared image for restoration and enhancement to obtain the third thermal infrared image.
- 12. The apparatus of claim 10, wherein the encoding and fusion module comprises two lightweight visual transformer networks MobileViT-XS and at least five fusion units, wherein the two MobileViT-XS are structurally identical and each comprises at least five lightweight visual transformer layers; For each lightweight visual converter layer, extracting the multi-scale features of the third visible light image and the third thermal infrared image by the two MobileViT-XS respectively, and outputting a corresponding hierarchical feature map of the image, wherein the multi-scale features comprise image texture features, image contour features, local semantic features, global semantic features and deep semantic features; and the fusion unit is used for carrying out channel splicing processing on the third visible light image features and the third thermal infrared image features of the same level to obtain the fusion features.
- 13. The device according to claim 10, wherein the visible light differential decomposition code path and the thermal infrared differential decomposition code path in the three-stream differential cooperative decoder are identical in structure, and each device comprises a bilinear interpolation up-sampling unit, a channel splicing unit and a depth separable convolution normalization activation unit; The dual-linear interpolation upsampling unit is used for respectively upsampling the third visible light image or the third thermal infrared image to sequentially obtain a first upsampling feature or a second upsampling feature, the channel splicing unit is used for carrying out channel splicing on the first upsampling feature, the third visible light image and the fusion feature, or the second upsampling feature, the third thermal infrared image and the fusion feature to respectively obtain a first splicing feature or a second splicing feature, and the depth separable convolution normalization activation unit is used for carrying out fifth convolution, fifth normalization and third activation processing on the first splicing feature or the second splicing feature to correspondingly obtain the visible light difference significance prediction feature or the thermal infrared difference significance prediction feature.
- 14. The apparatus of claim 10, wherein the collaborative fusion decoding path in the three-stream differential collaborative decoder comprises a context-aware decoupling and aggregation module and a modal-aware dynamic aggregation module; the context-aware decoupling and aggregation module is configured to receive and determine a first high-level collaborative feature based on the visible light difference significance prediction feature, the thermal infrared difference significance prediction feature, and the fusion feature; the modal aware dynamic aggregation module is configured to receive and determine the second salient target image based on the first high-level collaborative feature, the visible light difference salient prediction feature, and the thermal infrared difference salient prediction feature.
- 15. A system for salient object detection, the system comprising: the self-adaptive enhancement module is used for receiving a first visible light image and a first thermal infrared image of an object to be detected, carrying out complementary enhancement on the first visible light image and the first thermal infrared image, and obtaining an enhanced fourth visible light image and a fourth thermal infrared image; The encoding and fusion module is used for respectively extracting multi-scale features of the fourth visible light image and the fourth thermal infrared image and carrying out cross-modal feature fusion to obtain fusion features of the object to be detected, wherein the multi-scale features comprise image texture features, image contour features, local semantic features, global semantic features and deep semantic features; The three-stream differential collaborative decoder is used for decoding fusion features of the fourth visible light image, the fourth thermal infrared image and the object to be detected through three parallel decoding paths to obtain a first salient target image, wherein the three parallel decoding paths comprise a visible light differential decomposition code path, a thermal infrared differential decomposition code path and a collaborative fusion decoding path, the visible light differential decomposition code path is used for decoding based on the fusion features of the fourth visible light image and the object to be detected to obtain visible light differential salient prediction features of the object to be detected, the thermal infrared differential decomposition code path is used for decoding based on the fusion features of the fourth thermal infrared image features and the object to be detected to obtain thermal infrared differential salient prediction features of the object to be detected, and the collaborative fusion decoding path is used for decoding based on the visible light differential salient prediction features, the thermal infrared differential salient prediction features and the fusion features of the object to be detected to obtain the first salient target image.
- 16. An electronic device, comprising: a memory for storing computer readable instructions, and A processor for executing the computer readable instructions to cause the electronic device to perform the method of any of claims 1-9.
Description
Method, device and system for detecting saliency target and electronic equipment Technical Field The present disclosure relates to the field of computer vision, and in particular, to a method, apparatus, system, and electronic device for salient object detection. Background Saliency target detection aims at simulating the human visual attention mechanism, and rapidly locates and segments the most attractive visual objects from a complex scene. As an important preprocessing step, it has been widely used in downstream tasks such as image segmentation, object tracking, content-aware image editing, and robot navigation, and a high-precision salient object detection method is necessary. However, most of the current significant target tests are single visible light or thermal infrared tests, or simple visible light or thermal infrared fusion tests. The method has the limitations that firstly, single visible light detection is easy to be interfered by illumination change and background disorder, effective features are difficult to extract under the scenes such as low illumination and shadow, the detection precision is reduced, and single thermal infrared detection has the problems of low resolution, fuzzy target boundary, thermal cross interference and the like, so that the detection precision is difficult to improve. Meanwhile, the simple visible light or thermal infrared fusion detection adopts rough fusion modes such as channel splicing, element addition and the like, so that the essential difference of two modes can be ignored, the complementary advantages can not be fully exerted, the mode noise can be easily amplified, the fusion effect is poor, and the detection precision is reduced. In summary, the detection accuracy of the existing salient object detection method is reduced, and the existing salient object detection method is difficult to adapt to various application scenes. Disclosure of Invention The disclosure provides a method, a device, a system and electronic equipment for detecting a salient object, which are used for solving the problems that the detection precision of the existing salient object detection method is reduced and the existing salient object detection method is difficult to adapt to various application scenes to a certain extent. According to one aspect of the present disclosure, there is provided a method of salient object detection, the method comprising acquiring a first visible light image and a first thermal infrared image of an object to be detected; inputting a first visible light image and a first thermal infrared image into a pre-trained first model to obtain a first saliency target image of an object to be detected; the first model is an image fusion neural network model determined based on a second visible light image and a second thermal infrared image of a training object, and comprises a self-adaptive enhancement module, a coding and fusion module and a three-stream difference collaborative decoder, wherein the self-adaptive enhancement module is used for carrying out complementary enhancement based on the second visible light image and the second thermal infrared image to obtain an enhanced third visible light image and a third thermal infrared image, the coding and fusion module is used for respectively extracting multi-scale features of the third visible light image and the third thermal infrared image and carrying out cross-mode feature fusion to obtain fusion features, the three-stream difference collaborative decoder is used for decoding through three parallel decoding paths to obtain a second salient target image, and the three parallel decoding paths comprise a visible light difference decomposition code path, a thermal infrared difference decomposition code path and a collaborative fusion decoding path, the visible light difference decomposition code path carries out decoding based on the third visible light image and the fusion features to obtain visible light difference salient prediction features, the thermal infrared difference prediction features are decoded based on the third thermal difference salient prediction features, and the third infrared difference salient prediction features are fused to obtain the second salient target image. In addition, the self-adaptive enhancement module comprises a context awareness significance weighting unit and a reverse gating structure unit, wherein the context awareness significance weighting unit is used for receiving a second visible light image and a second thermal infrared image, generating a thermal significance map based on the second thermal infrared image, generating guide features and cooperative gating signals based on the thermal significance map, carrying out overlapped weighted enhancement on the second visible light image by utilizing the guide features and the cooperative gating signals to obtain a third visible light image, the reverse gating structure unit is used for receiving and respectively extrac