CN-121982597-A - City management inspection method and device based on unmanned aerial vehicle, electronic equipment and program product

CN121982597ACN 121982597 ACN121982597 ACN 121982597ACN-121982597-A

Abstract

The application discloses an unmanned aerial vehicle-based city management inspection method, an unmanned aerial vehicle-based city management inspection device, electronic equipment and a program product. The method comprises the steps of extracting features, fusing features and detecting images to be detected, which are acquired by the unmanned aerial vehicle, and generating a patrol result. In the feature extraction process, an integrated strategy of continuous context modeling based on a state space, global scale self-adaptive modulation and residual information retention is adopted, and on the premise of hardly increasing reasoning delay, the recognition sensitivity and the robustness of a low-contrast multi-scale target are improved. In the feature fusion process, the height, width and global statistical features of the multi-scale global pooling explicit modeling are utilized to capture the distribution of a large-scale abnormal region, a lightweight windowed local self-attention mechanism is introduced, the structural association and relative position information among local pixels are mined, and the perceptibility of scattered or shielding targets is enhanced. Meanwhile, a nonlinear category weight distribution strategy based on a median reference is provided, so that the classification loss can be finely regulated and controlled.

Inventors

WANG PENG
LIU JIAMEI
Xu Yaofan
ZHANG KAI
Cai Dashi

Assignees

深圳市锐明像素科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260409

Claims (10)

1. The city management inspection method based on the unmanned aerial vehicle is characterized by comprising the following steps of: Extracting features of an input image to be detected to obtain image features, wherein the image to be detected is a video frame or a static image which is acquired based on an unmanned aerial vehicle and contains urban space; Carrying out fusion processing on the image features to obtain target fusion features; outputting a patrol result corresponding to the image to be detected based on the target fusion characteristic; wherein, in the feature extraction process, the following operations are performed: converting the first input features into block-level representations, and introducing position information to obtain feature sequences; performing continuous context modeling based on a state space on the feature sequence to obtain global features; Performing enhancement processing on the global features to obtain target global features; Fusing the target global feature with the first input feature to obtain a first output feature corresponding to the first input feature; the first input features are extracted based on the image to be detected, and the image features are obtained based on the first output features.
2. The method of claim 1, wherein performing state space based continuous context modeling on the feature sequence to obtain global features comprises: Based on a state space model, the following operations are circularly executed to carry out position-by-position recursive updating on the feature sequence so as to determine corresponding output features based on the current hidden states of all the positions, and the output features of all the positions are fused to obtain the global features: Aiming at the current position, acquiring corresponding current input characteristics and a historical hidden state transmitted by the preamble position; Generating a learnable parameter adapted to the current input feature based on a preset selective scanning mechanism; Performing structure sensing gating operation according to the structure change degree of the current input feature to obtain gating weight, wherein the gating weight is used for adjusting the fusion proportion between the historical hidden state and the current input feature; And adjusting a state updating relation based on the leachable parameters, and carrying out weighted fusion on the historical hidden state and the current input characteristic by combining the gating weight to obtain the current hidden state.
3. The method of claim 1, wherein the global feature is a multi-scale feature, and the enhancing the global feature to obtain the target global feature comprises: respectively executing global average pooling operation on global features of each scale to obtain global statistical features corresponding to each scale; fusing the global statistical features of each scale to obtain multi-scale aggregation features; Performing light mapping operation on the multi-scale aggregation features to obtain multi-scale association characterization; adjusting the multi-scale associated characterization based on a preset learnable temperature coefficient, and obtaining a normalization weight through normalization operation; Under each scale, weighting is carried out on the global features of the current scale based on the normalized weights corresponding to the current scale, and the enhanced global features of the current scale are obtained; and fusing the enhanced global features corresponding to the scales to obtain the target global features.
4. A city management inspection method according to any of claims 1-3, characterized in that during the fusion process the following operations are performed: Performing multi-scale modeling and interaction enhancement processing on the second input features to obtain a first attention representation in the width dimension and a first attention representation in the height dimension; Performing local self-attention structure perception enhancement processing on the second input feature to obtain a second attention representation in the height dimension and a second attention representation in the width dimension; The first attention representation of the width dimension and the second attention representation of the width dimension are subjected to weighted fusion based on a content self-adaptive gating fusion mechanism, and the first attention representation of the height dimension and the second attention representation of the height dimension are subjected to weighted fusion to obtain a fused attention representation of the height dimension and a fused attention representation of the width dimension; Enhancing the second input feature based on the fusion attention token of the height dimension and the fusion attention token of the width dimension to obtain a second output feature corresponding to the first input feature; The second input feature is obtained based on the image feature, and the target fusion feature is obtained based on the second output feature.
5. The method as claimed in claim 4, wherein performing multi-scale modeling and interaction enhancement processing on the second input features to obtain a width-dimension first attention token and a height-dimension first attention token comprises: Under any preset dimension, sequentially executing self-adaptive average pooling operation, channel compression operation and batch normalization operation, executing activation operation on the obtained normalized context compression characterization of the current dimension, weighting the normalized context compression characterization of the current dimension based on the obtained attention weight of the current dimension, and projecting and adjusting the obtained enhanced characterization of the current dimension to obtain the target feature of the current dimension, wherein the preset dimension comprises a width dimension, a height dimension and a global dimension; and fusing target features of all preset dimensions, and performing dimension-related feature recalibration processing on the fused result to obtain the width dimension first attention representation and the height dimension first attention representation.
6. The method of claim 4, wherein performing local self-attention structure perception enhancement processing on the second input feature results in a second attention representation in a height dimension and a second attention representation in a width dimension, comprising: dividing the second input feature into a plurality of regular and non-overlapping local regions; Performing linear projection on each local area, generating inquiry, key and value of the current local area, and introducing relative position codes to calculate attention response in the current local area; The attention responses corresponding to the local areas are aggregated and projected to obtain structural perception enhancement features; Performing linear activation operation on the batch normalization intermediate characterization to perform weighting operation on the batch normalization intermediate characterization based on the obtained attention weight, and performing projection and scale adjustment on the obtained enhancement characterization to obtain a target enhancement feature; and executing feature recalibration processing operation related to the dimension based on the target enhanced features to obtain a second attention representation of the height dimension and a second attention representation of the width dimension.
7. The method of any one of claims 1-3, 5, and 6, wherein the method is implemented based on a pre-trained city management inspection model that is trained to classify loss functions as a median reference-based non-linear class weight distribution of class-wise cross entropy loss : Wherein, the Representing the total number of categories; represent the first Number of samples of class, and ; A vector representing the number of samples of each class; representing vectors Is a median of (2); represent the first Class balancing factors for the class; And Is a nonlinear mapping parameter and satisfies ; Representing intermediate mapping variables; And Respectively represent all classes Minimum and maximum values of (2); And Respectively representing a lower bound and an upper bound of the category weight; represent the first The weight corresponding to the class; Representing a classification loss; representing the total number of samples; The representation model is predicted as the first Probability of class; representing a corresponding real label; And Is a numerical stability constant.
8. Urban management inspection device based on unmanned aerial vehicle, its characterized in that includes: the device comprises a feature extraction structure, a feature detection structure and a feature detection structure, wherein the feature extraction structure is used for extracting features of an input image to be detected to obtain image features; the feature fusion structure is used for carrying out fusion processing on the image features to obtain target fusion features; the detection structure is used for outputting a patrol result corresponding to the image to be detected based on the target fusion characteristic; The feature extraction structure is specifically used for: in the feature extraction process, the following operations are performed: converting the first input features into block-level representations, and introducing position information to obtain feature sequences; performing continuous context modeling based on a state space on the feature sequence to obtain global features; Performing enhancement processing on the global features to obtain target global features; Fusing the target global feature with the first input feature to obtain a first output feature corresponding to the first input feature; the first input features are extracted based on the image to be detected, and the image features are obtained based on the first output features.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the drone-based urban management inspection method of any one of claims 1 to 7 when the computer program is executed.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the drone-based urban management inspection method according to any one of claims 1 to 7.

Description

City management inspection method and device based on unmanned aerial vehicle, electronic equipment and program product Technical Field The application belongs to the technical field of image processing, and particularly relates to an unmanned aerial vehicle-based city management inspection method, an unmanned aerial vehicle-based city management inspection device, electronic equipment and a computer program product. Background With the promotion of smart city construction and city fine management level, urban intelligent supervision based on unmanned aerial vehicle remote sensing vision gradually becomes an important technical means. The unmanned aerial vehicle is used for carrying out high-altitude inspection on urban public space, and targets such as construction waste, disordered materials, household waste, urban order abnormality and the like are identified and positioned by combining a deep learning algorithm, so that more efficient urban treatment can be realized. In a city management inspection scenario, the inspection target typically has local discreteness or overall continuity in space. This results in the limited feature characterization capability of the existing methods in complex scenarios, thereby affecting the accuracy and robustness of overall recognition. Disclosure of Invention The application provides an unmanned aerial vehicle-based city management inspection method, an unmanned aerial vehicle-based city management inspection device, electronic equipment and a computer program product, which can simultaneously extract fine local features and uniformly model cross-regional global context information, thereby improving the identification accuracy and robustness of inspection targets in complex scenes. In a first aspect, the application provides an unmanned aerial vehicle-based city management inspection method, which comprises the following steps: extracting features of an input image to be detected to obtain image features, wherein the image to be detected is a video frame or a static image which is acquired based on an unmanned aerial vehicle and contains urban space; Carrying out fusion processing on the image characteristics to obtain target fusion characteristics; outputting a patrol result corresponding to the image to be detected based on the target fusion characteristic; wherein, in the feature extraction process, the following operations are performed: converting the first input features into block-level representations, and introducing position information to obtain feature sequences; performing continuous context modeling based on a state space on the feature sequence to obtain global features; performing enhancement processing on the global features to obtain target global features; fusing the target global feature with the first input feature to obtain a first output feature corresponding to the first input feature; the first input features are extracted based on the image to be detected, and the image features are obtained based on the first output features. Further, performing state space based continuous context modeling on the feature sequence to obtain global features, including: Based on the state space model, the following operations are circularly executed to carry out position-by-position recursive updating on the feature sequence so as to determine corresponding output features based on the current hidden states of all the positions, and the output features of all the positions are fused to obtain global features: Aiming at the current position, acquiring corresponding current input characteristics and a historical hidden state transmitted by the preamble position; Generating a learnable parameter adapted to the current input feature based on a preset selective scanning mechanism; Performing structure sensing gating operation according to the structure change degree of the current input feature to obtain gating weight, wherein the gating weight is used for adjusting the fusion proportion between the historical hidden state and the current input feature; and adjusting the state updating relation based on the learnable parameters, and carrying out weighted fusion on the historical hidden state and the current input characteristic by combining the gating weight to obtain the current hidden state. Further, the global feature is a multi-scale feature, and the enhancement processing is performed on the global feature to obtain a target global feature, which comprises the following steps: respectively executing global average pooling operation on global features of each scale to obtain global statistical features corresponding to each scale; splicing the global statistical features of each scale to obtain multi-scale aggregation features; Performing light mapping operation on the multi-scale aggregation features to obtain multi-scale association characterization; adjusting the multi-scale associated characterization based on a preset learnable temperature coefficient, and