CN-122024177-A - Pedestrian alignment method, system, electronic equipment and storage medium

CN122024177ACN 122024177 ACN122024177 ACN 122024177ACN-122024177-A

Abstract

The embodiment of the invention provides a pedestrian alignment method, a system, electronic equipment and a storage medium, wherein the method comprises the steps of compressing a pedestrian video with air-ground cooperation through a VIT (virtual local area network) context encoder and extracting an identity representation with unchanged physical attribute; the method comprises the steps of obtaining a cross-view random mask of a time space sign, obtaining an upper part and a lower part Wen Biaozheng, constructing a cross-view joint embedded prediction network, predicting identity signs of target view angles in a representation space by taking context signs and target view angle codes as conditions, taking output of a target encoder as a training target, realizing ground-aerial photo sign alignment by two-way training, extracting target identity signs when receiving a ground query video, carrying out similarity matching with an aerial photo gallery, and determining target pedestrians. Therefore, through the cross-view prediction of the pure representation space, pixel-level reconstruction is not needed, and efficient and stable bidirectional feature alignment between the ground monitoring view angle and the unmanned aerial vehicle aerial view angle is realized.

Inventors

ZHAI YAJING
DONG YIYING
JIN XIN

Assignees

宁波市东方理工高等研究院

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. A pedestrian alignment method, the method comprising: preprocessing the pedestrian video of the air-ground coordination, and encoding by a video encoder to obtain continuous space-time characterization; inputting the continuous space-time representation into a VIT context encoder, extracting identity representation based on time sequence averaging pooling, and applying cross-view random masks to the continuous space-time representation to obtain upper and lower Wen Biaozheng; Building a cross-view joint embedded prediction network, inputting a predictor by taking the context representation and the target view code as conditions, predicting the identity representation of the target view in a representation space, updating parameters by a target encoder in an exponential moving average mode, taking the output of the target encoder as a prediction target, and bi-directionally training the predictor, wherein a loss function is ensured to meet a preset constraint requirement in the bi-directional training process; A process for bi-directionally training the predictor based on the prediction network, comprising: Mapping the identity token into a basic query vector, converting parameters of a target viewing angle into a target viewing angle condition vector, and inputting the target viewing angle condition vector as modulation information into a predictor; Inputting source view characteristics into the predictor, combining a prediction network to obtain target view related characteristics, combining identity characterization, generating target view identity characterization, taking real view characterization of target view video output by a target encoder as a prediction target, and calculating L2 distance between the target view identity characterization and the real view characterization as a characterization prediction loss; Introducing VICReg regular terms in the training process, applying three constraints of variance, invariance and covariance to the identity representation of the target visual angle through the VICReg regular terms, applying consistency constraint to the identity stability characteristics, and applying visual angle supervision constraint to the visual angle related characteristics; When a ground query video is received, extracting a target identity representation of the ground query video, inputting the target identity representation into the predictor, predicting a target representation of a corresponding aerial viewing angle in a representation space, and determining a target pedestrian through feature matching.
2. The pedestrian alignment method of claim 1, wherein the inputting the continuous space-time representation into a VIT context encoder extracts an identity representation based on time-series averaging pooling, comprising: Inputting the space-time representation into a VIT context encoder, obtaining continuous embedded vectors corresponding to each space-time representation through representation embedding conversion, and carrying out sequence reshaping and position encoding on the continuous embedded vectors; Acquiring motion association between successive embedded vectors in successive frames based on a multi-head self-attention mechanism of the VIT context encoder; And carrying out average pooling along a time dimension based on the motion association, and carrying out global space average pooling of the space-time representation corresponding feature map to obtain a single vector, namely the identity representation.
3. The pedestrian alignment method of claim 1, wherein the implementation procedure of the bidirectional training includes: integrating the joint space-time sequence of the aerial photographing data and the ground video data pairs on the same day by using the aerial photographing/ground video in the pedestrian video as a data base, and constructing bidirectional training, wherein the method comprises the following steps of: Extracting identity representation and source visual angle characteristics of a ground video by a VIT context encoder, combining an aerial photographing target visual angle, inputting the identity representation and source visual angle characteristics into a target visual angle condition prediction module, generating aerial photographing target visual angle identity representation, taking a real aerial photographing representation output by the target encoder to the aerial photographing video as a supervision target, and calculating prediction loss for training; And extracting identity representation and source view angle characteristics of the aerial video by a VIT context encoder, inputting the identity representation and source view angle characteristics into a target view angle condition prediction module by combining a ground target view angle condition, generating a ground target view angle identity representation, and calculating prediction loss to train by taking a real ground representation output by the target encoder to the ground video as a supervision target.
4. The pedestrian alignment method of claim 3 wherein the loss function comprises a total loss consisting of a characterization prediction loss, VICReg regular loss, identity stabilization feature consistency loss, and view angle supervision loss weighting, the total loss calculated by: Ltotal=Lpred+α·LVICReg+β·Lid+γ·Lview, wherein alpha, beta and gamma are weight coefficients of each loss term; the calculation formula for representing the prediction loss is as follows: Lpred=(1/N)·Σ||pi-sg(zi)||2^2, Lpred is a representation prediction loss, pi is a target view identity representation generated by an ith sample, zi is a real target view representation of target view video output by a target encoder, sg (·) is a stopping gradient operation, and N is the number of samples in a batch; the calculation formula of VICReg regular loss is as follows: LVICReg=lambda·Lvar+mu·Linv+nu·Lcov, Wherein Lvar is a variance term, linv is a invariance term, lcov is a covariance term, and lambda, mu, nu is each weight coefficient; The calculation formula of the identity stability characteristic consistency loss is as follows: Lid=(1/M)·Σ||fid(a)-fid(b)||2^2, Wherein, lid is identity stability feature consistency loss, fid (a) and fid (b) respectively represent identity stability features extracted by the same target pedestrian under different visual angles or different shooting heights, and M is the number of paired samples; The calculation formula of the visual angle supervision loss is as follows: Lview=Lcls+η·Lreg, wherein eta is a weight coefficient, lcls is used for constraining the prediction error of the view-related feature to the discrete view category label, and Lreg is used for constraining the prediction error of the view-related feature to the continuous view parameter or the shooting height parameter.
5. The pedestrian alignment method of claim 4 wherein the total loss further comprises a timing consistency loss calculated as: Ltemp=(1/[N·(T-1)])·ΣΣ|| p(i,t+1)-p(i,t)||2^2, wherein Ltemp is a loss of time sequence consistency, p (i, T) and p (i, t+1) respectively represent a target view identity token generated by the ith sample under an adjacent frame or an adjacent time window, and T is a time length.
6. The pedestrian alignment method of claim 1, wherein the inputting the target identity representation into the predictor predicts a target representation of a corresponding aerial viewing angle in a representation space, comprising: presetting an aerial photographing visual angle parameter set, converting the aerial photographing visual angle parameter set into a visual angle coding vector matched with the input dimension of the predictor through sine/cosine coding, inputting the visual angle coding vector into the predictor after being spliced with the target identity representation, and outputting the target representation of the aerial photographing visual angle through single forward reasoning.
7. The pedestrian alignment method of claim 6, wherein the determining the target pedestrian by feature matching comprises: comparing the target representation with real aerial photographing characteristics pre-stored in an aerial photographing unmanned aerial vehicle gallery, and calculating cosine similarity or Euclidean distance between the target representation and the real aerial photographing characteristics; And calculating a similarity score based on the cosine similarity or the Euclidean distance, sorting pedestrians corresponding to the real aerial photographing features based on the similarity score, and matching target pedestrians based on the sorting result.
8. A pedestrian alignment system, the system comprising; the coding quantization module is used for preprocessing the pedestrian video of the air-ground coordination, and coding the pedestrian video through the video coder to obtain continuous space-time characterization; the representation extraction module is used for inputting the continuous space-time representation into a VIT context encoder, extracting identity representation based on time sequence averaging pooling, and applying cross-view random masks to the continuous space-time representation to obtain an upper part and a lower part Wen Biaozheng; The cross-view prediction training module is used for building a cross-view joint embedded prediction network, inputting a predictor by taking the context representation and the target view coding as conditions, predicting identity representation of a target view in a representation space, updating parameters by an index sliding average mode by a target encoder, taking the output of the target encoder as a prediction target, and bidirectionally training the predictor, wherein a loss function is ensured to meet preset constraint requirements in the bidirectional training process; A process for bi-directionally training the predictor based on the prediction network, comprising: Mapping the identity token into a basic query vector, converting parameters of a target viewing angle into a target viewing angle condition vector, and inputting the target viewing angle condition vector as modulation information into a predictor; Inputting source view characteristics into the predictor, combining a prediction network to obtain target view related characteristics, combining identity characterization, generating target view identity characterization, taking real view characterization of target view video output by a target encoder as a prediction target, and calculating L2 distance between the target view identity characterization and the real view characterization as a characterization prediction loss; Introducing VICReg regular terms in the training process, applying three constraints of variance, invariance and covariance to the identity representation of the target visual angle through the VICReg regular terms, applying consistency constraint to the identity stability characteristics, and applying visual angle supervision constraint to the visual angle related characteristics; And the query module is used for extracting the target identity representation of the ground query video when the ground query video is received, inputting the target identity representation into the predictor, predicting the target representation of the corresponding aerial viewing angle in the representation space, and determining the target pedestrian through feature matching.
9. An electronic device comprising a processor and a memory; the processor is connected with the memory; The memory is used for storing executable program codes; The processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for performing the method according to any one of claims 1-7.
10. A computer readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, implements the method according to any of claims 1-7.

Description

Pedestrian alignment method, system, electronic equipment and storage medium Technical Field The invention relates to the technical field of unmanned aerial vehicle aerial photography, in particular to a pedestrian alignment method, a pedestrian alignment system, electronic equipment and a storage medium. Background Along with the development of unmanned aerial vehicle technology, in unmanned aerial vehicle field of taking photo by plane, when aiming at the cross-view pedestrian re-recognition task that involves unmanned aerial vehicle visual angle and ground monitoring visual angle to cooperate, three kinds of technical paths of measurement learning feature alignment, geometric transformation preprocessing and cross-domain image conversion are mainly followed at present. The method based on geometric transformation utilizes a spatial transformation network or homography matrix to try to compensate viewpoint difference through affine transformation, and the method of cross-domain image transformation utilizes a generated countermeasure network to carry out style migration at a pixel level so as to realize visual mapping of an aerial image and a ground image. However, on the basis of the prior art, the limitation of the asymmetric mapping scene which is very challenging in space-ground coordination is exposed. First, because of the significant inter-view domain differences, the extracted feature characterization tends to couple excessive view-specific attributes, and the model tends to capture top profile features at aerial viewing angles or front texture features at ground viewing angles, rather than intrinsic view-invariant identity features. This results in a feature offset distribution caused by view angle differences far exceeding the dynamic range of the identity discrimination features during retrieval, thereby limiting the accuracy of cross-view angle retrieval. Secondly, the existing space projection model is mostly based on a two-dimensional plane assumption, nonlinear geometric distortion between an aerial near-vertical view angle and a ground near-horizontal view angle is difficult to accurately represent, and particularly when the problem of multi-height self-shielding of a three-dimensional physical entity of a pedestrian is solved, the distortion of a characteristic topological relation is easily caused by simple geometric transformation. In addition, the existing generation model is in black box pixel level mapping due to lack of effective modeling on physical properties of a target bottom layer and potential state space thereof, identity consistency constraint is difficult to maintain in the visual angle conversion process, visual artifacts are easy to generate or key identity details are easy to lose, and the generated image loses discrimination value. Disclosure of Invention Aiming at the problems in the prior art, the embodiment of the invention provides a pedestrian alignment method, a pedestrian alignment system, electronic equipment and a storage medium. In a first aspect, embodiments of the present disclosure provide a pedestrian alignment method, the method including: preprocessing the pedestrian video of the air-ground coordination, and encoding by a video encoder to obtain continuous space-time characterization; inputting the continuous space-time representation into a VIT context encoder, extracting identity representation based on time sequence averaging pooling, and applying cross-view random masks to the continuous space-time representation to obtain upper and lower Wen Biaozheng; Building a cross-view joint embedded prediction network, inputting a predictor by taking the context representation and the target view code as conditions, predicting the identity representation of the target view in a representation space, updating parameters by a target encoder in an exponential moving average mode, taking the output of the target encoder as a prediction target, and bi-directionally training the predictor, wherein a loss function is ensured to meet a preset constraint requirement in the bi-directional training process; A process for bi-directionally training the predictor based on the prediction network, comprising: Mapping the identity token into a basic query vector, converting parameters of a target viewing angle into a target viewing angle condition vector, and inputting the target viewing angle condition vector as modulation information into a predictor; Inputting source view characteristics into the predictor, combining a prediction network to obtain target view related characteristics, combining identity characterization, generating target view identity characterization, taking real view characterization of target view video output by a target encoder as a prediction target, and calculating L2 distance between the target view identity characterization and the real view characterization as a characterization prediction loss; Introducing VICReg regular terms in the training pro