CN-116343260-B - Motion excitation-based space-time feature erasure video pedestrian re-identification method

CN116343260BCN 116343260 BCN116343260 BCN 116343260BCN-116343260-B

Abstract

The invention discloses a method for identifying pedestrian by erasing video based on space-time characteristics of action excitation. The method comprises the steps of obtaining pedestrian video data containing a plurality of actions, constructing a depth residual error network model based on the pedestrian video data, training the depth residual error network model by adopting a depth residual error network based on action excitation and space-time characteristics, and carrying out video pedestrian re-identification by utilizing the trained depth residual error network model. Aiming at the problem that in the current video pedestrian re-recognition research, apparent features and fine-grained features are concerned, namely static features in a single frame are ignored, and the characteristic dynamic features of a video are ignored, and the motion features formed by multiple frames are extracted, a space-time feature erasure video pedestrian re-recognition algorithm based on motion excitation is provided.

Inventors

CHEN YING
HUANG YOUJUN
YE FANGBIN
GAO LICHAO

Assignees

厦门身份宝网络科技有限公司

Dates

Publication Date: 20260508
Application Date: 20230222

Claims (9)

1. The method for identifying the pedestrian by erasing the video based on the space-time characteristics of the action excitation is characterized by comprising the following steps of: Acquiring pedestrian video data comprising a plurality of actions; Constructing a depth residual error network model based on pedestrian video data, wherein the depth residual error network model at least comprises a first Block, a second Block, a third Block, a fourth Block, a space-time characteristic module and a random frame characteristic erasing module which are connected through information transmission in sequence, wherein at least 2 action excitation modules are arranged in the second Block, and at least 3 action excitation modules are arranged in the third Block; The method comprises the steps of training a depth residual error network model based on action excitation and space-time characteristics, carrying out video pedestrian re-identification by utilizing the trained depth residual error network model, wherein the training process comprises the steps of respectively extracting apparent characteristics, action excitation characteristics, space information and time sequence information of pedestrians in video data by the depth residual error network model, extracting pedestrian characteristics from the apparent characteristics by taking the action excitation characteristics as excitation information, carrying out dimension reduction operation on the pedestrian characteristics, dividing the characteristics into time and space parts, respectively extracting the time sequence characteristics and the space characteristics through convolution operation, splicing the time sequence characteristics and the space characteristics to obtain pedestrian characteristics containing space-time relations, carrying out random frame characteristic erasure on the pedestrian characteristics containing the space-time relations, extracting random frames for the total frame number of video fragments, randomly extracting 1 frame to all frames, carrying out average operation on the characteristics of the extracted frames in the time dimension, then using vector characteristics to represent the characteristics, inputting the vector characteristics into the depth residual error network model, and training the pedestrian characteristics by a cross entropy loss function, a triple loss function of difficult sample excavation and a random gradient descent algorithm model.
2. The motion excitation-based spatio-temporal feature erasure video pedestrian re-recognition method of claim 1, wherein training the depth residual network model with the motion excitation-based and spatio-temporal feature-based depth residual network model comprises: the depth residual error network model respectively extracts apparent characteristics, action excitation characteristics, spatial information and time sequence information of pedestrians in video data, and extracts the characteristics of the pedestrians from the apparent characteristics by taking the action excitation characteristics as excitation information; refining the pedestrian characteristics by adopting the space information and the time sequence information to obtain pedestrian characteristics containing space-time relations; and inputting the pedestrian characteristics containing the space-time relationship into a depth residual error network model for training, and obtaining the trained depth residual error network model.
3. The method for identifying pedestrian re-by-pedestrian in motion-excitation-based space-time feature erasure video of claim 2, wherein the depth residual network model extracts apparent features and motion excitation features of pedestrians in video data, and wherein the method for extracting pedestrian features from the apparent features by using the motion excitation features as excitation information specifically comprises the following steps: extracting apparent features by a depth residual error network model; Performing dimension reduction on the appearance characteristics to obtain dimension reduction characteristics, and performing convolution on the dimension reduction characteristics to obtain convolution dimension reduction characteristics; Obtaining a frame difference characteristic through a dimension reduction characteristic, a convolution dimension reduction characteristic and a characteristic frame difference method and preprocessing; the pedestrian feature is obtained by multiplying the frame difference feature and the apparent feature to obtain the action excitation feature and adding the action excitation feature and the apparent feature.
4. The method for identifying pedestrian by using motion-excited space-time feature erasure video according to claim 2, wherein the depth residual network model extracts spatial information and time sequence information of pedestrians in video data respectively, refines pedestrian features by using the spatial information and the time sequence information, and obtains pedestrian features containing space-time relations, and the method specifically comprises the following steps: Performing dimension reduction operation of half of the dimension on the pedestrian characteristics, dividing the characteristics into two parts of time and space, and respectively extracting time sequence characteristics and space characteristics; In the time part, performing matrix transformation on the input features, performing 3D convolution, batch standardization, activation function activation, matrix transformation and averaging to obtain time features; in the space part, the input features are subjected to 2D convolution, batch standardization, activation function activation and averaging to obtain space features; and splicing the spatial features and the time features to obtain pedestrian features containing space-time relations.
5. The method for identifying pedestrian re-by-pedestrian based on motion-excited space-time feature erasure video of claim 2, wherein the step of inputting the pedestrian feature containing the space-time relationship into the depth residual network model for training to obtain the trained depth residual network model comprises the following steps: The method comprises the steps of inputting pedestrian characteristics comprising space-time relations into a depth residual error network model for training, carrying out parameter counting on the depth residual error network model through a loss function of cross entropy loss, a triple loss function of difficult sample mining and a random gradient descent algorithm, and carrying out training updating on the depth residual error network model through limited times of iteration by adopting the depth residual error network model after parameter updating.
6. The method for identifying pedestrian re-by-pedestrian based on motion-excited space-time feature erasure video of claim 2, wherein the step of inputting the pedestrian feature containing the space-time relationship into the depth residual network model for training to obtain the trained depth residual network model comprises the following steps: And carrying out parameter counting on the depth residual error network model through a loss function of cross entropy loss, a triple loss function of difficult sample mining and a random gradient descent algorithm, and carrying out training updating on the depth residual error network model through iteration for a limited number of times by adopting the depth residual error network model after parameter updating.
7. A motion-excitation-based spatiotemporal feature erasure video pedestrian re-recognition device applied to the motion-excitation-based spatiotemporal feature erasure video pedestrian re-recognition method described in any one of claims 1-6, characterized by comprising: The device comprises an acquisition module, a construction module, a training module and an identification module which are connected in sequence; The acquisition module is used for acquiring pedestrian video data comprising a plurality of actions; the construction module is used for constructing a depth residual error network model based on pedestrian video data; The training and identifying module is used for training the depth residual error network model by adopting the depth residual error network model based on action excitation and space-time characteristics, and carrying out video pedestrian re-identification by utilizing the trained depth residual error network model.
8. A motion-excited spatiotemporal feature erasure video pedestrian re-recognition device comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform a motion-excited spatio-temporal feature erasure video pedestrian re-recognition method according to any one of claims 1 to 6.
9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a spatiotemporal feature erasure video pedestrian re-recognition method based on motion activation of any one of claims 1 to 6.

Description

Motion excitation-based space-time feature erasure video pedestrian re-identification method Technical Field The invention relates to the technical field of video pedestrian re-identification, in particular to a method for identifying a video pedestrian by erasing space-time characteristics based on action excitation. Background Artificial intelligence is widely available in the public safety field, and pedestrian re-recognition is just a typical application of artificial intelligence in the public safety field. Pedestrian re-identification is widely applied to various fields such as criminal investigation, intelligent security, intelligent monitoring, smart cities, unmanned supermarkets and the like, and provides beneficial guarantee in the aspect of maintaining the life and property safety of people. In early criminal investigation cases, in order to find the action trail of criminal suspects from the monitoring cameras, a large number of case handling staff are required to observe a plurality of cameras, a large number of videos for a long time are watched for artificial screening and investigation, which is certainly time-consuming and labor-consuming and low in efficiency, and the case handling opportunity is possibly delayed, so that progress is unsmooth and the like. The pedestrian re-identification technology can accurately search out corresponding criminal suspects in massive video data, and provides powerful help for case detection. The video pedestrian re-recognition task itself faces complex challenges. Pedestrians with the same identity often show differences of different postures, different scales, different definition and the like under the shooting of cameras with different angles. However, pedestrians with different identities are similar in appearance, and images under the same camera are quite similar, so that the pedestrians are difficult to distinguish. In these cases, there is a great difficulty in accurately performing video pedestrian re-recognition. In the study of video pedestrian re-recognition, an assumption generally exists that multiple frames of images in a video sequence are utilized to complement and mutually reference, for example, when a target pedestrian in one frame of the video sequence is blocked, the situation that the target pedestrian is blocked in other video frames may not occur, and when the imaging of the target pedestrian in one frame is blurred due to motion, other frames may be correctly focused, and the like. Therefore, the video pedestrian re-identification is focused on the fusion of multi-frame features, and the pedestrian video features with discriminant ability are expected to be obtained by the fusion of the multi-frame features. However, the current research of the pedestrian re-recognition of the video mostly refers to the idea of the pedestrian re-recognition of the image, and the apparent characteristics are too focused to neglect the action characteristics of the pedestrian. If the appearance between different persons is similar, the apparent features are difficult to distinguish, and the action features are more discriminant than the apparent features. Disclosure of Invention In view of the above, the present invention aims to provide a method for identifying a video pedestrian by erasing spatial-temporal features based on motion excitation, wherein motion information is introduced as excitation of appearance information in the method for identifying a video pedestrian, which is more accurate than a model which only focuses on the appearance information originally. According to one aspect of the invention, a motion excitation-based space-time feature erasure video pedestrian re-recognition method is provided, and the method comprises the steps of obtaining pedestrian video data containing a plurality of motions, constructing a depth residual error network model based on the pedestrian video data, training the depth residual error network model by adopting a depth residual error network based on the motion excitation and the space-time feature, and utilizing the trained depth residual error network model to conduct video pedestrian re-recognition. In the technical scheme, aiming at the problem that in the current video pedestrian re-recognition research, apparent features and fine-grained features, namely static features in a single frame, are concerned, and dynamic features special for videos, namely action features formed by multiple frames, are ignored, the technical scheme starts with the extraction of the action features, and provides a space-time feature erasure video pedestrian re-recognition method based on action excitation. The motion information is introduced into the video pedestrian re-recognition method as excitation of the appearance information, so that the method is more accurate than the model which only focuses on the appearance information. In some embodiments, the training of the depth residual network model by using the depth