CN-117112813-B - Text and pedestrian video retrieval method based on cross-modal learning

CN117112813BCN 117112813 BCN117112813 BCN 117112813BCN-117112813-B

Abstract

The invention provides a text and pedestrian video retrieval method based on cross-modal learning, which comprises the steps of respectively carrying out graying treatment and mask treatment on a plurality of videos to be identified, carrying out graying filtration treatment on texts, processing each video through a visual feature extraction network to obtain three groups of video global features, processing the texts through a text encoder to obtain two groups of text global features, capturing fine-grained information by adopting a trained video retrieval model, obtaining similarity between each video and the texts based on the global features and the fine-grained features, and sequencing the similarity between each video and the texts to obtain the video with the highest similarity as a retrieval result. The invention improves the interference caused by insufficient details of pedestrian features and color dependence problems in the video in the current video retrieval method, and refines the pedestrian features and text features by selecting a mode of combining gray scale with color video frames and combining global with local, thereby improving the retrieval precision of the model.

Inventors

ZHU AICHUN
ZHANG XU
DONG GUANNAN
Ni fan
HU FANGQIANG

Assignees

南京工业大学

Dates

Publication Date: 20260512
Application Date: 20230724

Claims (7)

1. A method for text and pedestrian video retrieval based on cross-modal learning, the method comprising the steps of: step 1, respectively carrying out graying treatment and masking treatment on a plurality of videos to be identified to obtain a gray video frame sequence and a masking video frame sequence, and carrying out graying filtering treatment on texts to obtain graying texts; Step 2, processing each video processed in the step 1 through a visual feature extraction network to obtain three groups of video global features of each video, wherein the three groups of video global features comprise an original video global feature, a gray video global feature and a mask video global feature; Capturing fine granularity information in gray level videos, mask videos and original texts corresponding to the videos by adopting a trained video retrieval model, and processing to obtain fine granularity characteristics of the original texts, wherein the fine granularity characteristics of the gray level videos and the mask videos are obtained; the video processing in the step 2 is specifically that ResNet-50 visual characteristic extraction networks are adopted to respectively process the original video frame sequence Gray scale video frame sequence And masking a sequence of video frames Processing to obtain three groups of features and corresponding self-attention diagram including original video global features Gray video global features Masking video global features Original video self-attention force diagram Gray video self-attention force diagram Masking video self-attention attempts : ; ; ; The text encoder in step 2 comprises 1 BERT model and 1 Bi-LSTM, and encodes the original text and the gray text respectively to obtain two groups of text global feature vectors and corresponding self-attention force diagram comprising the original text global feature Graying text global features Original text self-attention force diagram Gray text self-attention force diagram : ; ; Wherein, the And Respectively describing original text and gray text; the fine granularity characteristic obtaining step in the step 3 is as follows: Step 31, global features of the gray video are performed by adopting the following formula Masking video global features Global features of original text Processing to obtain gray video and mask video, and self-attention force diagram of original text containing fine granularity information 、 ; ; ; ; Wherein: 、、 representing characteristic dimension adjustment coefficients for adjusting characteristic dimensions and 、 Consistent; 、、 for the weight proportion of the weighted sum, 、、 Gray video self-attention, mask video self-attention, and original text self-attention, respectively; step 32, self-attention-seeking with fine-grained information using softmax function 、 Processing to obtain the ordering of the information content contained in each token in each attention diagram, and acquiring the token with high information content as the corresponding gray video fine granularity characteristic according to the preset percentage Masking video fine granularity features Fine granularity feature of original text 。
2. The method for searching text and pedestrian video based on cross-modal learning according to claim 1, wherein the graying processing of the video in step 1 comprises: firstly, performing frame extraction processing on a video to obtain a video frame sequence ; Using graying functions in OpenCV Converting the video frames to obtain a gray video frame sequence 。
3. The cross-modal learning based text and pedestrian video retrieval method of claim 2 wherein masking the video in step 1 employs a visual encoder employing a Vision Transformer model, comprising: Sequence of video frames Input visual encoder for obtaining self-attention force diagram of video frame Masking a portion of the image blocks according to a predetermined masking rate to obtain a sequence of masked video frames 。
4. The cross-modal learning based text and pedestrian video retrieval method as claimed in claim 1, wherein the filtering of the text in step 1 includes: Filtering out adjectives in the text by using a NLTK vocabulary marker, screening adjectives representing colors, and using placeholders And replacing the position of the adjective representing the color to obtain the filtered text.
5. The method for searching text and pedestrian video based on cross-modal learning according to claim 1, wherein in step 3, the step of obtaining the similarity between the video and the text is: step 33, respectively calculating cosine similarity for three groups of video text features of any video: ; ; ; Wherein: Global features for original video frames Global features with original text The degree of cosine similarity between the two, Fine granularity feature for gray scale video Global features with grayscaled text The degree of cosine similarity between the two, Masking video fine granularity features Fine grain character with original text Cosine similarity of (c); step 34, processing the three groups of similarity by adopting softmax function as the weight of each group of characteristics : ; Step 35, weighting and summing the three groups of video global features based on the weights of the groups of features to obtain effective video features : ; Step 36, calculating effective video features Fine grain character with original text As the final similarity of the video and text : 。
6. The method for searching text and pedestrian video based on cross-modal learning according to claim 1, wherein the video search model is trained by cross entropy loss And cross-modal projection matching loss CMPM and cross-modal projection classification loss CMPC are trained, and the final loss is obtained by adopting the following formula: 。
7. The cross-modal learning based text and pedestrian video retrieval method of claim 6, wherein the cross entropy loss The following formula is adopted for obtaining: ; Wherein, the As a total number of samples, Representing the sample number, a sample comprising a video and corresponding text, an ID representing the video number, The representation number is A real video number corresponding to the sample of (2); the representation number is A prediction video number corresponding to a sample of (a).

Description

Text and pedestrian video retrieval method based on cross-modal learning Technical Field The invention belongs to the technical field of computer vision, and particularly relates to a text and pedestrian video retrieval method based on cross-modal learning. Background The purpose of pedestrian video retrieval is to be able to search and locate a target pedestrian in a video surveillance database based on the language description provided by the witness. Nowadays, a mature technology has been developed for image-text retrieval, and the speed and accuracy of the retrieval are improved well, but the retrieval technology for the video layer is in a new generation stage and is more difficult than image retrieval, but is necessary. The video has more time sequence characteristics relative to the image, and the continuous behavior of the current pedestrian in a certain period of time can be read through a section of video, so that the video is rich in more action characteristics and continuous action changes which are not available in the image compared with the image. In the video retrieval process, the extraction of pedestrian features in the video is often interfered by factors such as background and color dependence, so that some fine-grained features are lost and a better retrieval effect is not achieved. If the interference information in the video is screened out and the interference of the color information dependence on the retrieval effect is shielded, the problem to be solved is urgent. Disclosure of Invention The invention aims at solving the problem of precision reduction in pedestrian video retrieval caused by background interference and color dependence, and provides a text and pedestrian video retrieval method based on cross-modal learning. The technical scheme of the invention is as follows: The invention provides a text and pedestrian video retrieval method based on cross-modal learning, which comprises the following steps: step 1, respectively carrying out graying treatment and masking treatment on a plurality of videos to be identified to obtain a gray video frame sequence and a masking video frame sequence, and carrying out graying filtering treatment on texts to obtain graying texts; Step 2, processing each video processed in the step 1 through a visual feature extraction network to obtain three groups of video global features of each video, wherein the three groups of video global features comprise an original video global feature, a gray video global feature and a mask video global feature; and step 3, capturing fine granularity information in the gray level video, the mask video and the original text corresponding to each video by adopting a video retrieval model after training, processing to obtain fine granularity characteristics of the original text, acquiring similarity of each video and the text based on the global characteristics and the fine granularity characteristics, and sequencing the similarity of each video and the text to obtain a video with the highest similarity as a retrieval result. Further, the graying processing of the video in the step 1 includes: Firstly, performing frame extraction processing on a video to obtain a video frame sequence V O＝{v1,…,vN; And converting the video frames by using a graying function V G in OpenCV to obtain a gray video frame sequence V G＝{vG1,…,vGN. Further, masking the video in step1 employs a visual encoder, which employs Vision Transformer models, comprising: The video frame sequence V O＝{v1,…,vN is input into a visual encoder, the self-attention force diagram A of the video frame is obtained, partial image blocks are covered according to a preset mask rate, and the mask video frame sequence V M＝{vM1,…,vMN is obtained. Further, the filtering processing of the text in the step 1 includes: And filtering adjectives in the text by using a vocabulary labeling device in NLTK, screening adjectives representing colors, and replacing the positions of the adjectives representing the colors by using placeholders [ MASK ] to obtain the filtered text. Further, the video processing in step 2 is specifically that the original video frame sequence V O, the gray video frame sequence V G and the mask video frame sequence V M are respectively processed by using ResNet-50 visual feature extraction networks to obtain three sets of features and corresponding self-attention force diagram including the global features of the original videoGray video global featuresMasking video global featuresOriginal video self-attention force diagramGray video self-attention force diagramMasking video self-attention attempt a VM: Further, the text encoder in step 2 includes 1 BERT model and 1 Bi-LSTM, and encodes the original text and the gray text respectively to obtain two sets of text global feature vectors and corresponding self-attention force diagram including the original text global feature Gray text global featureOriginal text self-attention force diagramGray text sel