CN-116469170-B - Human-object interaction identification method based on recombinant sample learning

CN116469170BCN 116469170 BCN116469170 BCN 116469170BCN-116469170-B

Abstract

The invention provides a human-object interaction motion recognition method based on recombination sample learning, which has the characteristics that the method comprises the following steps of S1, S2, inputting a global feature vector, a position code vector and a query vector group into a human-object pair decoder to obtain a human-object pair feature vector group, S3, inputting the global feature vector, the position code vector and the human-object pair feature vector group into an interaction motion decoder to obtain an interaction motion feature vector group, S4, inputting the human-object pair feature vector group and the interaction motion feature vector group into a feedforward neural network to obtain human-object pair prediction and interaction motion prediction, and S5, according to the human-object pair prediction, the interaction motion prediction and N i class human-object interaction, obtaining recognition results. In a word, the method can improve the accuracy of human-object interaction identification.

Inventors

LIANG SHUANG
Zhuang Zikun
WANG JIAWEN
XIE CHI

Assignees

同济大学

Dates

Publication Date: 20260505
Application Date: 20230423

Claims (8)

1. The human-object interaction identification method based on the recombination sample learning is used for identifying an image to be identified according to a query vector group Q p containing N q random initialization query vectors and N i human-object interactions to obtain an identification result, and is characterized by comprising the following steps: Step S1, inputting the image to be identified into a convolutional neural network for feature extraction, and then encoding by a transducer encoder to obtain a global feature vector X s and a position encoding vector E; Step S2, inputting the global feature vector X s , the position coding vector E, and the query vector set Q p into a person-object pair decoder, to obtain a feature vector of N q person-object pairs as a person-object pair feature vector set R p ; Step S3, inputting the global feature vector X s , the position-coding vector E, and the person-object pair feature vector set R p into an interactive motion decoder, to obtain the N q person-object pair interactive motion feature vectors as an interactive motion feature vector set R i ; S4, inputting the human-object pair characteristic vector group R p and the interactive motion characteristic vector group R i into a feedforward neural network to obtain human-object pair prediction And interaction prediction Y i ＝{a n ,n∈{1,2,...,N q }; step S5, according to the human-object pair prediction Y p , the interaction motion prediction Y i and the N i human-object interactions, obtaining a human-object interaction prediction result of the N q human-object pair As a result of the said recognition of the said objects, Wherein, the Is the human body frame of the nth human-object pair, For the object frame of the nth person-object pair, o n is the object class confidence vector of the nth person-object pair containing the prediction confidence of all object classes, a n is the action class confidence vector of the nth person-object pair containing all action classes, y' n is the person-object interaction prediction result of the nth person-object pair, u n is the predicted object class of the nth person-object pair, v n is the predicted action class of the nth person-object pair, For the maximum confidence of the nth person-object pair, Wherein training a model comprising the convolutional neural network, the transducer encoder, the man-object pair decoder, the interaction decoder and the feedforward neural network by a training sample comprising a plurality of training images and corresponding real labels thereof, N i classes of man-object interactions, and a query vector set Q p comprising N q randomly initialized query vectors, comprises the following steps: Step T1, randomly selecting a training image I 1 and a training image I 2 from the training sample, inputting the training image I 1 and the training image I 2 into the convolutional neural network for feature extraction, and then encoding by the transducer encoder to obtain a global feature vector and a position encoding vector of the training image respectively; Step T2, for each training image, inputting the global feature vector, the position coding vector and the query vector group Q p into the person-object pair decoder to obtain a feature vector of N q person-object pairs as a person-object pair feature vector group; step T3, inputting the global feature vector, the position coding vector and the person-object pair feature vector set into the interactive motion decoder for each training image to obtain the interactive motion feature vector of the N q person-object pair as an interactive motion feature vector set; step T4, inputting the human-object pair feature vector group and the interactive motion feature vector group into the feedforward neural network for each training image to obtain human-object pair prediction and interactive motion prediction, and directly combining the human-object pair prediction and the interactive motion prediction to obtain human-object interactive prediction; Step T5, matching the human-object interaction prediction with the real label corresponding to the training image according to the Hungary algorithm for each training image to obtain the most accurate human-object interaction prediction; step T6, according to the most accurate human-object interaction prediction of the two training images, obtaining corresponding most accurate human-object pair feature vectors and most accurate interaction motion feature vectors, carrying out cross recombination or internal recombination on the most accurate human-object pair feature vectors and the most accurate interaction motion feature vectors to obtain a recombined human-object interaction feature vector group, inputting an interaction motion classification feed-forward network to obtain recombined human-object interaction prediction, and recombining the real labels corresponding to the two training images to obtain recombined real labels; Step T7, calculating a loss function according to the most accurate human-object interaction prediction and the real label of the two training images, and the recombined human-object interaction prediction and the recombined real label, and optimizing parameters of the interaction action decoder according to a loss function calculation result; step T8, repeating the steps T1 to T7 until all the training images in the training sample optimize the parameters of the model, the model training is completed, Most accurate human-object interaction prediction of kth said training image For the most accurate human-object interaction prediction of the nth said human-object pair of the kth said training image, For the most accurate human body border of the nth said human-object pair of the kth said training image, For the most accurate object border of the nth person-object pair of the kth said training image, o kn for the most accurate object class confidence vector of the nth person-object pair of the kth said training image, a kn for the most accurate action class confidence vector of the nth person-object pair of the kth said training image, Is the true number of human-object interaction examples in the kth training image.
2. The human-object interaction recognition method based on the reorganization sample learning according to claim 1, wherein: Wherein, the step S1 comprises the following substeps: s1-1, inputting the image to be identified into the convolutional neural network for feature extraction to obtain a visual feature map X v ; S1-2, obtaining the position coding vector E according to the visual feature map X v ; And S1-3, inputting the visual feature map X v and the position coding vector E into the converter coder for coding to obtain the global feature vector X s .
3. The human-object interaction recognition method based on the reorganization sample learning according to claim 1, wherein: wherein the feedforward neural network comprises a human-object classification feedforward network and an interactive action classification feedforward network, and the step S4 comprises the following substeps: S4-1, inputting the interaction characteristic vector group R i into the human-object classification feedforward network to obtain the human-object pair prediction of the N q human-object pair; and S4-2, splicing the human-object pair characteristic vector group R p and the interactive action characteristic vector group R i , and inputting the spliced human-object pair characteristic vector group R p and the interactive action characteristic vector group R i into the interactive action classification feed-forward network to obtain the interactive action prediction of the N q human-object pair.
4. The human-object interaction recognition method based on the reorganization sample learning according to claim 1, wherein: wherein, the step S5 comprises the following substeps: S5-1, for each person-object pair, performing product calculation on the action category confidence vector and the object category confidence vector according to the N i -class person-object interaction to obtain N i confidence products; S5-2, selecting the largest confidence coefficient product as the largest confidence coefficient of the person-object pair for each person-object pair, and respectively taking the object category and the action category corresponding to the largest confidence coefficient as the predicted object category and the predicted action category of the person-object pair; And step S5-3, sequencing the maximum confidence of the N q person-object pairs from large to small to obtain the person-object interaction prediction result of the N q person-object pairs.
5. The human-object interaction recognition method based on the reorganization sample learning according to claim 1, wherein: Wherein, when the cross-rebinning is performed according to the most accurate human-object interaction prediction of the training image I 1 and the training image I 2 , the step T6 includes the following sub-steps: Step T6-1, most accurate human-object interaction prediction based on the training image I 1 Obtaining the most accurate human-object pair characteristic vector group of the training image I 1 And the most accurate set of interaction feature vectors Step T6-2, predicting the most accurate human-object interaction according to the training image I 2 Obtaining the most accurate human-object pair characteristic vector group of the training image I 2 And the most accurate set of interaction feature vectors Step T6-3, combining the most accurate human-object pair feature vectors A kind of electronic device The stripe person-object pair feature vector is respectively combined with the most accurate interaction feature vector group A kind of electronic device The interaction feature vectors are spliced one by one to obtain a recombined character interaction feature group Step T6-4, combining the most accurate human-object pair feature vectors A kind of electronic device The stripe person-object pair feature vector is respectively combined with the most accurate interaction feature vector group A kind of electronic device The interaction feature vectors are spliced one by one to obtain a recombined character interaction feature group Step T6-5, the recombined human interaction feature set And the recombined character interaction feature set Respectively inputting the interactive action classification feedforward network to respectively obtain interactive action prediction Interactive action prediction Step T6-6, predicting Y p1 the person-object pair of the training image I 1 and the interactive motion prediction Combining to obtain recombinant human-object interaction prediction Predicting a person-object pair of the training image I 2 as Y p2 and the interactive motion prediction Combining to obtain recombinant human-object interaction prediction The recombinant human-object interaction prediction And said recombinant human-object interaction prediction Predicting for said recombinant human-object interaction; step T6-7, according to the real label corresponding to the training image I 1 Obtaining the real label of the person-object pair And interaction real label According to the real label corresponding to the training image I 2 Obtaining the real label of the person-object pair And interaction real label Step T6-8, according to the N i human-object interaction, the human-object pair real label And the interactive action real label Pairing and combining one by one to obtain the corresponding recombinant human-object interaction prediction Recombinant authentic tag of (a) The person-object pair is truly labeled And the interactive action real label Pairing and combining one by one to obtain the corresponding recombinant human-object interaction prediction Recombinant authentic tag of (a) The recombined real tag And the recombinant authentic tag And (5) recombining the real tag.
6. The human-object interaction recognition method based on the reorganization sample learning according to claim 1, wherein: Wherein, when the internal reorganization is performed according to the most accurate human-object interaction prediction of the training image I 1 and the training image I 2 , the step T6 includes the following sub-steps: Step T6-1, most accurate human-object interaction prediction based on the training image I 1 Obtaining the most accurate human-object pair characteristic vector group of the training image I 1 And the most accurate set of interaction feature vectors Step T6-2, predicting the most accurate human-object interaction according to the training image I 2 Obtaining the most accurate human-object pair characteristic vector group of the training image I 2 And the most accurate set of interaction feature vectors Step T6-3, combining the most accurate human-object pair feature vectors A kind of electronic device The stripe person-object pair feature vector is respectively combined with the most accurate interaction feature vector group A kind of electronic device Splicing the interaction characteristic vectors one by one, removing the original combination of the human-object pair characteristic vector and the interaction characteristic vector to obtain a recombined human interaction characteristic group Step T6-4, combining the most accurate human-object pair feature vectors A kind of electronic device The stripe person-object pair feature vector is respectively combined with the most accurate interaction feature vector group A kind of electronic device Splicing the interaction characteristic vectors one by one, removing the original combination of the human-object pair characteristic vector and the interaction characteristic vector to obtain a recombined human interaction characteristic group Step T6-5, the recombined human interaction feature set And the recombined character interaction feature set Respectively inputting the interactive action classification feedforward network to respectively obtain interactive action prediction Interactive action prediction Step T6-6, predicting Y p1 the person-object pair of the training image I 1 and the interactive motion prediction Combining to obtain recombinant human-object interaction prediction Predicting a person-object pair of the training image I 2 as Y p2 and the interactive motion prediction Combining to obtain recombinant human-object interaction prediction The recombinant human-object interaction prediction And said recombinant human-object interaction prediction Predicting for said recombinant human-object interaction; step T6-7, according to the real label corresponding to the training image I 1 Obtaining the real label of the person-object pair And interaction real label According to the real label corresponding to the training image I 2 Obtaining the real label of the person-object pair And interaction real label Step T6-8, according to the N i human-object interaction, the human-object pair real label And the interactive action real label Pairing and combining one by one, and removing the real labels corresponding to the training image I 1 Obtaining a prediction corresponding to the recombinant human-object interaction Recombinant authentic tag of (a) The person-object pair is truly labeled And the interactive action real label Pairing and combining one by one, and removing the real labels corresponding to the training image I 2 Obtaining a prediction corresponding to the recombinant human-object interaction Recombinant authentic tag of (a) The recombined real tag And the recombinant authentic tag And (5) recombining the real tag.
7. The human-object interaction recognition method based on the reorganization sample learning according to claim 5 or 6, wherein: Wherein when the combination of one object category of the person-object pair real label and one action category of one interaction characteristic vector in the interaction real label exceeds the N i category person-object interaction, setting the element value corresponding to the object category in the interaction characteristic vector to be 0 in the recombined real label, When the combination of one object category of the human-object pair real labels and all the action categories in one interaction characteristic vector in the interaction real labels exceeds the N i category human-object interaction, eliminating the combination of the object category and the interaction characteristic vector in the recombined real labels, When the combination of one object category of the human-object pair real label and all the action categories in all the interactive action feature vectors in the interactive action real label exceeds the N i -category human-object interaction, the object category in the recombined real label corresponds to one all-zero interactive action feature vector.
8. The human-object interaction recognition method based on the reorganization sample learning according to claim 1, wherein: In the step T7, the calculation formula of the loss function is as follows: L=λ b ·L b +λ u ·L u +λ o ·L o +λ a ·L a , Wherein L b 、L u 、L o and L a are respectively a frame regression loss function, a frame intersection ratio loss function, an object class loss function and an action class loss function, lambda b 、λ u 、λ o and lambda a are respectively weight super-parameters of the frame regression loss function, the frame intersection ratio loss function, the object class loss function and the action class loss function, L is the loss function, The calculation formula of the loss function calculation result is as follows: L batch ＝ρ·L orig +(1-ρ)·L compo , Wherein L batch is the result of calculating the loss function, L orig is the result of calculating the loss function L by the most accurate human-object interaction prediction and the real label, L compo is the result of calculating the loss function L by the recombined human-object interaction prediction and the recombined real label, and ρ is the weight super parameter for balancing the two.

Description

Human-object interaction identification method based on recombinant sample learning Technical Field The invention relates to the field of human-object interaction recognition, in particular to a human-object interaction recognition method based on recombinant sample learning. Background As an important direction in the field of artificial intelligence, motion recognition has been a research hotspot in academia and industry. Person-object interaction recognition is a key subtask in the field of action recognition, and needs to locate a person-object instance with interaction in an image and recognize the interaction relationship between a person and an object. The human-object interaction recognition is an important core technology for deepening scene understanding and visual cognition, and has wide application prospect and huge market demand in a plurality of fields such as security monitoring, video retrieval, unmanned driving and the like. In recent years, deep learning is a mainstream method in the field of human-object interaction recognition, and as a transform architecture brings about great transformation in the field of computer vision, recent researches propose a plurality of one-stage methods based on a transform, so as to realize end-to-end human-object interaction recognition. The human-object interaction identification has serious problem of data long tail distribution, so that the data distribution of the human-object interaction category is unbalanced. To solve this problem, some of the prior studies have proposed a few-sample or zero-sample human-object interaction recognition method that generalizes and migrates knowledge of the head action class to the tail action class or to a new action class that has never been learned. One type of method generates new training samples for model learning by recombining the partial features of different types of human-object interactions. However, the existing method basically adopts a traditional CNN-based two-stage framework, visual features of human-object interaction generally lack global context information, and the recombination of feature samples often causes further loss of the global context information, so that the characterization capability of the feature samples is weaker, the learning effect is also poorer, and the recognition accuracy of a human-object interaction recognition model based on the recombination of the feature samples is poorer. Disclosure of Invention The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to provide a human-object interaction recognition method based on recombinant sample learning. The invention provides a human-object interaction identification method based on recombination sample learning, which is used for identifying an image to be identified according to query vector groups Q p and N i human-object interactions comprising N q random initialization query vectors to obtain an identification result, and has the characteristics that the method comprises the following steps of S1, inputting the image to be identified into a convolutional neural network for feature extraction, then encoding by a transducer encoder to obtain a global feature vector X s and a position encoding vector E, S2, inputting the global feature vector X s, the position encoding vector E and the query vector group Q p into a human-object pair decoder to obtain a feature vector of N q human-object pairs as a human-object pair feature vector group R p, S3, inputting the global feature vector X s, the position encoding vector E and the human-object pair feature vector group R p into an interaction motion decoder to obtain an interaction motion feature vector group R i of N q human-object pairs, and inputting the human-object pair feature vector group R p and the human-object pair feature vector group R i into a human-object pair prediction neural network to obtain a human-object pair prediction networkAnd interaction prediction Y i＝{an,n∈{1,2,...,Nq }, step S5, according to the human-object pair prediction Y p, the interaction prediction Y i and the N i human-object interactions, obtaining a human-object interaction prediction result of the N q human-object pairN e {1,2,., N q } }) as recognition result, wherein,Human body frame for the nth person-object pair,For the object frame of the nth person-object pair, o n is the object class confidence vector of the nth person-object pair containing the prediction confidence of all object classes, a n is the action class confidence vector of the nth person-object pair containing all action classes, y' n is the person-object interaction prediction result of the nth person-object pair, u n is the predicted object class of the nth person-object pair, v n is the predicted action class of the nth person-object pair,Is the maximum confidence of the nth person-object pair. The human-object interaction identification method based on the recombina