CN-121982608-A - Multi-mode entity identification method integrating video and voice

CN121982608ACN 121982608 ACN121982608 ACN 121982608ACN-121982608-A

Abstract

The invention discloses a multi-modal entity identification method for fusing video and voice, which comprises the steps of firstly extracting visual candidate entity sets from video streams and hearing candidate entity sets from audio streams respectively, then establishing fine-grained time sequence corresponding relations between visual and hearing characteristics through a learnable time sequence alignment network to generate cross-modal aligned joint characteristic representations, then constructing a cross-modal heterogeneous graph containing multiple semantic relation edges by taking candidate entities as nodes, carrying out message propagation and node characteristic updating by using a graph neural network, finally calculating cross-modal alignment confidence coefficient based on updated node characteristics, fusing multi-modal candidate entities referring to the same entity, and outputting the category, position, time interval and confidence coefficient of each entity. The invention realizes fine granularity time sequence alignment and depth fusion of video and voice, and remarkably improves accuracy and robustness of cross-modal entity identification.

Inventors

SUN TAO
YANG ZHEN

Assignees

重庆小易智联智能技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260122

Claims (10)

1. A multi-modal entity recognition method integrating video and voice is characterized by comprising the following steps: Performing visual analysis on the input video stream to generate a visual candidate entity set containing space-time information; performing speech recognition and text analysis on the input audio stream to generate a set of hearing candidate entities containing time information; Extracting feature sequences of visual candidate entities and auditory candidate entities, establishing a fine-grained time sequence corresponding relation between visual features and acoustic features through a learnable time sequence alignment network, and generating a cross-modal aligned joint feature representation; constructing a cross-modal heterogeneous graph containing multiple semantic relation edges based on joint feature representation by taking visual candidate entities and auditory candidate entities as nodes; inputting the cross-modal heterogeneous graph into a cross-modal graph neural network to perform message propagation and node characteristic updating, and calculating cross-modal alignment confidence that any two candidate entities refer to the same real entity based on the updated node characteristics; according to the cross-modal alignment confidence, the multi-modal candidate entities referring to the same real entity are fused, and the category, the spatial position, the time interval and the cross-modal confidence of each identification entity are classified and output.
2. The method of claim 1, wherein the step of performing visual analysis on the input video stream to generate a set of visual candidate entities including spatio-temporal information comprises: Performing target detection on the video frame by adopting a detection model YOLO or Faster R-CNN based on deep learning, and taking the detected target frame and a tracking track obtained by a SORT or DeepSort algorithm thereof as visual candidate entities; And/or performing optical character recognition on the video frame by adopting a CRNN or a transducer-based model, and taking the recognized text region and the appearance period thereof as visual candidate entities; And/or adopting MTCNN or RETINAFACE model to perform face detection, performing cluster association by using face features, and taking the detected face area and the appearance period thereof as vision candidate entities.
3. The method of claim 1, wherein performing speech recognition and text analysis on the input audio stream to generate a set of hearing candidate entities including time information, comprises: Recognizing the audio stream by adopting an end-to-end voice recognition model Conformer or Wav2Vec 2.0, and outputting a text sequence with a word level or word level timestamp; Adopting a named entity recognition model based on BERT or RoBERTa to perform entity recognition on the text sequence; Each named entity identified is associated with a time interval in which it occurs in the audio stream, constituting an auditory candidate entity.
4. The method for multi-modal entity recognition of fusion video and speech according to claim 1, wherein the learnable time-aligned network specifically performs the following operations: encoding the visual candidate entity sequence and the auditory candidate entity sequence into feature vector sequences by a visual encoder and an acoustic encoder respectively And ; Calculating a visual feature sequence by using a differentiable dynamic time warping algorithm And acoustic feature sequences Optimal soft alignment path and alignment weight matrix between ; Based on a matrix of pairs Ji Quanchong And carrying out time sequence attention fusion on the visual characteristic sequence and the acoustic characteristic sequence to generate a joint characteristic vector aligned across modes at each time step.
5. The method for multi-modal entity recognition by merging video and speech according to claim 4, wherein the differentiable dynamic time warping algorithm is implemented by Soft-DTW, and the accumulated distance matrix is calculated recursively Wherein: Defining a local distance matrix Local distance matrix element T and s represent the index positions in the two sequences, respectively, namely the t-th time step in the visual feature sequence V and the s-th time step in the acoustic feature sequence a; Initialization of For the following By the time T the number of the holes is reached, To S, and Is used to calculate the cumulative distance using a differentiable softmin operation recursively Representing the distance from the start of the sequence (1, 1) to the current point (t, s), the smallest cumulative distance sum among all possible paths: , , Wherein, the In order to smooth the parameters of the image, In order to activate the function, Representing the input parameters; Based on the calculated accumulated distance Constructing an accumulated distance matrix R, and determining an optimal soft alignment path for aligning the whole visual sequence with the whole acoustic sequence based on the accumulated distance matrix R ; Computing optimal soft alignment path-to-local distance matrix by automatic differentiation or back propagation algorithm Each element of (3) Is used to construct a gradient matrix Matrix the gradients Normalization processing is carried out to obtain a final pair Ji Quan heavy matrix 。
6. The method for identifying a multi-modal entity fusing video and voice according to claim 1, wherein when constructing a cross-modal heterogeneous graph, the plurality of semantic relationship sides at least comprise the following types: The time sequence co-occurrence edge is connected with any two candidate entity nodes of which the overlapping degree exceeds a first threshold value on a time interval; a spatial adjacent edge is connected with a visual candidate entity node with the intersection ratio exceeding a second threshold value at the spatial position; A semantic similarity side is used for connecting any two nodes of which the node feature vector cosine similarity exceeds a third threshold value; and referring to a link edge, connecting the pronoun node with an entity candidate node possibly referred to by the pronoun node based on the pronoun digestion analysis of the audio text.
7. The method for multi-modal entity recognition of fusion video and speech according to claim 6, wherein the weights referring to the link edges are calculated by: performing dependency syntactic analysis on voice recognition text and recognizing pronouns And the context thereof; Calculating pronouns With each candidate entity node Is a comprehensive matching score of (2) : , Wherein, the For a learnable or preset weight coefficient, In order for the degree of time overlap to be the same, For the purpose of semantic similarity, A visual saliency score for the corresponding time; Will be Normalized by a sigmoid function to serve as an initial weight for referring to a link edge.
8. The method for identifying a multi-modal entity fusing video and voice according to claim 1, wherein a hierarchical message passing mechanism is adopted by a cross-modal graph neural network, and the node characteristic updating process is as follows: For each node in the heterogeneous graph In the first place The layer updating process comprises the following steps: Intra-modal aggregation-aggregation from homomodal neighbors Is a message of (2) : , Cross-modal aggregation-aggregation of neighbors from different modalities Is a message of: , Node update, namely updating node representation by combining self characteristics and aggregation information: , Wherein, the Indicating that the jth neighbor node is at the jth The feature vector of the layer output is used, Representing the intra-modal aggregation function, Representing a cross-modal aggregation function, AGGREGATE functions are mean-pooling, attention-pooling, or graph-attention-network, and UPDATE functions are gated loop units or fully connected layers.
9. The method for identifying a multi-modal entity fusing video and speech according to claim 8, wherein calculating the cross-modal alignment confidence is specifically as follows: For any two nodes And Final layer characteristics of (2) And Which cross-modal align confidence Calculated by bilinear function: , Wherein, the As a matrix of parameters that can be learned, As a result of the bias term, The function is activated for sigmoid.
10. The method according to claim 9, characterized in that the entity fusion process is in particular: Confidence will be aligned across modalities Candidate entity node pairs exceeding a preset fusion threshold are judged to refer to the same real entity; for all node sets determined to be the same entity Fused physical features thereof And spatial position Calculated by weighted average: , Wherein, the Is a node Is a modality confidence or alignment confidence of (c), Is the eigenvector of the kth node in C, For its spatial position.

Description

Multi-mode entity identification method integrating video and voice Technical Field The invention belongs to the technical field of multi-mode information processing and artificial intelligence, and particularly relates to a multi-mode entity identification method integrating video and voice. Background With the rapid development of artificial intelligence and multimedia processing technology, video and voice are used as two core modes of information transmission, and are increasingly widely applied in the fields of security monitoring, intelligent interaction, content analysis, intelligent medical treatment and the like. How to effectively identify entities (such as people, things, places, events, etc.) from videos and voices and realize the fusion and alignment of cross-modal information has become a key challenge in the field of multi-modal intelligent understanding. Currently, entity identification techniques have made significant progress in a single modality. In visual terms, deep learning-based object detection (e.g., YOLO, fast R-CNN), face recognition, optical Character Recognition (OCR), etc. techniques have been able to extract entities and their spatio-temporal information from video frames with greater accuracy. In auditory terms, automatic Speech Recognition (ASR) and Named Entity Recognition (NER) techniques are able to recognize text entities and their timestamps from an audio stream. However, most of the existing methods are directed to single-mode design, and lack of modeling of systems with internal correlations among multiple modes leads to the following problems in practical applications: First, timing alignment is difficult. The video and the voice signals always have an asynchronous phenomenon in the processes of acquisition, transmission and processing, and the traditional multi-modal fusion method (such as feature splicing, early fusion or late fusion) often ignores the corresponding relation of the video and the voice signals on a fine-grained time axis, so that the matching accuracy of the cross-modal entity is low. Second, semantic association is missing. Not only does there exist temporal co-occurrence relationships between visual entities in video and mention entities in speech, but there is also rich spatial, semantic, and reference correlation. The existing method is dependent on manual rules or simple similarity calculation for correlation, is difficult to model complex cross-modal semantic dependence, and is limited in performance particularly in complex scenes of multiple persons, multiple objects and multiple references. Again, the information fusion is inadequate. The existing system mostly adopts a two-stage independent recognition and re-fusion strategy, namely, entities are recognized from videos and voices respectively, and then post-processing alignment is carried out. This approach fails to exploit context information of another modality in the recognition process, easily resulting in information loss and error propagation. In addition, the existing cross-modal learning method often depends on large-scale annotation data, the model structure is fixed, the method is difficult to adapt to actual application requirements of different scenes, different equipment and different languages, and generalization capability and self-adaptation capability are limited. Therefore, a multi-modal entity recognition method capable of achieving fine-granularity alignment and depth fusion of cross-modal semantic relation between video and voice and having good expansibility and self-adaptation capability is needed, so that accuracy, consistency and practicability of entity recognition in complex scenes are improved. Disclosure of Invention In view of this, the invention provides a multi-modal entity recognition method for fusing video and voice, which creatively performs fine-granularity dynamic alignment on the video and voice at a time sequence level through a learnable time sequence alignment network, and solves the problem of difficult entity matching caused by time sequence dislocation between modalities in the traditional method. By constructing a cross-modal heterogeneous graph and introducing a graph neural network, the deep fusion of visual and auditory characteristics on semantic and spatial relations is realized, and the accuracy and the robustness of cross-modal entity identification are remarkably improved. In order to achieve the above purpose, the present invention provides the following technical solutions: the invention provides a multi-mode entity identification method integrating video and voice, which comprises the following steps: Performing visual analysis on the input video stream to generate a visual candidate entity set containing space-time information; performing speech recognition and text analysis on the input audio stream to generate a set of hearing candidate entities containing time information; Extracting feature sequences of visual candidate entities