CN-121980328-A - Man-machine collaborative labeling method and device based on multi-mode fusion and active learning

CN121980328ACN 121980328 ACN121980328 ACN 121980328ACN-121980328-A

Abstract

The invention discloses a man-machine collaborative labeling method and device based on multi-mode fusion and active learning, which comprises the steps of obtaining a multi-mode data set to be labeled, carrying out multi-mode feature decoupling and alignment to obtain a multi-mode sample, inputting an unlabeled multi-mode sample into a current multi-mode fusion model to obtain a corresponding sample fusion representation, determining fusion uncertainty of the multi-mode sample according to the sample fusion representation, screening out candidate samples, carrying out cross-mode enhancement pre-labeling on the candidate samples to obtain a corresponding pre-labeling result, displaying the pre-labeling result through a man-machine interaction interface, responding to labeling operation of labeling personnel, obtaining a corresponding multi-mode label, carrying out consistency check on the multi-mode label to obtain compatibility probability, and sending labeling error reminding through the man-machine interaction interface when the compatibility probability is smaller than a preset first threshold. The method improves the efficiency and accuracy of multi-mode data labeling, and can be applied to the technical field of data processing.

Inventors

CHANG JIE
Lin Daidi
YAN LIMING
BI JIAYU
CHEN ZHENGWEN

Assignees

天翼物联科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251212

Claims (10)

1. A man-machine collaborative labeling method based on multi-mode fusion and active learning is characterized by comprising the following steps: Acquiring a multi-modal data set to be marked, and performing multi-modal feature decoupling and alignment on a data sample in the multi-modal data set by using a pre-training model to obtain a multi-modal sample; inputting the unlabeled multi-modal sample into a current multi-modal fusion model to obtain a corresponding sample fusion representation, determining fusion uncertainty of the multi-modal sample according to the sample fusion representation, and screening candidate samples according to the fusion uncertainty; Performing cross-modal enhancement pre-labeling on the candidate samples to obtain corresponding pre-labeling results and displaying the pre-labeling results through a human-computer interaction interface; responding to labeling operation of labeling personnel, obtaining a corresponding multi-mode label, and performing consistency verification on the multi-mode label to obtain compatibility probability of the multi-mode label; when the compatibility probability is smaller than a preset first threshold value, a labeling error prompt is sent out through the man-machine interaction interface; and when the compatibility probability is greater than or equal to the first threshold, determining that the candidate sample is a marked sample, and performing incremental training on the current multi-modal fusion model through the marked sample.
2. The man-machine collaborative labeling method based on multi-modal fusion and active learning according to claim 1, wherein the multi-modal feature decoupling and alignment are performed on the data samples in the multi-modal data set by using a pre-training model to obtain multi-modal samples, and the method specifically comprises the following steps: Determining a modal dimension of the multimodal dataset, the modal dimension comprising at least two of an image dimension, a text dimension, and an audio dimension; extracting characteristic representations of the data samples in different modes through the pre-training model; And performing time sequence alignment on the characteristic representation to obtain the multi-mode sample corresponding to the data sample.
3. The man-machine collaborative labeling method based on multi-modal fusion and active learning according to claim 1, wherein the determining fusion uncertainty of the multi-modal samples according to the sample fusion representation, and further screening candidate samples according to the fusion uncertainty specifically comprises: Calculating the variance of the sample fusion representation in the representation space based on a Monte Carlo method to obtain fusion representation uncertainty; Determining the uncertainty among the modes based on the standard deviation or variance of the uncertainty index of each mode; Calculating a cosine similarity entropy value of the fusion representation of the sample fusion representation and the marked sample, and determining a diversity factor according to a negative value of the cosine similarity entropy value; carrying out weighted summation on the fusion representation uncertainty, the inter-modal uncertainty and the diversity factor according to a preset weight coefficient to obtain the fusion uncertainty; And selecting a plurality of multi-mode samples with the largest fusion uncertainty as the candidate samples.
4. The man-machine collaborative labeling method based on multi-modal fusion and active learning according to claim 1, wherein the cross-modal enhancement pre-labeling of the candidate samples specifically comprises: respectively carrying out label prediction on the characteristic representation of the candidate sample in different modes through classification models in different modes to obtain initial pre-labels in different modes, wherein the initial pre-labels comprise at least two of image pre-labels, text pre-labels and audio pre-labels; Performing cross-modal combination on the feature representation and the initial pre-annotation, and calculating a similarity score between the feature representation and the initial pre-annotation in the combination; when the similarity scores of different combinations are all larger than a preset similarity threshold, performing cross-modal fusion on the initial pre-labeling to obtain the pre-labeling result; and when the similarity score of the combination is smaller than the similarity threshold, taking the initial pre-labeling as the pre-labeling result, and generating a corresponding inconsistency prompt.
5. The man-machine collaborative labeling method based on multi-modal fusion and active learning according to claim 1, wherein the performing consistency check on the multi-modal label to obtain compatibility probability of the multi-modal label specifically comprises: Constructing a multi-mode labeling sample according to labeling information of different modes of the multi-mode label; and inputting the multimode labeling sample into a pre-trained multimode consistency evaluation model to obtain the compatibility probability of the multimode label.
6. The man-machine collaborative labeling method based on multi-mode fusion and active learning according to claim 1, wherein the man-machine interaction interface is used for sending out a labeling error reminder, and the method specifically comprises the following steps: determining a conflict description of the multi-modal tag; and generating a labeling error reminding text according to the conflict description, and displaying the labeling error reminding text on the human-computer interaction interface.
7. The man-machine collaborative labeling method based on multi-modal fusion and active learning according to claim 1, wherein the incremental training of the current multi-modal fusion model by the labeled sample specifically comprises: inputting the marked sample into the current multi-mode fusion model to obtain a model identification result; determining a loss value according to the model identification result and the multi-mode label; And updating parameters of the current multi-mode fusion model according to the loss value.
8. A man-machine collaborative annotation device based on multi-mode fusion and active learning is characterized by comprising: The data preprocessing module is used for acquiring a multi-modal data set to be marked, and performing multi-modal feature decoupling and alignment on data samples in the multi-modal data set by utilizing a pre-training model to obtain multi-modal samples; The sample screening module is used for inputting the unlabeled multi-modal samples into a current multi-modal fusion model to obtain corresponding sample fusion representations, determining fusion uncertainty of the multi-modal samples according to the sample fusion representations, and screening candidate samples according to the fusion uncertainty; the pre-labeling module is used for performing cross-modal enhancement pre-labeling on the candidate samples to obtain corresponding pre-labeling results and displaying the pre-labeling results through a human-computer interaction interface; The consistency verification module is used for responding to the labeling operation of labeling personnel, obtaining corresponding multi-mode labels, and carrying out consistency verification on the multi-mode labels to obtain the compatibility probability of the multi-mode labels; the reminding module is used for sending out annotation error reminding through the man-machine interaction interface when the compatibility probability is smaller than a preset first threshold value; And the incremental training module is used for determining the candidate sample as a marked sample when the compatibility probability is greater than or equal to the first threshold value, and performing incremental training on the current multi-mode fusion model through the marked sample.
9. An electronic device comprising a memory, a processor, a computer program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the computer program when executed by the processor implementing a multimodal fusion and active learning based human-machine co-labeling method according to any of claims 1 to 7.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements a man-machine co-labeling method based on multi-modal fusion and active learning according to any of claims 1 to 7.

Description

Man-machine collaborative labeling method and device based on multi-mode fusion and active learning Technical Field The invention relates to the technical field of data processing, in particular to a man-machine collaborative labeling method and device based on multi-mode fusion and active learning. Background With the wide application of deep learning in the fields of computer vision, natural language processing and the like, the demand for large-scale and high-quality annotation data is increasingly urgent. Traditional manual labeling modes are high in cost and low in efficiency, and labeling consistency is difficult to ensure. And (3) actively learning (ACTIVE LEARNING, AL) to carry out priority labeling by selecting a sample with the largest information quantity so as to obtain the optimal model performance with the least labeling cost. However, existing active learning studies have focused on single-modality data. For labeling of multimodal data (e.g., image-text, video-audio, etc.), the following problems still remain: 1) The sample value is difficult to measure, and uncertainty and information quantity of heterogeneous data from different modalities are difficult to evaluate uniformly; 2) The relation among modes is not fully utilized, the existing method generally carries out simple weighted average on the uncertainty of each mode, and the complementation and conflict information among the modes cannot be deeply mined to guide sample selection; 3) The labeling flow is not optimized, a labeling interface is separated from the model prediction, a labeling person cannot obtain effective intelligent assistance from the model, and the labeling efficiency still has a huge improvement space. The above problems need to be solved. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the prior art to a certain extent. Therefore, an object of the embodiment of the invention is to provide a man-machine collaborative labeling method based on multi-mode fusion and active learning, which improves efficiency and accuracy of multi-mode data labeling. Another object of the embodiment of the invention is to provide a man-machine collaborative labeling device based on multi-mode fusion and active learning. In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps: on the one hand, the embodiment of the invention provides a man-machine collaborative labeling method based on multi-mode fusion and active learning, which comprises the following steps: Acquiring a multi-modal data set to be marked, and performing multi-modal feature decoupling and alignment on a data sample in the multi-modal data set by using a pre-training model to obtain a multi-modal sample; inputting the unlabeled multi-modal sample into a current multi-modal fusion model to obtain a corresponding sample fusion representation, determining fusion uncertainty of the multi-modal sample according to the sample fusion representation, and screening candidate samples according to the fusion uncertainty; Performing cross-modal enhancement pre-labeling on the candidate samples to obtain corresponding pre-labeling results and displaying the pre-labeling results through a human-computer interaction interface; responding to labeling operation of labeling personnel, obtaining a corresponding multi-mode label, and performing consistency verification on the multi-mode label to obtain compatibility probability of the multi-mode label; when the compatibility probability is smaller than a preset first threshold value, a labeling error prompt is sent out through the man-machine interaction interface; and when the compatibility probability is greater than or equal to the first threshold, determining that the candidate sample is a marked sample, and performing incremental training on the current multi-modal fusion model through the marked sample. Further, in an embodiment of the present invention, the performing, by using a pre-training model, multi-modal feature decoupling and alignment on the data samples in the multi-modal dataset to obtain multi-modal samples specifically includes: Determining a modal dimension of the multimodal dataset, the modal dimension comprising at least two of an image dimension, a text dimension, and an audio dimension; extracting characteristic representations of the data samples in different modes through the pre-training model; And performing time sequence alignment on the characteristic representation to obtain the multi-mode sample corresponding to the data sample. Further, in one embodiment of the present invention, the determining the fusion uncertainty of the multi-modal sample according to the sample fusion representation, and further screening candidate samples according to the fusion uncertainty specifically includes: Calculating the variance of the sample fusion representation in the representation