CN-122020157-A - Method for generating multi-mode training data, training method and electronic equipment

CN122020157ACN 122020157 ACN122020157 ACN 122020157ACN-122020157-A

Abstract

The embodiment of the application provides a method for generating multi-mode training data, a training method and electronic equipment, wherein the method comprises the steps of obtaining a synthetic dialogue of a plain text type; the method comprises the steps of determining a virtual physiological data time sequence and a virtual video target frame description which are matched with a synthetic dialogue to obtain a multi-modal synthetic data set, collecting multi-modal information of a real dialogue process to obtain a multi-modal real data set, wherein the multi-modal real data set comprises real dialogue text data, a physiological data time sequence and a video target frame description which correspond to the real dialogue process, and obtaining multi-modal training data according to the multi-modal real data set and the multi-modal synthetic data set. According to the embodiment of the application, multi-mode training data can be provided, and the synthesized dialogue and the real dialogue can be unified into the labeling format, so that the adaptation degree of the training data and the trained model is improved, and the training effect on the model is improved.

Inventors

MA XIAOFENG
MA BIN
LIU ZHIPAN
YANG JINING
LI PAN
GUO MENG
SU YUEQI

Assignees

数字宁夏建设运营有限责任公司
北京大数据研究院

Dates

Publication Date: 20260512
Application Date: 20251230

Claims (10)

1. A method of generating multimodal training data, the method comprising: obtaining a synthesized dialogue of a plain text type; Determining a virtual physiological data time sequence and a virtual video target frame description which are matched with the synthesis dialogue to obtain a multi-mode synthesis data set, wherein the virtual physiological data time sequence is used for representing a change characteristic sequence of a physiological index matched with the synthesis dialogue, and the virtual video target frame description is a word description of visual information at least one target moment, which is obtained based on the emotion state of the synthesis dialogue; Acquiring multi-modal information of a real dialogue process to obtain a multi-modal real data set, wherein the multi-modal real data set comprises real dialogue text data, a physiological data time sequence corresponding to the real dialogue process and a video target frame description; and obtaining multi-modal training data according to the multi-modal real data set and the multi-modal synthetic data set.
2. The method of claim 1, wherein the deriving multi-modal training data from the multi-modal real data set and the multi-modal synthetic data set comprises: Aligning the multi-modal data included in each dialogue in the multi-modal real data set or the multi-modal synthetic data set to obtain an aligned multi-modal data stream; And cutting and packaging the aligned multi-mode data stream according to a set time length to obtain the multi-mode training data.
3. The method of claim 2, wherein the multimodal real data set and the multimodal synthetic data set each include text modality data and non-text modality data; The aligning the multi-modal data included in each dialogue in the multi-modal real data set or the multi-modal synthetic data set to obtain an aligned multi-modal data stream includes: Converting the text mode data into a time sequence by taking a time axis corresponding to a text mode as a reference time axis to obtain a reference sequence; extracting a characteristic sequence of the non-text modal data to obtain at least one sequence to be compared; And aligning each sequence to be aligned with the reference sequence to obtain the aligned multi-mode data stream.
4. The method of claim 3, wherein the non-text modality data includes video modality data and audio modality data; The extracting the characteristic sequence of the non-text modal data to obtain at least one sequence to be compared comprises: extracting the characteristics of each frame in the video mode data, and sequencing the extraction results according to time to obtain a video characteristic sequence; extracting audio energy of the audio mode data in a first time length and sequencing the extraction results according to time to obtain an audio feature sequence, wherein the first time length is a preset time length.
5. The method of any of claims 3-4, wherein said aligning each alignment sequence with the reference sequence results in the aligned multi-modal data stream, comprising: Calculating a target distance path between each sequence to be compared and the reference sequence; And bending a time axis of a non-reference mode according to the target distance path to realize alignment, wherein the non-reference mode comprises a video mode corresponding to the video mode data, a physiological mode corresponding to the physiological data time sequence and an audio mode corresponding to the audio mode data.
6. The method of claim 2, wherein the performing slicing and encapsulation on the aligned multi-modal data stream according to a set duration to obtain the multi-modal training data includes: Determining the identification and time interval information of the data blocks respectively corresponding to the set time lengths; Establishing a corresponding relation between a multi-mode data subset corresponding to each time interval information and a corresponding data block identifier, wherein the multi-mode data subset is obtained by segmenting the aligned multi-mode data stream according to the set time length, and comprises an audio data subset, a video data subset, a physiological data subset and a text data subset, and one data subset corresponds to one set time length; And obtaining the multi-mode training data at least according to the corresponding relation.
7. The method of claim 6, wherein, The audio data subset comprises audio data and transcribed text of the audio data, the video data subset comprises a video target frame and an emotion label corresponding to the video target, the physiological data subset comprises a physiological data time sequence, and the text data subset comprises transcribed text of a real consultation dialogue and/or text of a synthesized dialogue.
8. The method according to any of claims 6-7, wherein said obtaining the multimodal training data at least from the correspondence comprises: Extracting multi-modal features from each data block, wherein the multi-modal features comprise text semantic features, audio emotion features, video facial features and physiological data statistical features, and carrying out association organization on the multi-modal features according to a preset module of a consultation report to obtain the multi-modal training data.
9. A method of model training, the method comprising: obtaining multimodal training data according to the method of any one of claims 1-8; And training the consultation report generating model according to the multi-mode training data.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to implement the method of any one of claims 1-9 when the program is executed.

Description

Method for generating multi-mode training data, training method and electronic equipment Technical Field The application relates to the field of psychological consultation, in particular to a method for generating multimodal training data, a training method and electronic equipment. Background With the deepening of application of large multi-modal models (LMMs) in the field of mental health, automatic generation of professional psychological consultation reports becomes an important requirement for improving consultation efficiency. However, the performance of these models is largely affected by the training data. The training data provided by the related technology has at least the technical defects that the model has fewer learnable characteristics due to the adoption of the training data of a single mode, so that the quality of a psychological consultation report output by the model is affected. Disclosure of Invention The embodiment of the application aims to provide a method for generating multi-mode training data, a training method and electronic equipment, and by the embodiment of the application, the multi-mode training data can be provided, and the synthesized dialogue and the real dialogue can be unified into a labeling format, so that the adaptation degree of the training data and a trained model is improved, and the training effect on the model is improved. In a first aspect, an embodiment of the present application provides a method for generating multi-modal training data, where the method includes obtaining a synthesized dialogue of a plain text type, determining a virtual physiological data time sequence and a virtual video target frame description adapted to the synthesized dialogue, and obtaining a multi-modal synthesized dataset, where the virtual physiological data time sequence is used to characterize a variation feature sequence of a physiological index matched to the synthesized dialogue, the virtual video target frame description is a text description of visual information at least one target moment obtained based on an emotion state of the synthesized dialogue, collecting multi-modal information of a real dialogue process, and obtaining a multi-modal real dataset, where the multi-modal real dataset includes real dialogue text data, a physiological data time sequence corresponding to the real dialogue process, and a video target frame description, and obtaining multi-modal training data according to the multi-modal real dataset and the multi-modal synthesized dataset. According to the embodiment of the application, the information quantity carried by the obtained training data can be obviously improved by setting the matched other modal information for the synthesized dialogue, so that the quality of the training model obtained according to the multi-modal training data is improved. In some embodiments, the obtaining the multi-modal training data according to the multi-modal real data set and the multi-modal synthesized data set includes aligning multi-modal data included in each dialogue in the multi-modal real data set or the multi-modal synthesized data set to obtain an aligned multi-modal data stream, and segmenting and packaging the aligned multi-modal data stream according to a set duration to obtain the multi-modal training data. Some embodiments of the present application further need to align multi-modal data and segment and encapsulate the aligned data to obtain more structured training data that can be directly invoked, thereby improving the model training speed. In some embodiments, the multi-modal real data set and the multi-modal synthetic data set respectively comprise text modal data and non-text modal data, and the method comprises the steps of aligning multi-modal data included in each dialogue in the multi-modal real data set or the multi-modal synthetic data set to obtain an aligned multi-modal data stream, wherein the aligned multi-modal data stream comprises the steps of converting the text modal data into a time sequence by taking a time axis corresponding to a text modality as a reference time axis to obtain a reference sequence, extracting a characteristic sequence of the non-text modal data to obtain at least one sequence to be compared, and aligning each sequence to be compared with the reference sequence to obtain the aligned multi-modal data stream. According to some embodiments of the application, the time axis corresponding to the data of the text mode is used as a reference to align the characteristic sequences of other modes with the reference sequence, so that the alignment effect of the multi-mode data is improved. In some embodiments, the non-text mode data comprises video mode data and audio mode data, the extracting of the feature sequence of the non-text mode data to obtain at least one sequence to be compared comprises extracting features of each frame in the video mode data and sorting the extraction results according to time to obtain a video