CN-114757188-B - Standard medical text rewriting method based on generation countermeasure network
Abstract
The invention discloses a standardized medical text rewriting method based on a generated countermeasure network, which comprises the steps of extracting spoken language and standardized medical question-answer corpus to obtain a data set, constructing a standardized medical text generator and a spoken language medical text generator by adopting a Transformer model, pre-training through a user health term mapping table to obtain a standardized medical text, constructing a standardized medical text discriminator and a spoken language medical text discriminator by adopting an LSTM neural network, respectively optimizing the standardized medical text discriminator and the spoken language medical text discriminator by using a loss function in combination with medical text characteristics, and optimizing the standardized medical text generator and the spoken language medical text generator by adopting a reinforcement learning mode. The method and the device realize the mutual migration and rewrite between the spoken text and the normalized text, solve the problem of excessive dependence of the traditional text migration model on the labeling corpus, ensure that the model is still reliable under the condition of no parallel corpus, and reduce the workload of manually labeling data.
Inventors
- WANG ZUMIN
- XU CHANG
- JI CHANGQING
- QIN JING
Assignees
- 大连大学
Dates
- Publication Date
- 20260512
- Application Date
- 20220520
Claims (2)
- 1. A method for rewriting a canonical medical text based on generating an countermeasure network, comprising: extracting spoken and normalized medical question-answering corpus for processing to obtain a data set; construction of normalized medical text generator using a transducer model Spoken medical text generator Pre-training through a user health term mapping table to obtain normalized medical texts; construction of normalized medical text discriminator by LSTM neural network Spoken medical text discriminator ; Normalized medical text discriminant using loss functions in combination with medical text features Spoken medical text discriminator ; Optimizing normalized medical text generator by reinforcement learning mode Spoken medical text generator ; Construction of normalized medical text generator using a transducer model Spoken medical text generator Specifically, cycleGAN structures are adopted to construct a normalized medical text generator Spoken medical text generator The two generators are opposite in generation direction, and can form a closed loop to mutually provide feedback information after being connected; pre-training normalized medical text generator using maximum likelihood estimation Spoken medical text generator Setting the maximum length of a generated sentence pattern as 30 words, setting the embedding dimension Embedding _size value of a given word as 512, setting the encoder Encoder and the Decoder as six-layer structures, setting a user health term mapping table as a generated word table, and pre-training word vectors by using training sets divided from spoken language style sample sentences and normalized style sample sentences to generate Embedding initial values corresponding to the words; Optimizing normalized medical text discriminators using loss functions, respectively Spoken medical text discriminator The method specifically comprises the following steps: At the position of Random sampling under fixed conditions Is a true sample of (1) and a normalized medical text generator generated sample Then minimizing cross entropy; normalized medical text discriminator The loss function of (2) is as follows: Wherein the method comprises the steps of To normalize the generation of the medical text arbiter against loss, In order to lose the term coefficient(s), The loss is noted for the sequence, In order to lose the term coefficient(s), And (3) with The ranges of (2) are all less than 0.5; At the position of Random sampling under fixed conditions Is a real sample of (a) and a sample generated by a spoken medical text generator Then minimizing cross entropy, spoken medical text discriminant The loss function of (2) is as follows: To generate a countermeasures against loss for a spoken medical text discriminator, In order to lose the term coefficient(s), The loss is noted for the sequence, In order to lose the term coefficient(s), And (3) with The ranges of (2) are all less than 0.5; Introducing discriminators and generators with opposite targets in the processes of optimizing normalized medical text discriminators, spoken medical text discriminators, normalized medical text generators and spoken medical text generators, and performing mutual antagonism until a Nash equilibrium state is reached; The spoken sentence in the data set is used as an X style sample, the sentence containing normalized words is used as a pseudo parallel sample of a Y target style to be converted, the spoken sentence which can be mapped with the term in the test set is marked through a user health term mapping table and is provided as a hidden layer for a normalized medical text generator ; The data set includes a data set , Wherein i represents an ith sample, n represents n samples in total, x and y respectively represent a sample sentence of a spoken language style and a sample sentence of a normalized style, and the sample sentence of the spoken language style is represented as: , t represents the T-th word of the sentence, T represents the sentence length, i.e., the number of words; in order to correlate the sample sentences in the spoken language style with the sample sentences in the normalized style, after identifying the medical entity in each sentence through word segmentation, marking the sample sentences in the unnormalized spoken language style by combining a user health term mapping table, wherein the marking sequence is recorded as follows The corresponding position of the sample sentence needing to be normalized is marked as 1, and the sample sentence needing not to be normalized is marked as 0.
- 2. The method for rewriting normalized medical text based on generation of countermeasure network according to claim 1, wherein the LSTM neural network is adopted to construct a normalized medical text discriminator Spoken medical text discriminator The method specifically comprises the following steps: Last hidden layer of LSTM neural network Instead of a binary logistic regression layer, Determining that the entered medical text is from Is also the sample generated by the normalized medical text generator ; And carrying out nonlinear conversion on the input medical text high-dimensional sequence to obtain Embedding of words in the sequence, then inputting the sequence into each basic cell, and obtaining the probability of outputting each word by combining the full-connection hidden layer.
Description
Standard medical text rewriting method based on generation countermeasure network Technical Field The invention relates to the technical field of natural language processing, in particular to a standard medical text rewriting method based on a generated countermeasure network. Background Text style migration is always a hotspot problem in the field of natural language generation. The meaning is that the text with another specific style or attribute is converted or generated on the basis of keeping the original text semantic content unchanged, and the smoothness and reality of the newly generated text are ensured. The text style migration removes writing style migration or emotion migration of the text, and can be applied to the fields of dialogue question-answering systems, text rewriting, professional Wen Anwen book specification checking or generating and the like of chat robots. Most of the existing text generation models have the problem of model training difficulty, grammar errors or semantic deletion exist in generated contents, and the training difficulty of the text generation models can be flexibly reduced by applying the text style migration models. In recent years, development of deep learning technology has enabled natural language processing to be widely applied to various scenes and complex tasks. In the medical field, online inquiry technology is also popularized gradually, and the establishment of various medical health websites enables patients to perform self diagnosis in an inquiry and answer mode without going home. However, due to lack of specialized medical knowledge, users often have problems of unclear disease descriptions and spoken language of the expression content when using the platform tools, so that the AI-assisted diagnosis has an understanding barrier to the information provided by the users. This obstacle is not only in terms of machine reading and understanding, but is often bi-directional, and because of the spoken description of the patient or specialized terminology of the doctor, communication between the doctor and the patient is impaired, and the on-line inquiry efficiency is low. Therefore, the application of the text style migration technology in the aspects of text rewriting and text normalization provides a good scheme for solving the problems. Currently, text style migration methods can be generally divided into two types, supervised learning and unsupervised learning. The supervised learning is similar to the machine translation mode, and the parallel data set is used for style conversion, so that the text converted by the method has high precision and good conversion effect. The existing text style migration model also mostly adopts an end-to-end model similar to statistical machine translation, but the model lacks a labeling data corpus, and manual labeling data needs to consume a great deal of manpower and material resources, so that the research on the text style migration model is transferred to an unsupervised learning mode. Compared with a supervised style migration model similar to machine translation, the unsupervised learning model can effectively separate the attribute and the content of the text, train the model without a large amount of paired data, and obtain an ideal generated text. However, the research progress of the current unsupervised text style migration model is far slower than that of image style migration because of the text discreteness problem when the style migration is applied to the text. The text discreteness causes the text to generate the loss of the text fluency and the content integrity in the migration process, and the model has the problems of low quality of generated text and poor generalization. Secondly, the model quality is difficult to evaluate, and the language style definition is fuzzy unlike the image style discrimination mode, so that the model quality is more challenging. Disclosure of Invention The invention aims to provide a standardized medical text rewriting method based on generation of an countermeasure network, which enables bidirectional conversion between patient spoken illness state description and professional standardized terms used by doctors and AI auxiliary diagnosis. In order to achieve the above object, the present application provides a method for rewriting a medical text based on generation of a specification against a network, comprising: extracting spoken and normalized medical question-answering corpus for processing to obtain a data set; Construction of normalized medical text generator using a transducer model Spoken medical text generatorPre-training through a user health term mapping table to obtain normalized medical texts; constructing a normalized medical text discriminator D Φ1(Y) (Y) and a spoken medical text discriminator D Φ2(X) (X) by adopting an LSTM neural network; Optimizing the normalized medical text discriminator D Φ1(Y) (Y) and the spoken medical text discriminator D Φ2(X) (