CN-122020186-A - Nursing field text annotation corpus construction method for nursing teaching task
Abstract
The invention relates to the field of artificial intelligence technology and medical information processing, and discloses a nursing field text annotation corpus construction method for nursing teaching tasks, which comprises the following steps of S1, original nursing text data acquisition and training data set construction; the method comprises the steps of S2, cleaning nursing text data and standardizing entity labeling, S3, constructing an entity identification model oriented to the nursing field, S4, calculating a total loss function based on main loss and auxiliary loss, S5, training the entity identification model by adopting a training data set after cleaning and standardizing treatment, and S6, automatically labeling an original nursing text by utilizing the entity identification model. The method has the advantages that by constructing the dynamic dictionary in the nursing field and introducing the fuzzy matching function fusing the editing distance similarity and the semantic similarity, spelling errors, term variants and the like in the original text can be intelligently mapped to standard words, so that the deep standardization cleaning of the nursing text is realized, and the problem of data noise is effectively solved.
Inventors
- LIU JING
- WU DIANCHEN
- ZHAO LIXIN
- TAO RAN
Assignees
- 天津天堰科技股份有限公司
- 潍坊护理职业学院
Dates
- Publication Date
- 20260512
- Application Date
- 20260414
Claims (10)
- 1. A nursing field text annotation corpus construction method facing nursing teaching tasks is characterized by comprising the following steps: S1, acquiring original nursing text data, manually marking the acquired data, and constructing a training data set; S2, realizing standardized cleaning of the text by fuzzy matching based on a nursing dictionary, and performing consistency correction on entity labels by using a graph structure to finish standardized processing of data; S3, constructing an entity identification model oriented to the nursing field, wherein the entity identification model takes a normalized text sequence and a correction tag sequence as input, and utilizes fuzzy matching characteristics, an entity co-occurrence graph and a labeling correction function to construct a deep neural network fused with local context, global co-occurrence knowledge and labeling priori; s4, constructing a total loss function, wherein the total loss function comprises main loss for ensuring accuracy of entity boundary and type identification and auxiliary loss for forcing the representation of the model in the feature space to be consistent with the semantic relation provided by the correction function; s5, training the entity identification model by adopting a training data set after cleaning and standardization treatment; S6, automatically labeling the original care text by using the entity identification model.
- 2. The method for constructing the nursing field text annotation corpus oriented to nursing teaching tasks according to claim 1 is characterized in that an entity annotation system based on BIO is adopted for annotation, wherein B represents the beginning of an entity, I represents the inside of the entity, and O represents a non-entity.
- 3. The nursing field text annotation corpus construction method for nursing teaching tasks according to claim 1 is characterized in that in the step S2, the standardized cleaning of the text is realized by fuzzy matching based on a nursing dictionary, specifically: Constructing a dynamic dictionary in the nursing field, introducing a fuzzy matching function based on editing distance and context semantics, and mapping a variant form to a standard entity; after the original text is segmented, a fuzzy matching function is applied to each word element to obtain a normalized text sequence If the fuzzy matching function matches the standard word And the similarity is not lower than a threshold value Then Otherwise Wherein Representing the first of the normalized text sequences The text content of the individual tokens is displayed, Representing the segment to be normalized in the original care text, Is a character string list with the length of , Representing the length of the sequence, corresponding to the number of the word elements after word segmentation, Representing the text content of the nth lemma in the normalized text sequence.
- 4. The nursing field text annotation corpus construction method for nursing teaching tasks according to claim 1 or 3, wherein in S2, consistency correction is performed on entity annotations by using a graph structure, and standardized processing of data is completed, specifically: Extracting all entity fragments from the original labeling sequence to obtain an entity fragment list Each segment Including the entity words and their beginning and ending locations in the text, wherein, Represent the first A personal entity fragment containing entity word text and its start and end positions in the text; Represent the first Individual entity fragments; Index representing entity fragment and value range ; Representing the total number of entity fragments in the current document; Construction of entity co-occurrence graph V is the vertex set of the graph, C is the edge weight set of the graph, where nodes are defined Representing entity words, defining edge weights Representing entity words And The number of documents co-occurring in the corpus; For any two adjacent or overlapping entity segments in the current document, a annotation correction function is defined.
- 5. The method for constructing a text annotation corpus in the nursing field for nursing teaching tasks according to claim 1, characterized in that the annotation correction function adjusts the annotation of boundary inconsistencies by the consensus of neighbor entities, and for each pair, satisfies the following conditions If the adjacent or overlapped entity segments of the correction function are output Will then And Merging into one entity segment and updating the label sequence, if so, outputting The original state is maintained or split is carried out, wherein Representing two entity fragments And The ratio of the intersection on the text sequence, if the two fragments do not overlap ; Representing a boundary overlap threshold, which is a preset constant for determining whether two segments are directed to the same text region, when If the two are considered to have label conflicts, correction is needed.
- 6. The nursing field text annotation corpus construction method for nursing teaching tasks according to claim 1, wherein the step S3 specifically comprises the following steps: S31, context coding based on fuzzy matching feature enhancement is carried out, namely a feature fusion module is constructed, and feature vectors extracted based on fuzzy matching and output vectors of a pre-training language model are subjected to weighted fusion to obtain enhanced word element representation; S32, global context fusion based on entity co-occurrence graphs is carried out, wherein an entity co-occurrence graph is constructed, a graph neural network is utilized to learn global embedded vectors for each entity node in the graph, and the global embedded vectors are dynamically mapped and fused into the enhanced word element representation; s33, segment decoding and correction fusion of boundary perception, namely calculating the confidence coefficient of all possible continuous segments judged as entities by adopting a span-based decoding strategy, and simultaneously, introducing correction labels and correction functions generated based on a labeling correction process as prior constraints to adjust the attention degree of a model to a specific merging or splitting region; and S34, entity type classification and graph embedding enhancement, namely, for a target segment judged to be an entity, acquiring segment representation of the target segment, retrieving neighbor node information corresponding to the target segment based on the entity co-occurrence graph, performing characteristic splicing or interaction on the segment representation and the neighbor node information, and predicting the specific entity type of the target segment.
- 7. The nursing field text annotation corpus construction method for nursing teaching tasks according to claim 1, wherein the step S4 is specifically: Constructing a main loss function, namely constructing a positive sample based on entity fragments marked in the correction tag, constructing a negative sample based on random sampling of non-entity fragments, and calculating a classification main loss by utilizing the positive sample and the negative sample; Constructing a consistency loss function, namely determining merging or splitting relations between segment pairs based on semantic indication information provided by a correction function aiming at all segment pairs meeting the position overlapping condition; constructing a consistency regularization term, and calculating a consistency loss function by restraining the distance of the segment pairs in a feature space, so that the distance of the segment pairs with a merging relationship to the representing vector is minimized, and the distance of the segment pairs with a splitting relationship to the representing vector is maximized; And constructing a total loss function, namely carrying out weighted summation on the main loss function and the consistency loss function to obtain the total loss function, and updating model parameters by utilizing the total loss function.
- 8. The nursing field text annotation corpus construction method for nursing teaching tasks according to claim 6, wherein the model is trained in an end-to-end manner in S5, and all trainable parameters are optimized through a back propagation algorithm, specifically: data loading and batch processing, namely dividing a training data set subjected to cleaning and standardization processing into a training set and a verification set, and dividing the training set into a plurality of training batches, wherein the training data set comprises a standardized text sequence and a corresponding correction label sequence; Inputting normalized text sequences of each batch into an entity recognition model, sequentially obtaining initial context representations through a pre-training language model, obtaining enhanced word element representations through fusion of fuzzy matching features of a gating linear fusion unit, obtaining final word element representations through fusion of global context information of a graph attention network, and calculating entity scores and type probabilities of all possible fragments based on a span decoding strategy; Calculating the loss value of the current batch according to the total loss function, and updating all trainable parameters in the entity recognition model based on a back propagation algorithm, wherein the trainable parameters comprise a pre-training language model parameter, a drawing meaning network parameter, a multi-layer perceptron parameter and an embedded vector; And after each training period is finished, evaluating the performance of the current model by using a verification set to obtain the accuracy rate, recall rate and F1 value of entity identification, triggering an early-stopping mechanism if the F1 value of the verification set is not lifted in a continuous preset number of periods, and storing the model parameters corresponding to the highest F1 value on the verification set as a final training result.
- 9. The method for constructing a nursing field text annotation corpus for nursing teaching tasks according to claim 6, wherein the feature vectors comprise editing similarity, semantic similarity and matching confidence between the feature vectors and the best matching standard words.
- 10. The nursing field text annotation corpus construction method for nursing teaching tasks according to claim 6, wherein S32 specifically comprises: S321, for each word element in the current document, determining a possibly corresponding entity type set according to a nursing dictionary, and aggregating graph embedding of all entity types in the set through an attention mechanism to obtain a global context vector of the word element; S322, splicing the global context vector of the word element and the enhanced word element representation to form a final word element representation.
Description
Nursing field text annotation corpus construction method for nursing teaching task Technical Field The invention relates to the field of artificial intelligence technology and medical information processing, in particular to a nursing field text annotation corpus construction method for nursing teaching tasks. Background Under the background of rapid development of medical informatization and intelligence, the nursing field is taken as an important component of clinical medical treatment, and the generated text data has huge application value. In the electronic medical record system, the nursing record file, the clinical nursing path, the nurse shift report and the related nursing scientific research literature of the hospital, massive unstructured or semi-structured nursing text is accumulated. The texts record key information such as symptom changes, physical sign monitoring data, nursing operation execution conditions, medicine use feedback, nursing evaluation results and the like of the patient in detail, and are core carriers for reflecting real clinical nursing scenes. The method has the advantages that the nursing text data are fully mined and analyzed, and the method has important significance for assisting clinical nursing decisions, improving nursing teaching quality, monitoring nursing service quality and promoting nursing scientific research development. However, the prior art generally has the following problems: 1. In the prior art, when a nursing text is processed, a general character replacement or regular matching method is generally adopted for cleaning, and special abbreviation ambiguity, misspelling and rich semantic equivalent variants in the nursing field cannot be effectively processed, so that the same entity is often divided into different words, and the accuracy and the robustness of the subsequent entity identification are seriously affected. 2. In the prior art, when facing the problem of entity boundary inconsistency generated by manual annotation or model prediction, an effective automatic correction mechanism is lacking, and the method generally only depends on manual review, has low efficiency and is difficult to ensure the annotation consistency of a large-scale corpus. 3. In the prior art, the entity identification model is mostly judged only by relying on local context information, and strong co-occurrence relations and semantic relations existing in a global corpus level among entities in a nursing text are ignored, so that the discrimination capability of the model is obviously limited when entity boundary ambiguity, semantic ambiguity and rare entities are processed. 4. In the model training process in the prior art, only a fitting labeling label is usually used as a single target, prior knowledge in the nursing field cannot be effectively utilized to guide model learning, so that deviation exists between the representation of the model in the feature space and the professional knowledge in the field, and generalization capability and robustness are required to be improved. Disclosure of Invention In order to solve the defects in the prior art, the invention provides a nursing field text annotation corpus construction method facing nursing teaching tasks, so as to solve the problems in the prior art. The invention aims to achieve the aim, and the aim is achieved by the following technical scheme: a nursing field text annotation corpus construction method facing nursing teaching tasks comprises the following steps: S1, acquiring original nursing text data, manually marking the acquired data, and constructing a training data set; S2, realizing standardized cleaning of the text by fuzzy matching based on a nursing dictionary, and performing consistency correction on entity labels by using a graph structure to finish standardized processing of data; s3, constructing an entity identification model oriented to the nursing field, wherein the entity identification model takes a normalized text sequence and a correction tag sequence as input, and utilizes fuzzy matching characteristics, an entity co-occurrence graph and a labeling correction function to construct a deep neural network fused with local context, global co-occurrence knowledge and a labeling priori; S4, constructing a total loss function, wherein the total loss function comprises main loss for ensuring accuracy of entity boundary and type identification and auxiliary loss for forcing the representation of the model in the feature space to be consistent with the semantic relation provided by the correction function; s5, training the entity identification model by adopting a training data set after cleaning and standardization treatment; S6, automatically labeling the original care text by using the entity identification model. Further, the annotation adopts a BIO-based entity annotation system, wherein B represents the beginning of an entity, I represents the interior of the entity, and O represents a non-entity. Fu