Search

CN-115270795-B - Ring evaluation field named entity recognition technology based on small sample learning

CN115270795BCN 115270795 BCN115270795 BCN 115270795BCN-115270795-B

Abstract

The invention discloses a technology for identifying named entity in the field of criticizing based on small sample learning, which comprises the steps of obtaining and preprocessing corpus in a document in the field of criticizing, manually labeling the preprocessed corpus to obtain a manually labeled sample and an unlabeled sample, manually collecting and sorting entities in the field of criticizing, storing the entities in a word list form into an entity library, expanding the entity library, establishing a named entity identification model, training the named entity identification model by stages by utilizing the manually labeled sample and the unlabeled sample to obtain a trained named entity identification model, and correcting a prediction result of the named entity identification model by utilizing the expanded entity library in a prediction stage. According to the invention, a small amount of manually marked samples and an artificially arranged entity library are comprehensively utilized as supervision signals, pseudo tag data are gradually expanded in stages, the generalization capability of the model is improved by utilizing mixed data, and a relatively efficient NER model can be trained by using less manually marked data.

Inventors

  • ZHANG JIANBING
  • WANG JIULIANG
  • CHU YOUGANG
  • HUANG SHUJIAN
  • DAI XINYU
  • CHEN JIAJUN

Assignees

  • 南京大学
  • 南京大学

Dates

Publication Date
20260421
Application Date
20220721
Priority Date
20220721

Claims (9)

  1. 1. The technology for identifying the named entity in the criticizing field based on the small sample learning is characterized by comprising the following steps of: Step 1, acquiring corpus in the criticizing field file, preprocessing the corpus, manually marking the preprocessed corpus, and marking at least 10 samples for each entity type to obtain manually marked samples and unlabeled samples; step 2, manually collecting and arranging the entities in the field of criticizing, and storing the entities in an entity library in a word list form; Step 3, establishing a named entity recognition NER model, wherein the model consists of a pre-training encoder, a bidirectional long and short time memory network BiLSTM and a conditional random field CRF, and the acquisition method of the pre-training encoder comprises the steps of acquiring a pre-trained encoder in the general field, and pre-training corpus in the criticizing field to obtain the pre-training encoder; Step 4, training the named entity recognition NER model in stages by using the manual marked sample and the unmarked sample to obtain a trained named entity recognition NER model; Step 5, in the prediction stage, correcting the predicted result of the named entity recognition NER model by using the expanded entity library to obtain a final recognition result, and completing the named entity recognition in the criticizing field based on the small sample learning; in step 2, the method for expanding the entity library includes: step 2-1, obtaining an entity T from an entity library, and constructing an entity word list T; Step 2-2, randomly screening sentences from unlabeled samples obtained in the step 1 to obtain a sample set S, wherein the sample set S comprises samples S, comparing an entity word list T, calculating the number of entities contained in the samples S, and sequencing all samples in the sample set from large to small according to the number; Step 2-3, data enhancement is carried out on a sample S ' in the original sample set S ', and an enhanced sample S ' p is obtained through a synonym substitution and back translation method, the confusion degree of S ' p and the cosine similarity of the sample S ' are calculated, and S ' p is reserved as a qualified enhanced sample for standby only when the confusion degree of S ' p is lower than a threshold value S ppl and the cosine similarity of the sample S ' p and the sample is higher than a threshold value S sim , otherwise, the sample S ' 38362 is discarded; step 2-4, comparing the qualified enhanced sample s ' p with the original sample s', examining the changed continuous text region t span , calculating the part of speech of t span , if the probability of the part of speech being a noun is higher than a threshold p noun , indicating that t span is a new entity, storing the enhanced sample s ' p and the original sample s' in a medium for standby, and adding t span into an entity library; The operations of step 2-3 and step 2-4 are performed on all samples in the set of base samples S'.
  2. 2. The technology for identifying the named entity in the criticizing field based on small sample learning according to claim 1, wherein in step 1, the method for preprocessing the language comprises the following steps: The method comprises the steps of preprocessing the corpus, namely deleting incomplete sentences in the corpus, clearing sentences with complex structures containing formulas in the corpus, de-duplicating the corpus and uniformly converting and coding the corpus, manually screening the corpus, reserving for standby if the corpus contains entities of target entity types, and otherwise, taking down the corpus until at least 10 corpora are screened out from each target entity type.
  3. 3. The technology for identifying the named entity in the criticizing field based on small sample learning according to claim 2, wherein in step 1, the method for manually labeling the preprocessed corpus comprises the following steps: and (3) manually labeling the preprocessed corpus, namely manually labeling the corpus obtained through preprocessing, labeling the corpus in a BIO mode, wherein the labeled corpus is the manual labeling sample, and the unlabeled corpus is the unlabeled sample.
  4. 4. The technology for identifying a named entity in the criticizing field based on small sample learning according to claim 3, wherein in step 3, the method for acquiring the pre-training encoder comprises the following steps: And 3-1, acquiring a pre-trained encoder Encoder pre in the general field, and performing a pre-training task, namely continuously pre-training the pre-trained encoder in the general field for 2 rounds by using the preprocessed corpus in the step 1, and storing the pre-trained encoder Encoder cont for later use.
  5. 5. The technology for identifying a named entity in the criticizing field based on small sample learning according to claim 4, wherein in step 3, the method for acquiring the pre-training encoder comprises the following steps: And 3-2, obtaining a base sample set S 'and an enhanced sample set S' p obtained in the step 2, pre-training the pre-trained encoder Encoder cont stored in the step 3-1 by using a masking entity language modeling task for 2 rounds, namely masking the entity according to a masking language modeling Masked LM strategy to predict the entity again, injecting entity semantic knowledge into the pre-trained encoder, and obtaining a pre-training encoder Encoder entity .
  6. 6. The technology for identifying a named entity in the criticizing field based on small sample learning according to claim 5, wherein in step 3, the method for acquiring the pre-training encoder comprises the following steps: And 3-3, assembling a named entity recognition NER model by using the pre-training encoder Encoder entity obtained in the step 3-2, the bidirectional long and short time memory network BiLSTM and the conditional random field CRF, splicing an embedded vector of an entity to the embedded vector of the artificial labeling sample in a training stage by using the artificial labeling sample, and fine-tuning the whole named entity recognition NER model by using a supervision training method, wherein a loss function is negative log likelihood loss.
  7. 7. The technology for identifying named entity in the criticizing field based on small sample learning according to claim 6, wherein in step 4, the method for training named entity to identify NER model by stages by using the manually labeled sample and the unlabeled sample comprises the following steps: step 4-1, obtaining a manual labeling sample S fewshot , selecting 10 corresponding samples for each entity type, and constructing a labeled small sample training set Selecting sentences from the unlabeled samples in the step 1, and constructing an unlabeled training set Step 4-2, using the small sample training set Training the NER model by using a supervised learning method, taking the trained model as a teacher model, and storing the model for later use; Step 4-3, in the unlabeled training set Predicting with teacher model to generate pseudo tag to form pseudo tag data set S pseudo , calculating confidence coefficient for each piece of pseudo tag data S pseudo in S pseudo , sorting the pseudo tag data according to confidence coefficient from large to small, selecting the first N pseudo tag data with high confidence coefficient, and adding into the marked data set Obtaining an extended labeling data set, wherein the value of N is compared with the value of the training teacher model The ratio of the sizes is 3 to 5 times; Step 4-4, copying the structure and network parameters of the teacher model to obtain a student model, acquiring an extended labeling data set, introducing noise to train the student model, wherein the introduced noise is gradient noise when training the student model, or data noise introduced after inserting, disturbing and deleting training data; And 4-5, taking the student model as a teacher model for the next iteration, repeating the steps 4-2 to 4-4, training a new student model, and taking the student model obtained after 2 or 3 iterations as a final named entity recognition NER model.
  8. 8. The technology for identifying named entities in the criticizing field based on small sample learning according to claim 7, wherein in step 5, the method for correcting the predicted result of the named entity identification NER model by using the entity library comprises the following steps: And 5-1, inputting a target sample s pred of the entity to be predicted, and identifying NER model prediction by using the trained named entity to obtain a candidate entity t cand .
  9. 9. The technology for identifying named entities in the criticizing field based on small sample learning according to claim 8, wherein in step 5, the method for correcting the predicted result of the named entity identification NER model by using the entity library comprises the following steps: And 5-2, obtaining an entity in the entity library and a candidate entity t cand , comparing the entity with the candidate entity t cand , calculating an entity t po with the largest coincidence proportion with the candidate entity t cand and a corresponding coincidence proportion p overlap , if p overlap is larger than a threshold S po and the entity t po exists in S pred , correcting the predicted result of the model to be t po , otherwise, not correcting the predicted result to be t cand , and completing the recognition of the annular evaluation field named entity based on the small sample learning.

Description

Ring evaluation field named entity recognition technology based on small sample learning Technical Field The invention relates to a named entity recognition technology, in particular to a small sample learning-based named entity recognition technology in the field of criticism. Background With the rapid development of artificial intelligence technology, intelligent aided writing technology has been widely applied to various fields of human production and life, such as automatic contract generation, legal document correction, composition correction and the like. Named Entity Recognition (NER) is one of the pre-steps and core links of intelligent aided authoring systems, and is responsible for extracting entities with specific meanings from unstructured text. The recognition result determines the accuracy of the modification suggestion given by the intelligent auxiliary writing system, so that the satisfaction degree of a user on the system is directly influenced. In the field of environmental evaluation, because the entity type to be predicted belongs to a new type, the labeling data is lacking, a large number of manual labeling samples are expensive, and the conventional NER technology cannot be applied. For small sample NER techniques, the usual method is divided into two steps. In the first step, a small number of manually marked samples are used as supervisory signals to obtain useful structural information or pseudo tag information from large-scale unmarked data, so that the unmarked data is converted into usable data. And secondly, training the NER model by combining the labeling data and the converted data. The most commonly used NER model structure consists of a pre-trained encoder, a Bi-directional Long-Short Term Memory (BiLSTM) and a conditional random field (Conditional Random Field, CRF). The specific training methods of the model are numerous, and according to the difference between the mode of extracting the information in the first step and the method of training in the second step, the common methods can be divided into the following three methods: Method one (see papers: snell J, swersky K, zemel r.protometric networks for few-shot learning [ J ]. 2017.): this approach can be migrated to a small sample NER task, solving the small sample NER problem using meta-learning. The scheme uses a prototype network (Prototypical Network) that assumes that all entities of the same entity type have close embedded vector distances in the representation space, thereby representing the centers of these vectors as embedded vectors of the entity type. In the prediction stage, the entity type with the shortest distance is taken as a predicted value by comparing the distance between the embedded vector of the word and the embedded vector of the candidate entity type. Method II (refer to paper :Jiang H,Zhang D,Cao T,et al.Named entity recognition with small strongly labeled and large weakly labeled data[J].2021.): This approach can be migrated to a small sample NER task, solving the small sample NER problem from a data perspective. The scheme uses remote supervision (Distant Supervision) to transform unlabeled data into noisy pseudo tag data based on certain assumption rules. In order to ensure the accuracy of the model, the pseudo tag data needs to be denoised. And finally, combining the labeling data and the pseudo tag data, and training the NER model in a supervised learning mode. Method III (reference) :Jiang H,Zhang D,Cao T,et al.Named entity recognition with small strongly labeled and large weakly labeled data[J].2021.): This approach can be migrated to the small sample NER task, solving the small sample NER problem from a generalization perspective. The scheme uses Self-Training (Self-Training) and can gradually and iteratively improve the generalization of the model in stages. In each iteration, the accuracy of the teacher (Teacher) model is guaranteed by using high-quality samples, then the high-quality samples and pseudo-tag data are mixed, training noise is added, and a Student model with stronger generalization is trained. The existing small sample NER method can use a small amount of marked samples, and combines large-scale unmarked corpus to perform joint training, so that a high-precision NER model is obtained. However, the schemes are based on some simplistic assumptions or the use modes of the mixed data are relatively single, so that the schemes cannot be well applied to realistic application scenes such as the criticizing field and the like. In particular, for a meta-learning based approach, the approach assumes that embedded vectors of entities belonging to the same entity type are close in representation space. However, in a real scene, even if the entity types belong to the same entity type, different entities themselves contain specific semantics, and the distribution in the representation space is difficult to ensure that the entity types closest to the entity embedded vector are not ens