Search

CN-115879460-B - Text content-oriented new tag entity identification method, device, equipment and medium

CN115879460BCN 115879460 BCN115879460 BCN 115879460BCN-115879460-B

Abstract

The application relates to a new tag entity identification method, device, equipment and medium for text content. The method comprises the steps of constructing a full-word masking language model task and an NTP task by utilizing a training data set, retraining a pre-training model, constructing a candidate entity recognition model according to the retraining model and GlobalPointer, carrying out new tag recognition on an information data set by utilizing the candidate entity recognition model, sequencing the result of the recognized new tag to obtain an entity tag with highest article association, filtering a manually marked tag word stock according to the entity tag to obtain a new tag word stock, cleaning the new tag word stock, modifying and expanding the training data set by utilizing the cleaned tag word stock, training the candidate entity recognition model by utilizing the expanded training set, and carrying out new tag entity recognition on text contents according to the trained entity recognition model. By adopting the method, the identification accuracy of the new tag entity can be improved.

Inventors

  • XU CHENG
  • CHOU XIAOHUI

Assignees

  • 宁波深擎信息科技有限公司
  • 上海深擎信息科技有限公司

Dates

Publication Date
20260505
Application Date
20220816

Claims (10)

  1. 1. A method for identifying a new tag entity for text content, the method comprising: acquiring a manually marked tag word stock, an information text and an information data set, wherein the tag word stock comprises feature words and tags corresponding to the feature words; data screening is carried out on the information text according to the tag word stock, and a training data set is obtained; Training a pre-constructed BERT model by utilizing the training data set and a self-supervision mode to obtain a pre-training model; constructing a full word masking language model task and an NTP task by using the training data set to retrain the pre-training model to obtain a retrained model; constructing a candidate entity identification model according to the retraining model and GlobalPointer global pointers; Performing new tag identification on the information data set by using the candidate entity identification model, and performing result sorting on the identified new tags to obtain entity tags with highest article association degree; Filtering the manually marked tag word stock according to the entity tag to obtain a new tag word stock; Cleaning the new tag word stock according to the number of feature words corresponding to the tags in the new tag word stock, the times of filtering the feature words, the number of days of adding the feature words into the new tag word stock and the filtering times of the tags within preset time to obtain a cleaned tag word stock; Modifying and expanding the training data set by using the cleaned tag library to obtain an expanded training set, and training the candidate entity recognition model by using the expanded training set to obtain a trained entity recognition model; and carrying out new tag entity recognition on the text content according to the trained entity recognition model.
  2. 2. The method according to claim 1, wherein the method further comprises: coding sentences in which all the tags in the new tag library are located according to the BERT model, splicing the four layers of word vectors at the corresponding positions of the tags, and then carrying out average pooling to obtain word vectors of all the tags; Storing word vectors of all the tags by using a Faiss index, vectorizing a new tag entity identification result in the Faiss index, and then performing cosine similarity calculation on the new tag entity identification result and all the tags in a new tag library to return two tags with highest scores as first candidate synonym tags; Establishing bkTree all the tags in the new tag library according to the editing distance, normalizing the new tag entity identification result, and searching a plurality of tags with the editing distance smaller than 2 in bkTree to serve as second candidate synonym tags; And positioning a new tag entity identification result to a position in a tag library according to the first candidate synonym tag and the second candidate synonym tag, and expanding the new tag word library.
  3. 3. The method of claim 1, wherein constructing the masking language model task and the NTP task comprises: Performing random whole word masking on the training data set, and performing word segmentation on the masked text by using a ansj word segmentation device to obtain a word segmentation vocabulary, wherein the whole word refers to a completed vocabulary in the text; and extracting the vectors of the sentences containing the labels to obtain the vector representation of the labels, and using the hierarchical relationship between the labels and the feature words in the label word stock as the NTP task of the label word stock.
  4. 4. The method of claim 1, wherein using the candidate entity recognition model to perform new tag recognition on the information dataset, and performing result ranking on the recognized new tags to obtain the entity tag with the highest article relevance comprises: performing new tag identification on the information data set by using the candidate entity identification model, and performing full-text vector representation on texts in the information data set to obtain a first vector representation; carrying out vector representation after carrying out covering treatment on all current tag words in the text to obtain a second vector representation; Performing cosine similarity calculation on the first vector representation and the second vector representation to obtain article association degrees of removing the current tag and not removing the current tag, wherein the article association degrees are importance representations of the tags; And sorting all the labels in the information data set from small to large according to the importance expression of the labels to obtain the entity label with the highest article association degree.
  5. 5. The method of claim 1, wherein the step of cleaning the new tag word stock according to the number of feature words corresponding to the tags in the new tag word stock, the number of times the feature words are filtered, the number of days the feature words are added to the new tag word stock, and the number of times the tags are filtered within a preset time to obtain a cleaned tag word stock comprises: Calculating the aging value of the feature words in the new tag word stock according to the number of the feature words corresponding to the tags in the new tag word stock, the times of filtering the feature words, the number of days of adding the feature words into the new tag word stock and the filtering times of the tags in preset time to obtain the aging value of the feature words; If the aging value of the characteristic words is smaller than 0 within three times of the preset time, deleting the characteristic words and the labels corresponding to the characteristic words to obtain a cleaned label library.
  6. 6. The method of claim 5, wherein calculating the aging value of the feature words in the new tag word stock according to the number of feature words corresponding to the tags in the new tag word stock, the number of times the feature words are filtered, the number of days the feature words are added to the new tag word stock, and the number of times the tags are filtered within a preset time to obtain the aging value of the feature words, comprising: calculating the aging value of the feature words in the new tag word stock according to the number of the feature words corresponding to the tags in the new tag word stock, the filtered times of the feature words, the number of days when the feature words are added into the new tag word stock and the filtering times of the tags in a preset time to obtain the aging value of the feature words as follows Wherein i represents a feature word, T i represents a tag corresponding to the feature word, M i represents the number of times the feature word is filtered within a preset time, D i represents the number of days the feature word is added into a new tag word stock, The number of feature words corresponding to the T i is represented, Representing the number of times T i is filtered within a preset time.
  7. 7. The method of claim 1, wherein data filtering the information text according to the tag word stock to obtain a training data set, comprising: And stripping sentences containing words in the tag word stock in the information text, and forming a training data set by the obtained sentences and the words in the tag word stock corresponding to the sentences.
  8. 8. The method of claim 1, wherein training the pre-built BERT model using the training dataset and the self-supervision manner to obtain a pre-trained model comprises: covering words in sentences of the training data set, and performing blank filling training on the covered training data set by using the BERT model to obtain a trained BERT model; and disturbing the sequence of sentences of the training data set, and performing sentence sequencing training on the training data set subjected to the disorder sequence by using the trained BERT model to obtain a pre-training model.
  9. 9. A text content oriented new tag entity identification device, the device comprising: The training data set construction module is used for acquiring a manually marked tag word stock, an information text and an information data set, wherein the tag word stock comprises feature words and tags corresponding to the feature words; The model training module is used for training the pre-constructed BERT model by utilizing the training data set and the self-supervision mode to obtain a pre-training model; The candidate entity recognition model construction module is used for constructing a candidate entity recognition model according to the retraining model and the GlobalPointer global pointer; the tag word stock filtering module is used for carrying out new tag identification on the information data set by utilizing the candidate entity identification model, and carrying out result ordering on the identified new tags to obtain entity tags with highest article association degree; The new tag word stock cleaning module is used for cleaning the new tag word stock according to the number of the feature words corresponding to the tags in the new tag word stock, the number of times the feature words are filtered, the number of days the feature words are added into the new tag word stock and the number of times the tags are filtered within a preset time to obtain a cleaned tag word stock; the new label entity recognition module is used for modifying and expanding the training data set by utilizing the cleaned label library to obtain an expanded training set, training the candidate entity recognition model by utilizing the expanded training set to obtain a trained entity recognition model, and carrying out new label entity recognition on text content according to the trained entity recognition model.
  10. 10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 8 when the computer program is executed.

Description

Text content-oriented new tag entity identification method, device, equipment and medium Technical Field The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for identifying a new tag entity for text content. Background The perfection of a label library is critical for a label system, where a label system refers to a system that outputs a certain number of words (keywords) that can be summarized or classified into categories, which are called labels, and these keywords that can summarize labels according to the keywords are called feature words, and a system that labels is called a label system. However, the tag word and the feature word cannot be found by manual accumulation and discovery only, so that automatic tag discovery is technically required. Tag discovery is essentially close to the problem of new word discovery in chinese natural language processing, but is not exactly equivalent to new word discovery because some tag words may not be new words but already existing words. A more common way of new word discovery in traditional methods based on mutual information is unsupervised new word discovery based on the statistical way mentioned in Matrix 67. And (3) carrying out vocabulary recall by adopting ngram (n-gram word segmentation), namely carrying out full enumeration according to the n-gram word segmentation, and scoring the current vocabulary by calculating the internal solidification degree (PMI) of the vocabulary and the degree of freedom (left-right entropy) of the vocabulary. Ranking recalls are performed by score. The internal solidification degree refers to the probability that each word in the current vocabulary appears together, represents that the words often appear as a whole and are relatively likely to be a word, and the degree of freedom of the vocabulary refers to whether the words appearing on the left and right sides of the group consisting of the words are rich enough, for example, the frequency of the group consisting of the three words, namely, the internal solidification degree, is high enough, but the degree of richness of the words appearing on the left side of the group is low, and the words, such as a few words, a word and two words, are only frequently appeared, which means that the degree of richness of the words appearing on the left side of the group consisting of the three words is not high enough, namely, the degree of freedom of the vocabulary is not high enough. Therefore, the scoring score of the current vocabulary is obtained after the internal solidification degree and the external degree of freedom of the current vocabulary are weighed, new words which are more likely to appear can be screened out from high to low in score, and then the existing vocabulary is filtered out to obtain a final result. However, the current new label discovery method based on the unsupervised corpus has the disadvantage that the comparison depends on a large number of text inputs, because the method is based on a statistical method, and the enumeration and calculation of the frequency and thus the vocabulary solidification degree and the degree of freedom are required for the whole input data. So that when the input is a single article, no effective result can be obtained, it is difficult to find words (labels) with english characters and the number of words is too long, because word recall is based on n-gram enumeration, if n is too large, this would result in too low efficiency, and because the method is a full-scale recall and then rank-matching filtering to obtain a result, this is a method for removing erroneous candidate words, which is very dependent on the perfection of the existing vocabulary, and the available vocabulary proportion in the obtained result would be very low, and the result availability even of the previous 100 would be lower than 50%. Disclosure of Invention In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for identifying a new tag entity for text content, which can improve accuracy of identifying the new tag entity. A method of identifying a new tag entity for text-oriented content, the method comprising: The method comprises the steps of obtaining a manually marked tag word stock, an information text and an information data set, wherein the tag word stock comprises feature words and tags corresponding to the feature words; data screening is carried out on the information text according to the tag word stock, and a training data set is obtained; Training a pre-constructed BERT model by utilizing a training data set and a self-supervision mode to obtain a pre-training model; constructing a whole word masking language model task and an NTP task by using a training data set, and retraining the pre-training model to obtain a retraining model; Constructing a candidate entity identificat