CN-116522949-B - Word sense similarity calculation method and device based on adjacent vocabulary features

CN116522949BCN 116522949 BCN116522949 BCN 116522949BCN-116522949-B

Abstract

The invention provides a word sense similarity calculation method based on adjacent vocabulary features, and relates to the field of natural language processing. Firstly, an example word extraction module extracts example words from example sentences and sentences to be matched by using a longest public substring algorithm. And secondly, performing vocabulary word segmentation on example sentences except example words in the example sentence processing module, and performing part-of-speech tagging on the example words. And then, carrying out feature extraction on words around the example words in the example sentences at first by a feature extraction module, carrying out association degree calculation on the surrounding words and the example words and the option words through a corpus, and combining the calculated results with the respective features through weighting treatment to form the features of the example words and the option words. And finally, performing part-of-speech comparison on the options and the example words in the option processing module, and calculating similarity scores by combining the characteristics of the option words so as to select the optimal option words. The invention determines the characteristics of the current vocabulary by extracting the characteristics of surrounding vocabularies and calculating the association degree, and calculates the mutual information quantity of the two vocabularies by using the corpus so as to calculate the association degree of the two vocabularies.

Inventors

SHAO YUBIN
QI YUTING
LONG HUA
DU QINGZHI
ZHANG FENG
YANG RONGTAI

Assignees

昆明理工大学

Dates

Publication Date: 20260505
Application Date: 20230329

Claims (6)

1. The word sense similarity calculation method based on the adjacent vocabulary features is characterized by comprising the following steps: Step1, extracting example words from the example sentences and the sentences to be matched through a maximum public word string algorithm; obtaining public substring arrays in the example sentences and the sentences to be matched by adopting a maximum public substring algorithm, removing color labels from the public substrings in the example sentences where the public substring arrays are located, and obtaining characters with the color labels as example words; Step2, word segmentation is carried out on characters except for example words in example sentences, and part-of-speech tagging is carried out on the example words; Performing word segmentation processing on characters except for example words with color labels by adopting an HMM model, and performing part-of-speech tagging on the example words with color labels by adopting a Viterbi algorithm; step3, extracting feature vectors of example words and option words according to the example sentences after word segmentation; firstly, converting word segmentation around example words with color marks into word vectors for feature extraction, secondly, respectively calculating association degrees of the example words and option words with the surrounding word segmentation in a corpus by using mutual information quantity, and then, weighting the word vectors of the surrounding word segmentation by using the association degrees to form new vectors serving as feature vectors of the example words and the option words; step4, calculating similarity scores between the options and the example words, so as to obtain the optimal option words; calculating similarity scores of the option words according to the parts of speech and the feature vectors of the combination example words and the parts of speech and the feature vectors of the option words, wherein the highest similarity score is the optimal option word; the Step3 specifically comprises the following steps: Step3.1 obtaining the example-removed words by using word2vec model Word vector matrix for word segmentation other than ; Step3.2 example words are calculated in the corpus by the following method And option words And example words Association degree of other word divisions: ; Wherein the method comprises the steps of For example words And option words , To be a word except for example In addition to the other word segmentation, the word segmentation method, For in corpus , The probability of the simultaneous occurrence of the two, , Is that , The probability of an independent occurrence is determined, Is that , Is a mutual information amount of (a); step3.3 calculate the complete words Option words After the association degree with other word segmentation, calculating the word vector of each word segmentation according to the association degree For example words Option words Weights of word vectors ; Step3.4 vector each segmented word Multiplying its corresponding weight Namely, example words Option words Feature vectors extracted from the segmentation , Includes example words Feature vectors of (a) Option words Feature vectors of (a) ; 。
2. The word sense similarity calculation method based on adjacent vocabulary features of claim 1, wherein the example sentences have the same length as the sentences to be matched, and the rest of the vocabulary and punctuation marks are identical except for the marks and the example words.
3. The word sense similarity calculation method based on adjacent vocabulary features of claim 1, wherein Step1 specifically comprises the following steps: Step1.1 obtaining an example sentence And sentence to be matched , , Respectively representing n characters in text1, , Respectively representing n characters in text 2; step1.2 pairs of example sentences And sentence to be matched Performing example word recognition; Order the , And give Marking color label, and solving by dynamic programming method And The largest common substring of (3) to be solved And The largest common substring is stored into an array In the process, the , According to the maximum common substring array Removing Color labels of the public substring, and characters with the remaining color labels are example words 。
4. The method for calculating word sense similarity based on adjacent vocabulary features according to claim 1, wherein Step2 comprises the following specific steps of Except for example words Word segmentation is carried out on the characters of the character(s) by using an HMM model, and segmented texts are obtained , Then, the Viterbi algorithm is adopted to carry out the example word Marking the parts of speech to obtain the parts of speech 。
5. The word sense similarity calculation method based on adjacent vocabulary features of claim 1, wherein Step4 comprises the following specific steps: step4.1 comparative example word part of speech Part of speech with each option word Order-making As part-of-speech comparison results, one can obtain: ; Step4.2 calculate example words And each option word Similarity score of (2) : ; Step4.3 comparative example word And each option word Score between The option word with the highest score is the optimal option word.
6. An apparatus for implementing the word sense similarity calculation method based on adjacent vocabulary features according to claim 1, comprising the following modules: the example word extraction module is used for extracting example words from the example sentences and the sentences to be matched through a maximum public word string algorithm; The example sentence processing module is used for dividing words except example words in example sentences and marking the parts of speech of the example words; The feature extraction module is used for extracting feature arrays of example words and option words; and the option processing module is used for calculating the similarity score between the option words and the example words, so as to obtain the optimal option words.

Description

Word sense similarity calculation method and device based on adjacent vocabulary features Technical Field The invention relates to a word sense similarity calculation method and a word sense similarity calculation device based on adjacent vocabulary features, in particular to a semantic similarity calculation method of vocabularies, and belongs to the field of natural language processing. Background The semantic similarity of vocabulary has irreplaceable meaning and effect in the field of natural language processing. However, the semantic relationships between words are very complex, and it is difficult to measure the similarity of meanings between words using a simple numerical value. The same pair of words may be very similar in one respect, but the angles may vary considerably. Word sense similarity calculation has wide application in many fields, such as information retrieval, information extraction, text classification, word sense disambiguation, instance-based machine translation, and so forth. The current common word sense similarity calculation method comprises two methods, namely calculating word sense similarity based on a synonym dictionary, organizing all words into one or a plurality of tree structures, wherein the path length of two nodes can be used as a semantic distance, and calculating the word sense similarity based on large-scale corpus statistics, wherein the idea is to calculate the semantic similarity among words by using probability distribution of vocabulary context information. The first method is used for calculating the similarity of word senses and is completely dependent on a semantic dictionary, and a calculation result is sensitive to word receiving quality and completeness of the semantic dictionary, and the second method is used for carrying out statistics by using a large-scale corpus, so that the calculation result is possibly inaccurate due to the fact that the corpus scale is too small. The invention introduces part-of-speech discrimination to calculate word sense similarity based on corpus statistics, so that accurate results can be obtained even if the corpus scale is not large enough. Disclosure of Invention In order to solve the problem of insufficient computation of the existing word sense similarity, the invention aims to provide a method for computing the word sense similarity by replacing feature vectors of words to be matched with feature vectors based on adjacent words. The technical scheme of the invention is that a vocabulary matching algorithm based on adjacent vocabulary features is characterized in that the feature weighting of the surrounding word segmentation of the example word is used for replacing the features of the example word and the option word, and then the similarity score between the example word and the option word is obtained through calculation, so that the optimal option word is selected. The method comprises the following specific steps: step1, extracting example words from the example sentences and the sentences to be matched through a maximum public string algorithm And obtaining a public substring array in the example sentence and the sentence to be matched by adopting a maximum public substring algorithm, removing color labels from the public substring in the example sentence where the public substring array is located, and obtaining the characters with the color labels as example words, wherein the example sentence is consistent with the sentence to be matched in length, and the rest words and punctuation marks are identical except the marks and the example words. Step1.1 obtains an example sentence text1 and a sentence to be matched text2, wherein text 1= { a 1,a2,...,an},a1,a2,...,an respectively represents n characters in the text1, and text 2= { b 1,b2,...,[mask],...,bn},b1,b2,...,bn respectively represents n characters in the text 2; step1.2, identifying example words of the example sentence text1 and the sentence text2 to be matched; letting A=text1 and B=text2, marking color labels on the text1, and solving the largest common substring in A and B by adopting a dynamic programming method, wherein the specific method is as follows: the state transition equation is defined as: The maximum common substring length of A and B can be found from the above equation as max (c [ i, j ]), i, j ε { 1..once., n }. After the length of the maximum public sub-string is obtained, the initial position of the maximum public sub-string is marked, and then the whole maximum public sub-string can be obtained. And (3) storing the solved maximum public substring of A and B into an array L, wherein L= { L 1,...,lp }, p is less than or equal to n, removing the color label of the public substring in text1 according to the maximum public substring array L, and the characters with the color labels are the example words c f. Step2, word segmentation is carried out on characters except for example words in example sentences, and part-of-speech tagging is ca