CN-122021554-A - Sliding window-based heterogeneous text matrix local matching association degree calculation method
Abstract
The application provides a sliding window-based heterogeneous text matrix local matching association degree calculation method, and belongs to the technical field of text information processing. The method aims to solve the problem that the relevance calculation accuracy is reduced due to vector filling or truncation when texts with different lengths are processed in the prior art. The method comprises the steps of obtaining a first text and a second text, converting the first text and the second text into a first text representation and a second text representation with different lengths, determining a longer text representation and a shorter text representation according to the lengths of the text representations, defining a sliding window with the same size as the shorter text representation, moving the sliding window on the longer text representation according to a preset step length to obtain a plurality of local representations, calculating a local association degree value between each local representation and the shorter text representation according to each local representation, and finally selecting a representative value from all calculated local association degree values according to a preset rule to serve as a final association degree. According to the method, the text characterization is not required to be complemented or truncated, so that noise or missing information is fundamentally prevented from being introduced, and the accuracy of calculating the relevance between heterogeneous texts is remarkably improved.
Inventors
- LI MING
- YUAN YE
- KONG FEI
Assignees
- 北京中绿讯科科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251230
Claims (10)
- 1. The heterogeneous text matrix local matching association degree calculating method based on the sliding window is characterized by comprising the following steps of: The method comprises the steps of obtaining a first text and a second text to be compared, and converting the first text and the second text into a first text representation with a first length and a second text representation with a second length respectively, wherein the first length and the second length are different; Determining a longer text representation and a shorter text representation according to the first length and the second length, and defining a sliding window, wherein the size of the sliding window is the same as the length of the shorter text representation; Step three, moving the sliding window on the longer text representation according to a preset step length to obtain a plurality of local representations of the longer text representation; Step four, for each local representation, calculating a local association degree value between the local representation and the shorter text representation; And fifthly, selecting one or more representative values from all the calculated local association values according to a preset rule, and taking the representative values as the final association degree of the first text and the second text.
- 2. The method of claim 1, wherein the step of converting the first text and the second text into text representations comprises: word segmentation is carried out on the first text and the second text respectively; Extracting multidimensional features at least comprising word parts, word frequencies and word senses for each word after word segmentation; and vectorizing the multi-dimensional features and constructing the text representation in a three-dimensional array matrix form.
- 3. The method of claim 1, wherein the text is characterized as a sequence of word vectors generated by a pre-trained language model.
- 4. A method according to claim 3, wherein the step of calculating the local relevance value comprises: Respectively calculating the average vector of the word vector sequence contained in the local representation and the average vector of the word vector sequence contained in the shorter text representation; and calculating cosine similarity between the two average vectors to be used as the local association degree value.
- 5. The method of claim 1, wherein the local relevance value includes at least one of a similarity value obtained by cosine similarity calculation and a distance value obtained by euclidean distance calculation.
- 6. The method of claim 5, wherein the preset rules include at least one of: selecting a maximum value from all the calculated similarity values as a final association degree; and selecting a minimum value from all the calculated distance values as a final association degree.
- 7. The method of claim 1, wherein the preset rule comprises: sorting all the calculated local association values; Selecting a local association value of a preset percentage before ranking; And calculating the final relevance through weighted average based on the selected local relevance value.
- 8. The method of claim 1, wherein the preset step size is a fixed value of 1.
- 9. The method of claim 1, wherein the preset step size is an adaptive step size, and the adjusting manner comprises: setting an initial step length and a correlation threshold; when the calculated local association degree value is lower than the association degree threshold value, adopting the initial step length to move next time; And when the calculated local association degree value is not lower than the association degree threshold value, adopting a fine step length smaller than the initial step length to carry out the next movement.
- 10. The method of claim 1, wherein the calculating of the local relevance value includes employing at least one of a jaccard similarity or a manhattan distance.
Description
Sliding window-based heterogeneous text matrix local matching association degree calculation method Technical Field The application relates to the technical field of computers, in particular to a text information processing technology, and more particularly relates to a sliding window-based heterogeneous text matrix local matching association degree calculation method. Background In the field of text information processing, text relevance calculation is a core technology for information retrieval, content recommendation, repetition detection and other applications. In the prior art, text is generally converted into vectors with fixed dimensions through a word bag model, a word frequency-inverse document frequency algorithm or a deep learning model, and then similarity among the vectors is calculated by adopting cosine similarity and other methods so as to represent the association degree of the text. However, when the two text lengths to be compared are different, the lengths of the text tokens (e.g., word vector sequences) generated in the above manner are also different. In order to perform subsequent association calculation, the prior art generally adopts a vector filling or cutting-off mode to forcedly unify the specifications of two text representations. For example, a zero-padding operation is performed on a shorter text token to make its length consistent with a longer text token, or a longer text token is directly truncated to make its length consistent with a shorter text token. The processing mode has obvious defects that a large amount of meaningless noise data is introduced in the filling operation, the effective characteristics of the original text are diluted, and key text information is possibly lost in the cutting operation. Both of these approaches can seriously affect the accuracy of the final relevance calculation. In addition, there are also prior art schemes that employ sliding windows to process text, but these schemes typically divide a long text into multiple independent segments for separate processing, or use fixed size windows to extract local features. These methods do not solve the problem of how to perform end-to-end comparison on two complete texts with different lengths, nor avoid the problem of inconsistent comparison benchmarks due to different text lengths, so the accuracy of calculating the relevance of heterogeneous texts is still limited. Disclosure of Invention The application aims to provide a sliding window-based heterogeneous text matrix local matching association degree calculation method, which aims to solve the technical problem that association degree calculation accuracy is reduced due to the fact that a vector filling or cutting-off mode is adopted when texts with different lengths are processed in the prior art. The method comprises the steps of firstly obtaining a first text to be compared and a second text to be compared, converting the first text and the second text into a first text representation with a first length and a second text representation with a second length respectively, determining a longer text representation and a shorter text representation according to the first length and the second length, defining a sliding window, wherein the size of the sliding window is the same as the length of the shorter text representation, then moving the sliding window on the longer text representation according to a preset step length to obtain a plurality of local representations of the longer text representation, then calculating local association values between the local representations and the shorter text representation respectively for each local representation, and finally selecting one or more representative values from all calculated local association values as a first association value and a second association value according to a preset rule. Optionally, the step of converting the first text and the second text into text representations specifically includes word segmentation of the first text and the second text respectively; extracting multidimensional features at least comprising word class, part of speech, word frequency and word meaning for each word after word segmentation, vectorizing the multidimensional features, and constructing the text representation in a three-dimensional array matrix form. Optionally, the text is characterized as a sequence of word vectors generated by a pre-trained language model. Further, when the text token is a word vector sequence, the step of calculating the local relevance value specifically includes calculating average vectors of the word vector sequence contained in the local token and average vectors of the word vector sequence contained in the shorter text token respectively, and calculating cosine similarity between the two average vectors as the local relevance value. Optionally, the local relevance value includes at least one of a similarity value obtained by cosine similarity calculation and a distance