CN-121996794-A - Chinese and English mixed text duplication removing method based on density clustering and semantic verification
Abstract
The application discloses a Chinese and English mixed text duplication eliminating method based on density clustering and semantic verification. The method comprises the steps of converting each text into each semantic vector, obtaining a first cluster based on a density clustering algorithm, carrying out iterative optimization on the first cluster based on semantic verification of a binary search framework combined large model to obtain a clustering threshold, updating the first cluster based on the clustering threshold to obtain a plurality of second clusters, and carrying out de-duplication on each second cluster based on a preset semantic screening algorithm and a preset expression screening algorithm to obtain a de-duplicated text list. Through the mode, the method and the device can automatically and iteratively optimize to obtain the optimal clustering threshold, the clustering threshold is not required to be set manually, chinese and English mixed texts with different semantic densities can be adapted, semantic consistency in the same cluster is realized, and semantic uniqueness and expression diversity can be reserved through a preset semantic screening algorithm and a preset expression screening algorithm, so that expression similarity misjudgment caused by language structure difference is avoided.
Inventors
- LI JIAXIANG
- GUO JIANLIN
- JIAN WEIDONG
- WANG CHAO
Assignees
- 深圳市有方科技股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251211
Claims (10)
- 1. A Chinese and English mixed text duplication eliminating method based on density clustering and semantic verification is characterized by comprising the following steps: Converting each text into each semantic vector, and obtaining a first cluster based on a density clustering algorithm; Based on binary search framework and combining semantic verification of a large model, performing iterative optimization on the first cluster to obtain a cluster threshold; Updating the first clusters based on the cluster threshold to obtain a plurality of second clusters; and based on a preset semantic screening algorithm and a preset expression screening algorithm, removing the duplication of each second polymer to obtain a duplicate-removed text list.
- 2. The method for performing text de-duplication based on density clustering and semantic verification according to claim 1, wherein the converting each text into each semantic vector, based on a density clustering algorithm, obtains a first cluster, includes: converting each text into a corresponding semantic vector based on a text embedding model; normalizing each semantic vector, and calculating cosine distances between the normalized semantic vectors; And obtaining a sample access sequence, a plurality of first clusters and corresponding reachable distances based on each cosine distance and density clustering algorithm.
- 3. The method for performing text de-duplication based on density clustering and semantic verification according to claim 2, wherein the performing iterative optimization on the first cluster based on semantic verification of a binary search framework combined with a large model to obtain a cluster threshold value comprises: based on the binary search framework, performing iterative search in a preset neighborhood radius range, and calculating an intermediate threshold value in each iteration; updating each of the first clusters based on the intermediate threshold, the sample access order, and the reachable distance; Taking the intermediate value of the cluster diameter corresponding to the updated first cluster as an intermediate diameter, and inputting two samples corresponding to the intermediate diameter into a large model to obtain a semantic verification result; And optimizing the neighborhood radius range based on the semantic verification result until iteration is completed, and obtaining the clustering threshold value.
- 4. The method for performing text de-duplication based on density clustering and semantic verification according to claim 3, wherein optimizing the neighborhood radius based on the semantic verification result until iteration is completed, to obtain a clustering threshold value, includes: Responding to the semantic consistency of the semantic verification result, and updating the minimum neighborhood radius in the neighborhood radius range to the intermediate threshold; responding to the semantic verification result as semantic inconsistency, and updating the maximum neighborhood radius in the neighborhood radius range to the intermediate threshold; And iterating until the neighborhood radius range does not exceed a preset precision threshold or reaches preset iteration times, and taking the minimum neighborhood radius as the clustering threshold after iteration is completed.
- 5. The method for removing duplication of Chinese and English mixed text based on density clustering and semantic verification according to claim 1, wherein the removing duplication of each second category based on a preset semantic screening algorithm and a preset expression screening algorithm to obtain a duplicate removed text list comprises: Calculating the average value of all the samples for each second cluster and normalizing to obtain a cluster center; Screening a plurality of candidate samples from all the second clusters based on the preset semantic screening algorithm and the clustering center; Calculating the expression similarity between the candidate samples based on the preset expression screening algorithm, and screening out a plurality of final samples; And converting the original texts corresponding to the plurality of final samples into a structured list to obtain the text list after duplication removal.
- 6. The method for performing text de-duplication based on density clustering and semantic verification according to claim 5, wherein the screening a plurality of candidate samples from all the second clusters based on the preset semantic screening algorithm and the clustering center comprises: For each sample in each second cluster, calculating cosine distances between the sample and the cluster centers of the rest of the second clusters, and calculating average cosine distances; And in each second cluster, screening out the first samples with the largest average cosine distance as the candidate samples.
- 7. The method for performing text de-duplication based on density clustering and semantic verification according to claim 5, wherein the calculating the similarity of expressions between the candidate samples based on the preset expression screening algorithm, screening out a plurality of final samples, includes: calculating the expression similarity between the candidate samples based on a character-level n-gram model; Responding to the existence of the candidate samples of which the expression similarity does not exceed a preset similarity threshold, and taking a plurality of candidate samples of which the expression similarity does not exceed the preset similarity threshold as the final samples; and responding to the absence of the sample of which the expression similarity does not exceed a preset similarity threshold, and taking the candidate sample with the largest average cosine distance as the final sample.
- 8. A Chinese and English mixed text deduplication device based on density clustering and semantic verification is characterized by comprising: the first clustering module is used for converting each text into each semantic vector and obtaining a first cluster based on a density clustering algorithm; The threshold optimization module is used for iteratively optimizing the first cluster based on the binary search framework and combining the semantic verification of the large model to obtain a cluster threshold; the second clustering module is used for updating the first clusters based on the cluster threshold value to obtain a plurality of second clusters; and the screening and de-duplication module is used for de-duplication of each second polymer based on a preset semantic screening algorithm and a preset expression screening algorithm to obtain a de-duplicated text list.
- 9. A computer device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the density clustering and semantic verification based hybrid text deduplication method of any of claims 1-7.
- 10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the density clustering and semantic verification based mixed chinese-english text deduplication method of any of claims 1-7.
Description
Chinese and English mixed text duplication removing method based on density clustering and semantic verification Technical Field The application relates to the field of natural language processing, in particular to a Chinese and English mixed text duplication eliminating method based on density clustering and semantic verification. Background With the explosive growth of internet text data, especially text containing Chinese and English mixed expressions, such as technical questions and answers, product descriptions, user comments and the like, repeated or highly similar text not only occupies redundant storage resources, but also reduces the efficiency and accuracy of subsequent text analysis (such as meaning identification or knowledge graph construction). However, the existing Chinese and English mixed text duplication elimination technology has the following defects: the limitation of pure algorithm clustering is that the traditional pure density clustering algorithm depends on manually setting a distance threshold value, so that the semantic difference of Chinese and English mixed texts is difficult to adapt, chinese is based on single words, english is based on letters or words, the expression difference of the same semantic is large, and the fixed threshold value is easy to cause 'over-clustering' (namely, the same semantic is split into multiple clusters) or 'under-clustering' (namely, different semantics are combined into one cluster). The limitation of the pure large model method is that, although the existing large model has a longer Context window, in corpus deduplication tasks similar to scenes such as data set construction and the like, tokens needs exist, so that the large model method cannot meet the deduplication work of the whole corpus, and even for the deduplication task of the ultra-small-scale corpus with the Context needs controlled within the window range, the problem that the deduplication result caused by Context Rot of the large model is not available exists. Imbalance of expression diversity and semantic consistency existing methods either only retain a single sample and lose expression diversity, or cannot effectively filter text that is "semantically consistent but the expression is repeated" (e.g., "how to use API" and "how to use API. Most duplication removing methods are designed aiming at pure Chinese or pure English texts, and clustering deviation is easily caused by character structure difference on Chinese sentences containing English vocabulary (such as API and user_id), so that the expression similarity cannot be accurately judged. Disclosure of Invention The application mainly provides a Chinese and English mixed text duplication eliminating method based on density clustering and semantic verification, which aims to solve the problems that in the existing duplication eliminating technology, a threshold value is difficult to set in a Chinese and English mixed scene and expression diversity and semantic consistency are easy to unbalance. In order to solve the technical problems, the application adopts a technical scheme that a Chinese and English mixed text duplication eliminating method based on density clustering and semantic verification is provided. The method comprises the following steps: Converting each text into each semantic vector, and obtaining a first cluster based on a density clustering algorithm; Based on binary search framework and combining semantic verification of a large model, performing iterative optimization on the first cluster to obtain a cluster threshold; Updating the first clusters based on the cluster threshold to obtain a plurality of second clusters; and based on a preset semantic screening algorithm and a preset expression screening algorithm, removing the duplication of each second polymer to obtain a duplicate-removed text list. In an optional implementation manner of the embodiment of the present application, the converting each text into each semantic vector, based on a density clustering algorithm, obtains a first cluster, including: converting each text into a corresponding semantic vector based on a text embedding model; normalizing each semantic vector, and calculating cosine distances between the normalized semantic vectors; And obtaining a sample access sequence, a plurality of first clusters and corresponding reachable distances based on each cosine distance and density clustering algorithm. In an optional implementation manner of the embodiment of the present application, the iterative optimization of the first cluster based on the binary search framework and the semantic verification of the large model to obtain a cluster threshold includes: based on the binary search framework, performing iterative search in a preset neighborhood radius range, and calculating an intermediate threshold value in each iteration; updating each of the first clusters based on the intermediate threshold, the sample access order, and the reachable di