CN-121029964-B - Text matching method, device, equipment and storage medium based on probability distribution
Abstract
The application discloses a text matching method, a device, equipment and a storage medium based on probability distribution, and relates to the technical field of natural language processing, wherein the method comprises the following steps: semantic feature distribution of each expertise text is obtained from the expertise database, a knowledge probability distribution set is obtained, and the inherent features of the text are abstracted through the probability distribution, so that standardized representation of the knowledge database is realized. And acquiring the text input by the user, calculating the probability distribution of the corresponding text of the user, and obviously improving the fault tolerance of text matching. The similarity distance between the text probability distribution of the user and each distribution in the knowledge probability distribution set is calculated, and the overall semantic similarity is measured through the distribution distance instead of local matching, so that the distribution deviation can be tolerated in the comparison process. And finally, determining a text matching result according to the minimum value of the similarity distance, thereby realizing the effect of stably retrieving the semantic related expertise under high noise and obviously improving the fault tolerance and the matching precision.
Inventors
- YI XIAOLIN
- YANG HONGBING
- Fu Zhongqiong
- WANG AIHUA
- Yuan Fangqi
Assignees
- 湖北泰跃卫星技术发展股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20250807
Claims (7)
- 1. A method for matching text based on probability distribution, said method comprising: acquiring each expertise text from an expertise base, and calculating word frequency or word vector of the expertise text; Calculating the probability distribution of the knowledge text according to the word frequency or the word vector to obtain a knowledge probability distribution set; Acquiring a user input text, and calculating corresponding user text probability distribution; Calculating similarity distances between the user text probability distribution and each distribution in the knowledge probability distribution set; determining a text matching result according to the minimum value of the similarity distance; the step of calculating the knowledge text probability distribution according to word frequency or word vector to obtain a knowledge probability distribution set comprises the following steps: if the probability distribution of the knowledge text is calculated according to the word frequency, acquiring each expertise text from the expertise base, and extracting the specialized vocabulary of each expertise text to obtain the word frequency of the specialized vocabulary; Based on the word frequency of the professional vocabulary, introducing inverse document frequency to calculate the weight of the professional vocabulary, and obtaining a weighted calculation result; normalizing the weighted calculation result to obtain a knowledge probability distribution set; if the probability distribution of the knowledge texts is calculated according to the word vectors, numbering each knowledge text in the professional knowledge base according to the sequence to obtain a knowledge numbering set, wherein the knowledge numbering set comprises the mapping relation between each knowledge text and each knowledge number; selecting a target knowledge text, and carrying out vectorization processing on the target knowledge text to obtain a word vector; Based on a probability distribution formula, carrying out probability distribution calculation on each dimension of the word vector to obtain probability distribution of a target knowledge text; And establishing a mapping relation between each knowledge number and probability distribution of the target knowledge text to obtain a knowledge probability distribution set.
- 2. The method of claim 1, wherein the step of selecting the target knowledge text comprises: if the knowledge text is selected for the first time, selecting the knowledge text corresponding to the initial number in the knowledge number set as a target knowledge text; if the current number sequence is not selected for the first time, the current number sequence is increased and then used as a target number, and the knowledge text corresponding to the target number is selected as a target knowledge text.
- 3. The method of claim 1, wherein the step of establishing a mapping relationship between each knowledge number and a probability distribution of the target knowledge text to obtain a set of knowledge probability distributions comprises: traversing a knowledge number set, and defining an unprocessed knowledge text as a target knowledge text if the unprocessed knowledge text exists; and returning to the step of vectorizing the target knowledge text to obtain word vectors.
- 4. The method of claim 1, wherein the step of calculating a similarity distance of the user text probability distribution to each distribution in the set of knowledge probability distributions comprises: calculating the similarity distance between the user text probability distribution and each distribution in the knowledge probability distribution set through a Pasteur distance formula; Wherein, the pasteurization distance formula is expressed as: ; Wherein, the The sequence numbers representing the specialized vocabulary in the specialized vocabulary set, Representing the number of specialized words in the set of specialized words, Represent the first The probability distribution of the individual knowledge texts, A knowledge text word vector is represented and, The special-purpose vocabulary is represented by the word, Representing the probability distribution of the user entering text, A word vector representing the text entered by the user, Representing the user entering text into which, Represent the first The similarity distance between the probability distribution of the individual knowledge text and the probability distribution of the user input text.
- 5. A probability distribution-based text matching device, the device comprising: the first probability calculation module is used for acquiring each expertise text from the expertise base, and calculating word frequency or word vector of the expertise text; the second probability calculation module is used for acquiring the text input by the user and calculating the probability distribution of the corresponding user text; The similarity distance calculation module is used for calculating the similarity distance between the user text probability distribution and each distribution in the knowledge probability distribution set; the text matching determining module is used for determining a text matching result according to the minimum value of the similarity distance; The first probability calculation module is further used for obtaining each expertise text from a expertise base if the probability distribution of the knowledge text is calculated according to word frequencies, extracting the expertise of each expertise text to obtain the word frequencies of the expertise, calculating the weight of the expertise based on the word frequencies of the expertise, introducing inverse document frequencies to obtain a weighted calculation result, carrying out normalization processing on the weighted calculation result to obtain a knowledge probability distribution set, numbering each knowledge text in the expertise base according to the probability distribution of the word vectors to obtain a knowledge number set, wherein the knowledge number set comprises the mapping relation between each knowledge text and each knowledge number, selecting a target knowledge text, carrying out vectorization processing on the target knowledge text to obtain a word vector, carrying out probability distribution calculation on each dimension of the word vector based on a probability distribution formula to obtain the probability distribution of the target knowledge text, and establishing the mapping relation between each knowledge number and the probability distribution of the target knowledge text to obtain the probability distribution set.
- 6. A probability distribution based text matching device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the probability distribution based text matching method as claimed in any one of claims 1 to 4.
- 7. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the probability distribution based text matching method according to any one of claims 1 to 4.
Description
Text matching method, device, equipment and storage medium based on probability distribution Technical Field The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for text matching based on probability distribution. Background With the application of the RAG (RETRIEVAL-augmented Generation) based method for generating large language models, in order to enhance the generating effect of the models in specific fields or specialized tasks, the most relevant content is usually retrieved from a specialized knowledge base after a user inputs a text, and submitted to the large language models together with the text input by the user for processing. However, when there is a mistake in user input (e.g., misspelling, key term mistake, grammatical confusion) or noise (e.g., irrelevant information, ambiguous expressions), the accuracy of the matching of such methods tends to be significantly reduced. Therefore, how to improve the fault tolerance in text matching in the presence of noise in the user input information is a problem that needs to be solved at present. Disclosure of Invention The application mainly aims to provide a text matching method, device, equipment and storage medium based on probability distribution, and aims to solve the technical problem of low fault tolerance of text matching under the condition that noise exists in user input information. In order to achieve the above object, the present application provides a text matching method based on probability distribution, the method comprising: Acquiring semantic feature distribution of each expertise text from an expertise base to obtain a knowledge probability distribution set; Acquiring a user input text, and calculating corresponding user text probability distribution; Calculating similarity distances between the user text probability distribution and each distribution in the knowledge probability distribution set; and determining a text matching result according to the minimum value of the similarity distance. In an embodiment, the step of obtaining the semantic feature distribution of each expertise text from the expertise base to obtain the knowledge probability distribution set includes: acquiring each expertise text from an expertise base, and calculating word frequency or word vector of the expertise text; and calculating the probability distribution of the knowledge text according to the word frequency or the word vector to obtain a knowledge probability distribution set. In one embodiment, the step of calculating a knowledge text probability distribution according to word frequency or word vector to obtain a knowledge probability distribution set includes: if the probability distribution of the knowledge texts is calculated according to the word vectors, numbering each knowledge text in the professional knowledge base according to the sequence to obtain a knowledge numbering set, wherein the knowledge numbering set comprises the mapping relation between each knowledge text and each knowledge number; selecting a target knowledge text, and carrying out vectorization processing on the target knowledge text to obtain a word vector; Based on a probability distribution formula, carrying out probability distribution calculation on each dimension of the word vector to obtain probability distribution of a target knowledge text; And establishing a mapping relation between each knowledge number and probability distribution of the target knowledge text to obtain a knowledge probability distribution set. In one embodiment, the step of selecting the target knowledge text includes: if the knowledge text is selected for the first time, selecting the knowledge text corresponding to the initial number in the knowledge number set as a target knowledge text; if the current number sequence is not selected for the first time, the current number sequence is increased and then used as a target number, and the knowledge text corresponding to the target number is selected as a target knowledge text. In an embodiment, the step of establishing a mapping relationship between each knowledge number and probability distribution of the target knowledge text to obtain a knowledge probability distribution set includes: traversing a knowledge number set, and defining an unprocessed knowledge text as a target knowledge text if the unprocessed knowledge text exists; and returning to the step of vectorizing the target knowledge text to obtain word vectors. In one embodiment, the step of calculating a knowledge text probability distribution according to word frequency or word vector to obtain a knowledge probability distribution set includes: if the probability distribution of the knowledge text is calculated according to the word frequency, acquiring each expertise text from the expertise base, and extracting the specialized vocabulary of each expertise text to obtain