CN-113704420-B - Character recognition method and device in text, electronic equipment and storage medium
Abstract
The application provides a character recognition method, a character recognition device, electronic equipment and a computer readable storage medium in a text; the method comprises the steps of extracting a plurality of character candidate words from a text, obtaining at least one matching parameter corresponding to each character candidate word, selecting at least one character candidate word from the plurality of character candidate words to serve as a first candidate character entity according to the at least one matching parameter corresponding to each character candidate word, carrying out fusion processing on a question sentence corresponding to the text and the text to obtain a fusion text, carrying out entity identification processing on the fusion text to obtain at least one second candidate character entity, and carrying out character classification processing on the basis of the at least one first candidate character entity and the at least one second candidate character entity to obtain characters in the text. According to the character recognition method and device, characters can be accurately and efficiently recognized from the text.
Inventors
- LI CHENXI
- JING NING
Assignees
- 腾讯科技(深圳)有限公司
- 腾讯科技(深圳)有限公司
Dates
- Publication Date
- 20260421
- Application Date
- 20210319
- Priority Date
- 20210319
Claims (17)
- 1. A method for character recognition in text, the method comprising: extracting a plurality of character candidate words from the text, and acquiring at least one matching parameter corresponding to each character candidate word; selecting at least one role candidate word from the plurality of role candidate words according to at least one matching parameter corresponding to each role candidate word to serve as a first candidate role entity, wherein the first candidate role entity is obtained through an unsupervised first candidate role entity identification model; carrying out fusion processing on the question corresponding to the text and excluding the entity type and the text to obtain a fusion text; Performing entity recognition processing on the fusion text to obtain at least one second candidate role entity, wherein the at least one second candidate role entity is obtained through a second candidate role entity recognition model based on machine reading understanding; Filtering out duplicate candidate character entities in the at least one first candidate character entity and the at least one second candidate character entity; The following processing is executed for each candidate role entity obtained after filtering: Combining the statement in which the candidate role entity is located with the candidate role entity to obtain an entity statement pair, carrying out fusion processing on the entity feature and the text feature extracted from the entity statement pair to obtain a second fusion feature, and mapping the second fusion feature to be probability of belonging to the role entity; And when the probability of the character entity is greater than a character probability threshold value and the word frequency of the candidate character entity in the text is greater than a character word frequency threshold value, determining that the candidate character entity is a character in the text.
- 2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The types of the matching parameters comprise word frequency, solidification degree and degree of freedom; the obtaining at least one matching parameter corresponding to each role candidate word comprises the following steps: The following processing is performed for each of the character candidate words: Determining word frequency of the character candidate word in the text; dividing the character candidate word into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the character candidate word in the text, wherein the morpheme type comprises a word and a word; determining the solidification degree of the character candidate words according to the occurrence probability of each morpheme in the text and the occurrence probability of the character candidate words in the text; And determining left information entropy and right information entropy of the character candidate words, and determining the degrees of freedom of the character candidate words according to the left information entropy and the right information entropy of the character candidate words.
- 3. The method of claim 2, wherein the determining left information entropy and right information entropy of the character candidate word, and determining the degree of freedom of the character candidate word according to the left information entropy and right information entropy of the character candidate word, comprises: Determining a plurality of left adjacency words and a plurality of right adjacency words of the character candidate word in the text; determining the sub-information entropy corresponding to each left adjacent word, and determining the sub-information entropy corresponding to each right adjacent word; Determining the opposite number of the summation of the sub-information entropies corresponding to each left adjacent word as the left information entropy, and determining the opposite number of the summation of the sub-information entropies corresponding to each right adjacent word as the right information entropy; and determining the right information entropy as the degree of freedom when the left information entropy is greater than the right information entropy, and determining the left information entropy as the degree of freedom when the left information entropy is not greater than the right information entropy.
- 4. The method of claim 3, wherein the step of, The determining the sub-information entropy corresponding to each left adjacent word comprises the following steps: The following is performed for each of the left adjacency words: Determining a ratio between the number of occurrences of the left adjacency word in the text and the number of occurrences of all adjacency words of the character candidate word in the text as a first ratio; Carrying out logarithmic operation on the first ratio, and determining the product between a logarithmic operation result and the first ratio as a sub-information entropy corresponding to the left adjacent word; the determining the sub-information entropy corresponding to each right adjacency word comprises the following steps: The following is performed for each of the right adjacency words: Determining a ratio between the number of occurrences of the right adjacency word in the text and the number of occurrences of all adjacency words of the character candidate word in the text as a second ratio; And carrying out logarithmic operation processing on the second ratio, and determining the product between a logarithmic operation result and the second ratio as the sub-information entropy corresponding to the right adjacent word.
- 5. The method of claim 2, wherein the determining the probability of occurrence of each morpheme in the text and the probability of occurrence of the character candidate word in the text comprises: determining, for each morpheme, a ratio between the number of occurrences of the morpheme in the text and the number of occurrences of all character candidate words in the text as a probability of occurrence of the morpheme in the text; and determining the ratio between the occurrence frequency of the character candidate words in the text and the occurrence frequency of all the character candidate words in the text as the occurrence probability of the character candidate words in the text.
- 6. The method of claim 2, wherein the determining the degree of solidification of the character candidate word according to the probability of occurrence of each morpheme in the text and the probability of occurrence of the character candidate word in the text comprises: performing product processing on the occurrence probability of each morpheme in the text to obtain a product result; determining the ratio of the occurrence probability of the character candidate words in the text to the product result as a third ratio; And carrying out logarithmic operation processing on the third ratio, and determining a logarithmic operation result as the solidification degree of the character candidate word.
- 7. The method of claim 2, wherein selecting at least one character candidate word among the plurality of character candidate words as the first candidate character entity according to the at least one matching parameter corresponding to each character candidate word comprises: Selecting a character candidate word satisfying at least one of the following conditions from the plurality of character candidate words as the first candidate character entity: the word frequency of the character candidate words in the text exceeds a word frequency threshold, the solidification degree of the character candidate words exceeds a solidification degree threshold, and the freedom degree of the character candidate words exceeds a freedom degree threshold.
- 8. The method of claim 1, wherein performing the entity recognition process on the fused text to obtain at least one second candidate character entity comprises: Performing feature extraction processing on the fusion text to obtain a feature sequence; Mapping the characteristic sequence to obtain at least one position set; The following is performed for each of the sets of locations: And combining the characters corresponding to the starting position in the position set, the characters between the starting position and the ending position in the position set and the characters corresponding to the ending position in the position set, and determining the combined result as the second candidate character entity.
- 9. The method of claim 8, wherein mapping the feature sequence to obtain at least one set of locations comprises: dividing the feature sequence into a plurality of sub-features, wherein the plurality of sub-features are in one-to-one correspondence with a plurality of words in the text; mapping each sub-feature into a start probability belonging to a start position and an end probability belonging to an end position; Selecting at least one sub-feature with the initial probability larger than an initial probability threshold as an initial sub-feature, and selecting at least one sub-feature with the ending probability larger than an ending probability threshold as an ending sub-feature; constructing at least one candidate start-stop feature set based on the selected at least one start sub-feature and at least one end sub-feature, wherein the candidate start-stop feature set comprises a start sub-feature and an end sub-feature; determining a target start-stop feature set in the at least one candidate start-stop feature set; And determining characters corresponding to the start sub-features in the target start-stop feature set as the start positions in the position set, and determining characters corresponding to the end sub-features in the target start-stop feature set as the end positions in the position set.
- 10. The method of claim 9, wherein the determining a target start-stop feature set from the at least one candidate start-stop feature set comprises: performing fusion processing on the starting sub-feature and the ending sub-feature in the candidate start-stop feature set to obtain a first fusion feature, and mapping the first fusion feature to probability of belonging to the same entity; And determining the candidate start-stop feature set with the probability of the same entity being greater than an entity probability threshold value as the target start-stop feature set in the at least one candidate start-stop feature set.
- 11. The method of claim 1, wherein after performing the entity recognition process on the fused text to obtain at least one second candidate character entity, the method further comprises: Performing the following processing for each of the second candidate character entities: Dividing the second candidate character entity into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate character entity in the text, wherein the types of the morphemes comprise words and words; determining the solidification degree of the second candidate role entity according to the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate role entity in the text; Determining left information entropy and right information entropy of the second candidate role entity, and determining the degree of freedom of the second candidate role entity according to the left information entropy and right information entropy of the second candidate role entity; Filtering, among the plurality of second candidate character entities, the second candidate character entity satisfying at least one of the following conditions: The word frequency of the second candidate role entity in the text does not exceed a word frequency threshold, the solidification degree of the second candidate role entity does not exceed a solidification degree threshold, and the degree of freedom of the second candidate role entity does not exceed a degree of freedom threshold.
- 12. A character recognition apparatus in a text, the apparatus comprising: The first entity recognition module is used for extracting a plurality of role candidate words from the text and acquiring at least one matching parameter corresponding to each role candidate word; The first entity recognition module is further configured to select at least one role candidate word from the plurality of role candidate words according to at least one matching parameter corresponding to each role candidate word, where the first candidate role entity is obtained through an unsupervised first candidate role entity recognition model; the second entity identification module is used for carrying out fusion processing on the question corresponding to the text and not including the entity type and the text to obtain a fusion text; The second entity recognition module is further configured to perform entity recognition processing on the fused text to obtain at least one second candidate character entity, where the at least one second candidate character entity is obtained through a second candidate character entity recognition model based on machine reading understanding; The classification module is used for filtering repeated candidate role entities in the at least one first candidate role entity and the at least one second candidate role entity, combining sentences in which the candidate role entities are located with the candidate role entities to obtain entity sentence pairs for each candidate role entity obtained after filtering, carrying out fusion processing on entity features and text features extracted from the entity sentence pairs to obtain second fusion features, mapping the second fusion features to probability of belonging to the role entities, and determining that the candidate role entities are roles in the text when the probability of belonging to the role entities is larger than a role probability threshold and word frequency of the candidate role entities in the text is larger than a role word frequency threshold.
- 13. The apparatus of claim 12, wherein the type of matching parameters includes word frequency, degree of solidification, and degree of freedom, and wherein the first entity identification module is further configured to perform the following for each of the persona candidate words: The method comprises the steps of determining word frequency of a character candidate word in a text, dividing the character candidate word into a plurality of morphemes, determining occurrence probability of each morpheme in the text and occurrence probability of the character candidate word in the text, determining solidification degree of the character candidate word according to the occurrence probability of each morpheme in the text and the occurrence probability of the character candidate word in the text, determining left information entropy and right information entropy of the character candidate word, and determining freedom degree of the character candidate word according to the left information entropy and the right information entropy of the character candidate word.
- 14. The apparatus of claim 13, wherein the first entity identification module is further to: The character candidate word is determined to be a plurality of left adjacent words and a plurality of right adjacent words in the text, sub-information entropy corresponding to each left adjacent word is determined, sub-information entropy corresponding to each right adjacent word is determined, the opposite number of the summation result of the sub-information entropy corresponding to each left adjacent word is determined to be the left information entropy, the opposite number of the summation result of the sub-information entropy corresponding to each right adjacent word is determined to be the right information entropy, when the left information entropy is larger than the right information entropy, the right information entropy is determined to be the degree of freedom, and when the left information entropy is not larger than the right information entropy, the left information entropy is determined to be the degree of freedom.
- 15. An electronic device, comprising: a memory for storing computer executable instructions; a processor for implementing the method for character recognition in text according to any one of claims 1 to 11 when executing computer executable instructions stored in said memory.
- 16. A computer-readable storage medium, characterized in that computer-executable instructions are stored, which when executed are adapted to implement the character recognition method in text according to any one of claims 1 to 11.
- 17. A computer program product comprising computer instructions which, when executed by a processor, implement the method of character recognition in text as claimed in any one of claims 1 to 11.
Description
Character recognition method and device in text, electronic equipment and storage medium Technical Field The present application relates to artificial intelligence technology, and in particular, to a method and apparatus for character recognition in text, an electronic device, and a computer readable storage medium. Background Artificial intelligence (AI, artificial Intelligence) is the theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. As artificial intelligence technology research and advances, artificial intelligence technology has been developed and applied in a variety of fields. Taking character recognition in text (e.g., novels, etc.) as an example, character names (hereinafter simply referred to as character names or characters) in text generally include various names, designations, etc., and are distinguished from common person names. When characters in the text are identified through the universal name identification model in the related technology, the problem that the character identification is incomplete due to the specificity of the character type exists, namely, the accuracy of the identified characters is low, and unnecessary computing resources are consumed. As can be seen, the related art has no effective solution for how to accurately and efficiently identify characters from text. Disclosure of Invention The embodiment of the application provides a character recognition method, a character recognition device, electronic equipment and a computer readable storage medium in a text, which can accurately and efficiently recognize characters from the text. The technical scheme of the embodiment of the application is realized as follows: The embodiment of the application provides a character recognition method in text, which comprises the following steps: extracting a plurality of character candidate words from the text, and acquiring at least one matching parameter corresponding to each character candidate word; Selecting at least one role candidate word from the plurality of role candidate words according to at least one matching parameter corresponding to each role candidate word to serve as a first candidate role entity; carrying out fusion processing on the question corresponding to the text and the text to obtain a fusion text; Performing entity identification processing on the fusion text to obtain at least one second candidate role entity; and performing role classification processing based on at least one first candidate role entity and at least one second candidate role entity to obtain roles in the text. In the above scheme, the extracting the entity feature and the text feature from the entity sentence pair includes: Extracting a plurality of word vectors from the entity sentence pairs, and determining the average value among the plurality of word vectors as the entity characteristic; encoding the entity statement pair according to the direction from the starting position to the ending position to obtain a forward encoding vector; carrying out coding processing on the entity statement pair according to the direction from the end position to the start position to obtain a backward coding vector; and carrying out fusion processing on the forward coding vector and the backward coding vector to obtain the text feature. In the above scheme, the extracting a plurality of character candidate words from the text includes: the text is obtained, and preprocessing is carried out on the text, wherein the text is divided into a plurality of sentences according to a symbol list, and symbols in each sentence are filtered; and extracting a plurality of character candidate words from each sentence of the text subjected to the preprocessing. The embodiment of the application provides a character recognition device in a text, which comprises the following components: The first entity recognition module is used for extracting a plurality of role candidate words from the text and acquiring at least one matching parameter corresponding to each role candidate word; The first entity identification module is further configured to select at least one role candidate word from the plurality of role candidate words according to at least one matching parameter corresponding to each role candidate word, so as to serve as a first candidate role entity; The second entity identification module is used for carrying out fusion processing on the question sentence corresponding to the text and the text to obtain a fusion text; the second entity identification module is further configured to perform entity identification processing on the fused text to obtain at least one second candidate role entity; And the classification module is used for performing role classification processing based on at least one fi