Search

CN-115329752-B - Pinyin error correction method

CN115329752BCN 115329752 BCN115329752 BCN 115329752BCN-115329752-B

Abstract

The invention provides a pinyin error correction method, which simplifies a noise channel model used by most pinyin error correction algorithms at present by a real-time frequency counting method, effectively improves the efficiency of the error correction algorithm and lightens the algorithm. In addition, the method adopts a character letter direct replacement method to replace the traditional editing distance calculation method, avoids the frequent calculation of the editing distance by a pinyin error correction algorithm, and has high individuation, strong pertinence and small scale of a dictionary and can provide accurate candidate words for the establishment and management of a dictionary word stock of the error correction method. Compared with the existing algorithm, the method has the advantages that the error rate checking rate, the candidate word precision and the execution efficiency are improved greatly.

Inventors

  • DENG BIAO

Assignees

  • 中科凡语(武汉)科技有限公司

Dates

Publication Date
20260505
Application Date
20220509

Claims (7)

  1. 1. A Pinyin error correction method is characterized by comprising the following steps: s1) inputting a common Chinese character pinyin dictionary D; s2) establishing a pinyin hot word dictionary G; S3) establishing a Bayesian dictionary S; S4) inputting a Pinyin character string x; S5) matching x in the pinyin dictionary D, and if x is contained in the pinyin dictionary D, directly returning to the step S4 without error correction; s6) recording l as the length of x, recording x i as the ith letter of the Pinyin character string x, wherein i is more than or equal to 1 and less than or equal to l; s7) establishing a candidate pinyin character string set Y, wherein Y is an empty set initially; s8) repeating the following steps S81 to S85 starting from i=1 to i=l; s81) marking Q as a set of letters x i on adjacent keys of the keyboard, and marking m as the length of the set Q; S82) marking T as a new Pinyin character string set after the i-th element x i of the Pinyin character string x is replaced by the elements in the set Q one by one, so that the number of the elements of the set T is m; s83) deleting elements in the set T which are not in the pinyin dictionary D; s84) incorporating set T into set Y; s85) i increases by 1; s9) establishing an expansion candidate pinyin character string set Z, and initializing the expansion candidate pinyin character string set Z as an empty set; s10) marking n as the length of the set Y, wherein Y j is the j-th element of Y, Y j epsilon Y, 0≤j≤n, and repeating the following steps S101 to S117 from j=1 to j=n; s101) if the first letter of y j is 'c', and t is y j, replacing the first letter 'c' with the character string of 'ch', and if t is the character string in the pinyin dictionary D, adding t to z; S102) if the two foremost letters of y j are 'ch', and t is y j, replacing the two foremost letters of 'ch' with character strings of 'c', and if t is a character string in the pinyin dictionary D, adding t to Z; S103) if the first letter of y j is 'S', substituting the first letter 'S' with the character string of 'sh' by recording t as y j, and if t is the character string in the pinyin dictionary D, adding t to Z; S104) if the first two letters of y j are 'sh', and t is y j, replacing the first two letters 'sh' with the character strings of 'S', and if t is the character string in the pinyin dictionary D, adding t to Z; s105) if the first letter of y j is 'z', and t is y j, replacing the first letter 'z' with a character string of 'zh', and if t is a character string in the pinyin dictionary D, adding t to z; s106) if the two foremost letters of y j are 'zh', and t is y j, replacing the two foremost letters 'zh' with a character string of 'Z', and if t is a character string in the pinyin dictionary D, adding t to Z; S107) if the first letter of y j is 'f', and t is y j, replacing the first letter 'f' with the character string of 'h', and if t is the character string in the pinyin dictionary D, adding t to Z; S108) if the first letter of y j is 'h', and t is y j, replacing the first letter 'h' with the character string of 'f', and if t is the character string in the pinyin dictionary D, adding t to Z; s109) if the last two letters of y j are 'an', and t is y j, replacing the last two letters 'an' with character strings of 'ang', and if t is a character string in the pinyin dictionary D, adding t to Z; s110) if the last two letters of y j are 'en', and t is y j, replacing the last two letters 'en' with character strings of 'eng', and if t is a character string in the pinyin dictionary D, adding t to Z; S111) if the last two letters of y j are 'in', t is y j, the last two letters 'in' are replaced by character strings of 'ing', and if t is a character string in the pinyin dictionary D, t is added to Z; S112) if the last two letters of y j are 'on', t is y j, the last two letters 'on' are replaced by the character string of 'ong', and if t is the character string in the pinyin dictionary D, t is added to Z; s113) if the last three letters of y j are 'ang', and t is y j, replacing the last three letters of 'ang' with character strings of 'an', and if t is a character string in the pinyin dictionary D, adding t to Z; S114) if the last three letters of y j are 'eng', and t is y j, replacing the last three letters of 'eng' with the character string of 'en', and if t is the character string in the pinyin dictionary D, adding t to Z; s115) if the last three letters of y j are 'ing', t is y j, the last three letters 'ing' are replaced by character strings of 'in', and if t is a character string in the pinyin dictionary D, t is added to Z; S116) if the last three letters of y j are 'ong', t is y j, the last three letters of 'ong' are replaced by the character string of 'in', and if t is the character string in the pinyin dictionary D, t is added to z; S117) j increases by 1; s11) merging the set Z into the set Y; s12) outputting three candidate pinyin words with the forefront descending order of the sorting set Y; S13) the candidate pinyin word selected by the user is noted and updated to the dictionary G and the dictionary S.
  2. 2. The method of claim 1, wherein the dictionary D in step S1) includes 1 ten thousand common pinyin words of homophones, each pinyin word being a letter string, and case-less.
  3. 3. The pinyin error correction method of claim 1 wherein the dictionary G is created in step S2) by recording each pinyin word entered by the user and the frequency of occurrence and ordering the pinyin words in descending order of frequency of occurrence.
  4. 4. The method for correcting pinyin errors as claimed in claim 1, wherein each record of the bayesian dictionary S in step S3) contains pinyin strings of two Chinese characters and occurrence frequencies thereof, the current pinyin string is marked as a, the previous string of a is marked as B, the bayesian dictionary S is established by ignoring a if B is an empty string, treating BA as a pinyin string if B is not an empty string, increasing the frequency of BA by 1 if BA is in the bayesian dictionary S, adding BA to the bayesian dictionary S if BA is not in the bayesian dictionary S, and marking the frequency as 1.
  5. 5. The pinyin error correction method as claimed in claim 1, wherein in step S11), the set Z is combined to the set Y, if x is the first pinyin string of the input sentence, the set Y is sorted in descending order according to the hot word pinyin dictionary G, if x is not the first pinyin string of the input sentence, w is the previous pinyin string of x, u is the string of pinyin strings x connected to the tail of the pinyin string w, f is the frequency of occurrence of u in the bayesian dictionary S, r1, r2, r h is the frequency of occurrence of pinyin strings of x in the bayesian dictionary S, m is the cumulative sum of the frequencies of all words in the bayesian dictionary S, the conditional probability and the value α therein are calculated to be (0.5,0.9), and the set Y is sorted according to descending order of size without dividing m by actual calculation because the conditional probability denominator is the same.
  6. 6. The pinyin error correction method as claimed in claim 1, wherein the dictionary G in step S13) includes adding the selected word to G if the selected word is not in G, and recording the frequency as 1, and increasing the frequency of the selected word by 1 if the selected word is in G.
  7. 7. The pinyin error correction method as claimed in claim 1, wherein the dictionary S in step S13) includes adding the selected word to S if the selected word is not in S, and recording the frequency as1, and adding the selected word in S by 1;S including the word of the selected word if the selected word is in S, and adding the frequency as 1.

Description

Pinyin error correction method Technical Field The invention relates to the technical field of machine learning models, in particular to a pinyin error correction method. Background In various man-machine interaction processes, language and text input is indispensable, and people can automatically check errors and provide accurate candidate input words when inputting characters, so that great convenience is brought. Due to the nature of the chinese language, a variety of error correction schemes may be selected in the pinyin input of chinese characters, such as BK tree based (Burkhard-KELLERTREE), keyboard layout based, binary search tree based, noise channel model based, and so forth. Obviously, these pinyin error correction methods either start from the pinyin structure or evolve from english spelling error correction. The noise channel model (Shannon 1948) has been successfully applied in a wide range of fields, particularly in the field of communications, including of course pinyin error correction. The phonetic error correction method of the noise channel model regards phonetic misspellings or key presses when people input as errors caused by noise interference when a keyboard is transmitted to a channel of a system. In terms of error correction, the mathematical principle on which the noise channel model is based is a likelihood function in a bayesian formula. It can be seen that in many distinctive alternative pinyin error correction methods, starting points are based on a model for solving the problem, and people evolve from the original model to improve the pinyin error correction capability and improve the accuracy of the candidate words. These improvements are significant and valuable, and various methods are used to adapt to spelling and input habits of different people. Disclosure of Invention In view of the above-mentioned actual needs, the present invention aims to provide a pinyin error correction method, which improves a noise channel pinyin error correction model, simplifies the noise channel model used by most pinyin error correction algorithms currently by a real-time frequency counting method, effectively improves the efficiency of the error correction algorithm, and simultaneously lightens the algorithm itself. In addition, the method adopts a character letter direct replacement method to replace an edit distance calculation method in the traditional pinyin error correction algorithm, so that the algorithm is prevented from frequently calculating the edit distance. The establishment and management of the dictionary word library used by the pinyin error correction method are aimed at localization of individual users, so that individuation is high, pertinence is high, scale is small, and accurate candidate words can be provided. Compared with the existing algorithm, the method has the advantages that the error rate checking rate, the candidate word precision and the execution efficiency are improved greatly. In order to achieve the above and other related objects, the present invention adopts the following technical scheme: A Pinyin error correction method comprises the following steps: S1) inputting a common Chinese character pinyin dictionary D, wherein the dictionary D comprises about 1 ten thousand common pinyin words of homophones, each pinyin word is a letter character string, case and case are not distinguished, and the dictionary D is directly called a pinyin character string; s2) establishing a pinyin hot word dictionary G, wherein the dictionary G is established by recording each pinyin word input by a user and the occurrence frequency, and sorting the pinyin words according to the occurrence frequency in a descending order; S3) establishing a Bayesian dictionary S, wherein each record of the Bayesian dictionary S comprises pinyin character strings of two Chinese characters and occurrence frequency thereof, the current pinyin character string is recorded as A, the previous character string of A is recorded as B, the establishment method of the Bayesian dictionary S comprises the steps of ignoring A if B is an empty string, taking BA as one pinyin character string if B is not an empty string, increasing the frequency of BA by 1 if BA is in the Bayesian dictionary S, adding BA to the Bayesian dictionary S if BA is not in the Bayesian dictionary S, and marking the frequency of BA as 1; S4) inputting a Pinyin character string x; S5) matching x in the pinyin dictionary D, and if x is contained in the pinyin dictionary D, directly returning to the step S4 without error correction; S6) the length of l is x, x i is the ith letter of the Pinyin character string x, and obviously i is more than or equal to 1 and less than or equal to l; s7) establishing a candidate pinyin character string set Y, wherein Y is an empty set initially; s8) repeating the following steps S81 to S85 starting from i=1 to i=l; s81) marking Q as a set of letters on adjacent keys of the keyboard by letters x i and m