CN-120743976-B - Intelligent phone book searching method based on TF-IDF pinyin vector model
Abstract
The invention discloses an intelligent phone book searching method based on a TF-IDF pinyin vector model, which belongs to the field of communication and specifically comprises the steps of firstly carrying out pinyin conversion on contact names of communication equipment to obtain character strings, calculating IDF values of all characters, calculating word frequency TF values of all the characters to further obtain TF-IDF values, carrying out normalization processing, when contact M information to be queried is input by a user to obtain normalized TF-IDF vectors, calculating cosine similarity between the TF-IDF vectors of the contact M and the saved TF-IDF vectors of all the contacts, traversing the saved contacts, respectively calculating Chinese names and pinyin standardized editing distances of names of the contact M and all the contacts, selecting maximum values from the names as final editing distance similarity, weighting and constructing similarity scores of the contact M and all the contacts through the cosine similarity and the final editing distance similarity, sorting in descending order, and screening out the first K names to be displayed to the user as searching results. The invention realizes the remarkable improvement of the search precision.
Inventors
- LIU JIANBING
- FENG BO
- LI GUANGXU
- SHANG YINZHONG
- GAO FENG
- ZHU HAIBO
- JIANG RUI
- SONG JUPO
- LIU YONGHUI
Assignees
- 北京方位智联科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20250617
Claims (8)
- 1. The intelligent phone book searching method based on the TF-IDF pinyin vector model is characterized by comprising the following specific steps: Step one, storing contact person information of a certain embedded communication device into a telephone book CSV file; step two, respectively performing pinyin conversion on each contact name in the telephone book CSV file to obtain pinyin character strings and establishing a character index table; the IDF value calculation formula is: IDF(*) = log(m/(n+1))+1 * Any of the 26 characters a to z, m representing the total number of contacts, n representing the number of contacts containing the character; thirdly, word frequency statistics is carried out on the pinyin character strings of each contact person to obtain word frequency TF values of each character, TF-IDF values are calculated by combining the IDF values of the characters, and normalization processing is carried out to obtain TF-IDF vectors of the contact persons; the word frequency TF value calculation formula is: TF(*) = t(*)/Z; t (x) represents the number of times the character appears, and Z represents the total number of pinyin characters; The TF-IDF value calculation formula is TF-IDF (x) =TF (x) x IDF (x); Step four, the character index table, the IDF value matrix and TF-IDF vectors of all contacts are subjected to serialization persistence preservation in a binary format; Step five, when the user inputs the information of the contact person M to be inquired, converting the name of the contact person M into pinyin, and calculating the TF-IDF vector of the contact person M according to the steps; Step six, calculating the cosine similarity between the TF-IDF vector of the contact M and each saved TF-IDF vector of the contact M; Step seven, aiming at the stored current contact person N, respectively calculating the double-standard editing distance between the name Chinese name and the pinyin of the contact person M and the name Chinese name and the pinyin of the contact person N, and selecting the maximum value from the double-standard editing distance as the final editing distance similarity of the contact person M and the contact person N; Step eight, constructing final similarity scores of the contacts M and N through cosine similarity and final editing distance similarity weighting; Synthesizing cosine similarity and final editing distance similarity to form final similarity score: final similarity = a x cosine similarity + (1-a) x final edit distance similarity, Wherein alpha is a weight coefficient, and the value range is [0,1]; Step nine, returning to the step seven, traversing all the stored contacts one by one, calculating the final similarity scores of the contact M and each contact, and sorting all the contacts in a descending order according to the final similarity scores; k, manually setting an integer value according to actual needs; and step ten, displaying the search result to the user, supporting the user to select the contact person and providing the detailed view and call functions.
- 2. The intelligent phonebook search method of claim 1 wherein the phonebook CSV file includes name, department, mailbox and landline number information for all contacts based on a TF-IDF pinyin vector model.
- 3. The intelligent phonebook searching method based on the TF-IDF pinyin vector model as defined in claim 1, wherein the step two is specifically as follows: firstly, performing pinyin conversion on names of all contacts in a phone book to obtain pinyin character strings; Then, establishing a character level index table, and distributing index positions for each unique Pinyin character; The Chinese phonetic alphabet uses 26 English letters, so that the size of an index table is 26, and each letter corresponds to one index position, wherein 'a' corresponds to index 0, 'b' corresponds to index 1, and the like, and 'z' corresponds to index 25; And then, counting the document frequency of each character in the corpus in the pinyin of all the contacts, and calculating the IDF value of each character.
- 4. The intelligent phonebook searching method based on the TF-IDF pinyin vector model of claim 1, wherein the step three is specifically as follows: Firstly, counting the occurrence times of each character for the pinyin character string of each contact person, and dividing the occurrence times by the total number of the pinyin characters to obtain the word frequency TF value of the character; then, calculating TF-IDF values of the characters; Then, constructing TF-IDF vectors of each contact according to the character index table, and filling 0 in the positions where no characters appear; The TF-IDF vector comprises 26 elements, and the value of each element is the TF-IDF value or 0 of each character respectively; and finally, carrying out L2 normalization on the TF-IDF vector, namely dividing each element of the vector by the Euclidean norm of the vector, and storing the normalized vector and the contact information in a correlated way.
- 5. The intelligent phonebook searching method based on the TF-IDF pinyin vector model of claim 1, wherein the step four specifically comprises the steps of serializing and storing a character index table and an IDF value matrix in a binary format, serializing and storing TF-IDF vectors and contact information of all contacts in the binary format, namely storing the contact vectors in a sparse representation method, and storing indexes and values of only non-zero elements.
- 6. The intelligent phonebook searching method of claim 1, wherein in the sixth step, cosine similarity calculation is simplified into dot product operation of two normalized vectors, namely: Similarity = TF-IDF vector of query-TF-IDF vector of contact.
- 7. The intelligent phonebook searching method based on the TF-IDF pinyin vector model as in claim 1, wherein in the seventh step, the edit distance is a Levenshtein distance algorithm, and the normalization formula is: Standardized chinese edit distance: standardized chinese edit distance = 1-d1/L1; d1 is a Chinese editing distance, which refers to the word number difference of the two names of contacts M and N; l1 is the longer Chinese name length of the two contacts M and N; standardized pinyin editing distance: standardized pinyin editing distance = 1-d2/L2; d2 is the pinyin editing distance, which means that the number of characters of the pinyin of the two names of the contacts M and N is poor; l2 is the longer pinyin length of the two contacts M and N.
- 8. The intelligent phonebook searching method of claim 1, wherein in the step nine, when the front and rear similarity scores are close, contacts with higher edit distance similarity are prioritized according to the final edit distance similarity of contact M and two stored contacts.
Description
Intelligent phone book searching method based on TF-IDF pinyin vector model Technical Field The invention belongs to the field of communication, and particularly relates to an intelligent phonebook searching method based on a TF-IDF pinyin vector model. Background With the development of communication technology, embedded communication devices such as IP phones and conference phones are widely used for enterprises and individuals, and these devices generally need to store a large amount of contact information, but users often need to quickly search for specific contacts in the daily use process, and the search speed of users is slow due to the large amount of information. At present, the common telephone book searching method in the embedded telephone product mainly comprises the following steps: 1. Prefix matching methods, which only match contacts beginning with the user input character, e.g., entering "li" can only match "li" rather than "lixiaoming", have a limited range of matches. 2. The substring matching method is that substring containing user input is searched in the name or pinyin of the contact person, and the matching importance cannot be distinguished although the matching range is improved, and the calculation efficiency is low. 3. The edit distance method evaluates the similarity by calculating the edit distance (such as the Levenshtein distance) between the user input and the contact, and the method can tolerate spelling errors but does not consider character importance and has high calculation cost for long character strings. The above method has the following common problems: 1) The semantic understanding ability is weak, and the importance of different characters cannot be identified, such as the contribution of rare characters to matching should be larger than common characters. 2) And the calculation efficiency is low, and the response speed is low particularly when the number of contacts is large, so that the real-time searching requirement is difficult to meet. 3) And memory occupation is large, and particularly for embedded equipment with limited resources, efficient storage is difficult to realize. 4) And the fuzzy matching capability is limited, so that the user input error or partial matching condition is difficult to process, and the user experience is reduced. With the development of artificial intelligence technology, a vectorization method in natural language processing provides a new idea for solving the problems. However, the conventional Word embedding model (such as Word2Vec, BERT, etc.) requires a large amount of corpus training, has high computational complexity, and is difficult to implement on embedded devices with limited resources. Therefore, there is a need for an intelligent phonebook search method that is computationally efficient, compact in storage, and accurate in search. Disclosure of Invention The invention provides an intelligent phonebook searching method based on a TF-IDF pinyin vector model, which aims to solve the problems that the phonebook searching efficiency is low, the accuracy is poor, the memory occupation is large and the fuzzy searching requirement is difficult to adapt to in the prior art, is suitable for embedded communication equipment with limited processing capacity and storage space, and can realize the efficient and accurate contact person searching function with minimum system resource consumption. The intelligent phone book searching method based on the TF-IDF pinyin vector model comprises the following specific steps: Step one, storing contact person information of a certain embedded communication device into a telephone book CSV file; the phonebook CSV file includes information such as names, departments, mailboxes, landline numbers, and the like of all contacts. Step two, performing pinyin conversion on contact names in the CSV file of the phone book to obtain pinyin character strings and establishing a character index table; Firstly, performing pinyin conversion on names of all contacts in a phone book to obtain pinyin character strings; Then, a character-level index table is built, and an index position is allocated to each unique pinyin character. The Chinese pinyin uses 26 English letters, so the index table is 26 in size, and each letter corresponds to an index position of 'a' corresponding to index 0, 'b' corresponding to index 1, and so on, and 'z' corresponding to index 25. Then, counting the document frequency of each character in the corpus in the pinyin of all the contacts, and calculating the IDF value of each character: IDF(*) = log(m/(n+1))+1 * Any of the 26 characters a through z, m representing the total number of contacts, n representing the number of contacts containing the character. Thirdly, word frequency statistics is carried out on the pinyin character strings of each contact person to obtain word frequency TF values of each character, TF-IDF values are calculated by combining the IDF values of the characters,