CN-115048565-B - Sensitive information retrieval method and device
Abstract
The invention provides a sensitive information retrieval method and device, the method comprises the steps of obtaining first information of a user publishing a document to be retrieved, obtaining second information of the user associated with the first information according to the first information, associating the first information and the second information in advance, matching initial character segments of the first information and the second information with the document to be retrieved respectively, if the initial character segments of the first information or the second information are matched from the document to be retrieved, reading character strings of the character string length from the initial position of the initial character segments in the document to be retrieved according to the character string length of the first information or the second information, and if the initial character segments of the first information and the second information are matched, the character strings are used as the retrieved sensitive information. The invention realizes quick retrieval of sensitive information.
Inventors
- WANG JINGJING
- CHEN JIE
- SONG XIAO
- WEN TAO
- WANG CHUNHUA
Assignees
- 中国移动通信集团江苏有限公司
- 中国移动通信集团江苏有限公司
- 中国移动通信集团有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260421
- Application Date
- 20210308
- Priority Date
- 20210308
Claims (9)
- 1. A method of sensitive information retrieval comprising: Acquiring first information of a user publishing a document to be retrieved, and acquiring second information of the user associated with the first information according to the first information, wherein the first information and the second information are associated in advance; Respectively matching the initial character segments of the first information and the second information with the document to be searched; If the initial character segment of the first information or the second information is matched from the document to be searched, reading a character string with the character string length from the initial position of the initial character segment in the document to be searched according to the character string length of the first information or the second information; Matching the character string with the first information or the second information, and taking the character string as the searched sensitive information if the matching is successful; and according to the starting position and the ending position of the sensitive information in the document to be searched, performing one or more of data replacement, invalidation, randomization, offset, mask shielding and encoding on the sensitive information in the document to be searched, and obtaining a desensitization result of the sensitive information.
- 2. The method for retrieving sensitive information according to claim 1, wherein said matching the start character segments of the first information and the second information with the document to be retrieved, respectively, comprises: If the first information or the second information is a telephone number, the first three digits of the first information or the second information are used as the initial character segment to be matched with the document to be searched; And if the first information or the second information is not the telephone number, the first character of the first information or the second information is used as the initial character segment to be matched with the document to be searched.
- 3. The method of claim 1, wherein the step of matching the character string with the first information or the second information comprises: if the first information or the second information is of a first preset type, judging whether the ending character of the character string is the same as the ending character of the first information or the second information; If the ending character of the character string is identical to the ending character of the first information or the second information, judging whether the character string is identical to the first information or the second information; And if the character string is completely the same as the first information or the second information, obtaining that the character string is successfully matched with the first information or the second information.
- 4. The method of claim 1, wherein the step of matching the character string with the first information or the second information comprises: if the first information or the second information is of a second preset type, converting the character string and the first information or the second information into word vectors; calculating the similarity between the word vector of the character string and the word vector of the first information or the second information; And if the similarity is larger than a preset threshold, the character string is successfully matched with the first information or the second information.
- 5. The method for retrieving sensitive information according to any one of claims 1-4, wherein the acquiring second information of the user associated with the first information based on the first information includes: Constructing an information map relation of the user based on the information table of the user in the CRM system; And searching the information map relation according to the first information, and acquiring second information associated with the first information.
- 6. The method for retrieving sensitive information according to any one of claims 1 to 4, wherein said matching the start character segments of the first information and the second information with the document to be retrieved, respectively, includes: copying the document to be retrieved into a plurality of copies based on a main thread, wherein the number of copies is equal to the total number of the first information and the second information; And respectively taking each copy as the input of a corresponding sub-thread of the main thread, and respectively matching the initial character segments of the corresponding first information and the second information with the copies.
- 7. A sensitive information retrieval apparatus, comprising: The acquisition module is used for acquiring first information of a user publishing a document to be retrieved and acquiring second information of the user associated with the first information according to the first information, wherein the first information and the second information are associated in advance; The first matching module is used for respectively matching the initial character segments of the first information and the second information with the document to be searched; the reading module is used for reading the character string with the character string length from the initial position of the initial character segment in the document to be searched according to the character string length of the first information or the second information if the initial character segment is matched with the first information or the second information in the document to be searched; The second matching module is used for matching the character string with the first information or the second information, and if the matching is successful, the character string is used as the searched sensitive information; And the desensitization module is used for carrying out one or more of data replacement, invalidation, randomization, offset, mask shielding and encoding on the sensitive information in the document to be searched according to the starting position and the ending position of the sensitive information in the document to be searched, and obtaining a desensitization result of the sensitive information.
- 8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the sensitive information retrieval method according to any one of claims 1 to 6 when the program is executed by the processor.
- 9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the sensitive information retrieval method according to any one of claims 1 to 6.
Description
Sensitive information retrieval method and device Technical Field The present invention relates to the field of information security technologies, and in particular, to a method and apparatus for retrieving sensitive information. Background At present, complaint work order information comes from multiple channels, and the information is diversified. The complaint worksheet content is uniformly filed and filled by business personnel, and sensitive information exists, including but not limited to names, identity cards, mobile phone numbers and addresses. The flow goes to a manufacturer link, all plaintext is displayed, and the risk of information leakage is high. In the circulation process, sensitive information such as user privacy data and the like existing in complaint contents needs to be desensitized. Sensitive information in the complaint content needs to be retrieved before desensitization. Because of the huge user volume data corresponding to the operators, the existing sensitive word filtering algorithm often ignores the association relation among the sensitive words, so that repeated matching scanning retrieval is caused. In addition, DFA (DETERMINISTIC FINITE Automaton, deterministic finite automata) or Aho-cornick algorithm needs to construct all keywords as a dictionary tree in advance, i.e. a state transition table, and when there are a huge number of keywords, the size of the dictionary tree becomes unacceptably large and cannot be accommodated in the memory. The regular expression scheme is low in efficiency and needs to be matched for many times, so that the regular expression scheme is not suitable for the situation of massive keywords. The full text search algorithm is usually small, i.e. one or more keywords are searched, and is generally based on the pretreatment and word segmentation of the original text, and cannot be used for efficient searching of massive keywords. Disclosure of Invention The invention provides a sensitive information retrieval method and a device, which are used for solving the defect of low sensitive information retrieval efficiency in the prior art and improving the sensitive information retrieval efficiency. The invention provides a sensitive information retrieval method, which comprises the following steps: Acquiring first information of a user publishing a document to be retrieved, and acquiring second information of the user associated with the first information according to the first information, wherein the first information and the second information are associated in advance; Respectively matching the initial character segments of the first information and the second information with the document to be searched; If the initial character segment of the first information or the second information is matched from the document to be searched, reading a character string with the character string length from the initial position of the initial character segment in the document to be searched according to the character string length of the first information or the second information; and matching the character string with the first information or the second information, and taking the character string as the searched sensitive information if the matching is successful. According to the method for retrieving sensitive information provided by the invention, the method for respectively matching the initial character segments of the first information and the second information with the document to be retrieved comprises the following steps: If the first information or the second information is a telephone number, the first three digits of the first information or the second information are used as the initial character segment to be matched with the document to be searched; And if the first information or the second information is not the telephone number, the first character of the first information or the second information is used as the initial character segment to be matched with the document to be searched. According to the sensitive information retrieval method provided by the invention, the step of matching the character string with the first information or the second information comprises the following steps: if the first information or the second information is of a first preset type, judging whether the ending character of the character string is the same as the ending character of the first information or the second information; If the ending character of the character string is identical to the ending character of the first information or the second information, judging whether the character string is identical to the first information or the second information; And if the character string is completely the same as the first information or the second information, obtaining that the character string is successfully matched with the first information or the second information. According to the sensitive information retrieval method provided by the invention