Search

KR-102962877-B1 - Wordbreak algorithm using offset mapping

KR102962877B1KR 102962877 B1KR102962877 B1KR 102962877B1KR-102962877-B1

Abstract

A computer system is provided comprising a processor coupled to a mass storage device for storing instructions, wherein when the instructions are executed by the processor, the processor stores an original string composed of multiple characters, performs a word break algorithm on the original string, and tokenizes the original string to generate a processed string containing multiple word tokens separated by spaces. The processor is further configured to generate an offset map between a position within a word token of the processed string and a corresponding position in the original string, and to classify a portion of the processed string as a target. The processor may be further configured to use the offset map to identify a target character in the original string corresponding to the target, and to perform a predetermined action on the target character in the original string.

Inventors

  • 굽타 마노즈
  • 모트라니 카빈

Assignees

  • 마이크로소프트 테크놀로지 라이센싱, 엘엘씨

Dates

Publication Date
20260512
Application Date
20220505
Priority Date
20210528

Claims (20)

  1. As a computer system, A processor coupled to a mass storage device that stores instructions, comprising: When the above instruction is executed by the processor, the processor, Stores the original string consisting of multiple characters, and A wordbreak algorithm is performed on the above original string, and Tokenize the above original string to generate a processed string containing multiple word tokens separated by one or more spaces, and An offset map is generated between a position within the word token of the processed string and a corresponding position within the original string—the offset map includes a mapping between a first data structure containing character offset index values in the original string and a second data structure containing character offset index values in the processed string—, A portion of the above-mentioned processed string is classified as a target—the classification is performed by determining the start character offset index value and the end character offset index value of the target in the above-mentioned processed string from among the character offset index values within the above-mentioned second data structure—, Using the mapping between the character offset index values of the first and second data structures in the offset map, the target character in the original string corresponding to the target is identified, and Configured to perform a predetermined action on the target character in the original string, Computer system.
  2. In paragraph 1, To perform the above-determined action, the processor Display the above target character with highlighting, and Obfuscate the above target characters, and/or Further configured to extract the above target characters, Computer system.
  3. In paragraph 1, To identify the target character in the original string above, the processor, Identify the start character offset index value and character length of the above target character, and/or Further configured to identify the start character offset index value and the end character offset index value of the above target character, Computer system.
  4. In paragraph 1, To identify the target character in the original string above, the processor, Determine token index values for each of the start character offset index value and end character offset index value of the target in the processed string above, and The start character offset index value of the start token of the original string is determined using the start character offset index value stored in the first data structure of the offset map, and Configured to determine the end character offset index value of the end token using the end character offset index value stored in the second data structure of the offset map, Computer system.
  5. In paragraph 1, The above target is sensitive information of a predetermined sensitive information data type, Computer system.
  6. In paragraph 1, The above original string contains Japanese, Chinese, Korean, or Thai characters, Computer system.
  7. In paragraph 1, The above original string is extracted from an electronic document or electronic message, Computer system.
  8. In paragraph 1, The above original string includes characters omitted from the processed string after performing the above word break algorithm, Computer system.
  9. In paragraph 1, The first data structure stores the starting character offset index value and character length in the original string for each token word detected in the original string during the word break algorithm, and The second data structure above stores the end character offset index value of each token in the processed string, and Each of the above first and second data structures has the same number of elements, Computer system.
  10. As a computerized method, A step of storing an original string composed of multiple characters, and The step of performing a word break algorithm on the above original string, and The step of tokenizing the original string to generate a processed string containing multiple word tokens separated by one or more spaces, and A step of generating an offset map between a position within the word token of the processed string and a corresponding position within the original string—the offset map includes a mapping between a first data structure containing character offset index values in the original string and a second data structure containing character offset index values in the processed string—and, A step of classifying a portion of the above-mentioned processed string into a target—the classification is performed by determining the start character offset index value and the end character offset index value of the target in the above-mentioned processed string among the character offset index values in the above-mentioned second data structure—and, A step of identifying a target character in the original string corresponding to the target using the mapping between the character offset index values of the first and second data structures in the offset map, and A method comprising the step of performing a predetermined action on the target character in the original string. Computerized method.
  11. In Paragraph 10, The step of performing the above-mentioned predetermined action is, Displaying the above target character with highlighting, Obfuscation of the above target characters, and/or including one or more of the above-mentioned target characters, Computerized method.
  12. In Paragraph 10, The step of identifying target characters in the above original string is, Identifying the start character offset index value and character length of the above target character, and/or One or more of identifying the start character offset index value and the end character offset index value of the target character, Computerized method.
  13. In Paragraph 10, The step of identifying target characters in the above original string is at least partially, Determine token index values for each of the start character offset index value and end character offset index value of the target in the processed string above, and The start character offset index value of the start token of the original string is determined using the start character offset index value stored in the first data structure of the offset map, and Achieved by determining the end character offset index value of the end token using the end character offset index value stored in the second data structure of the offset map, Computerized method.
  14. In Paragraph 10, The above target is sensitive information of a predetermined sensitive information data type, Computerized method.
  15. In Paragraph 10, The above original string contains Japanese, Chinese, Korean, or Thai characters, Computerized method.
  16. In Paragraph 10, The above original string is extracted from an electronic document or electronic message, Computerized method.
  17. In Paragraph 10, The above original string includes characters omitted from the processed string after performing the above word break algorithm, Computerized method.
  18. In Paragraph 10, The first data structure stores the starting character offset index value and character length in the original string for each token word detected in the original string during the word break algorithm, and The second data structure above stores the end character offset index value of each token in the processed string, and Each of the above first and second data structures has the same number of elements, Computerized method.
  19. In Paragraph 18, The end character offset index value of each token of the processed string is calculated using the previous end character offset index value of each token of the processed string and the length of each token of the original string, Computerized method.
  20. As a computer system configured to classify words, Includes a server computing device configured to run a search program, The above search program receives a sensitive data definition and one or more policies as input, and is configured to search a data set containing multiple original strings for the sensitive data according to the sensitive data definition, and The above server computing device is, A word break algorithm is performed on a selected original string among the plurality of original strings above, and The selected original string is tokenized to generate a processed string containing multiple word tokens separated by one or more spaces, and Generating an offset map between a position within the word token of the processed string and a corresponding position within the original string—the offset map includes a mapping between a first data structure containing character offset index values in the original string and a second data structure containing character offset index values in the processed string—, A portion of the above-mentioned processed string is classified as a target—the classification is performed by determining the start character offset index value and the end character offset index value of the target in the above-mentioned processed string from among the character offset index values within the above-mentioned second data structure—, Identifying the target character in the original string corresponding to the target using the mapping between the character offset index values of the first and second data structures in the offset map, and Configured to perform a predetermined action on the target character in the original string, Computer system.

Description

Wordbreak algorithm using offset mapping Wordbreak algorithms are used in various computing contexts. One specific application where wordbreak algorithms are used is Data Loss Prevention (DLP). DLP systems are designed to protect against threats of data loss, such as theft and accidental disclosure, when storing or transmitting sensitive data, such as in computers and computer networks. For example, word classification programs used by these DLP systems can monitor and detect sensitive information contained in electronic communications, such as emails and messaging, to prevent sensitive information from being transmitted outside the corporate network. DLP technology supports multibyte characters found in languages such as Chinese, Korean, and Japanese. For strings composed of these multibyte characters, wordbreak algorithms can be used to separate the original string into individual words, typically separated by spaces, and generate a processed string containing the tokenized words. However, in the case of such multibyte languages, in many situations, the processed string generated by the wordbreak algorithm may have a different length from the original string, and may also lose some characters present in the original string, such as commas and punctuation, and may contain spaces, tabs, and other whitespace characters that differ from the original string. As such, if there is a discrepancy between the original string and the processed string, the word classification program of the DLP system may fail to correctly identify sensitive information in the original string, which could lead to important sensitive information being stolen or accidentally disclosed. According to one aspect of the present disclosure, a computer system comprising a processor coupled to a mass storage device for storing instructions is provided. When executed by the processor, the instructions cause the processor to store an original string composed of a plurality of characters, perform a word break algorithm on the original string, and tokenize the original string to generate a processed string comprising a plurality of space-separated word tokens. The processor may be further configured to generate an offset map between a location within a word token of the processed string and a corresponding location in the original string, classify a portion of the processed string as a target, use the offset map to identify a target character in the original string corresponding to the target, and perform a predetermined action on the target character in the original string. One of the potential advantages of this configuration is that the wordbreak algorithm can accurately identify target characters in the original string, which consists of multibyte language characters, based on the processed string, even if there is a discrepancy between the original string and the processed string. Therefore, a word classification program that applies this algorithm can accurately identify sensitive information and prevent the theft or accidental leakage of such information. This summary is provided to introduce, in a simplified form, some concepts described in detail in the detailed description below. This summary is not intended to identify the principal or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to any implementation that addresses any or all disadvantages mentioned in any part of this disclosure. FIG. 1 is a schematic diagram of a computing system comprising a computing device configured to generate a processed string that can be searched by a search program, and to perform a word break algorithm to identify a target character of the original string corresponding to a matching target in the processed string by utilizing an offset map when a match is found by the search program. FIG. 2 is a schematic diagram of another configuration of the computing system of FIG. 1, comprising a first computing device configured with a compliance and security program that sets policies and sensitive data definitions, and a server system comprising a second computing device and a third computing device that each execute a search program according to the definitions and policies set by the first computing device. Figure 3 shows a schematic diagram of a plurality of data structures manipulated by the computing system of Figure 1 when performing a word break algorithm with offset mapping. Figure 4 is a schematic diagram of an exemplary GUI used to set up the compliance and security program policy of Figure 2. Figures 5a-5d show four different GUI examples used to perform a predetermined action on a target character in the original string. FIG. 6 illustrates a schematic diagram of a plurality of data structures manipulated by the computing system of FIG. 1 when performing a word break algorithm having offset mapping for another example of the or