US-20260127202-A1 - UNIQUE DOCUMENT VARIANTS OF A SOURCE DOCUMENT FOR IDENTIFYING A USER ASSOCIATED THEREWITH
Abstract
A method and computer program product provide various operations including generating a plurality of unique document variants of a source document, wherein each unique document variant is generated by causing a language model to paraphrase at least a portion of the source document and providing, for each of a plurality of users, a different one of the unique document variants to the user rather than the source document. A plurality of records are stored, wherein each record includes a unique identifier for a particular user and one or more uniquely paraphrased portions of the unique document variant provided to the particular user. If a target document is used in an unauthorized manner, the records may be searched to identify a record in which a uniquely paraphrased portion of the unique document variant is found within the target document and the identity of the user may be output.
Inventors
- GEORGE-ANDREI STANESCU
- Ajay Dholakia
- EDUARD PAVEL
Assignees
- Lenovo Global Technology (United States) Inc.
Dates
- Publication Date
- 20260507
- Application Date
- 20241107
Claims (20)
- 1 . A computer program product comprising a non-transitory computer readable medium and program instructions embodied therein, the program instructions being configured to be executable by a processor to cause the processor to perform operations comprising: generating a plurality of unique document variants of a source document, wherein each unique document variant is generated by causing a language model to paraphrase at least a portion of the source document; providing, for each of a plurality of users that are authorized to access the source document, a different one of the unique document variants to the user rather than the source document; storing a plurality of records, wherein each record includes a unique identifier for a particular one of the users and one or more uniquely paraphrased portions of the unique document variant provided to the particular user; obtaining a target document that has been used in an unauthorized manner; searching the plurality of records to identify one of the records in which at least one of the one or more uniquely paraphrased portions of the unique document variant is found within the target document; and outputting identifying information for the user associated with the unique identifier included in the identified record.
- 2 . The computer program product of claim 1 , wherein the operation of searching the plurality of records to identify one of the records in which the one or more uniquely paraphrased portions of the unique document variant are found within the target document includes: searching the plurality of records to identify one of the records in which the unique document variant matches the target document.
- 3 . The computer program product of claim 1 , further comprising: automatically modifying a network or server access permission or setting for the user associated with the unique identifier included in the identified record.
- 4 . The computer program product of claim 1 , further comprising: storing the source document on a data storage device; presenting, via a user interface, an indicator suggesting that the source document is accessible to the plurality of users; and receiving user input from a given user among the plurality of users, wherein the user input is a request to access the source document, and where the one of the unique document variants is provided to the given user rather than the source document in response to receiving the user input.
- 5 . The computer program product of claim 4 , wherein the operation of generating the plurality of unique document variants of the source document includes generating one unique document variant for each of the plurality of users that are authorized to access the source document without receiving further user input from each of the plurality of users.
- 6 . The computer program product of claim 4 , wherein the one of the unique document variants that is provided to the given user is generated in response to receiving the user input that requests to access the source document.
- 7 . The computer program product of claim 4 , the operations further comprising: forming a user interface enabling any of the plurality of users to submit the user input, wherein the one of the unique document variants is provided to the given user by downloading the unique document variant to a computing device operated by the given user.
- 8 . The computer program product of claim 1 , further comprising: storing the source document on a data storage device; and receiving user input requesting that the source document be sent to one or more of the plurality of users, wherein, for each of the one or more users, a different one of the unique document variants is provided to the user by sending the one of the unique document variants in an electronic message.
- 9 . The computer program product of claim 1 , wherein each record includes a complete copy of the unique document variant provided to the particular user.
- 10 . The computer program product of claim 1 , wherein each record includes the one or more uniquely paraphrased portions of the unique document variant but not a complete copy of the unique document variant provided to the particular user.
- 11 . The computer program product of claim 10 , wherein the one or more uniquely paraphrased portions of the unique document variant includes only the difference between the unique document variant and the source document.
- 12 . The computer program product of claim 1 , wherein the operation of generating the plurality of unique document variants of the source document includes providing a seed value to an algorithm that uses the seed value to produce an output that controls how the language model with paraphrase the at least a portion of the source document, and wherein the one or more portion uniquely paraphrased portions of the unique document variant is included each record by including the seed value.
- 13 . The computer program product of claim 12 , wherein the operation of searching the plurality of records to identify one of the records in which the one or more uniquely paraphrased portions of the unique document variant is found within the target document includes regenerating the one or more uniquely paraphrases portions of the unique document variant by inputting the seed value of a record into the algorithm and providing the output and the source document to the language model.
- 14 . The computer program product of claim 1 , further comprising: storing, in each of the plurality of records, a hash of the unique document variant provided to the user identified in the record; and calculating a hash of the target document, wherein the operation of searching the plurality of records to identify one of the records in which the one or more uniquely paraphrased portions of the unique document variant is found within the target document includes searching the plurality of records to identify one of the records in which the hash of the unique document variant matches the hash of the target document.
- 15 . The computer program product of claim 1 , further comprising: tracking, for any of the plurality of users that contributed to the source document, one or more portions of the source document that are contributed by the user, wherein generating the plurality of unique document variants of the source document includes generating, for each of the plurality of users that contributed to the source document, the unique document variant provide to the user by causing the language model to paraphrase only those portions of the source document that were contributed by another user.
- 16 . The computer program product of claim 1 , further comprising: forming a digital signature for each unique document variant, wherein the operation of providing, for each of a plurality of users that are authorized to access the source document, a different one of the unique document variants to the user rather than the source document includes providing the digital signature to the user along with the unique document variant.
- 17 . The computer program product of claim 1 , further comprising: determining an extent to which the source document will be paraphrased during generating of the one or more of the document variants based upon the number of users that are authorized to access the source document, a level of confidentiality assigned to the source document, and/or a probability that the source document will be used in an unauthorized manner, wherein the operation of causing the language model to paraphrase at least a portion of the source document includes instructing the language model to paraphrase the at least a portion of the source document to the determined extent.
- 18 . The computer program product of claim 1 , further comprising: determining an extent to which the source document will be paraphrased during generating of the one or more of the document variants based upon a level of interaction among the plurality of users that are authorized to access to the source document, wherein the operation of causing the language model to paraphrase at least a portion of the source document includes instructing the language model to paraphrase the at least a portion of the source document to the determined extent.
- 19 . The computer program product of claim 1 , further comprising: determining the total number of words in the source document; and randomly generate a number (N) between one and the total number of words in the source document, wherein causing the language model to paraphrase at least a portion of the source document includes causing the language model to replace the Nth word in the source document with a synonym, paraphrase the sentence that includes the Nth word in the source document, and/or paragraph the paragraph that includes the Nth word in the source document.
- 20 . The computer program product of claim 1 , further comprising: establishing a similarity threshold between the unique document variants; determining a similarity index between a plurality of pairings of the unique document variants; and causing, in response to one of the pairings of the unique document variants having a similarity index greater than the similarity threshold, the language model to further paraphrase one or more of the unique document variants in the pairing.
Description
BACKGROUND The present disclosure relates to method of determining the source of a document leak or other unauthorized use of a document. BACKGROUND OF THE RELATED ART Unauthorized sharing or exposing of a private, confidential, and/or sensitive document of a company or other organization can have severe negative impacts on that organization. These negative impacts may include financial loss, legal liability, damage to reputation or loss of a competitive advantage. In some instances, a document may be unintentionally leaked due to some careless act of a person working within the organization. In other instances, a person within the organization may intentionally leak a document to advance their own agenda or cause damage to the organization. Either way, it may be important to identify the person who leaked the document in order to discover a motive behind the leak and the means and extent of the leak. If this information about the leak can be obtained, then perhaps a future document leak can be mitigated or prevented. Unfortunately, there is no practical way to guarantee that a person working within an entity, group or organization, such as an employee of a company, will not intentionally or unintentionally expose a confidential document outside the organization or a defined group within the organization. In fact, it could very well be counterproductive to fully restrict the workers within the organization from having access to the document since the information within the document may be necessary for furthering the purpose of the organization. Still, promptly identifying the person that caused the document to be leaked may help limit the impact of the leak and prevent future leaks. BRIEF SUMMARY Some embodiments provide a computer program product comprising a non-volatile computer readable medium and non-transitory program instructions embodied therein, the program instructions being configured to be executable by a processor to cause the processor to perform one or more operations. The operations comprise generating a plurality of unique document variants of a source document, wherein each unique document variant is generated by causing a language model to paraphrase at least a portion of the source document. The operations further comprise providing, for each of a plurality of users that are authorized to access the source document, a different one of the unique document variants to the user rather than the source document, and storing a plurality of records, wherein each record includes a unique identifier for a particular one of the users and one or more uniquely paraphrased portions of the unique document variant provided to the particular user. Still further, the operations comprise obtaining a target document that has been used in an unauthorized manner, searching the plurality of records to identify one of the records in which at least one of the one or more uniquely paraphrased portions of the unique document variant is found within the target document, and outputting identifying information for the user associated with the unique identifier included in the identified record. Some embodiments provide a method comprising generating a plurality of unique document variants of a source document, wherein each unique document variant is generated by causing a language model to paraphrase at least a portion of the source document. The method further comprise providing, for each of a plurality of users that are authorized to access the source document, a different one of the unique document variants to the user rather than the source document, and storing a plurality of records, wherein each record includes a unique identifier for a particular one of the users and one or more uniquely paraphrased portions of the unique document variant provided to the particular user. Still further, the method comprises obtaining a target document that has been used in an unauthorized manner, searching the plurality of records to identify one of the records in which at least one of the one or more uniquely paraphrased portions of the unique document variant is found within the target document, and outputting identifying information for the user associated with the unique identifier included in the identified record. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS FIG. 1 is a diagram of a system in which unique document variants are provided to users according to some embodiments. FIG. 2 is a diagram illustrating the generation of three unique document variants of a single source document for sharing with three authorized users according to some embodiments. FIG. 3 is a diagram of a computing device according to some embodiments. FIG. 4 is a flowchart of operations according to some embodiments. DETAILED DESCRIPTION Some embodiments provide a computer program product comprising a non-volatile computer readable medium and non-transitory program instructions embodied therein, the program instructions being configured