CN-121256861-B - User privacy data protection method and system based on data security
Abstract
The invention relates to the technical field of privacy protection, in particular to a user privacy data protection method and system based on data security. The method comprises the steps of utilizing phrase association analysis of candidate keywords in a text to determine a core sentence in the text of a user, screening the keyword phrases to determine a noise adding position according to the sensitivity degree of the candidate keyword phrases in the core sentence, dividing text units according to different semantic association relations of the keyword phrases, analyzing confidentiality requirements according to vocabulary sensitivity of the text units, clustering according to similarity conditions of the text units, and adding noise according to confidentiality requirements of different clustering clusters to obtain noisy text data. According to the invention, by combining semantic association analysis among fine-grained sentence vocabularies in the text, adaptive noise adding of different risks is carried out on sensitive parts, basic semantics are reserved, and the user privacy is protected while the accuracy and reliability of subsequent data analysis are ensured.
Inventors
- GUO QI
- Xu zhongwang
- DAI LEI
- FAN ANG
Assignees
- 北京美数信息科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251204
Claims (9)
- 1. A method for protecting user privacy data based on data security, the method comprising: acquiring the word group and candidate keywords of each sentence in the information text data through word segmentation processing; The method comprises the steps of obtaining the sensitivity probability of each phrase to each candidate keyword through the synchronous occurrence rate of each phrase and each candidate keyword in the information text data of all users; Obtaining key evaluation indexes of candidate keywords by combining sensitive evaluation indexes according to the occurrence frequency of the candidate keywords in a core sentence of a user and phrase association conditions; the method comprises the steps of combining a preset sensitive word stock to analyze the sensitive conditions of all words in each text unit to obtain the confidentiality requirement degree of each text unit, clustering based on the similarity conditions of the text units to obtain similar cluster, and carrying out noise adding on core sentences according to the confidentiality requirement degree of the text units in each similar cluster to obtain noisy text data of a user; The method for acquiring the sensitivity evaluation index comprises the following steps: For one sentence of any user, taking the number ratio of the candidate keywords in the sentence as the density weight of the sentence, accumulating the sensitivity probability of the word group in the sentence to the candidate keywords in the sentence to obtain the sensitivity accumulated value of the sentence; And combining the density weight and the sensitive accumulated value of the statement to obtain the sensitive evaluation index of the statement.
- 2. The method for protecting private data of a user based on data security according to claim 1, wherein the method for obtaining the sensitivity probability comprises the following steps: sequentially taking each candidate keyword as an analysis word in all user information texts of each social topic; For any phrase, the number of sentences in which the analysis word and the phrase occur simultaneously in the same sentence is used as the synchronous number; The ratio of the synchronous quantity to the distribution total quantity is used as the sensitivity probability of the phrase to the analysis word.
- 3. The method for protecting private data of a user based on data security according to claim 1, wherein the screening out the core sentence of the user based on the sensitive evaluation index comprises: and taking the sentences with the sensitivity evaluation index higher than the preset core threshold value as the core sentences of the users in all the sentences of each user.
- 4. The method for protecting private data of a user based on data security according to claim 1, wherein the method for obtaining the key evaluation index comprises the following steps: Marking a core sentence in which a candidate keyword is positioned as a target sentence for any candidate keyword, taking a sensitivity probability average value of all phrases in the target sentence to the candidate keyword as a phrase association degree of the candidate keyword in each target sentence, and combining the phrase association degrees in all target sentences to obtain an association index of the candidate keyword; Multiplying the occurrence number of the candidate keywords in each target sentence with the sensitivity evaluation index of the target sentence to obtain the sensitivity occurrence degree of the candidate keywords in each target sentence; And multiplying the important index of the candidate keyword by the associated index to obtain the key evaluation index of the candidate keyword.
- 5. The method for protecting private data of a user based on data security according to claim 1, wherein the screening out the keyword group based on the key evaluation index comprises: And when the value of the normalization processing of the key evaluation index of the candidate keywords is larger than a preset key threshold value, taking the corresponding candidate keywords as key word groups.
- 6. The method for protecting private data of a user based on data security according to claim 1, wherein the method for obtaining text units comprises: Dividing all key word groups into a content word set and an emotion word set through semantic model analysis; calculating cosine similarity of key phrases between the content word set and the emotion word set by using a word embedding model, and carrying out normalization processing to obtain cross-cluster vocabulary similarity; And when the cross-cluster vocabulary similarity is larger than a preset similarity threshold, merging the sentences in which the corresponding keyword groups are located into a text unit.
- 7. The method for protecting private data of a user based on data security according to claim 1, wherein the method for obtaining the security requirement level comprises the steps of: for any text unit, acquiring each phrase in the text unit; When the word groups exist in the preset sensitive dictionary, the sensitive score of each word group is obtained according to the preset sensitive dictionary, and the accumulated value of the sensitive scores of all the word groups in the text unit is used as the confidentiality requirement degree of the text unit.
- 8. The method for protecting user privacy data based on data security according to claim 1, wherein the step of denoising the core sentence according to the security requirement of the text unit in each similar cluster to obtain the noisy text data of the user comprises the steps of: Taking the average value of the confidentiality requirements of all the text units in each similar cluster as the scale parameter of the text units in each similar cluster; And (3) denoising the keyword group of the text unit through Laplace noise according to the scale parameter to obtain a noisy text of the user.
- 9. A data security-based user privacy data protection system, the system comprising: The data acquisition module is used for acquiring information text data of each user under each social topic, and acquiring phrase and candidate keywords of each sentence in the information text data through word segmentation; the sentence screening module is used for obtaining the sensitivity probability of each phrase to each candidate keyword through the synchronous occurrence rate of each phrase and each candidate keyword in the information text data of all users; The method for acquiring the sensitivity evaluation index comprises the following steps: For one sentence of any user, taking the number ratio of the candidate keywords in the sentence as the density weight of the sentence, accumulating the sensitivity probability of the word group in the sentence to the candidate keywords in the sentence to obtain the sensitivity accumulated value of the sentence; Combining the density weight and the sensitive accumulated value of the statement to obtain a sensitive evaluation index of the statement; The merging analysis module is used for obtaining key evaluation indexes of the candidate keywords by combining the occurrence frequency of the candidate keywords in the core sentences of the user with phrase association conditions and sensitive evaluation indexes; The noise adding protection module is used for analyzing the sensitivity conditions of all words in each text unit by combining a preset sensitive word stock to obtain the confidentiality requirement degree of each text unit, clustering based on the similarity conditions of the text units to obtain similar clustering clusters, and adding noise to the core sentences according to the confidentiality requirement degree of the text units in each similar clustering cluster to obtain the noise added text data of the user.
Description
User privacy data protection method and system based on data security Technical Field The invention relates to the technical field of privacy protection, in particular to a user privacy data protection method and system based on data security. Background With the rapid development of internet technology, the social media platform realizes personalized recommendation and accurate service optimization through a multi-source data sharing mechanism, such as user comments, dynamic and private messages and the like. However, such unstructured text data often contains sensitive content such as user identity information, location data, health records and the like, and once revealed, the user privacy security is seriously threatened. To address this problem, conventional privacy preserving techniques (e.g., differential privacy) often employ a global noise addition strategy to blur sensitive information by injecting random perturbations into the full dataset. However, unstructured text lacks a fixed format, global noise is easy to break semantic consistency of the text, key information is lost or usability of data is reduced, unified noise intensity is difficult to adapt to text fragments with different sensitivity degrees, and low-risk information can be excessively interfered, and high-risk information is insufficiently protected. Disclosure of Invention In order to solve the technical problems that in the prior art, the global noise is easy to destroy the semantic consistency of texts, so that key information is lost or the usability of data is reduced, and text fragments with different sensitivity degrees are difficult to adapt to unified noise intensity, the invention aims to provide a user privacy data protection method and system based on data security, and the adopted technical scheme is as follows: the first aspect of the application provides a user privacy data protection method based on data security, which comprises the following steps: acquiring the word group and candidate keywords of each sentence in the information text data through word segmentation processing; The method comprises the steps of obtaining the sensitivity probability of each phrase to each candidate keyword through the synchronous occurrence rate of each phrase and each candidate keyword in the information text data of all users; Obtaining key evaluation indexes of candidate keywords by combining sensitive evaluation indexes according to the occurrence frequency of the candidate keywords in a core sentence of a user and phrase association conditions; the method comprises the steps of combining a preset sensitive word stock to analyze the sensitive conditions of all words in each text unit to obtain the security requirement degree of each text unit, clustering based on the similar conditions of the text units to obtain similar clusters, and carrying out noise adding on core sentences according to the security requirement degree of the text units in each similar cluster to obtain the noise added text data of a user. Further, the method for acquiring the sensitivity probability comprises the following steps: sequentially taking each candidate keyword as an analysis word in all user information texts of each social topic; For any phrase, the number of sentences in which the analysis word and the phrase occur simultaneously in the same sentence is used as the synchronous number; The ratio of the synchronous quantity to the distribution total quantity is used as the sensitivity probability of the phrase to the analysis word. Further, the method for acquiring the sensitive evaluation index comprises the following steps: For one sentence of any user, taking the number ratio of the candidate keywords in the sentence as the density weight of the sentence, accumulating the sensitivity probability of the word group in the sentence to the candidate keywords in the sentence to obtain the sensitivity accumulated value of the sentence; And combining the density weight and the sensitive accumulated value of the statement to obtain the sensitive evaluation index of the statement. Further, the screening the core sentence of the user based on the sensitive evaluation index includes: and taking the sentences with the sensitivity evaluation index higher than the preset core threshold value as the core sentences of the users in all the sentences of each user. Further, the method for acquiring the key evaluation index comprises the following steps: Marking a core sentence in which a candidate keyword is positioned as a target sentence for any candidate keyword, taking a sensitivity probability average value of all phrases in the target sentence to the candidate keyword as a phrase association degree of the candidate keyword in each target sentence, and combining the phrase association degrees in all target sentences to obtain an association index of the candidate keyword; Multiplying the occurrence number of the candidate keywords in each target sentence wit