CN-122021899-A - Unstructured text-oriented main body and privacy association relation extraction and verification method

CN122021899ACN 122021899 ACN122021899 ACN 122021899ACN-122021899-A

Abstract

The invention provides a method for extracting and checking association relation between a main body and privacy for unstructured text, which relates to the technical field of privacy protection and comprises the steps of receiving an input unstructured text document, a main body set and a privacy value candidate fragment set, extracting an explicit main body-privacy association triplet based on a discriminant model, predicting potential privacy attributes of the main body, generating an explicit triplet set and a candidate set to be checked, wherein the explicit triplet set comprises associations which directly appear in the unstructured text document, the candidate set to be checked comprises potential associations based on global semantic prediction, carrying out semantic reasoning check and implicit attribute completion on the candidate set to be checked based on a generation type model, generating a reasoning check result, fusing the explicit triplet set and the reasoning check result based on a probability fusion strategy, and outputting a high-precision and interpretable main body-privacy association triplet list. The invention optimizes the multi-subject attribution accuracy and provides an interpretable and high-robustness solution for privacy compliance application.

Inventors

ZHU NAFEI
WANG ZIQING
HE JINGSHA
LIU YUE
ZHANG CHUNHUI

Assignees

北京工业大学

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. The method for extracting and checking the association relation between the main body and the privacy of the unstructured text is characterized by comprising the following steps: receiving an input unstructured text document, a main body set and a privacy value candidate segment set, wherein the document is document-level text containing cross-sentence semantic dependencies; Extracting an explicit subject-privacy association triplet from the unstructured text document based on the subject set and a privacy value candidate segment set based on first-stage processing of a discriminant model, and generating an explicit triplet set and a candidate set to be checked based on potential privacy attributes of a subject of full-text semantic prediction of the unstructured text document, wherein the explicit triplet set comprises associations directly appearing in the unstructured text document, and the candidate set to be checked comprises potential associations based on global semantic prediction; Based on the second stage processing of the generated model, carrying out semantic reasoning check and implicit attribute completion on the candidate set to be checked based on the unstructured text document to generate a reasoning check result, wherein the generated model adopts knowledge injection and a thinking chain mechanism to carry out controlled reasoning; Based on a probability fusion strategy of development set calibration, fusing the explicit triplet set and the reasoning verification result, outputting a high-precision and interpretable main body-privacy association triplet list, and attaching fusion scores and source marks to each result.
2. The unstructured text-oriented body and privacy association extraction and verification method according to claim 1, wherein the first stage of processing based on a discriminant model comprises two sub-tasks executed in parallel: subtask A, calculating association probability between each subject in the subject set and each privacy value candidate segment in the privacy value candidate segment set based on a document-level relation extraction model, extracting an explicit subject-privacy association triplet, and generating the explicit triplet set; a subtask B, based on a multi-label classification model, predicting privacy attribute types possibly related to each subject in the subject set under a global context, and generating the candidate set to be checked; the subtask A and the subtask B share the same depth context encoder and are synchronously optimized through a joint loss function.
3. The method for extracting and verifying the association relation between the main body and the privacy of the unstructured text according to claim 2 is characterized in that a ATLOP or DREEAM architecture is adopted by the document-level relation extraction model and used for capturing cross-sentence semantic dependencies in a document-level context, and a full-connection layer or bilinear classifier is adopted by the multi-label classification model to output probability distribution of the main body potential privacy attribute type.
4. The unstructured text-oriented body and privacy association extraction and verification method according to claim 1, wherein the second stage processing based on the generative model comprises: the generated model adopts a lightweight large model, and model capacity migration is carried out on the lightweight large language model by utilizing a parameter efficient fine tuning technology, so that the lightweight large predictive model is adapted to the privacy field; The generating model receives the candidate set to be verified, extracts an inference permission sign, a standard value format and a few sample examples for metadata configuration of the attribute of each main body-attribute candidate value to be verified, dynamically assembles the inference permission sign according to the inference permission sign to generate an injection instruction, generates an instruction for allowing logic inference based on a context implicit clue if the inference permission is expressed as True, and generates an instruction for strictly restricting the extraction of only explicit text evidence if the inference permission is expressed as False; And writing a distributed reasoning template in the instruction in a hard coding way, generating a natural language reasoning path by a forced model before outputting the reasoning verification result based on the instruction, and performing limited decoding to ensure that the output accords with a predefined JSON Schema, wherein the JSON Schema at least comprises existence, privacy values, evidence segments and reasoning analysis fields.
5. The unstructured text-oriented main body and privacy association relation extraction and verification method according to claim 4 is characterized in that the lightweight large language model is a generation model with parameters ranging from 0.6B to 7B and is selected from Qwen series or Llama 3 series, the efficient fine adjustment of parameters adopts QLoRA technology, only low-rank adapter weights are trained, vLLM reasoning frames are used in the reasoning stage, and display management is optimized through PagedAttention technology.
6. The unstructured text-oriented body and privacy association relation extraction and verification method according to claim 1, wherein the probability fusion strategy comprises: Calculating the statistical prior probability of the first stage, and mapping the explicit extraction result into a continuous probability value based on the precision prior and the missing report rate prior on the development set; analyzing multidimensional indexes output in the second stage, wherein the multidimensional indexes comprise a presence index, a format compliance index and a candidate set hit index, and obtaining an inference confidence coefficient through weighted summation; And fusing the scores of the first stage and the second stage by adopting a linear weighting formula to obtain a fused score, executing conditional rollback logic according to a global decision threshold, preferentially adopting the result of the second stage with high confidence, and rolling back to the explicit result of the first stage.
7. The method for extracting and verifying a subject and privacy association relationship for unstructured text of claim 6, wherein the executing the conditional rollback logic according to the global decision threshold preferably adopts a stage two result with high confidence, and rollback to a stage one explicit result comprises: setting the global decision threshold as The privacy values respectively given in the first stage and the second stage are recorded as , When the fusion score F (t) is not less than delta and the privacy value output in the second stage is not less than delta When adopting ; When the fusion score F (t) < delta and the privacy value output in the first stage When adopting Otherwise, it is determined that the association does not exist.
8. The unstructured text-oriented body and privacy association extraction and verification method of claim 6, wherein the linear weighting formula is defined as: For any candidate triplet t in the explicit triplet set and the reasoning check result, a final fusion score F (t) =αr (t) + (1- α) q (t), where r (t) represents the statistical prior probability of the first stage, q (t) represents the comprehensive confidence of the second stage, and α is the global weight parameter calibrated on the development set.
9. The method for extracting and verifying association between a body and privacy for unstructured text according to claim 1, further comprising a privacy knowledge base, wherein the privacy knowledge base stores metadata of privacy attributes in a JSON format, the metadata comprises an inference permission flag, a standard value format and a few sample examples, the inference permission flag is used for controlling an inference boundary of a generative model, and the operation type is allowed to comprise extraction and extraction only and inference.
10. The unstructured text-oriented subject and privacy association extraction and verification method of claim 1 wherein each result in the output subject-privacy association triplet list is marked as either explicitly extracted or inferentially completed and fusion score and inference paths are provided to support auditing and interpretability.

Description

Unstructured text-oriented main body and privacy association relation extraction and verification method Technical Field The invention relates to the technical field of privacy protection, in particular to a method for extracting and checking association relation between a main body and privacy of unstructured text. Background With the penetration of digital transformation, massive amounts of natural language text data are accumulated in the fields of medical health, lawsuits, financial audits and the like, and a considerable part of the natural language text data are unstructured documents containing complex context information. These texts (e.g., electronic medical records, decision books, chat records) typically have features that depend across sentence semantics and multi-subject co-occurrence (e.g., patient, family, healthcare worker, principal, witness, etc.). In the need for data compliance (e.g., GDPR, personal information protection laws), it is not sufficient to simply identify what sensitive words are contained in the text. The problem of 'subject-privacy attribution association' is urgently needed to be solved in practical application, namely, in a document containing a plurality of natural persons, whether a certain piece of privacy information (such as 'confirmed depression') belongs to which subject is accurately judged, so that a structured privacy portrait is constructed. At present, related researches on text main body privacy association relation recognition mainly relate to technical means such as sensitive information detection, document-level relation extraction, large model application and the like. The first type is a sensitive information detection technology based on pattern matching and sequence labeling, which is the most widely applied scheme in the industry at present, and mainly depends on regular expressions to match specific format characters or uses named entity recognition models such as BERT-CRF to label privacy information as specific entities. The second category is a document-level relation extraction technology in the general field, and the method uses a graph neural network or a full-connection attention mechanism to capture semantic dependency between entity pairs in a document, so as to extract semantic relation between two entities from the document. The third class is based on the end-to-end generation technology of a large language model, namely, the generated model reads the whole text by constructing prompt words, and the structured result is directly output in a mode of sequence generation. The prior art has the defects that the prior series of schemes provide theoretical basis and method support for the invention, but in the scene of complex subject-privacy association tasks, particularly in document-level contexts, the following significant technical bottlenecks still exist: First, traditional discriminant models lack deep semantic reasoning capability-inability to understand privacy attributes not explicitly mentioned in text but logically implicit (e.g., infer identity from behavioral descriptions), resulting in a significant amount of misinformation of implicit privacy. Secondly, the generated large model has the illusion risk and high landing cost, the direct use of the end-to-end large model is easy to generate the non-existing association, and especially the context window limitation and the high computational cost are faced when the document-level input is processed, so that the industrial-level large-scale data processing requirement is difficult to meet. Third, the challenge brought by cross-sentence semantic dependency and multi-subject confusion is that in complex texts where cross-sentence reference and multi-subject co-occurrence exist, the existing method is difficult to process long-distance semantic dependency, and the false association of Zhang crown plum wearing is easy to occur. Aiming at the defects of the prior art, the invention aims to solve the technical problems of how to construct a hybrid architecture, not only can realize the full coverage of explicit privacy by utilizing the high efficiency of a discriminant model, but also can excavate implicit privacy and reject error association by utilizing the reasoning capability of a generated model, thereby realizing high-precision, low-cost and interpretable main body-privacy association identification. Disclosure of Invention Aiming at the problems in the background technology, the invention provides the method for extracting and checking the association relation between the main body and the privacy of the unstructured text, which combines the high precision of the discriminant model and the deep semantic understanding capability of the generated model by introducing the innovative architecture of 'two-stage cooperation and probability fusion', breaks through the bottleneck of the traditional method in implicit privacy mining, suppresses the illusion risk, optimizes the calculation ef