CN-122021637-A - Policy text matching method and system based on semantic understanding and contrast learning

CN122021637ACN 122021637 ACN122021637 ACN 122021637ACN-122021637-A

Abstract

The invention provides a method and a system for matching a policy text based on semantic understanding and contrast learning in the technical field of computer information processing, wherein the method comprises the steps of S1, obtaining a large number of history policy texts, preprocessing each history policy text to obtain a plurality of policy paragraphs, S2, marking each policy paragraph, inputting each policy paragraph into a policy semantic vector model to obtain a corresponding policy semantic paragraph vector, constructing a data set through each policy semantic paragraph vector and marked labels, S3, creating a policy semantic matching model, setting a loss function of the policy semantic matching model, S4, training the policy semantic matching model through the data set and the loss function, and performing policy text matching based on the trained policy semantic matching model. The method has the advantage that the accuracy and the efficiency of policy text matching are greatly improved.

Inventors

LI WEI
LIN XING

Assignees

福州市数字产业互联科技有限责任公司

Dates

Publication Date: 20260512
Application Date: 20251218

Claims (10)

1. A policy text matching method based on semantic understanding and contrast learning is characterized by comprising the following steps: Step S1, acquiring a large number of history policy texts, and preprocessing at least comprising semantic segmentation, word segmentation and stop word removal for each history policy text to obtain a plurality of policy paragraphs; Step S2, labeling each policy paragraph comprises a positive sample pair and a negative sample pair, inputting each policy paragraph into a pre-trained policy semantic vector model to obtain a corresponding policy semantic paragraph vector, and constructing a data set through each policy semantic paragraph vector and labeled labels; step S3, creating a policy semantic matching model, and setting a loss function of the policy semantic matching model; and S4, training a policy semantic matching model through the data set and the loss function, and performing policy text matching based on the trained policy semantic matching model.
2. The method for matching the policy text based on semantic understanding and contrast learning according to claim 1, wherein the step S1 is specifically: acquiring a large number of history policy texts, and carrying out semantic segmentation on each history policy text by combining a preset segmentation rule and a pre-trained language model to obtain a plurality of segmentation paragraphs; Performing word segmentation on each segmented paragraph based on a preset policy dictionary, removing stop words from each segmented word based on a preset stop word list, and constructing a policy paragraph based on the residual segmented words of each segmented paragraph so as to complete preprocessing of each history policy text.
3. The method for matching the policy text based on semantic understanding and contrast learning according to claim 1, wherein the step S2 is specifically: Associating the policy paragraphs to construct paragraph pairs, labeling the paragraph pairs including positive sample pairs and negative sample pairs, and inputting the policy paragraphs into a pre-trained policy semantic vector model to obtain corresponding policy semantic paragraph vectors; pairing the policy semantic paragraph vectors based on the labeled labels and constructing a data set.
4. The method for matching the policy text based on semantic understanding and contrast learning according to claim 1, wherein the step S3 is specifically: Creating a policy semantic matching model for policy text matching based on the Siamese network or the Triplet network; When the policy semantic matching model adopts a Siamese network, the loss function of the policy semantic matching model is set to adopt a contrast learning loss function, and when the policy semantic matching model adopts a Triplet network, the loss function of the policy semantic matching model is set to adopt a Triplet loss function.
5. The method for matching the policy text based on semantic understanding and contrast learning according to claim 1, wherein the step S4 is specifically: The method comprises the steps of setting super parameters of the training of a policy and semantic matching model, wherein the super parameters at least comprise a learning rate, a batch size, the number of training rounds and a boundary value, dividing the data set into a training set, a verification set and a test set based on a preset proportion, training the policy and semantic matching model through the training set, optimizing model parameters of the policy and semantic matching model by combining a loss function and a negative sampling mechanism in the training process, performing performance monitoring and evaluation on the policy and semantic matching model in training through the verification set, iteratively optimizing the super parameters based on an evaluation result, testing the policy and semantic matching model through the test set after training is completed, deploying the policy and semantic matching model through the test, and performing policy text matching through the deployed policy and semantic matching model.
6. The policy text matching system based on semantic understanding and contrast learning is characterized by comprising the following modules: the history policy text preprocessing module is used for acquiring a large number of history policy texts, and preprocessing at least comprising semantic segmentation, word segmentation and stop word removal is carried out on each history policy text to obtain a plurality of policy paragraphs; The data set construction module is used for marking each policy paragraph comprising a positive sample pair and a negative sample pair, inputting each policy paragraph into a pre-trained policy semantic vector model to obtain a corresponding policy semantic paragraph vector, and constructing a data set through each policy semantic paragraph vector and marked labels; A policy and semantic matching model creation module for creating a policy and semantic matching model, setting a loss function of the policy semantic matching model; And the policy text matching module is used for training the policy semantic matching model through the data set and the loss function and carrying out policy text matching based on the trained policy semantic matching model.
7. The system for matching policy text based on semantic understanding and contrast learning of claim 6, wherein the historical policy text preprocessing module is specifically configured to: acquiring a large number of history policy texts, and carrying out semantic segmentation on each history policy text by combining a preset segmentation rule and a pre-trained language model to obtain a plurality of segmentation paragraphs; Performing word segmentation on each segmented paragraph based on a preset policy dictionary, removing stop words from each segmented word based on a preset stop word list, and constructing a policy paragraph based on the residual segmented words of each segmented paragraph so as to complete preprocessing of each history policy text.
8. The system for matching policy text based on semantic understanding and contrast learning of claim 6, wherein the data set construction module is specifically configured to: Associating the policy paragraphs to construct paragraph pairs, labeling the paragraph pairs including positive sample pairs and negative sample pairs, and inputting the policy paragraphs into a pre-trained policy semantic vector model to obtain corresponding policy semantic paragraph vectors; pairing the policy semantic paragraph vectors based on the labeled labels and constructing a data set.
9. The semantic understanding and contrast learning-based policy text matching system of claim 6, wherein the policy semantic matching model creation module is specifically configured to: Creating a policy semantic matching model for policy text matching based on the Siamese network or the Triplet network; When the policy semantic matching model adopts a Siamese network, the loss function of the policy semantic matching model is set to adopt a contrast learning loss function, and when the policy semantic matching model adopts a Triplet network, the loss function of the policy semantic matching model is set to adopt a Triplet loss function.
10. The system for matching policy text based on semantic understanding and contrast learning of claim 6, wherein the policy text matching module is specifically configured to: The method comprises the steps of setting super parameters of the training of a policy and semantic matching model, wherein the super parameters at least comprise a learning rate, a batch size, the number of training rounds and a boundary value, dividing the data set into a training set, a verification set and a test set based on a preset proportion, training the policy and semantic matching model through the training set, optimizing model parameters of the policy and semantic matching model by combining a loss function and a negative sampling mechanism in the training process, performing performance monitoring and evaluation on the policy and semantic matching model in training through the verification set, iteratively optimizing the super parameters based on an evaluation result, testing the policy and semantic matching model through the test set after training is completed, deploying the policy and semantic matching model through the test, and performing policy text matching through the deployed policy and semantic matching model.

Description

Policy text matching method and system based on semantic understanding and contrast learning Technical Field The invention relates to the technical field of computer information processing, in particular to a policy text matching method and system based on semantic understanding and contrast learning. Background The number of various types of political documents is continuously and rapidly increasing, and a multi-level and multi-type text system with wide content coverage, frequent updating, cross-regional collaboration, multi-industry guidance, special support and the like is formed. In this context, policy making departments, academic institutions and related enterprises face increasingly significant technical challenges in conducting policy research, content comparison and intelligence extraction. Specifically, in terms of supporting efficient and accurate policy text analysis and retrieval, the following key problems mainly exist in the prior art: First, policy text structure is strict, language standardization is high, and automatic analysis and semantic matching are difficult. Policy documents are generally written in legal provision or administrative document language, and have complex logic levels, often including a large number of defined conditions, citation relationships, and formal expressions. Such text is not only long in length, but also semantically has a large number of duplicates, transcriptions, or synonymous heterogeneous phenomena. For example, the same policy concept may use different expressions in different documents or paragraphs, and conventional word matching-based methods are prone to misjudgment due to expression differences. In addition, the common nested sentence patterns, composite structures and dense use of technical terms in the policy text also increase the difficulty of the natural language processing model in syntactic analysis and key information extraction, and influence the accurate analysis and comparison of policy terms. Second, the traditional keyword matching-based retrieval mechanism has semantic limitations, and is difficult to effectively cope with the word mismatch problem. The existing policy retrieval system depends on keyword query and literal matching strategies, and although basic retrieval requirements can be met, the existing policy retrieval system is difficult to capture semantically similar and express different policy contents. For example, "motivate enterprise technology innovation" is semantically related to "support enterprise development of technology research and development activities," but traditional retrieval models are difficult to establish effective associations due to keyword misalignment. The limitation leads to limited recall rate and accuracy of the retrieval result, and cannot meet the requirement of the user on efficient acquisition of the semantic association content. Third, the existing policy text analysis technology generally lacks modeling capability for deep semantic logic and inter-term association relation, and influences the relevance and accuracy of the search result. Terms in policy text are typically related to each other by semantic links, logical references, or conditional constraints, forming a canonical hierarchy with inherent consistency. However, most current systems still rely on shallow statistics such as word frequency and co-occurrence, and fail to effectively identify deep semantic information such as policy intent, applicable situation, and exceptional terms. For example, when determining whether two policies have a join or supplement relationship, it is often impossible to draw a correct conclusion only according to the similarity of the text surfaces, and deep reasoning needs to be performed by combining the semantic roles and the policy contexts. The lack of semantic logic modeling capability also results in poor performance of existing systems in policy comparison, conflict detection and compliance analysis in complex scenarios. Therefore, how to provide a method and a system for matching policy texts based on semantic understanding and contrast learning, so as to improve the accuracy and efficiency of matching policy texts, is a technical problem to be solved urgently. Disclosure of Invention The invention aims to solve the technical problem of providing a policy text matching method and a system based on semantic understanding and contrast learning, which can improve the accuracy and efficiency of policy text matching. In a first aspect, the present invention provides a method for matching policy text based on semantic understanding and contrast learning, comprising the steps of: Step S1, acquiring a large number of history policy texts, and preprocessing at least comprising semantic segmentation, word segmentation and stop word removal for each history policy text to obtain a plurality of policy paragraphs; Step S2, labeling each policy paragraph comprises a positive sample pair and a negative sample