CN-122019797-A - Digital archive multi-mode data semantic enhancement fusion retrieval method and system
Abstract
The invention relates to the technical field of digital archive management and information retrieval, and discloses a digital archive multi-modal data semantic enhancement fusion retrieval method and a system, wherein the method is used for carrying out temporal logic reasoning on archive seals to deduce authority effectiveness evolution by constructing a policy period time axis and a policy term evolution map, generating multi-modal expression vectors by fusing temporal authority feature vectors and content semantic vectors, realizing cross-policy period semantic enhancement retrieval by combining query expansion and temporal authority filtering, and solving the problems of missed detection and misjudgment of policy and regulation archives in seal authority history evolution and term cross-period retrieval.
Inventors
- LIN JINZHONG
- WANG YIZHEN
- Xu Jiaoxian
- Guan Haoyao
- YANG HUAILIANG
- LIN JINFENG
Assignees
- 中档信息(广东)集团有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260126
Claims (10)
- 1. The multi-mode data semantic enhancement fusion retrieval method for the digital archives is characterized by comprising the following steps of: acquiring a file scanning image and forming time metadata thereof, extracting release time, application range and revocation time of a policy file associated with the file, generating a policy period time axis based on time attributes of each policy file, matching the file forming time with the policy period time axis, and determining a policy period to which the file belongs; Extracting terms from archives in each policy period, identifying proper terms and authority related terms of the policies, performing semantic similarity calculation on term sets of each policy period, identifying term evolution pairs with semantic continuation relations, taking the terms of each policy period as nodes and term evolution pairs as edges, and generating a policy term evolution map; Performing seal detection and seal identification on the file scanning image, extracting position coordinates and seal characteristics of each seal, matching the seal characteristics with a seal knowledge base, and generating identity labels and basic authority levels of each seal; Analyzing the spatial position relation of a plurality of seals on the same file, identifying a seal combination mode, carrying out association matching on each seal identity label and a permission rule set of the current policy period, and obtaining permission level definition of each seal in the policy period; Deriving a composite authority efficacy label of the stamp combination in the current policy period based on the stamp combination mode and the current authority level definition, and generating an authority efficacy evolution record; Encoding the authority effectiveness evolution record into a temporal authority feature vector, performing semantic encoding on archive text content to generate a content semantic vector, and fusing the temporal authority feature vector with the content semantic vector to generate a temporal authority enhanced multi-modal representation vector; Receiving a user query sentence, identifying a policy term and a permission state constraint condition in the query sentence, traversing a graph in a policy term evolution map by taking the identified policy term as a starting point, retrieving an equivalent term of the term in each history policy period, replacing the policy term in the original query sentence with the history equivalent term, generating an extended query set, carrying out semantic coding on the extended query set to generate an extended query semantic vector, and generating a temporal permission filtering condition according to the permission state constraint condition; Screening archives by using temporal authority filtering conditions to obtain a candidate archive set meeting authority state constraint, performing similarity calculation on the expanded query semantic vector and temporal authority enhancement multi-mode expression vectors of all archives in the candidate archive set, sequencing the candidate archives according to comprehensive similarity, and outputting a search result.
- 2. The method for semantically enhancing, fusing and retrieving digital archives multi-modal data as set forth in claim 1, wherein the generating of the policy cycle timeline includes: The policy files are ordered according to release time, the release time of each policy file is taken as the starting time point of a new period, and the revocation time or the release time of the next policy file is taken as the ending time point of the period, so that a continuous policy period sequence is formed.
- 3. The method for semantically enhanced fusion retrieval of multimodal data in a digital archive of claim 1, wherein identifying term evolution pairs having semantic continuation relationships comprises: Calculating semantic similarity of terms of the current policy period and terms of adjacent policy periods, judging that the two terms form term evolution pairs when the similarity exceeds a preset threshold, and adding evolution type labels for the term evolution pairs; The evolution type tags include name substitution, concept merging, and concept splitting.
- 4. The method for semantically enhancing, fusing and retrieving digital archives multi-modal data according to claim 1, wherein the identifying the stamp combination pattern comprises: classifying the seal combination modes into parallel seal, laminated seal and continuous seal riding according to the relative positions and overlapping conditions among a plurality of seals; The parallel stamping mode means that a plurality of stamps are sequentially arranged in the horizontal direction or the vertical direction and are not overlapped; The lamination stamping mode refers to the fact that a plurality of stamps have partial overlapping areas; The continuous mode of the riding seam refers to continuous stamping of the seal across the boundary of the file page.
- 5. The method for semantically enhancing, fusing and retrieving digital archives multi-modal data according to claim 1, wherein deriving the composite authority efficacy tag for the current policy period for the seal assembly based on the seal assembly mode and the current authority level definition comprises: Acquiring authority level definition and seal combination mode of each seal in the seal combination; retrieving a corresponding authority superposition rule from authority combination rules of the current policy period according to the seal combination mode; Substituting authority level definitions of all seals into authority superposition rules, and calculating a composite authority efficacy value; And generating a composite authority effectiveness label according to the comparison result of the composite authority effectiveness value and the authority threshold value.
- 6. The digital archive multimodal data semantic enhancement fusion retrieval method of claim 5, wherein the composite rights efficacy label includes full efficacy, partial efficacy, and no efficacy; the rights efficacy evolution record includes a stamp combination identifier, a source policy period identifier, a target policy period identifier, a source period rights efficacy tag, a target period rights efficacy tag, and a efficacy change type.
- 7. The method for carrying out semantic enhancement and fusion search on the multi-modal data of the digital archive according to claim 1 is characterized in that the method for carrying out fusion on the temporal authority feature vector and the content semantic vector comprises the steps of carrying out single-hot encoding on each field in the authority effectiveness evolution record, mapping seal combination marks, authority effectiveness labels and effectiveness change types into vectors with corresponding dimensions respectively, forming the temporal authority feature vector after splicing, carrying out vector splicing after multiplying the temporal authority feature vector and the content semantic vector by corresponding weight coefficients respectively, and generating the multi-modal representation vector with enhanced temporal authority.
- 8. The digital archive multimodal data semantic enhancement fusion retrieval method of claim 1, wherein the permission status constraints include permission-effectiveness-type constraints and policy-cycle-scope constraints; The rights efficacy type constraint specifies a rights efficacy label that the query target archive should possess, and the policy period scope constraint specifies a policy period scope to which the formation time of the query target archive belongs.
- 9. The method for semantically enhancing, fusion and retrieval of digital archives multimodal data according to claim 1, wherein said computing similarity between the expanded query semantic vector and the temporal rights enhanced multimodal representation vector of each archive in the candidate archive set comprises: And calculating cosine similarity of each extended query semantic vector in the extended query set and the temporal authority enhancement multi-mode expression vector of the candidate file, multiplying each cosine similarity by the corresponding extended query weight, and then summing to obtain a comprehensive similarity score.
- 10. A digital archive multimodal data semantic enhancement fusion retrieval system for executing the digital archive multimodal data semantic enhancement fusion retrieval method of any one of claims 1 to 9, comprising: the policy period determining module is used for acquiring a file scanning image and forming time metadata thereof, extracting release time, application range and revocation time of a policy file associated with the file, generating a policy period time axis based on time attributes of the policy files, matching the file forming time with the policy period time axis, and determining the policy period to which the file belongs; The policy term evolution map construction module is used for extracting terms from archives in each policy period, identifying policy proper nouns and authority related terms, carrying out semantic similarity calculation on term sets of each policy period, identifying term evolution pairs with semantic continuation relations, taking the terms of each policy period as nodes and term evolution pairs as edges, and generating a policy term evolution map; The seal identification and permission matching module is used for carrying out seal detection and seal identification on the file scanning image, extracting the position coordinates and seal characteristics of each seal, matching the seal characteristics with a seal knowledge base and generating identity labels and basic permission levels of each seal; The temporal authority efficacy deduction module is used for analyzing the spatial position relation of a plurality of seals on the same file, identifying a seal combination mode, carrying out association matching on each seal identity label and an authority rule set of the current policy period, and obtaining authority level definition of each seal in the policy period; Deriving a composite authority efficacy label of the stamp combination in the current policy period based on the stamp combination mode and the current authority level definition, and generating an authority efficacy evolution record; the temporal authority feature fusion module is used for encoding the authority effectiveness evolution record into a temporal authority feature vector, carrying out semantic encoding on file text content to generate a content semantic vector, and fusing the temporal authority feature vector with the content semantic vector to generate a multi-modal representation vector with enhanced temporal authority; The query expansion module is used for receiving a user query statement, identifying a policy term and a permission state constraint condition in the query statement, performing graph traversal in a policy term evolution map by taking the identified policy term as a starting point, retrieving an equivalent term of the term in each history policy period, replacing the policy term in the original query statement with the history equivalent term, generating an expanded query set, performing semantic coding on the expanded query set to generate an expanded query semantic vector, and generating a temporal permission filtering condition according to the permission state constraint condition; and the retrieval output module is used for screening the archives by utilizing temporal authority filtering conditions, obtaining a candidate archive set meeting authority state constraint, carrying out similarity calculation on the expanded query semantic vector and the temporal authority enhancement multi-mode expression vector of each archive in the candidate archive set, sequencing the candidate archives according to the comprehensive similarity, and outputting a retrieval result.
Description
Digital archive multi-mode data semantic enhancement fusion retrieval method and system Technical Field The invention relates to the technical field of digital archive management and information retrieval, in particular to a digital archive multi-mode data semantic enhancement fusion retrieval method and system. Background In the field of administrative approval digital archive management, a seal is used as a carrier of approval authority, and the approval effectiveness represented by the seal changes along with the change of policy and regulation. At the same time, official term names in the policy and regulation class archives evolve with policy changes. In the prior art, when archives are searched, methods based on keyword matching or semantic vector search are generally adopted, but the methods respectively process seal authority identification and policy term evolution, do not establish correlation reasoning of seal authority semantics and policy period, and do not establish historical evolution mapping relation of policy terms. The prior art has the following defects that when a user searches a 'history file with independent approval efficacy' or uses a current policy term for searching, the system cannot judge the actual legal efficacy of the same seal combination in different history periods, and meanwhile, the current policy term cannot match the history file expressed by adopting the revoked term, so that the user cannot respond accurately based on the authority history state and the searching requirement of the cross-policy period, and the technical problems of searching omission and misjudgment are caused. Disclosure of Invention The invention provides a method and a system for semantically enhancing and fusing digital archive multi-mode data, which solve the technical problems of search omission and misjudgment caused by seal authority historical evolution and policy term cross-period change in the related technology. The invention provides a multi-mode data semantic enhancement fusion retrieval method for digital files, which comprises the following steps: acquiring a file scanning image and forming time metadata thereof, extracting release time, application range and revocation time of a policy file associated with the file, generating a policy period time axis based on time attributes of each policy file, matching the file forming time with the policy period time axis, and determining a policy period to which the file belongs; Extracting terms from archives in each policy period, identifying proper terms and authority related terms of the policies, performing semantic similarity calculation on term sets of each policy period, identifying term evolution pairs with semantic continuation relations, taking the terms of each policy period as nodes and term evolution pairs as edges, and generating a policy term evolution map; Performing seal detection and seal identification on the file scanning image, extracting position coordinates and seal characteristics of each seal, matching the seal characteristics with a seal knowledge base, and generating identity labels and basic authority levels of each seal; Analyzing the spatial position relation of a plurality of seals on the same file, identifying a seal combination mode, carrying out association matching on each seal identity label and a permission rule set of the current policy period, and obtaining permission level definition of each seal in the policy period; Deriving a composite authority efficacy label of the stamp combination in the current policy period based on the stamp combination mode and the current authority level definition, and generating an authority efficacy evolution record; Encoding the authority effectiveness evolution record into a temporal authority feature vector, performing semantic encoding on archive text content to generate a content semantic vector, and fusing the temporal authority feature vector with the content semantic vector to generate a temporal authority enhanced multi-modal representation vector; Receiving a user query sentence, identifying a policy term and a permission state constraint condition in the query sentence, traversing a graph in a policy term evolution map by taking the identified policy term as a starting point, retrieving an equivalent term of the term in each history policy period, replacing the policy term in the original query sentence with the history equivalent term, generating an extended query set, carrying out semantic coding on the extended query set to generate an extended query semantic vector, and generating a temporal permission filtering condition according to the permission state constraint condition; Screening archives by using temporal authority filtering conditions to obtain a candidate archive set meeting authority state constraint, performing similarity calculation on the expanded query semantic vector and temporal authority enhancement multi-mode expression vectors of all arch