CN-121996627-A - Intelligent classification and clustering management method and system for science and technology archives
Abstract
The invention relates to the technical field of archive management and discloses a technological archive intelligent classification cluster management method and a system, wherein the method comprises the steps of carrying out multi-mode analysis on original data of a technological archive to obtain structural archive metadata, text content data and non-text content description data; the method comprises the steps of carrying out semantic association analysis on topic concepts of structured archive metadata and text content data to obtain co-occurrence frequency and grammar dependency relationship so as to construct a concept association network, carrying out dense semantic injection on non-text content description data to obtain semantic enhancement information, carrying out similarity measurement on the semantic enhancement information and the topic concepts to obtain semantic similarity, calculating a similarity matrix, taking scientific archives as initial clusters, carrying out iterative aggregation to obtain a tree-shaped cluster structure, carrying out lineage archiving on the tree-shaped cluster structure based on a preset granularity threshold value to obtain archive classification results, and improving the efficiency of intelligent classification cluster management of the scientific archives.
Inventors
- XING SHUANGYAN
- Kong Linglian
- LI LUKE
- XU RUHAN
- XU CHUNHUI
Assignees
- 滨州市农业科学院
Dates
- Publication Date
- 20260508
- Application Date
- 20260127
Claims (10)
- 1. The intelligent classification and clustering management method for the science and technology archives is characterized by comprising the following steps of: S1, carrying out multi-mode analysis on original data of a science and technology archive to obtain structural archive metadata, text content data and non-text content description data of the science and technology archive; S2, carrying out semantic association analysis on the theme concepts of the structural archive metadata and the text content data to obtain co-occurrence frequency and grammar dependency relationship of the theme concepts so as to construct a concept association network of the science and technology archive; s3, carrying out dense semantic injection on the non-text content description data based on the concept association network to obtain semantic enhancement information of the non-text content description data; s4, carrying out similarity measurement on the semantic enhancement information and the subject concepts to obtain semantic similarity of the science and technology archive; s5, calculating a similarity matrix of the science and technology archive based on the semantic similarity, and obtaining a tree-shaped cluster structure of the science and technology archive by iteratively aggregating most similar cluster pairs by taking the science and technology archive as an initial cluster based on the similarity matrix; s6, carrying out lineage archiving on the tree-shaped clustering structure based on a preset granularity threshold to obtain an archive classification result of the science and technology archive.
- 2. The intelligent classified clustering management method of technological archives according to claim 1, wherein the multi-modal parsing of the original data of technological archives to obtain the structured archives metadata, text content data and non-text content description data of technological archives comprises: Acquiring a text report, a design drawing file and a structured data table of a science and technology file; Performing optical character recognition on the text report to obtain text content data of the science and technology archive; Carrying out graphic element analysis on the design drawing file to obtain non-text content description data of the science and technology file; and carrying out structural integration on the structural data table, the text report and the key attribute information of the design drawing file to obtain the structural file metadata of the science and technology file.
- 3. The intelligent classified clustering management method of technological archives according to claim 1, wherein the semantic association analysis is performed on the topic concepts of the structured archives metadata and the text content data to obtain co-occurrence frequency and grammatical dependency relationship of the topic concepts, so as to construct a concept association network of the technological archives, and the method comprises the following steps: semantic concept mining is carried out on the structured archive metadata and the text content data to obtain theme concepts of the structured archive metadata and the text content data; In the text content data, performing co-occurrence measurement on the theme concepts to obtain co-occurrence frequency of the theme concepts; Based on the sentence structure of the text content data, carrying out dependency structure analysis on the theme concepts to obtain grammar dependency relationship of the theme concepts; carrying out association strength fusion on the theme concepts according to the co-occurrence frequency and the grammar dependency relationship to obtain semantic association strength of the theme concepts; and constructing a concept association network of the science and technology archive by taking the theme concept as a node and the semantic association strength as a connection edge.
- 4. The intelligent classification and clustering management method of science and technology archives according to claim 3, wherein the performing association strength fusion on the topic concept according to the co-occurrence frequency and the grammatical dependency relationship to obtain semantic association strength of the topic concept comprises: Carrying out logarithmic normalization on the co-occurrence frequency to obtain a statistical association strength value of the theme concept; performing quantization mapping on the grammar dependency relationship to obtain a grammar association strength value of the theme concept; And carrying out weighted fusion on the statistical association strength value and the grammar association strength value based on a preset fusion weight coefficient to obtain the semantic association strength of the subject concept.
- 5. The intelligent classification and clustering management method of science and technology archives according to claim 1, wherein the performing dense semantic injection on the non-text content description data based on the concept association network to obtain semantic enhancement information of the non-text content description data comprises: extracting entity terms from the non-text content description data, and identifying non-text core term nodes of the concept association network; In the concept association network, traversing neighbor nodes of the non-text core term node by taking the non-text core term node as a query starting point to obtain a semantic expansion context of the non-text core term node; and directionally fusing the semantic expansion context to the non-text content description data to obtain semantic enhancement information of the non-text content description data.
- 6. The intelligent classification and clustering management method for science and technology archives according to claim 1, wherein the performing similarity measurement on the semantic enhancement information and the subject concept to obtain the semantic similarity of the science and technology archives comprises: projecting the semantic enhancement information and archive concept features of the subject concepts to the same high-dimensional semantic space to obtain a space coordinate vector of the science and technology archive; performing cosine similarity evaluation on the space coordinate vector to obtain semantic association strength of the space coordinate vector; And collecting the semantic association strength as semantic similarity of the science and technology archive.
- 7. The intelligent classification and cluster management method of technological archives according to claim 1, wherein the calculating the similarity matrix of the technological archives based on the semantic similarity, and based on the similarity matrix, taking the technological archives as initial clusters, and obtaining the tree-like cluster structure of the technological archives by iteratively aggregating most similar cluster pairs comprises: Calculating a similarity matrix of the science and technology archive based on the semantic similarity; Taking the science and technology file as an initial cluster; performing distance measure transformation on the similarity matrix to obtain an inter-cluster distance matrix of the initial cluster; Inquiring the cluster with the smallest current distance based on the inter-cluster distance matrix, and collecting the cluster with the smallest current distance into the most similar cluster pair; merging the most similar cluster pairs into a parent cluster, updating the inter-cluster distance matrix with the distance between the parent cluster and the rest clusters, Taking the most similar cluster pair as a child node, taking the father cluster as a father node, and constructing an initial tree-like cluster structure of the science and technology archive; and carrying out iterative expansion on the initial tree-shaped cluster structure to obtain the tree-shaped cluster structure of the technological archive.
- 8. The intelligent classified clustering management method of technological archives according to claim 7, wherein the calculating the similarity matrix of the technological archives based on the semantic similarity comprises: The calculation formula of the similarity matrix is as follows: ; in the formula, Representing similarity matrix Is the first of (2) Line 1 The column elements are arranged in a row, Representing the total number of said technological profiles, Representing files And files The degree of semantic similarity between the two, Representing files And files The degree of semantic similarity between the two, Representing files And files The degree of semantic similarity between the two, Representing the sum-up operation, Representing an open square operation.
- 9. The intelligent classification and cluster management method of technological archives according to claim 1, wherein the performing lineage archiving on the tree-like cluster structure based on a preset granularity threshold to obtain the archives classification result of the technological archives includes: Performing hierarchical positioning cutting on the tree-shaped cluster structure according to a preset granularity threshold value to obtain a tree-shaped segmentation layer of the tree-shaped cluster structure; Distributing the technological files contained in the nodes positioned in the same tree division layer with a unified category identifier; And according to the unified category identifier, carrying out target category regularity on the science and technology archive to obtain an archive classification result of the science and technology archive.
- 10. A science and technology archive intelligent classification cluster management system, characterized in that the system is used for realizing the intelligent classification cluster management method of the science and technology archive according to claim 1, and comprises the following steps: The multi-modal analysis module is used for carrying out multi-modal analysis on the original data of the science and technology archive to obtain structural archive metadata, text content data and non-text content description data of the science and technology archive; The concept network construction module is used for carrying out semantic association analysis on the topic concepts of the structural archive metadata and the text content data to obtain co-occurrence frequency and grammar dependency relationship of the topic concepts so as to construct a concept association network of the technological archive; the semantic enhancement module is used for carrying out dense semantic injection on the non-text content description data based on the concept association network to obtain semantic enhancement information of the non-text content description data; the similarity calculation module is used for carrying out similarity measurement on the semantic enhancement information and the subject concepts to obtain the semantic similarity of the science and technology archive; The hierarchical clustering module is used for calculating a similarity matrix of the technological archives based on the semantic similarity, and obtaining a tree-shaped clustering structure of the technological archives by iteratively aggregating most similar cluster pairs by taking the technological archives as an initial cluster based on the similarity matrix; And the pedigree archiving module is used for conducting pedigree archiving on the tree-shaped cluster structure based on a preset granularity threshold value to obtain an archive classification result of the science and technology archive.
Description
Intelligent classification and clustering management method and system for science and technology archives Technical Field The invention relates to the technical field of archive management, in particular to an intelligent classification cluster management method and system for scientific and technological archives. Background In the prior art, multi-mode original data of a science and technology archive are difficult to systematically analyze, structured archive metadata, text content data and non-text content description data cannot be accurately extracted at the same time, and various data are not effectively integrated and are distributed in a fragmentation mode. Meanwhile, the topic concept of the structured archive metadata and the text content data is not fully mined, the co-occurrence frequency and the grammar dependency relationship of the topic concept cannot be fully captured, and a comprehensive concept association network cannot be constructed, so that the semantic association of the archive data is difficult to effectively embody. In the prior art, obvious short plates exist on the semantic processing of non-text content description data, and semantic information of the non-text content description data cannot be supplemented through an effective semantic injection mode, so that the similarity measurement of the non-text data and a theme concept lacks accuracy. In addition, a scientific similarity matrix calculation method and an iterative aggregation mechanism are lacked in the clustering process, the tree-shaped clustering structure is unreasonable to construct, granularity control of lineage archiving is lacked in accuracy, and finally, the classification result of the science and technology archives is disordered, management efficiency is low, so that how to realize deep analysis, accurate semantic association mining and efficient classification and clustering archiving of multi-modal archives becomes a problem to be solved urgently. Disclosure of Invention The invention provides a technological archive intelligent classification clustering management method and system, which are used for solving the problems in the background technology. In order to achieve the above purpose, the invention provides a technological archive intelligent classification clustering management method, which comprises the following steps: S1, carrying out multi-mode analysis on original data of a science and technology archive to obtain structural archive metadata, text content data and non-text content description data of the science and technology archive; S2, carrying out semantic association analysis on the theme concepts of the structural archive metadata and the text content data to obtain co-occurrence frequency and grammar dependency relationship of the theme concepts so as to construct a concept association network of the science and technology archive; s3, carrying out dense semantic injection on the non-text content description data based on the concept association network to obtain semantic enhancement information of the non-text content description data; s4, carrying out similarity measurement on the semantic enhancement information and the subject concepts to obtain semantic similarity of the science and technology archive; s5, calculating a similarity matrix of the science and technology archive based on the semantic similarity, and obtaining a tree-shaped cluster structure of the science and technology archive by iteratively aggregating most similar cluster pairs by taking the science and technology archive as an initial cluster based on the similarity matrix; s6, carrying out lineage archiving on the tree-shaped clustering structure based on a preset granularity threshold to obtain an archive classification result of the science and technology archive. In a preferred embodiment, the multi-modal parsing of the original data of the technology archive to obtain the structured archive metadata, the text content data and the non-text content description data of the technology archive includes: Acquiring a text report, a design drawing file and a structured data table of a science and technology file; Performing optical character recognition on the text report to obtain text content data of the science and technology archive; Carrying out graphic element analysis on the design drawing file to obtain non-text content description data of the science and technology file; and carrying out structural integration on the structural data table, the text report and the key attribute information of the design drawing file to obtain the structural file metadata of the science and technology file. In a preferred embodiment, the semantic association analysis is performed on the topic concepts of the structured archive metadata and the text content data to obtain co-occurrence frequency and grammatical dependency relationship of the topic concepts, so as to construct a concept association network of the science and