CN-121980030-A - Document intelligent classification and retrieval system based on deep learning

CN121980030ACN 121980030 ACN121980030 ACN 121980030ACN-121980030-A

Abstract

The invention relates to the technical field of intelligent document classification and retrieval and discloses an intelligent document classification and retrieval system based on deep learning. The system comprises the steps of extracting the core concepts of the document and semantic vectors thereof through a deep learning model, and constructing a document structure tree reflecting a concept association network. And further carrying out semantic enhancement on the structure tree nodes, and analyzing the change of the structure tree nodes in continuous time to obtain a dynamic evolution sequence. Based on this sequence, the system computes the spatiotemporal distribution of core concepts and extracts state transition paths. And finally, generating a tag multidimensional probability vector for classification according to the state transition path, and driving the self-adaptive retrieval operation. According to the scheme, through modeling the dynamic evolution process of the internal logic structure of the document, understanding of deep semantics and development venation of the document content is realized, and therefore accuracy and relevance of classification and retrieval under a complex scene are improved.

Inventors

XU LIHUA
ZHANG CHUNYAN

Assignees

包头钢铁职业技术学院

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. Document intelligent classification and retrieval system based on deep learning, which is characterized by comprising: the document flow analysis module receives and analyzes an input original document flow and separates text content data and document metadata from the original document flow; The document tree structure construction module is used for carrying out concept extraction on text content data based on a deep learning model to obtain a concept set consisting of a plurality of core concepts and semantic vectors thereof, carrying out relevance evaluation on the core concepts in the concept set to generate a concept relevance network, and constructing a document structure tree reflecting the internal logic structure of a document based on the concept relevance network; the structure evolution analysis module is used for carrying out semantic enhancement on tree nodes of the document structure tree to obtain enhanced semantic nodes, and analyzing the change of the enhanced semantic nodes on continuous time segments to obtain a dynamic evolution sequence of a tree structure; The state transition path extraction module is used for calculating the state distribution of each core concept on the concept space and the time dimension according to the dynamic evolution sequence to obtain a concept space-time distribution diagram; and the self-adaptive retrieval execution module is used for generating a tag multidimensional probability vector for classifying the document based on the state transition path and executing self-adaptive retrieval operation on the target document according to the tag multidimensional probability vector.
2. The intelligent classification and retrieval system for documents based on deep learning according to claim 1, wherein the concept extraction of text content data based on the deep learning model to obtain a concept set composed of a plurality of core concepts and semantic vectors thereof comprises: Carrying out semantic segment segmentation on the text content data to form a plurality of initial semantic segments; coding each initial semantic segment by using a deep learning model to obtain a segment semantic vector; performing cluster analysis on all the segment semantic vectors to form a plurality of conceptual clusters; Generating a core concept expression for each concept cluster, and taking the center vector of the concept cluster as the semantic vector of the core concept; And summarizing all core concepts and semantic vectors thereof to form the concept set.
3. The deep learning based document intelligent classification and retrieval system according to claim 2, wherein said performing relevance evaluation on core concepts in the concept set generates a concept relevance network, comprising: Calculating the similarity between semantic vectors of any two core concepts in the concept set, and taking the similarity as initial association strength; Correcting the initial association strength based on the co-occurrence information recorded in the document metadata to obtain the final association strength; constructing an initial relation diagram by taking a core concept as a node and taking final association strength as the weight of an edge; And executing semantic path search on the initial relation graph, finding out indirect association relations among nodes, and supplementing the indirect association relations to the initial relation graph to form the concept association network.
4. The deep learning based intelligent document classification and retrieval system according to claim 3, wherein said constructing a document structure tree reflecting the internal logical structure of the document based on said concept association network comprises: selecting connection with association strength meeting a preset threshold from the concept association network to form a backbone association network; in a backbone association network, determining a core concept node with zero degree or minimum degree as a root node of a document structure tree; The root node is used as a starting point, sub-nodes are expanded layer by layer in a backbone association network according to the sequence of the association strength from high to low, and branches of a tree are formed; and mounting other core concept nodes which are not incorporated into the backbone association network in the association network as leaf nodes of the backbone nodes closest to the other core concept nodes, so as to complete the construction of the document structure tree.
5. The intelligent document classification and retrieval system based on deep learning of claim 4, wherein said semantically reinforcing the tree nodes of the document structure tree to obtain reinforced semantic nodes comprises: matching and inquiring the core concept corresponding to each tree node of the document structure tree with a preset external knowledge base to obtain related expansion concepts and description texts; Coding the expansion concept and the description text by using a deep learning model to generate an expansion semantic vector; fusing the original semantic vectors of the tree nodes with all related expanded semantic vectors to generate reinforced semantic vectors of the tree nodes; updating semantic weights for tree nodes according to the source authority of the expanded concept; The enhanced semantic nodes are formed by the enhanced semantic vectors and the updated semantic weights together; the updating of the semantic weights for the tree nodes according to the source authority of the expanded concept comprises the following steps: An authority level mapping table is predefined, different extended concept sources are divided into a plurality of authority levels according to the credibility and the recognition degree, and a basic authority level score is distributed for each level; inquiring a source corresponding to each expansion concept acquired by a tree node, and acquiring a basic authority score according to the authority level mapping table; Calculating semantic relevance scores between each expansion concept and the original core concept of the tree node; the basic authority score and the semantic relevance score are subjected to weighted fusion to obtain a comprehensive contribution factor of the expansion concept to current tree nodes; Normalizing all comprehensive contribution factors of the expansion concepts related to the current tree node to obtain a normalized weight vector; and carrying out weighted summation on the original semantic weights of the tree nodes and the weight values corresponding to the normalized weight vectors to obtain updated semantic weights.
6. The intelligent classification and retrieval system for documents based on deep learning according to claim 5, wherein said analyzing the change of the enhanced semantic nodes on continuous time segments to obtain a dynamic evolution sequence of a tree structure comprises: sampling the original document stream on a plurality of continuous time stamps respectively to obtain a plurality of document snapshots; repeating the steps of concept extraction, document structure tree construction and semantic enhancement aiming at each document snapshot to obtain an instant document structure tree corresponding to the timestamp; Arranging all the instant document structure trees in time sequence, and recording the appearance, disappearance or attribute change of each tree node and the connection relation thereof in different instant document structure trees; Based on the record, a dynamic evolution sequence describing the change of the document structure tree with time is generated.
7. The intelligent document classification and retrieval system based on deep learning according to claim 6, wherein the calculating the state distribution of each core concept in the concept space and time dimension according to the dynamic evolution sequence to obtain the concept space-time distribution map comprises: Establishing a three-dimensional coordinate system, wherein two dimensions represent conceptual space and one dimension represents time; mapping the node position of each core concept in the transient document structure tree with different time stamps in the dynamic evolution sequence into a concept space coordinate; mapping the time stamp corresponding to the core concept into a time dimension coordinate; In a three-dimensional coordinate system, drawing a state point for each occurrence time stamp of the core concept, wherein the density attribute of the state point is determined by the semantic weight of the enhanced semantic node under the time stamp; connecting state points of the same core concept on continuous time stamps to form a state track of the core concept; the state points and state trajectories of all core concepts together constitute the concept spatiotemporal profile.
8. The intelligent classification and retrieval system for documents based on deep learning as claimed in claim 7, wherein said extracting state transition paths between core concepts from said concept spatiotemporal profiles comprises: In the concept space-time distribution diagram, identifying events of which the state tracks of any two different core concepts are close or cross in space-time; Counting the frequency of occurrence of state track approaching or crossing events between every two core concepts within the full time range of the dynamic evolution sequence; selecting a core concept pair related to an event with the frequency exceeding a preset threshold value as an effective state migration pair; And extracting a complete space coordinate change sequence of a state track from the initial core concept to the target core concept aiming at each effective state transition pair to form a state transition path.
9. The deep learning based intelligent document classification and retrieval system according to claim 8, wherein said generating a labeled multidimensional probability vector for document classification based on said state transition path comprises: Presetting a label set covering all possible classifications; For each leaf node in the document structure tree, analyzing all state migration paths taking the core concept corresponding to the leaf node as a starting point or an ending point; counting the occurrence frequency of the target classification labels associated with the state transition paths; Generating a probability vector for the leaf node based on the occurrence frequency, wherein each dimension of the vector corresponds to a class in the tag set, and the value of the probability vector represents the occurrence probability of the class tag; and summarizing probability vectors of all leaf nodes in the document structure tree, and carrying out normalization processing to obtain the tag multidimensional probability vector representing the whole document.
10. The deep learning based intelligent document classification and retrieval system according to claim 8, wherein said performing an adaptive retrieval operation for a target document based on said labeled multidimensional probability vector comprises: receiving a search request of a user, wherein the search request comprises keywords, natural language questions or example documents; Analyzing and vectorizing the search request to generate a query semantic vector; Acquiring a tag multidimensional probability vector corresponding to each candidate document from a processed document library, and extracting a sub-vector corresponding to a search intention related dimension in the vector; Calculating a matching degree score between the query semantic vector and the tag multidimensional probability subvector of each candidate document; Introducing a time attenuation factor, and carrying out weighted enhancement on the matching degree score of a recently generated or recently state-evolved document; and sorting all the candidate documents according to the weighted final matching degree score, and returning the sorting result as a retrieval result.

Description

Document intelligent classification and retrieval system based on deep learning Technical Field The invention relates to the technical field of intelligent document classification and retrieval, in particular to an intelligent document classification and retrieval system based on deep learning. Background Currently, document classification and retrieval systems based on deep learning have been widely used. In the prior art, the document is generally regarded as a static whole, global semantic vectors or keyword features of the document are directly extracted through a neural network model, and similarity calculation and classification are performed based on the global semantic vectors or the keyword features. Such methods deal with document content as flattened feature representations, focusing on capturing the final semantic state of the document, and taking this as the sole basis for classification and retrieval. Such conventional solutions have drawbacks. They ignore the logical association and organization architecture between the internal concepts of the document, and cannot characterize the hierarchical, networked internal structure formed by the document content. At the same time, the prior art is transient and isolated in the processing of documents, lacking the ability to analyze the evolution of the document content or its core concepts in the time dimension. This results in a system that is hard to understand the deep logical context of the document, and is also incapable of coping with complex search and classification tasks that require insight into the content development trajectories, trends. Therefore, how to construct a dynamic model capable of expressing the internal logic structure of the document and to use the evolution information of the content along with time to improve the accuracy and the interpretability of classification and retrieval becomes a problem to be solved. Disclosure of Invention The invention aims to provide a document intelligent classification and retrieval system based on deep learning so as to solve the problems in the background technology. To achieve the above object, the present invention provides a document intelligent classification and retrieval system based on deep learning, the system comprising: the document flow analysis module receives and analyzes an input original document flow and separates text content data and document metadata from the original document flow; The document tree structure construction module is used for carrying out concept extraction on text content data based on a deep learning model to obtain a concept set consisting of a plurality of core concepts and semantic vectors thereof, carrying out relevance evaluation on the core concepts in the concept set to generate a concept relevance network, and constructing a document structure tree reflecting the internal logic structure of a document based on the concept relevance network; the structure evolution analysis module is used for carrying out semantic enhancement on tree nodes of the document structure tree to obtain enhanced semantic nodes, and analyzing the change of the enhanced semantic nodes on continuous time segments to obtain a dynamic evolution sequence of a tree structure; The state transition path extraction module is used for calculating the state distribution of each core concept on the concept space and the time dimension according to the dynamic evolution sequence to obtain a concept space-time distribution diagram; and the self-adaptive retrieval execution module is used for generating a tag multidimensional probability vector for classifying the document based on the state transition path and executing self-adaptive retrieval operation on the target document according to the tag multidimensional probability vector. Preferably, the concept extraction of text content data based on the deep learning model obtains a concept set composed of a plurality of core concepts and semantic vectors thereof, including: Carrying out semantic segment segmentation on the text content data to form a plurality of initial semantic segments; coding each initial semantic segment by using a deep learning model to obtain a segment semantic vector; performing cluster analysis on all the segment semantic vectors to form a plurality of conceptual clusters; Generating a core concept expression for each concept cluster, and taking the center vector of the concept cluster as the semantic vector of the core concept; And summarizing all core concepts and semantic vectors thereof to form the concept set. Preferably, the performing relevance evaluation on the core concepts in the concept set, generating a concept relevance network, includes: Calculating the similarity between semantic vectors of any two core concepts in the concept set, and taking the similarity as initial association strength; Correcting the initial association strength based on the co-occurrence information recorded in the document m