CN-122019730-A - Mixed vector-based normative file intelligent question-answering system
Abstract
The invention belongs to the technical field of data processing, and discloses a normalized file intelligent question-answering system based on mixed vectors, which comprises a structure analysis module, a knowledge base construction module, a dense vector representation and a sparse vector representation, wherein the structure analysis module is used for dividing a high school rule normalized file into a plurality of clause atomic fragments by extracting natural hierarchy information, generating a corresponding hierarchy path identifier and an original position index for each clause atomic fragment and constructing the corresponding hierarchy path identifier and the original position index into a clause atomic fragment set by association encapsulation, the knowledge base construction module is used for carrying out multidimensional semantic modeling on the clause atomic fragment set to respectively generate the dense vector representation and the sparse vector representation, constructing a mixed vector index based on the dense vector representation and the sparse vector representation, realizing the refined structure modeling of the high school rule normalized file, and improving the clause retrieval accuracy.
Inventors
- WANG PING
- TANG WEI
- XU GUOMING
Assignees
- 南京苏迪科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260408
Claims (10)
- 1. The utility model provides a normative file intelligence question-answering system based on mixed vector which characterized in that includes: The structure analysis module is used for splitting the high school rule paradigm file into a plurality of clause atom fragments by extracting natural hierarchy information, generating a corresponding hierarchy path identifier and an original position index for each clause atom fragment, and constructing a clause atom fragment set by association encapsulation; the knowledge base construction module carries out multidimensional semantic modeling on the clause atomic fragment set to respectively generate dense vector representation and sparse vector representation; The double-flow mixed retrieval module receives and analyzes a user query instruction, extracts a query intention and a key term to generate a query feature vector, and performs parallel retrieval on a mixed vector index based on the query feature vector to acquire an initial candidate clause set; The conflict resolution module is used for detecting the clauses with new and old conflict or mutual exclusion relation in the initial candidate clause set, automatically removing the clauses which are invalid or replaced, and floating the current clauses to the highest position to form a clause sequence after conflict resolution; And the structuring generation module is used for selecting target clauses from the clause sequence after conflict resolution and generating structured intelligent question-answering results by combining natural level information of the target clauses.
- 2. The hybrid vector-based normative file intelligent question-answering system according to claim 1, wherein the method of extracting natural hierarchy information comprises: Acquiring a high school rule normative file text to be processed, performing standardized preprocessing operation on the content of the high school rule normative file text, removing non-text content, unifying a numbering format, reserving an original text sequence, and further forming resolvable standard text data; And traversing the identified natural hierarchy marks, judging the starting position and the ending position of different hierarchies by combining the nested relation and the appearance sequence among the hierarchy marks, establishing a tree-shaped hierarchy path representing the membership relation among the hierarchies, and finally completing the acquisition of the natural hierarchy information.
- 3. The hybrid vector-based normative file intelligent question-answering system according to claim 2, wherein the method for obtaining the clause atom fragment set includes: Traversing the canonical text data based on the tree-like hierarchical path, and splitting the clause content of the high school rule normative file into a plurality of clause atom fragments according to the natural hierarchical structure mark as a segmentation unit, wherein each clause atom fragment comprises the corresponding high school rule normative file text content; Extracting a corresponding complete hierarchical path from a tree hierarchical path aiming at each clause atomic fragment, coding the hierarchical path according to a preset format to generate a corresponding hierarchical path identifier, recording the position information of each clause atomic fragment in a high school rule normative file to form an original position index, and carrying out one-to-one association on each generated clause atomic fragment, the corresponding hierarchical path identifier and the original position index to package the clause atomic fragment set.
- 4. The hybrid vector-based normative file intelligent question-answering system according to claim 3, wherein the method of generating dense vector representations and sparse vector representations comprises: for any clause atomic fragment in the clause atomic fragment set, text encoding is carried out on the corresponding high school rule normative file text content to generate a text semantic vector, and meanwhile, a hierarchical path embedded vector is constructed based on the hierarchical path identifier of the clause atomic fragment; Based on a preset normative term set, carrying out term identification and weighting processing on the text content of the high school rule normative file corresponding to the clause atomic fragment, and giving corresponding weight values to the normative terms by detecting the normative terms appearing in the clause atomic fragment and the corresponding appearance frequencies, so as to construct sparse vector representation for keyword retrieval.
- 5. The hybrid vector based normative file intelligent question-answering system according to claim 4, wherein the method of constructing a hybrid vector index comprises: For each clause atomic fragment, storing the corresponding dense vector representation and the sparse vector representation in a correlated way, constructing a corresponding vector index structure for the dense vector representation and the sparse vector representation respectively, and uniformly associating the two vector index structures to the same clause atomic fragment through an index mapping relation, so as to form a mixed vector index supporting parallel execution of semantic retrieval and keyword retrieval.
- 6. The hybrid vector-based normative file intelligent question-answering system according to claim 5, wherein the query feature vector acquisition method comprises: Receiving a user query instruction, analyzing the user query instruction to obtain word segmentation, removing stop words, part-of-speech tagging and morphological reduction, and identifying core elements in the query by using a natural language processing technology, wherein the core elements comprise target terms, related legal concepts and requirements of the query, so as to extract query intention and key terms in the query; Each key term in the query is mapped into a corresponding word vector through a TF-IDF algorithm, and the query feature vector for retrieval is generated through a weighted average or vector splicing mode by combining the word vectors in the query.
- 7. The hybrid vector-based normative file intelligent question-answering system according to claim 6, wherein the initial candidate term set acquisition method includes: based on the query feature vector, performing parallel search on the mixed vector index, and simultaneously performing semantic search and keyword search, generating a support clause set in each query process, and calculating the stable support degree of clauses according to the support clause set generated in each query process; and the stability support degree is based on the occurrence frequency of the clauses in the query, a stability support degree threshold is preset, and if the stability support degree of the clauses is larger than or equal to the preset stability support degree threshold, the clauses are regarded as stability clauses and are included in a final search result, so that an initial candidate clause set after stability support degree screening is obtained.
- 8. The hybrid vector based normative file intelligent question-answering system according to claim 7, wherein the method for detecting terms in which new and old conflicts or mutual exclusion relations exist in the initial candidate term set comprises: Based on the aging information and evolution information of each candidate item, conflict detection is carried out on all candidate items in the candidate item set, and new and old conflict relations, substitution relations or mutual exclusion relations among the candidate items are identified by judging whether the candidate items have the conditions of failure, substitution or mutual exclusion.
- 9. The hybrid vector-based normative file intelligent question-answering system according to claim 8, wherein the method of forming the conflict resolved clause sequence comprises: Eliminating the candidate clauses with the new and old conflict relation and the invalid candidate clauses with the replacement relation and the replaced candidate clauses in the candidate clause set, floating the candidate clauses with late effective time and still in the effective state to the highest position according to the aging information of the candidate clauses, giving higher sorting priority, eliminating the remaining candidate clauses in the mutual exclusion relation, and further forming a clause sequence after conflict resolution.
- 10. The hybrid vector-based normative file intelligent question-answering system according to claim 9, wherein the method of generating structured intelligent question-answering results includes: selecting the most relevant clause of the query from the clause sequence after conflict resolution as a target clause according to the matching degree of the query intention and the clause content of the user, wherein the matching degree is calculated based on the semantic similarity of the query feature vector and the clause content; And analyzing the hierarchical structure of the target clause in the file by utilizing the natural hierarchical information of the clause, acquiring the hierarchical path of the target clause and corresponding contextual information to determine the position and effect of the clause in the file, and generating a structured intelligent question-answering result by combining the hierarchical path, the number and the content of the target clause.
Description
Mixed vector-based normative file intelligent question-answering system Technical Field The invention relates to the technical field of data processing, in particular to a mixed vector-based normative file intelligent question-answering system. Background The existing intelligent question-answering system for the high school rule paradigm file mainly has the following problems: The full text search or paragraph search mode is adopted, the natural hierarchical structure of the high school rule paradigm file is not analyzed, and the minimum semantic unit of the clause atomic fragment cannot be split. The mode is limited by text integrity, a large amount of irrelevant redundant information is easy to be included in the retrieval process, retrieval deviation is caused, and core requirements of user query are difficult to be matched accurately. The hierarchical path identification and the original position index are not configured for the search unit, so that the logical membership of the clause in the canonical system cannot be characterized, the generation of the structured answer is difficult to support, the accurate tracing of the search result and the original file cannot be realized, the core requirements of the high school rule-norm file query on the source authority and the position certainty cannot be met, and the reliability and the practicability of the search result are limited. The single vector representation model is adopted for searching, namely, if only dense vectors are adopted, key terms and literal features are difficult to match accurately, term omission is easy to occur, and if only sparse vectors are adopted, deep semantics and context association of texts cannot be captured, and semantic deviation is easy to occur. Meanwhile, the retrieval mode based on single vector index cannot improve the retrieval precision while guaranteeing high recall rate, and is difficult to accurately identify the user query intention and match the most relevant terms. In view of the above, the present invention proposes a hybrid vector-based normalized file intelligent question-answering system to solve the above-mentioned problems. Disclosure of Invention In order to overcome the defects in the prior art and achieve the purposes, the invention provides the following technical scheme that the normative file intelligent question-answering system based on the mixed vector comprises the following steps: The structure analysis module is used for splitting the high school rule paradigm file into a plurality of clause atom fragments by extracting natural hierarchy information, generating a corresponding hierarchy path identifier and an original position index for each clause atom fragment, and constructing a clause atom fragment set by association encapsulation; the knowledge base construction module carries out multidimensional semantic modeling on the clause atomic fragment set to respectively generate dense vector representation and sparse vector representation; The double-flow mixed retrieval module receives and analyzes a user query instruction, extracts a query intention and a key term to generate a query feature vector, and performs parallel retrieval on a mixed vector index based on the query feature vector to acquire an initial candidate clause set; The conflict resolution module is used for detecting the clauses with new and old conflict or mutual exclusion relation in the initial candidate clause set, automatically removing the clauses which are invalid or replaced, and floating the current clauses to the highest position to form a clause sequence after conflict resolution; And the structuring generation module is used for selecting and selecting target clauses from the clause sequence after conflict resolution and generating structured intelligent question-answering results by combining the natural level information of the target clauses. Preferably, the method for extracting natural hierarchy information includes: Acquiring a high school rule normative file text to be processed, performing standardized preprocessing operation on the content of the high school rule normative file text, removing non-text content, unifying a numbering format, reserving an original text sequence, and further forming resolvable standard text data; And traversing the identified natural hierarchy marks, judging the starting position and the ending position of different hierarchies by combining the nested relation and the appearance sequence among the hierarchy marks, establishing a tree-shaped hierarchy path representing the membership relation among the hierarchies, and finally completing the acquisition of the natural hierarchy information. Preferably, the method for acquiring the clause atom fragment set includes: Traversing the canonical text data based on the tree-like hierarchical path, and splitting the clause content of the high school rule normative file into a plurality of clause atom fragments according to the natura