CN-121980284-A - Heterogeneous configuration text quick matching method based on skeleton index and frozen semantic anchor points

CN121980284ACN 121980284 ACN121980284 ACN 121980284ACN-121980284-A

Abstract

The invention discloses a heterogeneous configuration text quick matching method based on skeleton indexes and frozen semantic anchor points, which comprises the steps of firstly constructing a self-adaptive structure dictionary tree (Trie) and an inverted index of a multi-manufacturer configuration template library, analyzing texts into structural operators and parameter placeholders, adopting a scanning mechanism based on maximum prefix matching and fuzzy automata (Fuzzy Automaton) in a real-time detection stage to realize automatic error correction and mixed feature sequence extraction of the heterogeneous texts, then executing double-channel vectorization, adopting position weighted N-gram hash coding to generate sparse vectors for the structural sequences, utilizing a pre-training language model of frozen parameters to combine prompting words to generate semantic dense vectors for parameter instances, constructing a type semantic anchor point space based on natural language description, and finally completing optimal matching through a cascading strategy of structure quick initial screening, parameter semantic fine ranking and dynamic weighted comprehensive scoring. According to the method, the command skeleton and the parameter semantics are decoupled, so that the retrieval efficiency and the generalization of semantic understanding are considered, and the accuracy and the robustness of heterogeneous configuration text matching are remarkably improved.

Inventors

XU YANG
SUN ZHIXIN
XU YUHUA
CHEN SONGLE

Assignees

南京邮电大学

Dates

Publication Date: 20260505
Application Date: 20260124

Claims (10)

1. A heterogeneous configuration text quick matching method based on skeleton indexes and frozen semantic anchor points is characterized by comprising the following steps: Step S1, constructing a structure dictionary and an inverted index based on a multi-manufacturer configuration template library, performing lexical analysis on a preset template, distinguishing structural operators and parameter type placeholders, constructing a globally unique self-adaptive structure dictionary tree, and establishing a structure ID inverted index, Step S2, self-adaptive scanning and feature extraction of the text to be detected, receiving real-time configuration text flow, carrying out maximum prefix matching scanning by utilizing a structure dictionary, carrying out real-time spelling error correction by combining with a fuzzy automaton, converting the text into a mixed feature sequence consisting of a structure ID and a parameter instance, Step S3, heterogeneous feature decoupling and dual-channel vectorization coding, separating mixed feature sequences, respectively executing position weighted N-gram hash coding on the structure ID sequences to generate sparse structure vectors, executing pre-training language model reasoning based on prompt words on parameter examples to generate dense semantic vectors, Step S4, constructing a type feature space based on a semantic anchor point, constructing natural language description aiming at parameter type placeholders defined by templates, extracting feature vectors as the semantic anchor point, establishing a global type reference matrix, And S5, performing multistage cascade matching of structure-type cooperation, performing quick candidate primary screening by using the structure sparse vector and the inverted index, performing type verification by calculating the similarity between the parameter instance vector and the semantic anchor point, and finally introducing a dynamic weight mechanism to calculate a global score based on the length of the parameter sequence to determine the best matching template.
2. The method for quickly matching heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 1, wherein constructing a structure dictionary and an inverted index in step S1 comprises: step S11, defining a parameter type placeholder set Traversing the sequence of template words, will not belong to the collection Word judgment of (1) is a structural operator, belonging to a set The word of (1) is determined to be a type placeholder; Step S12, extracting the sub-sequences of the structural operators of all templates to construct a dictionary tree Assigning a globally unique structure identifier SID to each leaf node representing the complete command skeleton; step S13, establishing an inverted index table with keys of SID and values of a template number list containing the structure 。
3. The method for rapid matching of heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 1, wherein the adaptive scanning and feature extraction in step S2 comprises: step S21, utilizing the scanning pointer to make the structure dictionary Performing depth-first traversal, and outputting a corresponding standard structure ID if the path is completely consistent with the input segment; Step S22, if the input segment cannot be matched accurately, calculating the editing distance between the input segment and the candidate node, and if the minimum editing distance is smaller than a preset threshold value, judging that the spelling is wrong and correcting the spelling as a standard structure ID; Step S23, if the accurate matching and the fuzzy correction are failed, judging the parameter as a parameter instance, intercepting the character string according to the delimiter and temporarily storing the character string, and simultaneously inserting a parameter slot mark at a position corresponding to the structure sequence.
4. The method for quickly matching heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 3, wherein the editing distance in step S22 adopts a Levenshtein distance algorithm, and the recursive computation logic is as follows: let the input segment be a, the candidate node string be b, the distance function The definition is as follows: Wherein the method comprises the steps of Representing character strings The substring after the first character is removed.
5. The method for quickly matching heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 1, wherein the implementation of the two-channel vectorization coding in step S3 is as follows: Channel one, defining N-gram sliding window for structure ID sequence, generating high-dimensional sparse vector by utilizing Hash function mapping And introducing a position attenuation weight function Wherein Is the starting position; Channel two, pair parameter instance In combination with preamble structure words Constructing a prompt word Input sequence Input, inputting a general pre-training language model of frozen parameters, and extracting the output of the final layer [ CLS ] position as parameter semantics 。
6. The method for quickly matching heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 1, wherein constructing a semantic anchor-based type feature space in step S4 comprises: step S41, respectively constructing corresponding natural language description texts aiming at predefined placeholders such as IPV4, PORT and the like ; Step S42 of Inputting the frozen pre-trained language model same as step S3, extracting an output vector as a semantic anchor point of the type Constructing a global type feature anchor matrix 。
7. The method for rapid matching of heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 1, wherein the first stage of the structure-type collaborative multi-stage cascade matching in step S5 is: using structural sparse vectors Searching candidate templates in the inverted index, calculating input vector Vector with template Weighted Jaccard similarity coefficient of (c) : Keeping the similarity coefficient greater than a threshold Is entered into the candidate set.
8. The method for quickly matching heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 1, wherein the second stage of cascade matching in step S5 is parameter consistency verification: step S51, according to the placeholder type sequence of the candidate template, selecting an anchor matrix from the anchor matrix Semantic anchor points corresponding to the middle index; Step S52, calculating a parameter instance vector Associated with corresponding semantic anchor points Cosine similarity of (2) ; Step S53, combining the Length penalty factors Calculating parameter overall scores :
9. The method for rapid matching of heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 1, wherein the third stage of cascade matching in step S5 is dynamic weighted synthesis scoring: definition of parameter-based sequence length Dynamic weight coefficient of (a) : Calculating a final composite match score : And selecting the template with the highest score and exceeding the confidence threshold as a final result.
10. The method for quickly matching heterogeneous configuration text based on skeleton indexes and frozen semantic anchors according to claim 5, wherein the construction format of the two-channel prompt word input sequence is as follows: the sequence is intended to disambiguate homonymous parameters under different command structures using semantic characterization of context-enhanced parameter instances.

Description

Heterogeneous configuration text quick matching method based on skeleton index and frozen semantic anchor points Technical Field The invention relates to the technical field of intersection of network automation operation and maintenance and natural language processing, in particular to a heterogeneous configuration text quick matching method based on skeleton indexes and frozen semantic anchor points, which belongs to the network equipment configuration management direction. The invention is suitable for configuration checking, compliance checking and automatic analysis scenes in a multi-manufacturer mixed networking environment, and can realize real-time analysis and accurate matching of heterogeneous configuration text streams containing spelling noise, abbreviations and nonstandard formats under the condition of no need of manually marking training data. Background Along with the evolution of large-scale data centers and cloud computing networks, modern network architecture is increasingly complex, and the network architecture is usually formed by mixed networking of devices of multiple manufacturers such as Hua Cheng, cisco, hua Santa and the like, and the accuracy of device configuration management is directly related to the stability and the safety of the network. The configuration text of the network device usually exists in the form of unstructured Command Line Interface (CLI), command grammar and parameter format of different manufacturers are greatly different, and in the actual operation and maintenance process, manually input configuration instructions often accompany noise such as misspelling, nonstandard abbreviation or parameter sequence reversal. The traditional configuration analysis method mainly relies on expert experience to write a large number of regular expressions or static templates, the maintenance cost is extremely high, a rule base expands exponentially along with the increase of equipment models, the method belongs to rigid matching, the fault tolerance capability for misspelling or fuzzy semantics is lacking, and the method is difficult to adapt to the automatic processing requirement of massive heterogeneous configuration texts. In the prior art, for checking and analyzing network configuration, two types of technical paths mainly exist, namely, based on network logic verification and based on traditional machine learning rules. For example, patent CN116170304B discloses a network device profile checking method that generates a network link table by acquiring a start and current profile, checks whether a MAC address is in a trust certificate, and tracks the consistency of routing node links to verify configuration eligibility. Although the scheme has certain advantages in terms of logic correctness and link on-off checking after configuration is effective, the premise is that key parameter information in a configuration file must be accurately analyzed, which requires that an input configuration text is standard and normative, however, the method lacks a deep semantic understanding and automatic error correction mechanism for a front-end text layer, and when an original configuration text itself has spelling errors (such as interface is wrote to interfac) or heterogeneous grammar specific to a manufacturer, the method cannot perform normalization processing in an analysis stage, so that the subsequent link logic checking fails because valid parameters cannot be extracted. On the other hand, in order to reduce the maintenance cost of manual rules, some prior art techniques have begun to attempt rule extraction using machine learning. For example, patent CN119854127B discloses a self-learning algorithm of network configuration auditing rules, which automatically generates configuration auditing rules by collecting historical traffic data and device running logs, extracting key features and training a decision tree model based on supervised learning. Although the method realizes automatic construction of rules to a certain extent, the method is highly dependent on a large amount of historical data with fault labels for supervised training, and in an actual production network, fault samples are often scarce and high in labeling cost, so that the model faces the problem of difficult cold start. In addition, the rule model based on the decision tree essentially processes discrete features, lacks understanding capability of deep semantics of natural language, cannot perform zero sample generalized matching through semantic association like a large language model when facing unseen configuration parameter description or command structure of novel equipment, and is difficult to solve the semantic ambiguity problem of parameter types. In summary, the existing configuration text matching technology is difficult to consider in terms of processing speed, fault tolerance capability and zero sample parameter verification. While the traditional logic verification method is too focused on the