CN-122021628-A - Graph structure data word segmentation processing method based on byte pair coding

CN122021628ACN 122021628 ACN122021628 ACN 122021628ACN-122021628-A

Abstract

The invention provides a graph structure data word segmentation processing method based on byte pair coding, and relates to the technical field of graph data processing, wherein the method is to obtain a global structure frequency mapping table of graph data by global statistics processing by utilizing a graph data set with labels; the method comprises the steps of obtaining a target graph example and a global structure frequency mapping table, processing by adopting a reversible traversing algorithm covering the whole edge to obtain a discrete symbol sequence with aligned structure, aggregating the discrete symbol sequence by utilizing a data-driven compression algorithm to obtain a graph semantic word segmentation vocabulary and a corresponding encoding-decoding rule set, and converting input graph data by utilizing the graph semantic word segmentation vocabulary and the corresponding encoding-decoding rule set to obtain a word segmentation processing result. The invention solves the problem that the existing processing sequence data model is difficult to obtain accurate word segmentation processing effect under the condition of no need of modifying model architecture.

Inventors

SHI CHUAN
YANG CHENG
GUO ZEYUAN

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (6)

1. The method for processing the word segmentation of the coded graph structure data based on the byte pair is characterized by comprising the following steps of: s1, obtaining a global structure frequency mapping table of graph data through global statistics processing by using a graph data set with labels; s2, processing the coding word segmentation device by adopting a reversible traversal algorithm and bytes covering the whole edge based on the target graph instance and the global structure frequency mapping table to obtain a discrete symbol sequence with aligned structures; s3, aggregating the discrete symbol sequences by using a data-driven compression algorithm to obtain a graph semantic word segmentation vocabulary; and S4, converting the input image data by utilizing the image semantic word segmentation vocabulary to obtain word segmentation processing results, and completing word segmentation processing of the image structure data.
2. The method for processing the word segmentation of the encoded graph structure data based on the byte pair according to claim 1, wherein S1 comprises: Traversing the data set of the map with the label, counting each pattern book, and identifying all the local structure modes and the original count values; processing the labeled graph data set based on the local structure mode and the original count value to obtain the relative frequency of each local mode in the global range; And converting the discrete labeled graph data set into probability distribution information reflecting the topological characteristics of the data by using the relative frequency of each local mode in the global range, so as to obtain a global structure frequency mapping table of the graph data.
3. The byte-based coded graph structure data word segmentation method according to claim 2, wherein the expression of the relative frequency of the local pattern in the global range is: ; ; ; Wherein, the Representing the technique of pattern p for a single graph data, The data of a single figure is represented, A diagram illustrating a structural mode is shown, The edges are represented as such, Representing the first node of the constituent edges, Representing a second node of the group edge, Representing a set of all the edges in the graph, Tags representing points or edges (noted as constants or default values for space), The original count is represented by a representation of the original count, Representing the entire set of graph datasets, Indicating the relative frequency of the normalization, Representing any one of all the legal patterns, The meta-character set representing the diagram contains labels for all points and edges.
4. The method for processing the word segmentation of the encoded graph structure data based on the byte pair according to claim 1, wherein the step S2 comprises: inquiring a global structure frequency mapping table, traversing the target graph example from a fixed initial node, and acquiring a relative frequency value of a local mode formed by each candidate edge; and carrying out priority decision calculation by using the relative frequency value, selecting a candidate edge with the highest relative frequency value as a path of the next hop, and encoding the encoding word segmentation device by using bytes to obtain a discrete symbol sequence with aligned structures.
5. The method for word segmentation of encoded graph structure data based on byte according to claim 4, wherein the expression of the path of the next hop is: ; Wherein, the The path of the next hop is indicated, Representing that for all possible next hop paths, the weighting function is taken such that The next modulation path with the largest value of (c) is, Representing node u with an unrecessed outgoing edge, Representing the calculation of weights for a particular next hop path based on structural statistics, Representing a candidate next-hop path, Indicating the normalized relative frequency.
6. The method for processing the word segmentation of the encoded graph structure data based on the byte pair according to claim 1, wherein the step S3 comprises: counting the co-occurrence frequency of all adjacent symbols in the discrete symbol sequence to obtain adjacent symbol pairs; selecting a pair of adjacent symbols with highest global frequency, merging the adjacent symbols into a new composite mark, and recording the new composite mark into a codebook by utilizing a merging rule; And carrying out iterative replacement on adjacent symbols corresponding to the discrete symbol sequence through the composite mark to obtain a graph semantic word segmentation vocabulary and a corresponding encoding-decoding rule set.

Description

Graph structure data word segmentation processing method based on byte pair coding Technical Field The invention relates to the technical field of graph data processing, in particular to a graph structure data word segmentation processing method based on byte pair coding. Background The large pre-training transducer model, represented by the Large Language Model (LLMs), has achieved the most advanced results in different fields. The key component of this success is the word segmenter, which converts the original input into a sequence of discrete symbols. By structuring the information into learnable units, the segmenter provides an interface between complex data and the transducer architecture, supporting the scalability and performance of these models. Two main strategies, each with its inherent limitations, have been explored with respect to the development of transformers to graph structure data. One strategy is to modify the architecture by incorporating an attention mechanism into the Graph Neural Network (GNNs) to create a specialized graph Transformer. These methods require deviations from the standard sequence model and specific graph designs of its ecosystem. Another strategy is to convert the graph into a continuous embedding (continuous embeddings) for use by the transducer, but this typically results in a missing or unstable representation of the information, which may degrade the model performance. Developing a principle graph segmenter requires review of the segmentation concepts in the context of view structure data. In particular, text may be modeled as a path graph (PATH GRAPH) in which a linear sequence of token provides a fixed neighborhood structure and canonical order, making the word relatively straightforward. In contrast, general graphs present additional challenges because their neighbors may branch into multiple directions rather than follow a simple linear sequence. They also lack permutation invariance (permutation invariance), i.e., the graph under node permutation is considered equivalent. Furthermore, co-occurrence statistics widely used in text (e.g., n-gram frequency based on successive token) are not directly applicable to graphs. Disclosure of Invention Aiming at the defects in the prior art, the invention provides the graph structure data word segmentation processing method based on byte pair coding, which solves the problem that the existing processing sequence data model is difficult to obtain accurate word segmentation processing effect under the condition of no need of modifying a model framework. In order to achieve the aim of the invention, the technical scheme adopted by the invention is that the graph structure data word segmentation processing method based on byte pair coding comprises the following steps: s1, obtaining a global structure frequency mapping table of graph data through global statistics processing by using a graph data set with labels; s2, processing the coding word segmentation device by adopting a reversible traversal algorithm and bytes covering the whole edge based on the target graph instance and the global structure frequency mapping table to obtain a discrete symbol sequence with aligned structures; S3, aggregating the discrete symbol sequences by using a data-driven compression algorithm to obtain a graph semantic word segmentation vocabulary and a corresponding encoding-decoding rule set; s4, converting the input image data by utilizing the image semantic word segmentation vocabulary and the corresponding encoding-decoding rule set to obtain a word segmentation processing result, and completing word segmentation processing of the image structure data. The invention has the beneficial effects that the invention provides the graph structure data word segmentation processing method based on byte pair coding, which combines reversible and structure-guided graph serialization and a byte pair coding word segmentation device (BPE) to construct a faithful and efficient interface to code the graph into a discrete symbol (Token) sequence. The processing of the graph data is not dependent on designing a specific graph neural network architecture, or the existing sequence data model architecture is modified to process the graph data, but the graph data is decoupled and processed without modification, so that the graph task is completed. The method can directly benefit from the latest hardware acceleration technology (such as an algorithm-FlashAttention of a high-efficiency computing attention mechanism) and architecture breakthrough (such as an ultra-long context window) in the field of sequence modeling, thereby having the expandability potential of processing millions of node-level ultra-large scale graph data. Meanwhile, the invention enables the mature text or image generation model which completes the classical generation type task (based on diffusion model-diffusion, stream matching-flowmatching) to directly generate the graph data by representi