CN-122020112-A - Text vector generation method and system based on multidimensional feature matrixing characterization

CN122020112ACN 122020112 ACN122020112 ACN 122020112ACN-122020112-A

Abstract

The application provides a text vector generation method and a system based on multidimensional feature matrixing characterization, belongs to the field of natural language processing, and aims to solve the problem that text structure information is lost due to simple fusion in the prior art. The method comprises the steps of dividing an input text into a text unit sequence, extracting and quantifying characteristics from grammar, semantics, statistics and other multidimensional degrees for each unit, constructing a multidimensional data matrix which maintains the relation between sequence and dimension based on the sequence and the characteristics, and carrying out fusion processing on the matrix to generate an integral text vector with fixed length. The application reserves the complete structural information of the text by constructing the multidimensional matrix, avoids the defects of the traditional method, and remarkably improves the representation capability and accuracy of the text vector.

Inventors

LI MING
YUAN YE
KONG FEI

Assignees

北京中绿讯科科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251226

Claims (10)

1. The text vector generation method based on the multidimensional feature matrixing is characterized by comprising the following steps of: dividing an input text into a sequence containing a plurality of text units; extracting and quantifying the characteristics of each text unit in the sequence from a plurality of preset characteristic dimensions, wherein the characteristic dimensions at least comprise a grammar characteristic dimension for representing grammar attributes, a semantic characteristic dimension for representing semantic attributes and a statistical characteristic dimension for representing statistical attributes; thirdly, constructing a multidimensional data matrix which reserves the sequence of the text units and the characteristic dimension relation based on the text unit sequences and the multidimensional quantization characteristics corresponding to the text unit sequences; And step four, carrying out fusion processing on the multidimensional data matrix to generate an integral text vector with a fixed length, which represents the whole input text.
2. The method of claim 1, further comprising, prior to step one: At least one preprocessing operation is performed on the input text, the preprocessing operation including at least one of deduplication processing, deactivated word filtering, and special symbol cleaning.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises, The grammar feature dimension is word class feature; the semantic feature dimension comprises a part-of-speech polarity feature and a word sense category feature; The statistical feature dimension is word frequency feature.
4. The method of claim 3, wherein the step of, The step of quantifying the word frequency characteristics specifically comprises mapping the original word frequency of the text unit to a preset, nonlinear hierarchical normalization value.
5. The method of claim 3, wherein the step of, The feature dimension may further include a temporal dimension or an emotional intensity dimension.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises, The fusion process is a weighted fusion process in which the weights used for weighting are determined based on the word frequency-inverse document frequency TF-IDF value of the text unit or the position of the text unit in the sequence.
7. The method of claim 1, wherein the step of determining the position of the substrate comprises, The fusion process is implemented by a convolutional neural network.
8. The method of claim 7, wherein the step of determining the position of the probe is performed, The step of realizing the fusion processing through the convolutional neural network specifically comprises the following steps: Taking the multidimensional data matrix as input, and applying at least one convolution kernel to carry out convolution operation along the direction of the text unit sequence so as to extract local features; And performing maximum pooling operation on the feature map obtained after the convolution operation to generate the whole text vector.
9. The method of claim 1, wherein the step of determining the position of the substrate comprises, The multi-dimensional data matrix is a three-dimensional matrix, and the three-dimensional matrix comprises a text unit sequence axis for representing the sequence of text units, a characteristic dimension axis for representing different characteristic dimensions and a characteristic quantization value axis for representing characteristic quantization values.
10. A text vector generation system based on multi-dimensional eigenmatrixing, comprising: A text segmentation module for segmenting the input text into a sequence comprising a plurality of text units; The feature extraction module is used for extracting and quantifying features of each text unit in the sequence from a plurality of preset feature dimensions, wherein the feature dimensions at least comprise a grammar feature dimension for representing grammar attributes, a semantic feature dimension for representing semantic attributes and a statistical feature dimension for representing statistical attributes; The matrix construction module is used for constructing a multidimensional data matrix which reserves the sequence of the text units and the characteristic dimension relation based on the text unit sequences and the multidimensional quantization characteristics corresponding to the text unit sequences; And the vector fusion module is used for carrying out fusion processing on the multidimensional data matrix so as to generate an integral text vector with fixed length, which represents the whole input text.

Description

Text vector generation method and system based on multidimensional feature matrixing characterization Technical Field The application relates to the technical field of computers, in particular to a text vector generation method and a system based on multidimensional feature matrixing characterization. Background In the fields of natural language processing and artificial intelligence, text vectorization characterization is a key step in converting unstructured text information into computer-processable numerical vectors. The prior art has undergone an evolution from the early bag of words model and the word frequency-inverse document frequency (TF-IDF) model to modern distributed representation methods. Early technologies such as word bag models mainly rely on word frequency statistics, but the word bag models completely ignore word sequences and semantic association, and easily generate high-dimensional sparse problems. Then, the distributed expression method represented by the word embedding technology learns the semantic similarity of words through a neural network, but the generated word vector is static, so that the word ambiguity problem cannot be solved, and rich grammar or emotion characteristics such as part of speech, emotion and the like cannot be effectively fused. In recent years, a pre-training language model based on deep learning realizes the generation of a dynamic vector related to a context through an attention mechanism, and the depth of semantic understanding is remarkably improved. However, the prior art still has a general drawback in fusing multidimensional features to generate vectors representing the entire text. For example, although some methods consider various features such as parts of speech (grammatical attributes), semantic categories (semantic attributes), and word frequencies (statistical attributes), in generating a final text vector, a method of directly weighting and summing or stitching feature vectors of each word is generally adopted. The simple linear fusion method can destroy inherent internal structural information of the text, and specifically, the method loses the sequence relation of words in the text and also loses the structural association between different dimensional characteristics inside each word. The loss of structural information causes that the generated vector cannot completely and accurately represent the original text, and particularly, the representation capability and accuracy of the generated vector are limited when complex scenes with high requirements on text structures and logical relations, such as legal documents, medical records and the like, are processed. Disclosure of Invention The application aims to provide a text vector generation method and a system based on multidimensional feature matrixing representation, aiming at the technical problem that the accuracy of final vector representation is affected by the loss of inherent sequence structure information and dimensional structure information inside features of a text caused by a simple feature fusion mode in the prior art when generating a text vector. The application provides a text vector generation method based on multi-dimensional feature matrixing representation, which comprises the following steps of firstly dividing an input text into a sequence containing a plurality of text units, secondly extracting and quantizing features of each text unit in the sequence from a plurality of preset feature dimensions, wherein the feature dimensions at least comprise a grammar feature dimension for representing grammar attributes, a semantic feature dimension for representing semantic attributes and a statistical feature dimension for representing statistical attributes, thirdly, constructing a multi-dimensional data matrix which keeps the sequence of the text units and the relation of the feature dimensions based on the text unit sequence and the corresponding multi-dimensional quantization features of the text unit sequence, and fourthly, carrying out fusion processing on the multi-dimensional data matrix to generate an integral text vector with fixed length and representing the whole input text. Optionally, prior to the first step, at least one preprocessing operation is performed on the input text, the preprocessing operation including at least one of deduplication, disabling word filtering, and special symbol cleaning. Optionally, the grammar feature dimension is a word class feature, the semantic feature dimension comprises a word class polarity feature and a word sense class feature, and the statistical feature dimension is a word frequency feature. Further, the step of quantizing the word frequency characteristics specifically includes mapping the original word frequency of the text unit to a preset, nonlinear hierarchical normalization value. Further, the feature dimension further includes a temporal dimension or an emotional intensity dimension. Optionally, the fusion process is