CN-121980276-A - Tibetan text small sample learning method based on matching network

CN121980276ACN 121980276 ACN121980276 ACN 121980276ACN-121980276-A

Abstract

The invention provides a Tibetan text small sample learning method based on a matching network, which relates to the field of natural language processing and deep learning, and comprises the steps of data preprocessing, wherein the preprocessing comprises text cleaning, word segmentation and labeling; the method comprises the steps of designing a matching network, wherein the matching network consists of a global feature extractor and a local feature extractor and is used for capturing language features of different levels, dynamically integrating the global feature and the local feature through a feature fusion mechanism, simultaneously judging similarity by applying an improved measurement method, and adopting a meta-learning mechanism and a self-adaptive learning rate to adjust a small sample learning strategy to carry out model training and optimization. The method and the device can improve the accuracy and efficiency of text processing of the rare resource languages such as Tibetan language and the like, and provide an effective solution for the natural language processing technology of the resource rare languages.

Inventors

YU YONGBIN
LI CHENBO
WANG XIANGXIANG
Fan Manping
Renqing Dongzhu
Lausanne Garden
Renzeng duojie
QUN NUO
ZHANG ZIYUE
LIU YUTONG
NIMA ZHAXI
CHEN JUAN
Ding Jiaheng
Toudan just let
Ban Mabao
ZHENG ZHIWEN

Assignees

电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (8)

1. The Tibetan text small sample learning method based on the matching network is characterized by comprising the following steps of: Step S1, data preprocessing, namely cleaning, labeling and word segmentation are carried out on a Tibetan text original data set, a data set suitable for small sample learning is constructed, the data set is divided into a training set and a testing set, and a support set and a query set are formed by randomly sampling from the training set; Step S2, a matching network model is constructed, the matching network model comprises a global feature extractor and a local feature extractor, and the global feature extractor is responsible for capturing sentence-level semantic information and is recorded as The local feature extractor is responsible for extracting detailed information of vocabulary and the context thereof and is recorded as Query sample dynamic embedding and support set sample embedding are carried out through a matching network; s3, carrying out feature fusion and similarity measurement by using a matching network model; S4, attention weight distribution and category label prediction, converting similarity scores into attention weights through a softmax function, and carrying out weighted summation on category labels supporting concentration by using the attention weights to obtain category prediction of query samples; And S5, performing model training driven by meta learning, namely adjusting parameters of an embedded function by minimizing loss between a prediction type label and a real type label, adopting a episode-based training strategy based on meta learning, forming a plurality of tasks by repeatedly sampling a support set and a query set, and performing iterative optimization on parameters of a matching network model by using cross entropy loss.
2. The method for learning small samples of Tibetan text based on the matching network according to claim 1, wherein the step S2 of dynamically embedding the query samples comprises: dynamic embedding function of a query sample incorporating support set context information becomes : ; Wherein attLSTM is the combination of LSTM structure and attention mechanism, Representing query samples Is characterized in that, For the support set, g (S) is each element in the support set S Is represented by K A time step of processing; Support set Represented as , Is a support set sample that is a set of support, Is the first Individual support set samples The corresponding category label is used for the purpose of identifying, Representing the number of support set samples in the support set; The LSTM states after k processing steps are: ; ; ; ; Wherein, the In the original hidden state, the device is in a hidden state, In the final hidden state, the position of the camera is, Is the first The memory cell of the step is a memory cell, To at the first The step is to extract a weighted feature sum of the context information from the support set S by means of an attention mechanism, Functions are embedded for the support set.
3. The method for learning small samples of Tibetan text based on the matching network according to claim 1, wherein the supporting set sample embedding in step S2 comprises: support set embedding function introducing support set context information as , Implemented by a bi-directional LSTM, expressed as: ; ; ; Wherein, the In order to be in a hidden state, In the state of a cell, the cell is in a state of being, Representing support set samples Is carried out by extracting the initial characteristics of the network pair support set sample from the preset characteristics And (5) coding to obtain the product.
4. The method according to claim 1, wherein the similarity measure in step S3 uses a plurality of distance measures including cosine distance, euclidean distance, poincare distance, minkowski distance and Pearson distance.
5. The method for learning small samples of Tibetan text based on the matching network according to claim 2, wherein the attention weight allocation in step S4 comprises: through an attention mechanism, embedding and aligning query samples with sample embedding with the same true category in the support set, and obtaining sample-level attention weight through similarity evaluation : ; Wherein, the Representing the first of the support sets A number of samples of the sample were taken, The function is embedded for the query set, And (3) with Obtained from the same embedded network, In order to support the embedding of the set of functions, And Also obtained by the same embedded network, Representation of And The Euclidean distance between the two points is the straight line distance between the two points.
6. The method for small sample learning of Tibetan text based on matching network according to claim 5, wherein the step S4 is characterized in that the matching network model predicts the category label of the query sample Obtained by: ; Wherein, the In order for the attention to be weighted, Is a new query sample.
7. The method for small sample learning of Tibetan text based on matching network according to claim 6, wherein the training goal in step S5 is to optimize matching network model parameters Such that a given support set When the model pairs are batched Maximizing the probability of correctly classifying samples in (a): Wherein, the Representing a desire, sampling a task from training set data by selecting N categories and k sentences for each category , Is a slave task A set of class labels obtained by the middle sampling, Is in a given support set And query samples In the case of (a), the model predicts the correct class label The probability is obtained by category convergence calculation of attention weights obtained by normalizing similarity of query samples and support set samples by softmax.
8. The method for learning small samples of Tibetan text based on the matching network according to claim 7, wherein the cross entropy loss in the step S5 is defined by taking a negative logarithm of the prediction probability corresponding to the real class label of the query sample and summing or averaging over the query set of one task, where the sum or average is defined as: Wherein, the Representing a set of queries for a task.

Description

Tibetan text small sample learning method based on matching network Technical Field The invention relates to the technical field of natural language processing and deep learning, in particular to a Tibetan text small sample learning method based on a matching network. Background With the rapid development of artificial intelligence technology, natural Language Processing (NLP) has become an important branch in the field of machine learning, and is widely applied to various scenes such as machine translation, emotion analysis, speech recognition and the like. In this context, deep learning models, with their powerful feature extraction capabilities, have achieved significant achievements in processing large-scale text data. However, these models are often based on a large amount of annotation data, and the performance of these models is still unsatisfactory for data-scarce languages, such as Tibetan. The Tibetan language high-quality parallel corpus has the disadvantages of scarce resources and high labeling cost, so that the traditional deep learning method is difficult to be directly applied to Tibetan language natural language processing tasks. Small sample learning is an emerging learning paradigm, aimed at enabling models to obtain good performance with only a small amount of annotation data by designing more efficient learning algorithms. In recent years, small sample learning has advanced to some extent in fields such as image recognition and speech recognition, but in the field of text processing of resource scarce languages such as Tibetan, research and application are still relatively limited, and how to train a model by effectively utilizing limited data becomes a problem to be solved urgently. Disclosure of Invention Aiming at the problem of data scarcity in the prior Tibetan natural language processing, the invention provides a Tibetan text small sample learning method based on a matching network, the method fully utilizes the existing small amount of Tibetan text resources, and effectively improves the processing capacity of the model on Tibetan text in a small sample learning scene through an innovative network structure and a learning strategy. A Tibetan text small sample learning method based on a matching network, the method comprising: Step S1, data preprocessing, namely cleaning, labeling and word segmentation are carried out on a Tibetan text original data set, a data set suitable for small sample learning is constructed, the data set is divided into a training set and a testing set, and a support set and a query set are formed by randomly sampling from the training set; Step S2, a matching network model is constructed, the matching network model comprises a global feature extractor and a local feature extractor, and the global feature extractor is responsible for capturing sentence-level semantic information and is recorded as The local feature extractor is responsible for extracting detailed information of vocabulary and the context thereof and is recorded asQuery sample dynamic embedding and support set sample embedding are carried out through a matching network; s3, carrying out feature fusion and similarity measurement by using a matching network model; S4, attention weight distribution and category label prediction, converting similarity scores into attention weights through a softmax function, and carrying out weighted summation on category labels supporting concentration by using the attention weights to obtain category prediction of query samples; And S5, performing model training driven by meta learning, namely adjusting parameters of an embedded function by minimizing loss between a prediction type label and a real type label, adopting a episode-based training strategy based on meta learning, forming a plurality of tasks by repeatedly sampling a support set and a query set, and performing iterative optimization on parameters of a matching network model by using cross entropy loss. Further, the step S2 of dynamically embedding the query sample includes: dynamic embedding function of a query sample incorporating support set context information becomes : ; Wherein attLSTM is the combination of LSTM structure and attention mechanism,Representing query samplesIs characterized in that,For the support set, g (S) is each element in the support set SIs represented by KA time step of processing; Support set Represented as,Is a support set sample that is a set of support,Is the firstIndividual support set samplesThe corresponding category label is used for the purpose of identifying,Representing the number of support set samples in the support set; The LSTM states after k processing steps are: ; ; ; ; Wherein, the In the original hidden state, the device is in a hidden state,In the final hidden state, the position of the camera is,Is the firstThe memory cell of the step is a memory cell,To at the firstThe step is to extract a weighted feature sum of the context information from the supp