CN-122024705-A - Voice text matching method, device, equipment and medium based on reinforcement learning

CN122024705ACN 122024705 ACN122024705 ACN 122024705ACN-122024705-A

Abstract

The invention discloses a voice text matching method based on reinforcement learning, and obtaining a plurality of training samples to be matched and matching training samples of each training sample to be matched. And constructing a semantic feature space according to the plurality of training samples to be matched and the plurality of matching training samples. And determining a target optimal matching path of the candidate training samples to be matched according to the plurality of key anchor points, the plurality of candidate training samples to be matched and the preset cumulative rewarding function in the semantic feature space. Updating the voice text matching model according to the accumulated reward value of the target optimal matching path, storing relevant matching data of a plurality of candidate training samples to be matched currently into an experience pool, and returning to execute matching training samples corresponding to the plurality of training samples to be matched and each training sample to be matched, so as to construct a semantic feature space until reaching a training termination condition, thereby obtaining the target voice text matching model and improving the accuracy of the model for obtaining a matching result.

Inventors

LIU QUAN
DU XIAOXIANG

Assignees

北京云上曲率科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (10)

1. A reinforcement learning-based phonetic text matching method, comprising: Obtaining a plurality of training samples to be matched and matching training samples corresponding to each training sample to be matched, wherein the training samples to be matched are any one of voice features to be matched and text features to be matched, the matching training samples are any one of the voice features to be matched and the text features to be matched, the voice features to be matched are in one-to-one correspondence with the text features to be matched, and the text features to be matched are in one-to-one correspondence with the voice features to be matched; Constructing a semantic feature space according to the plurality of training samples to be matched and the plurality of training samples to be matched, wherein the semantic feature space comprises a plurality of semantic feature candidate spaces, and each semantic feature candidate space comprises a key anchor point, a plurality of training samples to be matched corresponding to the key anchor point and a plurality of training samples to be matched; Determining a target optimal matching path corresponding to each candidate training sample to be matched according to the plurality of key anchors, a plurality of candidate training samples to be matched, which correspond to the plurality of key anchors, and a preset cumulative rewarding function, wherein the target optimal matching path is used for determining the candidate training samples to be matched; updating a voice text matching model according to the accumulated reward value corresponding to the target optimal matching path, storing relevant matching data of a plurality of candidate training samples to be matched currently into an experience pool, and returning to execute matching training samples corresponding to the plurality of candidate training samples to be matched and each candidate training sample to be matched, so as to construct a semantic feature space until a training termination condition is reached, thereby obtaining a trained target voice text matching model, wherein the target voice text matching model is used for obtaining a matching result of a target object to be matched.
2. The method of claim 1, wherein constructing a semantic feature space from the plurality of training samples to be matched and the plurality of training samples to be matched comprises: constructing an initial semantic feature space according to the training samples to be matched and the matching training samples; Dividing the initial semantic feature space according to the distribution densities of the training samples to be matched and the matching training samples to obtain an intermediate semantic feature space, wherein the intermediate semantic feature space comprises a plurality of intermediate semantic feature subspaces, and each intermediate semantic feature subspace comprises a plurality of intermediate training samples to be matched and a plurality of intermediate matching training samples; initializing an initial anchor point position of each intermediate semantic feature subspace according to a semantic center vector of each intermediate semantic feature subspace, wherein the semantic center vector is determined according to a plurality of intermediate training samples to be matched and a plurality of intermediate matching training samples; According to the initial anchor point position, a preset radius, a training sample to be matched in the intermediate semantic feature subspace and the matching success rate of the matching training sample, adjusting the initial anchor point position to obtain a key anchor point, and verifying the key anchor point; According to the verification result, a plurality of semantic feature candidate spaces are determined in a plurality of intermediate semantic feature subspaces, wherein each semantic feature candidate space comprises a key anchor point, a plurality of candidate training samples to be matched and a plurality of candidate training samples to be matched, which correspond to the key anchor point; And constructing the semantic feature space according to the plurality of semantic feature candidate spaces.
3. The method according to claim 1, wherein the determining the target optimal matching path corresponding to each candidate training sample to be matched according to the plurality of key anchors, the plurality of candidate training samples to be matched corresponding to the plurality of key anchors, the plurality of candidate training samples to be matched, and the preset cumulative reward function includes: According to the key anchor points, the candidate training samples to be matched and the candidate training samples to be matched, which correspond to the key anchor points, respectively, determining a plurality of candidate matching paths corresponding to the candidate training samples to be matched, wherein the candidate matching paths are used for determining candidate matching training samples corresponding to the candidate training samples to be matched; and acquiring accumulated rewards corresponding to the candidate matching paths according to a preset accumulated rewards function, and determining the candidate matching path corresponding to the largest accumulated rewards as a target optimal matching path according to the accumulated rewards.
4. The method of claim 3, wherein determining a plurality of candidate matching paths corresponding to each candidate training sample to be matched according to a plurality of candidate training samples to be matched corresponding to a plurality of key anchors and a plurality of candidate training samples to be matched corresponding to a plurality of key anchors respectively, comprises: Sequentially taking each key anchor point as a starting point, and matching in a plurality of candidate training samples to be matched and a plurality of candidate training samples to be matched, which correspond to each key anchor point, to obtain a plurality of initial candidate matching paths, wherein each initial candidate matching path comprises a first initial candidate matching sub-path corresponding to the candidate training sample to be matched and a second initial candidate matching sub-path corresponding to the candidate training sample to be matched; performing path expansion in a plurality of candidate training samples to be matched and a plurality of candidate training samples corresponding to all key anchors aiming at the first initial candidate matching sub-path and the second initial candidate matching sub-path to obtain a first candidate matching sub-path corresponding to the first initial candidate matching sub-path and a second candidate matching sub-path corresponding to the second initial candidate matching sub-path; And synthesizing the first candidate matching sub-path and the second candidate matching sub-path to obtain a candidate matching path.
5. The method of claim 4, wherein the candidate matching paths include a plurality of matching connection nodes, the matching connection nodes are candidate training samples to be matched or the candidate training samples to be matched by the candidate training samples to be matched, the obtaining cumulative prize values corresponding to the plurality of candidate matching paths according to a preset cumulative prize function includes: for each candidate matching path, obtaining a local excitation value of each matching connection node according to cosine similarity; acquiring a global excitation value of each candidate matching path; carrying out weighted summation on the local excitation value and the global excitation value to obtain a path rewarding value; substituting the path rewarding value into a preset accumulated rewarding function to calculate, and obtaining the accumulated rewarding value corresponding to the candidate matching path.
6. The method of claim 5, wherein the voice text matching model comprises a policy network and a value network, and updating the voice text matching model based on the cumulative prize value corresponding to the target optimal matching path comprises: determining a strategy gradient according to the accumulated rewards value corresponding to the target optimal matching path; according to the strategy gradient, adjusting the weight parameter of the strategy network; Determining a value error according to the accumulated reward value and a preset value error loss function; and adjusting the weight parameters of the value network according to the value error.
7. The method of claim 6, wherein the method further comprises: obtaining a target object to be matched, wherein the target object to be matched is any one of a voice characteristic target object to be matched and a text characteristic target object to be matched; And inputting the target object to be matched into the target voice text matching model, and acquiring a matching result of the target object to be matched through the strategy network in the target voice text matching model.
8. A reinforcement learning-based phonetic text matching device, comprising: The training sample acquisition module is used for acquiring a plurality of training samples to be matched and matching training samples corresponding to each training sample to be matched, wherein the training samples to be matched are any one of voice features to be matched and text features to be matched, the training samples to be matched are any one of the voice features to be matched and the text features to be matched, the voice features to be matched are in one-to-one correspondence with the text features to be matched, and the text features to be matched are in one-to-one correspondence with the voice features to be matched; The semantic feature space construction module is used for constructing a semantic feature space according to the plurality of training samples to be matched and the plurality of training samples to be matched, wherein the semantic feature space comprises a plurality of semantic feature candidate spaces, and each semantic feature candidate space comprises a key anchor point, a plurality of training samples to be matched corresponding to the key anchor point and a plurality of training samples to be matched; The target optimal matching path determining module is used for determining a target optimal matching path corresponding to each candidate training sample to be matched according to a plurality of key anchor points, a plurality of candidate training samples to be matched, and a preset accumulated rewarding function, wherein the target optimal matching path is used for determining the candidate training samples to be matched; The target voice text matching model obtaining module is used for updating a voice text matching model according to the accumulated rewarding value corresponding to the target optimal matching path, storing relevant matching data of a plurality of candidate training samples to be matched currently into an experience pool, and returning to execute matching training samples corresponding to the plurality of training samples to be matched and each training sample to be matched, constructing a semantic feature space until a training termination condition is reached, so that a trained target voice text matching model is obtained, and the target voice text matching model is used for obtaining a matching result of a target object to be matched.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the reinforcement learning based phonetic text matching method of any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the reinforcement learning based phonetic text matching method of any of claims 1 to 7.

Description

Voice text matching method, device, equipment and medium based on reinforcement learning Technical Field The invention relates to the technical field of artificial intelligence, in particular to a voice text matching method, device, equipment and medium based on reinforcement learning. Background With the rapid development of multimedia technology, cross-modal voice text matching technology is increasingly widely applied in the fields of voice recognition, intelligent man-machine interaction, multimedia retrieval and the like. The technology aims at constructing semantic association between voice data and text data and realizing intelligent matching of different mode data. However, the traditional cross-modal voice text matching method mainly relies on static feature extraction and simple similarity measurement, and is difficult to fully capture complex semantic association between voice and text. Based on this, the deep learning technology has made a significant breakthrough in feature extraction and pattern recognition in recent years. The high-efficiency cross-modal matching model is constructed by applying advanced models such as convolutional neural network, cyclic neural network and the like to feature extraction of voice and text and introducing methods such as metric learning and the like. At the same time, reinforcement learning, as a method for learning optimal decision strategies through continuous interaction environment, has great potential in dealing with complex decision problems, but its application in the cross-modal matching field is still in the primary exploration stage. For example, the method based on metric learning can utilize Siamese network or triple network to embed the voice and the text into the shared semantic space through contrast learning and evaluate the matching degree by means of Euclidean distance or cosine similarity, and in addition, an alignment model based on an attention mechanism can be adopted to realize the dynamic alignment of the voice frame and the text word element through a cross-modal transducer and a multi-head attention mechanism thereof, so as to complete the voice text matching. However, with the prior art, the ability to adapt feature matching is lacking due to the reliance on a fixed feature alignment mechanism. Meanwhile, similarity calculation of local features is concerned too much in the voice text matching process, global semantic consistency is ignored, and long-distance semantic dependency relations among voice texts cannot be effectively recognized and processed. These limitations together result in reduced accuracy in matching of the phonetic text, which reduces the accuracy of matching of the phonetic text. Disclosure of Invention The invention aims to provide a voice text matching method based on reinforcement learning, which is used for solving the problem that the self-adaptive feature matching capability is lacking due to the dependence on a fixed feature alignment mechanism in the prior art. Meanwhile, the similarity calculation of local features is concerned too much in the voice text matching process, and global semantic consistency is ignored, so that long-distance semantic dependency relations among voice texts cannot be effectively recognized and processed, and the accuracy of voice text matching is reduced. In order to achieve the above object, in a first aspect, an embodiment of the present invention provides a method for matching a voice text based on reinforcement learning, including: Obtaining a plurality of training samples to be matched and matching training samples corresponding to each training sample to be matched, wherein the training samples to be matched are any one of voice features to be matched and text features to be matched, the matching training samples are any one of the voice features to be matched and the text features to be matched, the voice features to be matched are in one-to-one correspondence with the text features to be matched, and the text features to be matched are in one-to-one correspondence with the voice features to be matched; Constructing a semantic feature space according to the plurality of training samples to be matched and the plurality of training samples to be matched, wherein the semantic feature space comprises a plurality of semantic feature candidate spaces, and each semantic feature candidate space comprises a key anchor point, a plurality of training samples to be matched corresponding to the key anchor point and a plurality of training samples to be matched; Determining a target optimal matching path corresponding to each candidate training sample to be matched according to the plurality of key anchors, a plurality of candidate training samples to be matched, which correspond to the plurality of key anchors, and a preset cumulative rewarding function, wherein the target optimal matching path is used for determining the candidate training samples to be matched; updating a voice text