CN-121565141-B - Large-scale speech synthesis task processing method based on concurrency optimization scheduling

CN121565141BCN 121565141 BCN121565141 BCN 121565141BCN-121565141-B

Abstract

The invention discloses a large-scale voice synthesis task processing method based on concurrent optimal scheduling, which comprises the following steps of S1, receiving a plurality of voice synthesis task requests from an external system, constructing a task set to be scheduled, S2, collecting the system running state of a current voice synthesis platform, constructing a system state information set, S3, inputting the system state information set into a structure improved deep Q network model to form a fusion state representation vector, S4, inputting the fusion state representation vector into a Q value estimation main network, distributing target tasks in the task set to be scheduled to a designated computing resource node S5, collecting the execution feedback information of the voice synthesis tasks, constructing a task execution state transition sample set, S6, inputting the state transition sample set into an experience playback pool, and executing parameter updating on the structure improved deep Q network model. The invention adopts an improved deep Q network model to realize intelligent dispatching optimization of large-scale speech synthesis tasks.

Inventors

WEI YAQIAN
MA YILIN
ZHANG CHAO
LIU YUJIE

Assignees

国能神皖合肥发电有限责任公司

Dates

Publication Date: 20260512
Application Date: 20251126

Claims (10)

1. A large-scale speech synthesis task processing method based on concurrency optimization scheduling is characterized by comprising the following steps: S1, receiving a plurality of voice synthesis task requests from an external system, and constructing a task set to be scheduled; S2, collecting the system running state of the current voice synthesis platform, and constructing a system state information set; S3, inputting a system state information set into a structure improved deep Q network model, wherein the structure improved deep Q network model is introduced into a multi-channel state input module, respectively extracting the characteristics of state information of corresponding channels, and splicing all channel characteristics to form a fusion state representation vector; s4, inputting the fusion state expression vector into a Q value estimation main network, outputting an optimal scheduling strategy, dispatching a target task in a task set to be scheduled to a designated computing resource node according to the optimal scheduling strategy, and calling a target voice model to execute voice synthesis task generation operation; s5, collecting execution feedback information of the voice synthesis task, and constructing a task execution state transition sample set; S6, inputting the state transition sample set into an experience playback pool, periodically sampling sample batches from the experience playback pool, and executing parameter updating on the structure improved deep Q network model by utilizing the target network and the loss function to optimize the voice task scheduling strategy.
2. The method for processing large-scale speech synthesis tasks based on concurrent optimized scheduling according to claim 1, wherein each target speech synthesis task in the task set to be scheduled comprises an input text, a speech model type, a speech style identifier, a response time limit parameter and an allocation status identifier, and the allocation status identifier comprises unscheduled, scheduled and completed.
3. The method for processing large-scale speech synthesis tasks based on concurrent optimized scheduling according to claim 1, wherein the system state information set comprises task queue state information, resource node state information and historical scheduling feedback information.
4. The method for processing large-scale speech synthesis tasks based on concurrent optimized scheduling according to claim 1, wherein the step S3 specifically comprises: S31, inputting a system state information set into a structure improved deep Q network model, wherein the structure improved deep Q network model comprises a multichannel state input module, a feature fusion module, a Q value estimation main network and a strategy updating module; S32, the multi-channel state input module comprises a task state channel coding sub-network, a resource state channel coding sub-network and a history track state channel coding sub-network; S33, inputting task queue state information into a task state channel coding sub-network, wherein the sub-network comprises at least one layer of fully-connected neural network with an activation function, and performing normalization processing and nonlinear feature mapping to generate a task feature representation vector; S34, inputting the state information of the resource nodes into a resource state channel coding sub-network, wherein the sub-network comprises a normalization layer and a two-layer multi-layer perceptron structure, extracting high-dimensional characteristics of the resource states, performing dimensional compression, and generating a resource characteristic representation vector; S35, inputting historical scheduling feedback information into a historical track state channel coding sub-network, wherein the sub-network comprises a sliding window aggregation module and a characteristic transformation layer, modeling historical data trend and generating a historical behavior characteristic vector; S36, in the feature fusion module, the task feature expression vector, the resource feature expression vector and the historical behavior feature vector are spliced according to feature dimensions to generate a fusion state expression vector.
5. The method for processing large-scale speech synthesis tasks based on concurrent optimization scheduling according to claim 4, wherein the step S33 specifically comprises: S331, inputting task queue state information into a task state channel coding sub-network, wherein each task structuring vector consists of an input text length, a voice model type identifier, a voice style identifier, a task response time limit parameter and a task priority grading; s332, performing linear normalization processing on each dimension attribute of each task structural vector to generate a normalized task input matrix; s333, inputting the normalized task input matrix into a feature transformation module comprising one or more layers of fully connected neural networks, and applying a nonlinear activation function after each layer Performing nonlinear feature mapping to obtain a task feature representation matrix; s334, global average pooling operation is carried out on the task feature representation matrix, and a task channel feature representation vector is generated by averaging according to columns in the dimension of the sample.
6. The method for processing large-scale speech synthesis tasks based on concurrent optimized scheduling according to claim 4, wherein the step S34 specifically comprises: S341, inputting the state information of the resource nodes into a resource state channel coding sub-network, wherein each resource node structuring vector consists of GPU occupancy rate, residual video memory capacity, current queuing task number, bandwidth occupancy rate and node temperature; S342, performing linear normalization processing on each dimension attribute of each resource node structural vector to generate a normalized resource input matrix; S343, inputting the normalized resource input matrix into a feature transformation module comprising a two-layer multi-layer perceptron structure, wherein the first layer outputs a high-dimensional intermediate feature matrix, and the second layer performs dimension compression mapping to generate a resource node feature matrix; And S344, performing global average pooling operation on the resource node feature matrix, and obtaining an average value according to columns in the node dimension to generate a resource channel feature expression vector.
7. The method for processing large-scale speech synthesis tasks based on concurrent optimization scheduling according to claim 4, wherein the step S35 specifically comprises: s351, inputting historical scheduling feedback information into a historical track state channel coding sub-network, wherein each historical feedback record vector consists of a previous round of task completion time delay, a scheduling success rate of a corresponding node, a task execution failure rate, an average task execution time length and a resource utilization fluctuation coefficient; s352, performing normalization processing on each dimension attribute in each historical feedback record to generate a normalized historical feedback input matrix; S353, inputting a normalized historical feedback input matrix into a time sequence modeling structure comprising a gating mechanism, performing time dimension modeling by adopting a layer of GRU network, and outputting a hidden state sequence; S354, performing time weighted average operation on the hidden state sequence to generate a characteristic representation vector of the historical track channel.
8. The method for processing large-scale speech synthesis tasks based on concurrent optimized scheduling according to claim 1, wherein the step S4 specifically comprises: S41, inputting a fusion state representation vector into a Q value estimation main network, wherein the Q value estimation main network is of a multi-layer perceptron structure and comprises an input layer, a hidden layer and an output layer, and the dimension of the output layer is an action space dimension to generate a corresponding action value estimation result of a candidate scheduling action set; s42, selecting an action with the maximum Q value from the action value estimation result as a scheduling action corresponding to the current optimal scheduling strategy; S43, executing scheduling operation on target speech synthesis tasks to be allocated in a task set to be scheduled according to an optimal scheduling strategy, and allocating the target tasks to be allocated to optimal resource nodes; S44, calling the voice model type and voice style identification information corresponding to the target voice synthesis task, loading and executing the appointed voice synthesis model, generating the voice synthesis task and pushing the voice synthesis task to a user or calling party interface.
9. The method for processing the large-scale speech synthesis task based on the concurrency optimization scheduling according to claim 1, wherein the step S5 specifically includes: s51, collecting feedback information in the execution process of a voice synthesis task, wherein the feedback information comprises task completion time delay, voice quality score generation and real-time resource occupation state of a resource node; S52, taking the real-time resource occupation state of the task completion time delay and the generated voice quality score as a feedback vector; S53, extracting a fusion state representation vector corresponding to the current moment and the executed scheduling action, and combining the fusion state representation vector and the executed scheduling action with a feedback vector to form a state transition sample triplet; S54, repeatedly executing the steps S51 to S53, continuously collecting state transition samples for a plurality of voice synthesis tasks, and summarizing to form a task execution state transition sample set.
10. The method for processing large-scale speech synthesis tasks based on concurrent optimized scheduling according to claim 1, wherein the step S6 specifically comprises: S61, storing a task execution state transition sample set into an experience playback pool contained in a strategy updating module in the structure improved deep Q network model; s62, periodically sampling a sample batch with a fixed size from an experience playback pool, wherein each sample comprises a fusion state representation vector, a historical scheduling action and a feedback vector; s63, calculating an action value estimation result of the current strategy network by utilizing a Q value estimation main network in the structure improved deep Q network, and defining a loss function based on time difference errors by combining a delay target value of a target network: ; Wherein, the The loss function is optimized for the scheduling policy, In order to be of the size of the sample batch, For index variables of samples in the experience playback pool during training, For the instant prize value, As a discount factor, the number of times the discount is calculated, For all of the scheduled actions in the target network, For the Q value of the target network, For the current Q value prediction result, The next state of the vector is represented for the fused state, For the set of target Q network parameters, The vector is represented for the fused state, For the history of the scheduled actions, Estimating a parameter set of the main network for the Q value; S64, minimizing a loss function L by adopting a gradient descent algorithm, updating a parameter set of a strategy updating module in the structure improved deep Q network model, and realizing continuous optimization and self-adaptive enhancement of a voice task scheduling strategy.

Description

Large-scale speech synthesis task processing method based on concurrency optimization scheduling Technical Field The invention relates to the technical field of voice processing and intelligent scheduling, in particular to a large-scale voice synthesis task processing method based on concurrency optimization scheduling. Background In the existing speech synthesis platform, along with the continuous increase of the demands of multitasking speech synthesis applications, how to effectively schedule diversified tasks to ensure the response performance and speech quality of the system has become an important direction for system optimization. Currently, mainstream speech synthesis systems generally employ static task allocation policies or rule-based scheduling algorithms, such as a first-in first-out policy based on task arrival time, a fixed resource allocation mechanism based on priority weights, or a polled load balancing manner based on a resource occupancy threshold. The method has a certain feasibility in a small-scale task environment, but has lower scheduling efficiency when facing a large-scale scene with high concurrency, heterogeneous task input and frequent resource state fluctuation, and is easy to cause problems of task congestion, resource utilization rate reduction and voice synthesis quality fluctuation. The voice synthesis task has high heterogeneity in a parameter level, and the voice model types, voice style setting and response time limit requirements related to different tasks directly influence the model loading path, calculation load intensity and synthesis delay. In addition, each computing node in the platform is in a dynamic resource state in the running process, including GPU occupancy rate, video memory capacity, current queuing task number and bandwidth occupancy rate, and effective adaptation is difficult to carry out through static configuration. Meanwhile, historical feedback information in the task execution process, such as scheduling success rate, task failure rate and resource utilization fluctuation, also contains potential behavior evolution trend information, and has a certain reference value for subsequent task scheduling. The existing scheduling scheme has the problem of insufficient expression in the aspect of state modeling, and the multidimensional state interaction relation of task attributes, resource features and historical behaviors cannot be fully considered. Most schedulers only use the current task length or node load as a scheduling basis, long-term behavior information modeling is not introduced, and the influence of different scheduling strategies on the system performance is difficult to accurately predict. In addition, the traditional method lacks an adaptive learning mechanism, and when the platform running environment or task structure changes, the scheduling strategy is difficult to dynamically adjust, so that the scheduling efficiency of the system is reduced. Deep reinforcement learning has been explored to some extent in resource scheduling, and learning scheduling strategies from environmental feedback is implemented through a state-action-rewarding model. However, the existing method mostly adopts a unified structure to process all input states, and does not carry out differential modeling on tasks, resources and historical feedback, so that the state expression capability is limited. Meanwhile, part of methods do not effectively utilize information of historical feedback sequences, and have defects in modeling of long-term scheduling effects. Therefore, how to provide a large-scale speech synthesis task processing method based on concurrency optimization scheduling is a problem that needs to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide a large-scale voice synthesis task processing method based on concurrency optimization scheduling, which adopts a structure-improved deep Q network model, integrates task queue states, resource node states and historical scheduling feedback information, and details an intelligent scheduling mechanism oriented to a high concurrency voice synthesis environment through multi-channel state coding, feature fusion and reinforcement learning strategy iteration. According to the embodiment of the invention, the large-scale speech synthesis task processing method based on concurrency optimal scheduling comprises the following steps: S1, receiving a plurality of voice synthesis task requests from an external system, and constructing a task set to be scheduled; S2, collecting the system running state of the current voice synthesis platform, and constructing a system state information set; S3, inputting a system state information set into a structure improved deep Q network model, wherein the structure improved deep Q network model is introduced into a multi-channel state input module, respectively extracting the characteristics of state information of corresponding channels, and