CN-121999755-A - End-side speech synthesis and play control method, device, equipment and storage medium

CN121999755ACN 121999755 ACN121999755 ACN 121999755ACN-121999755-A

Abstract

The embodiment of the invention discloses an end-side voice synthesis and play control method, an end-side voice synthesis and play control device, equipment and a storage medium. According to the technical scheme, the structure body with fixed capacity is constructed in advance, the structure body is used for representing the instance of each voice segment file, the state mechanism of each voice segment is defined by the voice segment state, the full life cycle structural management of the voice segments is realized, the text segmentation is carried out on the streaming text of the large language model according to the segmentation rule, each structure body corresponds to each text segment and each synthesized mode of each voice segment, the delay phenomenon is avoided when the voice segments are played and controlled, the abnormal playing condition does not cause the interaction between the playing processes of the voice segments when the voice is played, and the long-term occupation of the development process of the terminal side equipment by the voice field data after the playing operation is completed can be avoided by carrying out the reset processing on each structure body after the voice segment file playing operation is completed, so that the consumption of the memory resources of the terminal side equipment is reduced.

Inventors

Request for anonymity

Assignees

广东大同世界磁电科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260206

Claims (10)

1. An end-side speech synthesis and play control method, which is characterized by comprising the following steps: Pre-constructing a structure body with fixed capacity, wherein parameter information is encapsulated in the structure body, and the parameter information at least comprises a voice section index, a voice section state, a synthesized voice section file path and an original text section corresponding to the voice section; Customizing a structure array of a dynamic size based on the pre-constructed structures, wherein the structure array comprises at least one structure, and each structure is used for representing each voice segment file instance; Initializing and configuring the structure body array; Obtaining a streaming text output from a preset large language model, segmenting the streaming text according to a preset segmentation rule to obtain at least one segment of text segment, traversing the structure array and searching a target structure in the structure array, and respectively storing each text segment obtained after text segmentation into different target structures according to the traversing sequence; Performing voice synthesis on text segments in the target structure body according to a preset voice synthesis engine to obtain synthesized voice segment files, wherein each synthesized voice segment file is stored in a voice segment file path in the corresponding target structure body; If the voice segment index in the target structure indicates that the synthesized voice segment file is the first voice segment file, playing the synthesized voice segment file, and if the voice segment state in the target structure indicates that the synthesized voice segment file is in a playing completion state, triggering the synthesized next voice segment file to execute playing operation; And if the synthesized voice segment files of all the target structures in the structure array complete the playing operation, resetting all the target structures in the structure array.
2. The method for end-side speech synthesis and playback control according to claim 1, wherein the initializing the structure array comprises: determining the number of the structures in the structure array, and performing memory allocation on the structure array based on the number of the structures; Traversing each structure in the structure array, and configuring parameter information packaged in each structure according to traversing sequence, wherein configuring parameter information packaged in each structure specifically comprises configuring a voice segment index of each structure, a voice segment state of each structure, a synthesized voice segment file path of each structure and an original text segment corresponding to a voice segment of each structure, wherein configuring the voice segment index of each structure is used for indicating what voice segment corresponds to a current structure, configuring the voice segment state of each structure is used for indicating that the voice segment state of the current structure is in an idle state, configuring the synthesized voice segment file path of each structure is used for indicating a preset storage path of a synthesized voice segment file of the current structure, and configuring the original text segment corresponding to the voice segment of each structure is used for indicating that the original text segment corresponding to the voice segment is an empty text when the voice segment state of the current structure is in the idle state.
3. The method for end-side speech synthesis and playback control according to claim 1, wherein the predetermined large language model is a LLM model, and the predetermined segmentation rule specifically comprises: Traversing the streaming text output from the LLM model; if the preset Chinese punctuation exists in the streaming text, text segmentation is carried out on the streaming text by taking the preset Chinese punctuation as a segmentation boundary point of a text segment to obtain at least one text segment, wherein the preset Chinese punctuation comprises a Chinese sentence punctuation, a Chinese exclamation mark punctuation or a Chinese question mark punctuation.
4. The method for end-side speech synthesis and playback control according to claim 1, wherein traversing the array of structures and searching for a target structure in the array of structures, and storing each text segment obtained after text segmentation into a different target structure according to the traversing sequence, specifically comprising: traversing the structure body array, and searching whether a structure body with a voice segment state indicating an idle state exists in the structure body array; if the voice segment state exists in the structure array to indicate the structure in the idle state, determining the structure as a target structure, and respectively storing each text segment into different target structures according to the traversing sequence; updating the voice segment state in each target structure body to indicate that the voice segment in each target structure body is in a state to be synthesized.
5. The method for end-side speech synthesis and playback control according to claim 1, wherein the predetermined speech synthesis engine is a Sherpa-ONNX TTS engine, and the speech synthesis is performed on the text segment in the target structure according to the predetermined speech synthesis engine to obtain a synthesized speech segment file, specifically including: updating the speech segment state in the target structure to indicate that the speech segment in the target structure is in a synthesized state; Calling Sherpa-ONNX TTS engine to make speech synthesis on the text segment in the target structure body so as to obtain synthesized speech segment file; updating the state of the voice segment in the target structure body to indicate that the voice segment in the target structure body is in a state to be played.
6. The method for end-side speech synthesis and playback control according to claim 1, wherein if the speech segment index in the target structure indicates that the synthesized speech segment file is a first speech segment file, playing the synthesized speech segment file, and if the speech segment status in the target structure indicates that the synthesized speech segment file is in a playback completion status, triggering the synthesized next speech segment file to execute playback operation, specifically comprising: If the voice segment index in the target structure indicates that the synthesized voice segment file is the first voice segment file, updating the voice segment state in the target structure to indicate that the voice segment in the target structure is in a playing state; using aplay command to play the synthesized voice section file; updating the voice segment state in the target structure body to indicate that the synthesized voice segment file in the target structure body is in a play completion state; traversing the structure body array and searching a next target structure body of which the voice section state in the structure body array indicates that the synthesized voice section file is in a state to be played; triggering the synthesized next voice segment file to execute playing operation according to the traversing sequence.
7. The method for end-side speech synthesis and playback control according to claim 1, wherein if playback of the synthesized speech segment files of all the target structures in the structure array is completed, performing a reset process on all the target structures in the structure array, comprises: If the voice segment state of each target structure in the structure array indicates that the synthesized voice segment file of each target structure is in a playing completion state, traversing each target structure in the structure array, updating the voice segment state in each target structure to indicate that the voice segment state of each target structure is in an idle state, and carrying out blank processing on an original text segment corresponding to the voice segment in each target structure.
8. An end-side speech synthesis and playback control apparatus, the apparatus comprising: A construction unit, configured to construct a structure body with a fixed capacity in advance, where parameter information is encapsulated in the structure body, where the parameter information at least includes a speech segment index, a speech segment state, a synthesized speech segment file path, and an original text segment corresponding to the speech segment; A definition unit, configured to customize a structure array of a dynamic size based on the pre-constructed structures, where the structure array includes at least one structure, and each structure is used to represent each speech segment file instance; An initialization configuration unit, configured to perform initialization configuration on the structure array; The execution unit is used for acquiring a streaming text output from a preset large language model, carrying out text segmentation on the streaming text according to a preset segmentation rule to obtain at least one text segment, traversing the structure body array, searching a target structure body in the structure body array, and respectively storing each text segment obtained after text segmentation into different target structure bodies according to the traversing sequence; A voice synthesis unit, configured to perform voice synthesis on text segments in the target structure according to a preset voice synthesis engine, so as to obtain synthesized voice segment files, where each synthesized voice segment file is stored in a corresponding voice segment file path in the target structure; A playing unit, configured to play the synthesized speech segment file if the speech segment index in the target structure indicates that the synthesized speech segment file is a first speech segment file, and trigger a next synthesized speech segment file to execute a playing operation if the speech segment state in the target structure indicates that the synthesized speech segment file is in a playing completion state; and the resetting processing unit is used for resetting all the target structural bodies in the structural body array if the synthesized voice segment files of all the target structural bodies in the structural body array complete playing operation.
9. A computer device, characterized in that the computer device comprises a memory and a processor, the memory having stored thereon a computer program, the processor implementing the end-side speech synthesis and playback control method according to any one of claims 1 to 7 when executing the computer program.
10. A storage medium storing a computer program which, when executed by a processor, implements the end-side speech synthesis and playback control method of any one of claims 1 to 7.

Description

End-side speech synthesis and play control method, device, equipment and storage medium Technical Field The present invention relates to the field of speech synthesis technologies, and in particular, to an end-side speech synthesis and play control method, apparatus, device, and storage medium. Background The current Speech synthesis scheme mostly depends on TTS (Text To Speech) technology, however, the problem of lack of full life cycle structured management of Speech segments from 'not synthesized, synthesized To complete playing operation' exists in control of Speech synthesis and playing, after receiving a Text requiring Speech synthesis, the current Speech synthesis scheme needs To synthesize full Speech for playing the full Text, although the mode of full Speech synthesis and playing can ensure consistency of the Text To a certain extent, but delay phenomenon can be caused in Speech playing, in practical application, when abnormal situation of playing interruption occurs once in the playing process of the full Speech, the non-performed Speech playing process is affected by abnormal interruption. On the other hand, in the current speech synthesis playing control, after the synthesized speech completes the playing process, the speech field data is not effectively processed, and the development process of the terminal side equipment is occupied for a long time, so that the memory resource of the terminal side equipment has larger resource consumption. Disclosure of Invention The invention aims to overcome the defects of the prior art and provide an end-side voice synthesis and play control method, an end-side voice synthesis and play control device, end-side voice synthesis and play control equipment and a storage medium. In order to achieve the above purpose, the present invention adopts the following technical scheme: a method for controlling end-side speech synthesis and play includes pre-constructing a structure body with fixed capacity, packing parameter information in the structure body, storing the parameter information at least including a speech segment index, a speech segment state, a synthesized speech segment file path and an original text segment corresponding to the speech segment in the structure body array, customizing a structure body array with a dynamic size based on the pre-constructed structure body, initializing the structure body array, obtaining a streaming text output from a preset large language model, text segmentation according to preset segmentation rules to obtain at least one segment text segment, traversing the structure body array, searching the target structure body in the structure body array, storing each text segment obtained after text segmentation into different target structure bodies according to the sequence of traversal, playing the speech file indicating that the target structure is in the speech segment synthesis state after the speech segment is synthesized according to the first speech segment in the speech synthesis file, playing the target structure file after the speech segment is synthesized, displaying the target structure file after the speech segment is synthesized, indicating that the speech segment is in the target structure is in the speech segment synthesis file after the speech segment, playing the target structure file after the speech segment is in the speech synthesis state after the speech segment is synthesized, storing the target structure file after the speech segment is in the target structure file, and if the synthesized voice segment files of all the target structures in the structure array complete the playing operation, resetting all the target structures in the structure array. The method comprises the steps of initializing and configuring a structure array, specifically comprising the steps of determining the number of structures in the structure array, performing memory allocation on the structure array based on the number of structures, traversing each structure in the structure array, configuring parameter information packaged in each structure according to traversing sequence, configuring the parameter information packaged in each structure, specifically comprising configuring a voice segment index of each structure, a voice segment state of each structure, a synthesized voice segment file path of each structure and an original text segment corresponding to a voice segment of each structure, wherein the voice segment index of each structure is used for indicating a number of voice segments corresponding to a current structure, the voice segment state of each structure is configured for indicating that the voice segment state of the current structure is in an idle state, configuring a storage path preset by a synthesized voice segment file of each structure, configuring the voice segment file path of each structure for indicating that the synthesized voice segment file of the current structure is in the idle state, and configurin