CN-121983056-A - Voice editing processing method and device

CN121983056ACN 121983056 ACN121983056 ACN 121983056ACN-121983056-A

Abstract

The embodiment of the specification provides a voice editing processing method and a voice editing processing device, wherein the voice editing processing method comprises the steps of firstly obtaining interactive action data of a user aiming at voice text based on voice text of dialogue voice input by an interactive component of an application program in the voice editing processing process, detecting whether the voice editing intention exists or not according to the interactive action data and the interactive environment data, further determining the intention type of the voice editing intention under the condition that the voice editing intention exists is detected, displaying a voice editing identifier corresponding to the intention type on the interactive component, obtaining voice editing fragments acquired after triggering the voice editing identifier, and carrying out editing adaptation processing on the voice text based on the voice editing fragments, so that the voice text editing processing is realized by sensing the voice editing intention.

Inventors

ZHOU CHUNXIAN

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (14)

1. A voice editing processing method, comprising: acquiring interactive action data of a user aiming at a voice text of dialogue voice input by an interactive component of an application program; detecting whether voice editing intention exists or not according to the interactive action data and the interactive environment data; If yes, determining the intention type of the voice editing intention, and displaying a voice editing identifier corresponding to the intention type in the interaction component; and acquiring a voice editing segment acquired after triggering the voice editing identifier, and editing and adapting the voice text based on the voice editing segment.
2. The voice editing processing method according to claim 1, wherein the editing adaptation processing of the voice text based on the voice editing section includes: Performing text editing on the voice text based on the voice fragment text of the voice editing fragment to obtain an editing text; and detecting the text continuity of the edited text, and performing text editing and displaying according to a detection result.
3. The method for processing voice editing according to claim 2, wherein the detecting text continuity of the edited text and displaying text editing according to the detected result comprises: extracting semantic key features of the editing text, and performing semantic association calculation based on the semantic key features; And if the semantic association calculation result meets the preset condition, carrying out display updating on the voice text displayed on the interaction component according to the editing text.
4. The voice editing processing method according to claim 3, wherein the text continuity detection is performed on the edited text, and the text editing presentation is performed according to the detection result, further comprising: If the semantic association calculation result does not meet the preset condition, determining an abnormal field in the editing text, displaying the editing text on the interaction component and marking the abnormal field; and after the abnormality prompt mark of the abnormality field is triggered, performing text correction processing on the edited text.
5. The voice editing processing method of claim 2, the text editing the voice text based on the voice clip text of the voice editing clip to obtain an edited text, comprising: Writing the voice fragment text into the voice text according to a text editing mode corresponding to the intention type to obtain the edited text, wherein the text editing mode comprises text insertion, text modification and/or text deletion.
6. The voice edit processing method according to claim 1, after the operation of determining the intention type of the voice edit intention is performed, and before the operation of acquiring the voice edit segment collected after triggering the voice edit identification is performed, further comprising: Extracting structural features of the voice text to obtain text structural features, and extracting features of the intention type to obtain editing intention features; And editing and detecting the text structure characteristics based on the editing intention characteristics, and editing and prompting the editing fields obtained by detection.
7. The voice editing processing method of claim 6, wherein the displaying, at the interaction component, the voice editing identifier corresponding to the intention type, includes: or determining an editing position according to the interactive action data and displaying the voice editing identifier; the voice editing mark is used for inputting voice editing fragments corresponding to editing instructions for editing voice texts and/or inputting voice editing fragments of voice contents for editing the voice texts.
8. The voice editing processing method of claim 1, the detecting whether there is a voice editing intention based on the interactive action data and interactive environment data, comprising: Inputting the interactive action data and the interactive environment data into a lightweight detection model deployed by the application program to detect voice editing intention, and outputting an intention detection result; the interactive environment data comprise context data of the interactive components, and page data of an interactive page to which the interactive components belong.
9. The voice edit processing method according to claim 8, wherein the voice edit intention detection is realized by: Performing feature mapping on the interactive action features of the interactive action data and the environmental features of the interactive environment data to obtain a mapping feature set, and performing quantitative feature conversion on the mapping feature set to obtain conversion features; And screening the edit dimension characteristics of each edit dimension from the conversion characteristics, carrying out matching calculation with the detection rules of each edit dimension, and determining the intention detection result according to the matching calculation result.
10. The voice edit processing method according to claim 1, the determining an intention type of the voice edit intention, comprising: Performing feature filtering and feature association on text features of the voice text and interactive action features of the interactive action data to obtain associated features; and carrying out matching calculation on the association features and intention detection rules of the intention types, and determining the intention types according to matching calculation results.
11. The voice editing processing method according to claim 1, wherein the application program includes a client for performing access interaction with an agent and/or a client for performing dialogue interaction with a large language model.
12. A voice edit processing apparatus comprising: A data acquisition module configured to acquire interactive action data of a user for a voice text of a dialogue voice input at an interactive component of an application; an intention detection module configured to detect whether a voice editing intention exists according to the interactive action data and the interactive environment data; if yes, the identification display module is operated, and the identification display module is configured to determine the intention type of the voice editing intention and display the voice editing identification corresponding to the intention type on the interaction component; The editing adaptation processing module is configured to acquire the voice editing fragments acquired after the voice editing identification is triggered, and carry out editing adaptation processing on the voice text based on the voice editing fragments.
13. A voice editing processing apparatus comprising: And a memory configured to store computer-executable instructions that, when executed, cause the processor to: acquiring interactive action data of a user aiming at a voice text of dialogue voice input by an interactive component of an application program; detecting whether voice editing intention exists or not according to the interactive action data and the interactive environment data; If yes, determining the intention type of the voice editing intention, and displaying a voice editing identifier corresponding to the intention type in the interaction component; and acquiring a voice editing segment acquired after triggering the voice editing identifier, and editing and adapting the voice text based on the voice editing segment.
14. A computer readable storage medium storing computer executable instructions which, when executed, implement the steps of the method of claim 1.

Description

Voice editing processing method and device Technical Field The present document relates to the field of data processing technologies, and in particular, to a method and an apparatus for voice editing processing. Background With the continuous development of voice recognition technology and natural language processing technology, voice input is a convenient and efficient interaction mode, and is widely applied to various application programs, a user can convert voice information into text content through voice input, so that the efficiency of information input is remarkably improved, the requirements of various scenes such as communication, recording and the like are met, however, in the actual use process, the user often needs to edit and adjust generated voice text, so that the accuracy of input content is ensured, and under the background, with the continuous development of artificial intelligence technology and the continuous improvement of convenience requirements of the user on voice interaction, how to better realize voice interaction processing becomes the focus of attention of all parties. Disclosure of Invention One or more embodiments of the present specification provide a voice editing processing method including acquiring interactive motion data of a user for a voice text of a dialogue voice input at an interactive component of an application. And detecting whether voice editing intention exists or not according to the interactive action data and the interactive environment data. If yes, determining the intention type of the voice editing intention, and displaying a voice editing identifier corresponding to the intention type in the interaction component. And acquiring a voice editing segment acquired after triggering the voice editing identifier, and editing and adapting the voice text based on the voice editing segment. One or more embodiments of the present specification provide a voice editing processing apparatus including a data acquisition module configured to acquire interactive motion data of a user for a voice text of a dialogue voice input at an interactive component of an application. And the intention detection module is configured to detect whether voice editing intention exists according to the interactive action data and the interactive environment data. And the identification display module is configured to determine the intention type of the voice editing intention and display the voice editing identification corresponding to the intention type on the interaction component. The editing adaptation processing module is configured to acquire the voice editing fragments acquired after the voice editing identification is triggered, and carry out editing adaptation processing on the voice text based on the voice editing fragments. One or more embodiments of the present specification provide a speech editing processing device including a processor and a memory configured to store computer-executable instructions that, when executed, cause the processor to obtain interactive motion data for speech text of a conversational speech input at an interactive component of an application. And detecting whether voice editing intention exists or not according to the interactive action data and the interactive environment data. If yes, determining the intention type of the voice editing intention, and displaying a voice editing identifier corresponding to the intention type in the interaction component. And acquiring a voice editing segment acquired after triggering the voice editing identifier, and editing and adapting the voice text based on the voice editing segment. One or more embodiments of the present specification provide a computer-readable storage medium storing computer-executable instructions that, when executed, implement a process of obtaining interactive action data for a user's voice text for conversational voice input at an interactive component of an application. And detecting whether voice editing intention exists or not according to the interactive action data and the interactive environment data. If yes, determining the intention type of the voice editing intention, and displaying a voice editing identifier corresponding to the intention type in the interaction component. And acquiring a voice editing segment acquired after triggering the voice editing identifier, and editing and adapting the voice text based on the voice editing segment. Drawings For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are needed in the description of the embodiments or of the prior art will be briefly described below, it being obvious that the drawings in the description that follow are only some of the embodiments described in the present description, from which other drawings can be obtained, without inventive faculty, for a person skilled in the art; FIG. 1 is a schematic diagram of an implementation environme