US-20260126954-A1 - METHOD FOR INTERACTING VOICE, ELECTRONIC DEVICE AND STORAGE MEDIUM

US20260126954A1US 20260126954 A1US20260126954 A1US 20260126954A1US-20260126954-A1

Abstract

A method for interacting voice is provided. The method includes: determining a user included in a physical environment and a first position of the user in the physical environment based on a real-time audio stream collected in the physical environment; presenting a user indicator corresponding to the user in association with a target indicator in a voice interaction interface rendered for the physical environment, where the relative positional relationship between the user indicator and the target indicator is determined based on the relative positional relationship between the first position and a second position corresponding to the target indicator in the physical environment; and adjusting a visual presentation attribute of the user indicator based on a portion of the real-time audio stream corresponding to the user.

Inventors

Zhiheng Xu
Pengfei ZHONG
Xiaohua REN
Xiaolin Huang
Huibin Zhao

Assignees

BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD.

Dates

Publication Date: 20260507
Application Date: 20251219
Priority Date: 20250307

Claims (20)

1 . A method for interacting voice based on a large language model, comprising: determining a user included in a physical environment and a first position of the user in the physical environment based on a real-time audio stream collected in the physical environment; presenting a user indicator corresponding to the user and being associated with a target indicator in a voice interaction interface rendered for the physical environment, wherein a relative positional relationship between the user indicator and the target indicator is determined based on a relative positional relationship between the first position and a second position corresponding to the target indicator in the physical environment; and adjusting a visual presentation attribute of the user indicator based on a portion of the real-time audio stream corresponding to the user.
2 . The method according to claim 1 , wherein the determining a user included in a physical environment and a first position of the user in the physical environment based on a real-time audio stream collected in the physical environment comprises: determining a source direction and a distance of a voice signal in the real-time audio stream collected in the physical environment using a time difference of arrival positioning algorithm; determining the user included in the physical environment based on the source direction; and determining the first position of the user in the physical environment based on the distance.
3 . The method according to claim 1 , further comprising: parsing text information of the portion of the real-time audio stream corresponding to the user; determining identity information corresponding to the user based on the text information; and presenting an identity prompt in association with the user indicator corresponding to the user based on the identity information corresponding to the user.
4 . The method according to claim 3 , further comprising: in response to determining that a same user is identified with at least two different pieces of identity information, merging the at least two different pieces of identity information and portions of the real-time audio stream used to determine the at least two different pieces of identity information.
5 . The method according to claim 4 , further comprising: combining text information corresponding to a first user and text information corresponding to a second user to obtain combined text information, wherein the identity information corresponding to the first user is different from the identity information corresponding to the second user; and performing context analysis on the combined text information using a large language model to obtain a context analysis result, wherein the context analysis result indicates whether the first user and the second user are the same user with different identity information.
6 . The method according to claim 1 , further comprising: presenting the target indicator at a center of the voice interaction interface.
7 . The method according to claim 1 , further comprising: forming the voice interaction interface based on a planar layout of the physical environment.
8 . The method according to claim 7 , further comprising: determining a planar position of the second position in the planar layout; and presenting the target indicator in the voice interaction interface based on the planar position.
9 . The method according to claim 1 , wherein the adjusting the visual presentation attribute of the user indicator based on the portion of the real-time audio stream corresponding to the user comprises: continuously adjusting a size of the user indicator based on an accumulated number of words of the text information of the portion of the real-time audio stream corresponding to the user, wherein the size is positively correlated with the accumulated number of words.
10 . The method according to claim 9 , further comprising: in response to the accumulated number of words being greater than or equal to an accumulation threshold, stopping continuously adjusting the size of the user indicator.
11 . The method according to claim 1 , wherein the adjusting the visual presentation attribute of the user indicator based on the portion of the real-time audio stream corresponding to the user comprises: adjusting a size of the user indicator based on a volume of the portion of the real-time audio stream corresponding to the user, wherein the size is positively correlated with the volume.
12 . The method according to claim 1 , wherein the adjusting the visual presentation attribute of the user indicator based on the portion of the real-time audio stream corresponding to the user comprises: continuously adjusting a presentation position of the user indicator to move toward the target indicator based on an accumulated number of words of the text information of the portion of the real-time audio stream corresponding to the user, wherein a moving distance of the presentation position of the user indicator is positively correlated with the accumulated number of words.
13 . The method according to claim 12 , further comprising: in response to the presentation position of the user indicator being the same as a presentation position of the target indicator, stopping continuously adjusting the presentation position of the user indicator to move toward the target indicator.
14 . The method according to claim 12 , further comprising: in response to determining that the user indicator overlaps the target indicator, stopping continuously adjusting the presentation position of the user indicator to move toward the target indicator.
15 . The method according to claim 1 , wherein the adjusting the visual presentation attribute of the user indicator based on the portion of the real-time audio stream corresponding to the user comprises: in response to determining that the user is currently speaking based on the portion of the real-time audio stream corresponding to the user, adjusting a visual style of the user indicator to a dynamic icon.
16 . The method according to claim 1 , further comprising: in response to determining that the user is currently speaking based on the portion of the real-time audio stream corresponding to the user, presenting a dynamic indicator starting from a presentation position of the user indicator and pointing to a presentation position of the target indicator, between the user indicator and the target indicator.
17 . The method according to claim 16 , wherein the dynamic indicator comprises a dynamic text stream generated based on text information of content being spoken by the user currently.
18 . The method according to claim 1 , wherein the second position comprises a position where a sound collection device is arranged in the physical environment or a position where a target user is located in the physical environment.
19 . An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform operations comprising: determining a user included in a physical environment and a first position of the user in the physical environment based on a real-time audio stream collected in the physical environment; presenting a user indicator corresponding to the user and being associated with a target indicator in a voice interaction interface rendered for the physical environment, wherein a relative positional relationship between the user indicator and the target indicator is determined based on a relative positional relationship between the first position and a second position corresponding to the target indicator in the physical environment; and adjusting a visual presentation attribute of the user indicator based on a portion of the real-time audio stream corresponding to the user.
20 . A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are configured to cause a computer to perform operations comprising: determining a user included in a physical environment and a first position of the user in the physical environment based on a real-time audio stream collected in the physical environment; presenting a user indicator corresponding to the user and being associated with a target indicator in a voice interaction interface rendered for the physical environment, wherein a relative positional relationship between the user indicator and the target indicator is determined based on a relative positional relationship between the first position and a second position corresponding to the target indicator in the physical environment; and adjusting a visual presentation attribute of the user indicator based on a portion of the real-time audio stream corresponding to the user.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the priority from Chinese Patent Application No. 202510272830.8, filed on Mar. 7, 2025, the entire disclosure of which is hereby incorporated by reference. TECHNICAL FIELD The present disclosure relates to the field of computer technology, and in particular to the technical fields of artificial intelligence such as speech recognition, audio processing, computer vision, and large language models, and more particularly to a method for interacting voice based on a large language model, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product. BACKGROUND In work and life, when people handle complex tasks or matters requiring multi-person collaboration, they usually adopt meetings to communicate and discuss the tasks and matters. Correspondingly, centralized discussions through meetings can improve the processing quality and efficiency of tasks and matters. In this context, how to help people conduct meetings more efficiently and with better experience, and facilitate people to track and review the communication and interaction behaviors occurring during meetings, is a matter worthy of attention and an urgent demand. SUMMARY Embodiments of the present disclosure propose a method for interacting voice based on a large language model, an electronic device, and a computer-readable storage medium. In a first aspect, an embodiment of the present disclosure proposes a method for interacting voice based on a large language model, including: determining a user included in a physical environment and a first position of the user in the physical environment based on a real-time audio stream collected in the physical environment; presenting a user indicator corresponding to the user in association with a target indicator in a voice interaction interface rendered for the physical environment, where the relative positional relationship between the user indicator and the target indicator is determined based on the relative positional relationship between the first position and a second position corresponding to the target indicator in the physical environment; and adjusting a visual presentation attribute of the user indicator based on a portion of the real-time audio stream corresponding to the user. In a second aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; where the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to implement the method for interacting voice based on a large language model described in any implementation manner of the first aspect. In a third aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, where the computer instructions are configured to cause a computer to implement the method for interacting voice based on a large language model described in any implementation manner of the first aspect when executed. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily apparent from the following description. BRIEF DESCRIPTION OF THE DRAWINGS By reading the detailed description of non-limiting embodiments with reference to the following drawings, other features, purposes, and advantages of the present disclosure will become more apparent: FIG. 1 is an exemplary system architecture to which the present disclosure may be applied; FIG. 2 is a flowchart of a voice interaction process based on a large language model according to an embodiment of the present disclosure; FIG. 3 is a flowchart of a process for determining identity information corresponding to a user according to an embodiment of the present disclosure; FIGS. 4a-4h are schematic diagrams of effects of a voice interaction interface according to embodiments of the present disclosure respectively; FIG. 5 is a schematic diagram of an effect achieved by a voice interaction process based on a large language model in an application scenario according to an embodiment of the present disclosure; FIG. 6 is a structural block diagram of an apparatus for interacting voice based on a large language model according to an embodiment of the present disclosure; FIG. 7 is a structural schematic diagram of an electronic device adapted for executing the method for interacting voice based on a large language model according to an embodiment of the present disclosure. DETAILED DESCRIPTION OF EMBODIMENTS The exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings