CN-122027762-A - Conference information processing method, device, system and medium
Abstract
The application relates to the technical field of data processing, in particular to a conference information processing method, a device, a system and a medium, which are used for solving the problems that in the related art, conference assistant operation indication judgment is inaccurate and an operation result does not have pertinence, wherein the method is applied to a conference auxiliary system and comprises the following steps: identifying the intention of a target participant in the conference based on multi-mode data in the conference, determining that the target participant sends an effective job instruction, and determining a plurality of target data sources required for completing the effective job instruction based on text fragments associated with the effective job instruction, conference context and associated conference records; based on the data corresponding to the target data sources, the multi-mode conference information related to the effective operation indication is fused to obtain an operation result, wherein the target participant is a local participant or a remote participant, so that the conference context is more comprehensively understood, the obtained operation result is more targeted, and the quality and the persuasion of the operation result are improved.
Inventors
- CHENG NAN
- DING JIANXIN
- WEI XIAOWANG
- YANG JIANBO
Assignees
- 北京可以科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260122
Claims (13)
- 1. A conference information processing method, applied to a conference assistance system, comprising: acquiring multi-mode data in a conference, and identifying the intention of a target participant in the conference based on the multi-mode data to determine whether the target participant sends out an effective operation instruction, wherein the target participant is a local participant or a remote participant; Determining to send out an effective job instruction, and determining a plurality of target data sources required for completing the effective job instruction based on text fragments, the conference context and associated conference records associated with the effective job instruction; And fusing the multi-mode conference information associated with the effective job instruction based on the data respectively corresponding to the target data sources to obtain a job result.
- 2. The method of claim 1, wherein the conference assistance system includes an entity terminal located in an off-line conference environment and a virtual server located in an on-line conference environment, the multi-modal data including voice data and image data, and the acquiring the multi-modal data in the conference includes: Collecting voice data in the off-line conference environment through a microphone array of the entity terminal; Acquiring image data in the off-line conference environment through a camera acquisition device of the entity terminal; and receiving voice data of a remote participant in the online conference environment through the virtual server.
- 3. The method of claim 2, wherein the identifying intent of the target participant in the meeting based on the multimodal data to determine whether the target participant is issuing a valid job indication comprises: converting voice data in the multi-mode data into text data; carrying out keyword recognition on text fragments in the text data, determining that preset keywords are recognized, and determining a target participant to which the target voice data belongs to send out an operation instruction based on the target voice data of the text fragments containing the preset keywords; and inputting the text data of the target participant into an intention recognition model, and determining whether the target participant sends out a valid operation instruction.
- 4. The method of claim 3, wherein if the target image data of the target participant exists in the multimodal data, the determining, based on the target voice data including the text segment of the preset keyword, that the target participant to which the target voice data belongs issues a job instruction includes: extracting non-content characteristics of the target voice data, and visually identifying the target image data; and determining that the extracted non-content features are contained in a preset non-content feature set, and the identified visual identification result is contained in a preset visual identification result set, and determining that the target participant sends out a work instruction.
- 5. The method of claim 3, wherein the entering text data of the target participant into an intent recognition model, determining whether the target participant is issuing a valid job indication, comprises: Inputting the text data of the target participant into the intention recognition model, extracting semantic information of the text data of the target participant based on the intention recognition model, and recognizing the intention of the target participant based on the semantic information to obtain an intention recognition result; Based on the intent recognition result, it is determined whether the target participant issues a valid job indication.
- 6. The method of claim 3, wherein the intent recognition result includes whether to issue a job indication and a confidence, and wherein the determining whether the target participant issued a valid job indication based on the intent recognition result comprises: if the intention recognition result comprises sending out an operation instruction and the confidence coefficient is larger than a confidence coefficient threshold value, determining to send out an effective operation instruction; And if the intention recognition result comprises sending out a job instruction and the confidence coefficient is not greater than the confidence coefficient threshold value, or the intention recognition result comprises not sending out a job instruction, determining that no valid job instruction is sent out.
- 7. The method of any of claims 2-6, wherein the determining a plurality of target data sources needed to complete the active job indication based on the active job indication associated text snippet, the meeting context, and the associated meeting record comprises: Determining a plurality of candidate data sources associated with the effective job indication based on the text segment associated with the effective job indication, the meeting context, and the associated meeting record; Determining that at least two candidate data sources support to finish the effective job indication in the plurality of candidate data sources, deleting the data sources meeting a data source rejection strategy from the at least two candidate data sources to obtain the plurality of target data sources, wherein the data source rejection strategy comprises response delay being larger than a first threshold value and/or data processing overhead being larger than a second threshold value.
- 8. The method of claim 7, wherein the fusing the multi-modal conference information associated with the valid job indication based on the data corresponding to the plurality of target data sources, respectively, to obtain a job result, comprises: Acquiring data corresponding to the target data sources respectively; And based on a multi-mode fusion model, fusing multi-mode conference information associated with the effective job instruction in the data respectively corresponding to the plurality of target data sources to obtain the job result.
- 9. The method of claim 8, wherein the data respectively corresponding to the plurality of target data sources comprises at least two of: Network data associated with the valid job indication, the network data being data retrieved based on a network and extracted through a large language model; internal data associated with the active job indication, the internal data being data retrieved from an internal database and/or the associated meeting record; Screen sharing picture data in the online conference environment, which are acquired through the virtual server side; text data obtained by converting voice data in the conference; the image data of the area which is analyzed by the entity terminal and is associated with the effective operation instruction is the image data of a preset area in the offline conference environment, which is acquired by a camera acquisition device of the entity terminal, and the preset area is at least one of a whiteboard, a display and a projection area.
- 10. The method of any one of claims 1-9, wherein after the job results are obtained, the method further comprises: And sending the operation result to the target participant in audio or other preset formats, and/or the related personnel mentioned by the target participant.
- 11. A conference information processing apparatus, characterized by being applied to a conference assistance system, comprising: The system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring multi-mode data in the conference, identifying the intention of a target participant in the conference based on the multi-mode data so as to determine whether the target participant sends out a valid operation instruction, and the target participant is a local participant or a remote participant; A determining unit, configured to determine that an effective job instruction is sent, and determine a plurality of target data sources required for completing the effective job instruction based on a text segment associated with the effective job instruction, the conference context, and an associated conference record; And the information fusion unit is used for fusing the multi-mode conference information associated with the effective job instruction based on the data respectively corresponding to the plurality of target data sources to obtain a job result.
- 12. The conference auxiliary system is characterized by comprising an entity terminal positioned in an off-line conference environment and a virtual server positioned in the on-line conference environment; The conference assistance system comprising a memory for storing a computer program or instructions and a processor for executing the computer program or instructions in the memory such that the method according to any one of claims 1-10 is performed.
- 13. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor, enable the processor to perform the method of any one of claims 1-10.
Description
Conference information processing method, device, system and medium Technical Field The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a system, and a medium for processing conference information. Background With the development and popularization of internet technology, teleconferencing is favored in various fields as a new conference modality. The teleconference can enable multiple persons at different geographic positions to participate in the same conference at the same time through a network, and the conference assistant is a digital tool for improving conference efficiency by utilizing an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, and the core functions of the conference assistant comprise voice transcription, intelligent abstract, conference management and collaboration support. In the related art, a conference assistant collects audio of a conference participant and converts the audio into text to obtain a conference context. In the conference process, the conference assistant mostly adopts text keywords, such as the name of the conference assistant, judges whether the user gives an attack instruction to the conference assistant, and after judging that the user gives the attack instruction to the conference assistant, triggers the execution of a corresponding operation flow of the operation instruction based on the conference context to obtain and feed back an operation result to the user. For example, the user asks the conference assistant "what has just been decided," and the conference assistant derives a reply based on the conference context and outputs it to the user in audio form. However, the lack of uniform and robust judgment based on the text keyword recognition results in low accuracy in judging whether the user is going to go to the business, and the execution of the operation process is only performed based on the transcribed conference context, so that the operation result to the user is not targeted and poor in quality. Disclosure of Invention The embodiment of the application provides a conference information processing method, device, system and medium, which are used for realizing more comprehensive understanding of conference context, so that the obtained operation result is more targeted, and the quality and persuasion of the operation result are improved. The specific technical scheme provided by the embodiment of the application is as follows: In a first aspect, an embodiment of the present application provides a method for processing conference information, which is applied to a conference auxiliary system, where the method includes: acquiring multi-mode data in a conference, and identifying the intention of a target participant in the conference based on the multi-mode data to determine whether the target participant sends out an effective operation instruction, wherein the target participant is a local participant or a remote participant; Determining to send out an effective job instruction, and determining a plurality of target data sources required for completing the effective job instruction based on text fragments, the conference context and associated conference records associated with the effective job instruction; And fusing the multi-mode conference information associated with the effective job instruction based on the data respectively corresponding to the target data sources to obtain a job result. In a possible implementation method, the conference assistance system includes an entity terminal located in an offline conference environment and a virtual server located in an online conference environment, and the multi-modal data includes voice data and image data, and the acquiring the multi-modal data in the conference includes: Collecting voice data in the off-line conference environment through a microphone array of the entity terminal; Acquiring image data in the off-line conference environment through a camera acquisition device of the entity terminal; and receiving voice data of a remote participant in the online conference environment through the virtual server. In one possible implementation method, the identifying, based on the multimodal data, an intention of a target participant in the conference to determine whether the target participant issues a valid job indication includes: converting voice data in the multi-mode data into text data; carrying out keyword recognition on text fragments in the text data, determining that preset keywords are recognized, and determining a target participant to which the target voice data belongs to send out an operation instruction based on the target voice data of the text fragments containing the preset keywords; and inputting the text data of the target participant into an intention recognition model, and determining whether the target participant sends out a valid operation instruction. In one possible implementation method