CN-122019827-A - Index information generation method, device, electronic equipment and medium

CN122019827ACN 122019827 ACN122019827 ACN 122019827ACN-122019827-A

Abstract

The application discloses an index information generation method, an index information generation device, electronic equipment and a medium, and relates to the technical field of artificial intelligence. The method comprises the steps of obtaining audio and video features of a first audio and video, determining a tag sequence according to the audio and video features, wherein the tag sequence comprises a tag corresponding to each audio and video frame of the first audio and video, the tag comprises at least one of a speaker-related tag and an interference information tag, and generating structural index information of the first audio and video according to the tag sequence and a first text corresponding to the first audio and video.

Inventors

Huang Ranxi
MIAO FENG

Assignees

维沃移动通信有限公司

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (15)

1. An index information generation method, characterized in that the method comprises: Acquiring audio and video characteristics of a first audio and video, wherein the audio and video characteristics comprise at least one of frequency spectrum characteristics and rhythm characteristics; Determining a tag sequence according to the audio-video characteristics, wherein the tag sequence comprises a tag corresponding to each audio-video frame of the first audio-video, and the tag comprises at least one of a speaker-related tag and an interference information tag; and generating the structured index information of the first audio and video according to the tag sequence and the first text corresponding to the first audio and video.
2. The method of claim 1, wherein the speaker-dependent tag comprises a speaker identity tag, and wherein the step of determining the speaker identity tag from the audio-visual characteristics comprises: Generating a model through speaker related labels, and obtaining an embedded vector of at least one voice segment based on the frequency spectrum characteristics, wherein each voice segment comprises at least one continuous audio-video frame in the first audio-video frame; and clustering the at least one embedded vector to obtain speaker identity tags corresponding to each audio and video frame respectively.
3. The method according to claim 1 or 2, the step of determining the interference information tag from the audiovisual features comprising: And performing context-aware silence detection, language and gas component recognition and noise recognition processing on the audio and video features through an interference information label generation model to obtain interference information labels corresponding to each audio and video frame respectively.
4. The method of claim 1, wherein the first text comprises semantically complete N content units, N being a positive integer; The step of generating the structured index information of the first audio and video according to the tag sequence and the first text corresponding to the first audio and video comprises the following steps: Performing time domain alignment and text coding processing on the N content units and the tag sequence to obtain N multidimensional feature coding sequences corresponding to the N content units one by one; executing proposition structured decoding and association mapping processing on the N multi-dimensional feature coding sequences through a discussion point extraction unit to obtain proposition tensor matrixes and proposition and original text association matrixes, wherein each proposition tensor in the proposition tensor matrixes is used for indicating semantic information of at least one discussion point and corresponding support statement in the first text, and the proposition and original text association matrixes are used for indicating association strength between each proposition tensor and the corresponding support statement in the first text; and performing structural fusion and organization processing on the first text according to the proposition tensor matrix and the proposition and original text incidence matrix by a structural content generation unit to obtain the structural index information.
5. The method of claim 4, wherein the structured index information comprises a summary; and executing structural fusion and organization processing on the first text by a structural content generation unit according to the proposition tensor matrix and the proposition and text association matrix to obtain the structural index information, wherein the step comprises the following steps: positioning and extracting key language blocks corresponding to key propositions from the first text based on the propositions and original text incidence matrix through a summary generation module in the structured content generation unit; and generating a coherent text connecting the key language blocks according to the proposition tensor matrix to form the abstract.
6. The method of claim 4, wherein the structured index information comprises a hierarchical directory; and executing structural fusion and organization processing on the first text according to the proposition tensor matrix and the proposition and original text association matrix to obtain the structural index information, wherein the step comprises the following steps: Based on the propositional tensor matrix, carrying out semantic clustering on the N content units to obtain at least two topic clusters; obtaining a generalized title of each topic cluster according to the core argument of each topic cluster; based on the proposition and original text association matrix, organizing the arguments corresponding to the proposition tensors belonging to the same topic cluster into child nodes under the topic cluster; the subject matter of the first text, the generalized titles of each of the at least two topic clusters, and the child nodes under each topic cluster are organized into a hierarchical directory comprising at least three logical levels.
7. The method of claim 6, wherein the semantically clustering the N content units based on the propositional tensor matrix to obtain at least two topic clusters comprises: based on the propositional tensor matrix, carrying out semantic clustering on the N content units according to a time constraint mode to obtain the at least two topic clusters; or based on the propositional tensor matrix, carrying out semantic clustering on the N content units according to a cross-time constraint mode to obtain the at least two topic clusters.
8. The method of claim 4, wherein the structured index information comprises a relational mapping network; and executing structural fusion and organization processing on the first text according to the proposition tensor matrix and the proposition and original text association matrix to obtain the structural index information, wherein the step comprises the following steps: Identifying a question-type chunk in the first text based on syntactic features of the sentence in the first text; based on the proposition and original text association matrix and the proposition tensor matrix, identifying a punctuation type language block and a corresponding support type language block in the first text, and identifying an answer type language block corresponding to each question type language block in the first text; And establishing a supporting relation link between each punctuation type language block and a corresponding supporting type language block, establishing a question-answer relation link between each question type language block and a corresponding answer type language block, and generating a hyperlink path index through graph embedding to obtain the relation mapping network.
9. The method of claim 4, wherein before generating the structured index information of the first audio-video according to the tag sequence and the first text corresponding to the first audio-video, the method further comprises: executing multi-mode fusion decision processing on the prosodic features, the second text corresponding to the first audio and video and the tag sequence through a multi-mode segmentation model to obtain a content segmentation result corresponding to the first audio and video; The content segmentation result is used for segmenting the first audio and video into N audio and video segments with complete semantics, and the N audio and video segments are in one-to-one correspondence with the N content units.
10. The method according to claim 9, wherein the performing, by the multimodal segmentation model, a multimodal fusion decision process based on the prosodic feature, the second text corresponding to the first audio and video, and the tag sequence to obtain a content segmentation result corresponding to the first audio and video includes: performing time sequence modeling on the rhythm features through a voice feature analysis unit in the multi-mode segmentation model to obtain a time sequence acoustic tensor matrix corresponding to the first audio and video; sentence similarity analysis is carried out on the second text through a semantic understanding unit in the multi-mode segmentation model, so that a sentence-level semantic similarity matrix corresponding to the first audio and video is obtained; Determining a global optimal segmentation point sequence according to the acoustic mutation intensity measured by the first derivative of the acoustic tensor matrix, the semantic consistency change represented by the sentence-level semantic similarity matrix and the speaker identity, emotion and rhythm change represented by the tag sequence through a multi-modal fusion unit in the multi-modal segmentation model; And taking the global optimal segmentation point sequence as the content segmentation result.
11. The method of claim 1, wherein after the generating the structured index information for the first audio-video, the method further comprises: And displaying the structured index information, and constructing a bidirectional mapping relation between text elements in the index information and the first audio/video time axis by dynamically planning a shortest path matching algorithm, wherein the bidirectional mapping relation is used for realizing bidirectional jump of the text elements and the audio/video fragments and highlight synchronization of corresponding texts under the condition of playing the audio/video.
12. The method of claim 11, wherein the method further comprises: Collecting interaction behavior data between a user and the structured index information; And dynamically optimizing the generation strategy of the structured index information based on the interactive behavior data.
13. An index information generation system is characterized by comprising an audio feature extraction model, a label generation model and a structured content generation model; The audio feature extraction model is used for acquiring audio and video features of the first audio and video, wherein the audio and video features comprise at least one of frequency spectrum features and rhythm features; The tag generation model is used for determining a tag sequence according to the audio-video characteristics, wherein the tag sequence comprises tags corresponding to each audio-video frame of the first audio-video, and the tags comprise at least one of speaker related tags and interference information tags; The structured content generation model is used for generating structured index information of the first audio and video according to the tag sequence and the first text corresponding to the first audio and video.
14. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of indexing information according to any one of claims 1 to 12.
15. A computer readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of indexing information according to any of claims 1 to 12.

Description

Index information generation method, device, electronic equipment and medium Technical Field The application belongs to the technical field of artificial intelligence, and particularly relates to an index information generation method, an index information generation device, electronic equipment and a medium. Background With the popularity of podcasts, interviews, training, etc., the need for users to quickly obtain key information in audio and video is becoming increasingly prominent. In the related art, audio content may be converted to text based on automatic speech recognition (Automatic Speech Recognition, ASR) techniques, followed by generation of a text summary based on rules or shallow statistical models. However, index information such as abstracts, catalogs, etc. refined based on rules or shallow statistical models lacks context relevance and logic, and it is difficult to support indexing and knowledge acquisition of content of long text. Therefore, there is a problem that the user has low efficiency in acquiring key information in an audio/video scene. Disclosure of Invention The embodiment of the application aims to provide an index information generation method, an index information generation device, electronic equipment and a medium, which can solve the problem that a user has low key information acquisition efficiency in an audio and video scene. In a first aspect, an embodiment of the present application provides an index information generating method, where the method includes obtaining an audio-video feature of a first audio-video, where the audio-video feature includes at least one of a spectral feature and a prosodic feature, determining a tag sequence according to the audio-video feature, where the tag sequence includes a tag corresponding to each audio-video frame of the first audio-video, where the tag includes at least one of a speaker-related tag and an interference information tag, and generating structured index information of the first audio-video according to the tag sequence and a first text corresponding to the first audio-video. The embodiment of the application provides an index information generation system which can comprise an audio feature extraction model, a tag generation model and a structured content generation model, wherein the audio feature extraction model is used for acquiring audio and video features of a first audio and video, the audio and video features comprise at least one of frequency spectrum features and rhythm features, the tag generation model is used for determining a tag sequence according to the audio and video features, the tag sequence comprises a tag corresponding to each audio and video frame of the first audio and video, the tag comprises at least one of a speaker related tag and an interference information tag, and the structured content generation model is used for generating structured index information of the first audio and video according to the tag sequence and a first text corresponding to the first audio and video. In a third aspect, an embodiment of the present application provides an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect. In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect. In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect. In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement the method according to the first aspect. In the embodiment of the application, the audio-video characteristics of the first audio-video can be obtained, the audio-video characteristics comprise at least one of frequency spectrum characteristics and rhythm characteristics, a tag sequence is determined according to the audio-video characteristics, the tag sequence comprises tags corresponding to each audio-video frame of the first audio-video respectively, the tags comprise at least one of speaker related tags and interference information tags, and the structured index information of the first audio-video is generated according to the tag sequence and a first text corresponding to the first audio-video. According to the scheme, the tag sequence containing speaker related and interference information in the first audio and video can be determined according to the audio frequency