CN-121996813-A - Agent-based video retrieval method

CN121996813ACN 121996813 ACN121996813 ACN 121996813ACN-121996813-A

Abstract

The invention discloses an Agent-based video retrieval method which comprises two steps of video warehousing and retrieval, wherein an original video is segmented in time sequence to obtain video fragments when the video is in the warehousing, a comprehensive description comprising picture description and audio description is generated through a video description module, then the comprehensive description of the original video is summarized through a large language model, the two types of description are respectively stored in a database and vector indexes are generated, and when the Agent is retrieved, query processing agents analyze user inquiry to clearly determine retrieval intention, and the database retrieves Agent optimization inquiry and obtains retrieval results based on the vector indexes. In the video description module, the prompt word Agent generates a targeted prompt word according to the picture description, guides the audio understanding large model to extract audio information associated with the picture, and realizes the tight combination of the audio and the picture information. The invention solves the problems of insufficient association of audio and video information and difficult resolution of user intention in the prior art, effectively improves the accuracy of video retrieval, reduces the interaction obstacle between a user and a system, and improves the user experience ‌ ‌.

Inventors

ZHANG YUANTONG

Assignees

新国脉数字文化股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251208

Claims (7)

1. The Agent-based video retrieval method is characterized by comprising the following steps of: The video warehousing step comprises the steps of carrying out time sequence segmentation on an original video to obtain video fragments, generating comprehensive description of the video fragments through a video description module, wherein the comprehensive description comprises picture description and audio description of the video fragments; the searching step comprises the steps of receiving user inquiry, judging whether the user needs to search the video and whether the video is the whole video or the video clip needs to be searched through a Query processing Agent, optimizing the search inquiry and calling a search tool through a database searching Agent according to the judging result, and acquiring a corresponding search result from the database and returning.
2. The Agent-based video retrieval method of claim 1, wherein the step of the video description module generating a comprehensive description of the video clip comprises: Inputting the video clips into a video understanding large model to obtain picture descriptions of the video clips; Inputting the picture description into a prompt word Agent, and generating a prompt word for an audio understanding large model by the prompt word Agent according to the picture description; inputting the prompt words into an audio understanding large model to obtain audio description of the video clip; And inputting the picture description and the audio description into a large language model, and combining to obtain the comprehensive description of the video clip.
3. The Agent-based video retrieval method according to claim 1, wherein the step of time sequence segmentation is to segment an original video into video segments of a preset length through a video splitting model.
4. The Agent-based video retrieval method according to claim 1, wherein the step of the database retrieving Agent optimizing the search query includes performing semantic expansion and vector conversion on the user query, and performing similarity retrieval based on the vector index.
5. The Agent-based video retrieval method according to claim 1, wherein the retrieval result comprises a video clip, a complete video link or a corresponding answer, and is specifically determined according to a judgment result of a Query processing Agent.
6. The Agent-based video retrieval method according to claim 1, wherein the prompt words generated by the prompt word Agent are related to the picture content of the video segment, and are used for guiding the audio understanding big model to extract the audio information associated with the picture.
7. The Agent-based video retrieval method of claim 1, wherein the vector index is generated based on semantic vectors of a comprehensive description of video segments and an overall description of the original video.

Description

Agent-based video retrieval method Technical Field The invention relates to the technical fields of large model prompt word engineering, AIAgent and image processing, in particular to an Agent-based video retrieval method. Background In the existing video retrieval method, after video is segmented during video warehousing, picture description, text information, voice text and audio mode information are respectively extracted and stored, and during retrieval, keyword extraction and vector conversion are carried out on user inquiry and then retrieval results are matched. However, the prior art has the following defects: firstly, the requirements of users on complete videos or video clips are difficult to distinguish, so that the user interaction experience is poor; secondly, the audio information extraction is limited to extracting voice or other audio parts independently, and is not closely related to the picture information, so that the retrieval accuracy is affected. Disclosure of Invention The invention aims to provide an Agent-based video retrieval method, which aims to solve the problems of unsound association between audio and picture information and ambiguous user retrieval intention and improve video retrieval accuracy and user experience. In order to solve the technical problems, the invention provides an Agent-based video retrieval method, which comprises a video warehousing step and a retrieval step, and specifically comprises the following steps: (1) Video warehousing: a. The original video is subjected to time sequence segmentation to obtain video segments, the time sequence segmentation can be realized through a video stripping model, and the original video is segmented into video segments with proper lengths, so that the subsequent processing and searching are facilitated; b. the method comprises the steps of inputting a video fragment into a video description module to generate comprehensive description comprising picture description and audio description, wherein the specific process comprises the steps of analyzing the video fragment by a video understanding big model, outputting the picture description, generating a targeted prompt word after receiving the picture description by a prompt word Agent, guiding an audio understanding big model to extract audio information related to the picture, outputting the audio description by the audio understanding big model according to the prompt word, merging the picture description and the audio description by a big language model, and obtaining the comprehensive description; c. Summarizing comprehensive descriptions of all video clips of the same original video by using the large language model to generate an overall description of the original video; d. And respectively storing the comprehensive description of the video clips and the integral description of the original video into a database, and generating corresponding vector indexes aiming at the two types of description, so that the subsequent quick retrieval is facilitated. (2) Searching: a. Receiving a Query input by a user, and transmitting the Query into a Query processing Agent; b. The Query processing Agent analyzes the Query, judges whether the user has video retrieval requirements, and if so, further judges whether the user needs to retrieve the complete video or the video clip; c. The database search Agent receives the judgment result of the Query processing Agent, optimizes the user Query, comprises the processing of semantic expansion, vector conversion and the like, then invokes a search tool, performs similarity search in the database based on vector indexes to obtain the search result with the highest matching degree, and d. Returns the search result to the user, wherein the search result can be a video clip, a complete video link or a related answer according to the judgment result. In summary, the beneficial effects of the invention are as follows due to the adoption of the technology: 1. According to the invention, the prompt word Agent generates the prompt word of the audio understanding large model according to the picture description, so that the audio information extraction is closely related to the picture content, the extraction of irrelevant audio information is avoided, and the accuracy and the efficiency of the retrieval are improved; 2. the search intention of the user is clarified through the Query processing Agent, the search requirements of the complete video and the video clips are distinguished, the interaction obstacle between the user and the system is reduced, and the user experience is improved; 3. The comprehensive description of the video clips and the integral description of the original video are stored separately and a vector index is generated, and the accuracy and the retrieval efficiency ‌ ‌ of the retrieval result are further improved by matching with the query optimization of the database retrieval Agent. Drawings The accompanying drawings, which ar