CN-121996815-A - Video content retrieval method, device and terminal based on television system voice interaction

CN121996815ACN 121996815 ACN121996815 ACN 121996815ACN-121996815-A

Abstract

The invention discloses a video content retrieval method, a device and a terminal based on television system voice interaction, which relate to the technical field of video processing and comprise the steps of extracting picture frames from video at preset frequency when the video is played for the first time, converting the picture frames into multidimensional image feature vectors and corresponding video time stamps, storing the multidimensional image feature vectors and the corresponding video time stamps in a hierarchical manner, and constructing vectorization data containing visual semantic information; the method comprises the steps of obtaining a voice retrieval instruction, carrying out intention recognition and semantic understanding, extracting a detection keyword, generating a multidimensional retrieval feature vector, calculating the matching degree between the multidimensional retrieval feature vector and the multidimensional image feature vector of the video picture frames stored in a database, screening picture frames with the similarity higher than a preset similarity threshold value to form a retrieval candidate matching set, and determining matched picture playing. The invention provides a high-efficiency, accurate and strong-interactivity video content retrieval method, which remarkably improves the retrieval experience and the operation efficiency of a user in the video watching process.

Inventors

FAN YANBO
Guo shangfeng
NIE HAI
YIN SHUANGSHUANG

Assignees

深圳市酷开网络科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260119

Claims (10)

1. A video content retrieval method based on television system voice interaction, comprising: when a video is played for the first time, extracting picture frames from the video at a preset frequency, carrying out multidimensional feature analysis, converting the picture frames into multidimensional image feature vectors, carrying out hierarchical storage on the multidimensional image feature vectors and corresponding video time stamps, and storing the multidimensional image feature vectors and the corresponding video time stamps into a database to construct vectorized data containing visual semantic information; acquiring a voice retrieval instruction, decoding the voice retrieval instruction into a text instruction, carrying out intention recognition and semantic understanding on the text instruction by combining with a current video playing scene, extracting a detection keyword, and generating a multidimensional retrieval feature vector; Calculating the matching degree between the multidimensional retrieval feature vector and the multidimensional image feature vector of the video picture frames stored in the database, and screening picture frames with similarity higher than a preset similarity threshold value to form a retrieval candidate matching set; based on the search candidate matching set, receiving an operation instruction of a user, selecting one of the matching pictures from the search candidate matching set for playing, or controlling default jump to the first matching picture for playing.
2. The method for retrieving video content based on voice interaction of a television system according to claim 1, wherein when the video is played for the first time, extracting a frame from the video at a preset frequency, performing multidimensional feature analysis, converting the frame into a multidimensional image feature vector, storing the multidimensional image feature vector and a corresponding video timestamp in a database in a hierarchical manner, and constructing vectorized data including visual semantic information comprises: when the video is about to be played, judging whether historical processing data exists in the current video or not; The method comprises the steps of starting picture frame extraction when historical data does not exist, adopting an open source-based multimedia processing frame to decode a current video stream in real time through a multithreading mechanism, adopting a time slicing type sampling mechanism, and extracting picture frames from the decoded video stream according to a preset time interval; performing color space conversion and size normalization pretreatment on the extracted picture frames; Carrying out multidimensional feature analysis on the preprocessed picture frame by adopting a computer vision and depth learning algorithm, carrying out picture frame vectorization, and converting picture frame content into multidimensional image feature vectors; And carrying out hierarchical storage on the multidimensional image feature vector and the corresponding video time stamp, storing the multidimensional image feature vector and the corresponding video time stamp into a database, constructing vectorized data containing visual semantic information, and generating a retrieval index.
3. The method for retrieving video content based on voice interaction of a television system according to claim 1, wherein the steps of obtaining a voice retrieval instruction, decoding the voice retrieval instruction into a text instruction, performing intention recognition and semantic understanding on the text instruction in combination with a current video playing scene, extracting a detection keyword, and generating a multi-dimensional retrieval feature vector comprise: acquiring an analog voice signal with a search keyword, and converting the analog voice signal into a digital audio signal to obtain a voice search instruction; Decoding the voice retrieval instruction into a text instruction; And carrying out intention recognition and semantic understanding on the decoded text instruction by adopting a natural language processing module, extracting detection keywords by combining with the current video playing scene, and generating a multidimensional retrieval feature vector for the extracted keywords.
4. The method for searching video contents based on voice interaction of a television system according to claim 1, wherein the step of calculating the matching degree between the multi-dimensional searching feature vector and the multi-dimensional image feature vector of the video picture frame stored in the database, and screening out the picture frame with the similarity higher than the preset similarity threshold value, and forming the searching candidate matching set comprises: Receiving and acquiring the generated multidimensional retrieval feature vector; calculating the matching degree between the multidimensional retrieval feature vector and the multidimensional image feature vector of the video picture frame stored in the database by utilizing a cosine similarity algorithm; Screening out picture frames with similarity higher than a similarity threshold by setting a dynamic similarity threshold to form a search candidate matching set; and if the picture which is matched with the multidimensional searching feature vector condition is not found, generating a no-matching result signal.
5. The method for searching video contents based on voice interaction of a television system according to claim 1, wherein the step of calculating the matching degree between the multi-dimensional searching feature vector and the multi-dimensional image feature vector of the video picture frame stored in the database, and screening out the picture frame with the similarity higher than the preset similarity threshold value, and forming the searching candidate matching set further comprises: calculating the matching degree between the multidimensional retrieval feature vector and the multidimensional image feature vector of the video picture frame stored in the database by adopting a cosine similarity algorithm: The similarity calculation formula is as follows: ; Wherein A is a multidimensional retrieval feature vector, B is a multidimensional image feature vector of a video picture frame, and both are 768-dimensional vectors after L2 normalization; setting a dynamic similarity threshold, and screening out the first K picture frames with similarity higher than a preset similarity threshold, wherein K is a natural number greater than 1.
6. The method for retrieving video content based on voice interaction of a television system according to claim 1, wherein the step of obtaining the voice retrieval instruction is preceded by the steps of: Voiceprint information corresponding to user IDs is pre-entered and registered, corresponding user preferences and instruction priorities are set for the voiceprint information corresponding to each user ID, and the voiceprint information is stored.
7. The method for retrieving video content based on voice interaction of a television system according to claim 6, wherein the steps of obtaining a voice retrieval instruction, decoding the voice retrieval instruction into a text instruction, performing intention recognition and semantic understanding on the text instruction in combination with a current video playing scene, extracting a detection keyword, and generating a multi-dimensional retrieval feature vector further comprise: When a voice retrieval instruction is acquired, analyzing voiceprint characteristics in the voice retrieval instruction, comparing the voiceprint characteristics with registered voiceprint information corresponding to a user ID, and identifying the user ID identity sending the voice retrieval instruction; decoding the voice retrieval instruction into a text instruction with a user ID identity; Placing the identified text instruction with the user ID into an instruction queue according to time sequence; Carrying out priority evaluation on the text instruction with the user ID identity in the instruction queue according to a preset rule; When a plurality of concurrent or approximately concurrent text instructions exist in the instruction queue, selecting the text instruction with the user ID identity with the highest priority to be transmitted to a natural language processing module for intention recognition and semantic understanding according to the priority evaluation result; for the text instruction with the user ID identity and with the highest priority, controlling the temporary rest intention recognition and semantic understanding, prompting the user to reissue, or processing after the current instruction processing is completed.
8. A video content retrieval device based on television system voice interactions, the device comprising: the video content processing module is used for extracting picture frames from the video at preset frequency when the video is played for the first time, carrying out multidimensional feature analysis, converting the picture frames into multidimensional image feature vectors, carrying out hierarchical storage on the multidimensional image feature vectors and corresponding video time stamps, and storing the multidimensional image feature vectors and the corresponding video time stamps into a database to construct vectorized data containing visual semantic information; the voice command acquisition and semantic analysis module is used for acquiring a voice retrieval command, decoding the voice retrieval command into a text command, carrying out intention recognition and semantic understanding on the text command by combining with a current video playing scene, extracting a detection keyword, and generating a multidimensional retrieval feature vector; The video content retrieval matching module is used for calculating the matching degree between the multidimensional retrieval feature vector and the multidimensional image feature vector of the video picture frames stored in the database, and screening picture frames with the similarity higher than a preset similarity threshold value to form a retrieval candidate matching set; And the play control module is used for receiving an operation instruction of a user based on the search candidate matching set, selecting one of the matching pictures from the search candidate matching set for play, or controlling default jump to the first matching picture for play.
9. An intelligent terminal comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising steps for performing the method of any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored which, when being executed by a processor, enables an electronic device to perform the steps of the method according to any one of claims 1-7.

Description

Video content retrieval method, device and terminal based on television system voice interaction Technical Field The present invention relates to the field of video processing technologies, and in particular, to a video content retrieval method and apparatus based on voice interaction of a television system, an intelligent terminal, and a storage medium. Background With the popularization of smart televisions, the search demands of users for video content are increasingly diversified and facilitated. In the existing television video retrieval technology, firstly, operations such as fast forward, fast backward, frame-by-frame playing and the like are performed based on a manual operation remote controller to search target video content, the efficiency of the mode is extremely low, a user needs to consume a large amount of time and effort and is difficult to accurately position a target picture, secondly, the retrieval is performed through chapter indexes of videos, but the chapter indexes are usually preset by a video uploading person or a producer, granularity is thicker, the retrieval requirements of the user on specific detail pictures (such as a certain specific scene, a person action and the like) cannot be met, thirdly, part of intelligent televisions support voice retrieval, but the searching is performed on different video resources in a video library instead of the retrieval of content fragments of the currently playing video, and deep interaction on the currently playing video content cannot be achieved. Accordingly, there is a need for improvement and development in the art. Disclosure of Invention In order to solve the technical problems, the invention provides a video content retrieval method, a device, an intelligent terminal and a storage medium based on voice interaction of a television system, which can solve the problems of missing voice interaction retrieval and insufficient multi-phase picture retrieval processing of the currently played video of the television system in the prior art, and provides a high-efficiency, accurate and strong-interactivity video content retrieval method, thereby remarkably improving retrieval experience and operation efficiency of users in the video watching process. The technical scheme of the application is as follows: a video content retrieval method based on television system voice interaction comprises the following steps: when a video is played for the first time, extracting picture frames from the video at a preset frequency, carrying out multidimensional feature analysis, converting the picture frames into multidimensional image feature vectors, carrying out hierarchical storage on the multidimensional image feature vectors and corresponding video time stamps, and storing the multidimensional image feature vectors and the corresponding video time stamps into a database to construct vectorized data containing visual semantic information; acquiring a voice retrieval instruction, decoding the voice retrieval instruction into a text instruction, carrying out intention recognition and semantic understanding on the text instruction by combining with a current video playing scene, extracting a detection keyword, and generating a multidimensional retrieval feature vector; Calculating the matching degree between the multidimensional retrieval feature vector and the multidimensional image feature vector of the video picture frames stored in the database, and screening picture frames with similarity higher than a preset similarity threshold value to form a retrieval candidate matching set; based on the search candidate matching set, receiving an operation instruction of a user, selecting one of the matching pictures from the search candidate matching set for playing, or controlling default jump to the first matching picture for playing. The method for searching video content based on television system voice interaction, wherein when a video is played for the first time, extracting picture frames from the video at preset frequency, performing multidimensional feature analysis, converting the picture frames into multidimensional image feature vectors, and storing the multidimensional image feature vectors and corresponding video time stamps in a hierarchical manner into a database, wherein the step of constructing vectorized data containing visual semantic information comprises the following steps: when the video is about to be played, judging whether historical processing data exists in the current video or not; The method comprises the steps of starting picture frame extraction when historical data does not exist, adopting an open source-based multimedia processing frame to decode a current video stream in real time through a multithreading mechanism, adopting a time slicing type sampling mechanism, and extracting picture frames from the decoded video stream according to a preset time interval; performing color space conversion and size normalization pretreatment