CN-121982618-A - Intelligent video analysis system based on multi-mode AI

CN121982618ACN 121982618 ACN121982618 ACN 121982618ACN-121982618-A

Abstract

The invention discloses an intelligent video analysis system based on a multi-mode AI, which belongs to the technical field of video analysis and comprises a video preprocessing module, an IFS and Tracking analysis module, a task arrangement and coordination module, a logo recognition module, a face recognition module, a structured video abstract generation module and a structured video abstract generation module, wherein the video preprocessing module is used for extracting an analysis frame of an original video segment, the IFS and Tracking analysis module is used for carrying out cross-frame Tracking on the analysis frame and screening important frames, the task arrangement and coordination module is used for realizing state management and progress quantification by means of Redis, if a package or a person appears on the analysis frame, the corresponding module is triggered to carry out package logo or face recognition, the logo recognition module is used for screening an optimal frame on the analysis frame with the package, and combining target detection and GPT dual recognition to determine the package logo type, the face recognition module is used for carrying out face detection on the analysis frame with the person appearing, and identifying the identity of the person, and the structured video abstract generation module is used for generating a structured abstract of the original video segment. The method and the device can efficiently process the video, improve the identification precision of the target in the video, and facilitate subsequent retrieval and searching.

Inventors

LIU XIN
TIAN RUJUN
ZHANG CHENGCAI
WU ERGANG

Assignees

南京迈特望科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260402

Claims (10)

1. An intelligent video analysis system based on a multi-modal AI, comprising: the video preprocessing module is used for acquiring an original video segment to be analyzed and extracting an analysis frame of the original video segment; the IFS and Tracking analysis module is used for carrying out target detection and cross-frame Tracking on each analysis frame, screening important frames in the analysis frames, constructing an important frame list, and combining the important frame list, the analysis frame list and the cross-frame Tracking result to generate an analysis file; the task arrangement and coordination module is used for realizing state management and progress quantification by relying on Redis, judging all detection target types of the analysis frame according to an analysis file stored by the Redis, and triggering the corresponding module to carry out parcel logo recognition or face recognition if the detection target of the analysis frame appears to be a parcel or a person; The logo identification module is used for carrying out grouping screening on the analysis frames with packages, combining target detection with GPT dual identification and conflict processing, determining package logo types group by group, and finally integrating and outputting logo identification result files; The face recognition module is used for carrying out face detection on the analysis frame of the appearance person so as to recognize the identity of the person, and combining the recognized identity of the person with the face picture to generate a person file; and the structured video abstract generating module is used for generating a structured abstract of the original video segment according to the analysis file, the logo recognition result file and the character file.
2. The intelligent video analysis system based on multi-mode AI according to claim 1, wherein the video preprocessing module is configured to obtain an original video segment to be analyzed, and extract an analysis frame of the original video segment, and specifically includes: Naming the original video segment according to a set format and creating a structure catalog to store each file; preprocessing the original video segment, extracting analysis frames according to the configured frame interval, and recording the sequence numbers and time stamps of the analysis frames.
3. The multi-modal AI-based intelligent video analysis system of claim 1, wherein the IFS and Tracking analysis module is configured to perform object detection and cross-frame Tracking on each analysis frame, and to screen important frames in the analysis frames, and to construct an important frame list, and specifically includes: performing target detection on each analysis frame through a first target detection model, and outputting the type of a detection target, the size of a detection frame, a center coordinate and a confidence coefficient; Calculating the overlapping degree of the current frame detection frame and the historical track prediction frame by adopting an IOU matching algorithm to construct a cost matrix, and carrying out target association by adopting a global allocation strategy for minimizing the cost matrix; converting each analysis frame into a mathematical vector by using a VIT model, calculating the importance score of each analysis frame according to the following formula, screening the analysis frames exceeding a set threshold as important frames, and constructing an important frame list; ; Wherein, the Is the first The importance scores of the individual analysis frames, And The weight coefficients of similarity and confidence respectively, Is the first The mathematical vector of the frame is analyzed and, Is the first Analyzing frames and the first Analyzing cosine similarity of mathematical vectors corresponding to the frames; for the set maximum value of the similarity, Is the first The average confidence of all detected objects in the frame is analyzed, Is the set confidence maximum.
4. The intelligent video analysis system based on multi-mode AI of claim 1, wherein the logo recognition module is configured to perform packet screening on an optimal frame for an analysis frame of a package, determine package logo types group by combining target detection with GPT dual recognition and conflict processing, and finally integrate and output a logo recognition result file, and specifically includes: grouping all analysis frames with packages according to a set window; for each group, using a second target detection model to identify each analysis frame in the group, and outputting the detection frame size, the center coordinates and the identification confidence of each logo region; Selecting an optimal analysis frame from each group as a target frame of the current group by combining detection frame information, identification confidence coefficient and package detection frame information of the identified logo region, and outputting an optimal logo region of the target frame; for each target frame, using a GPT model to respectively identify a parcel detection frame area and an optimal logo area, and carrying out conflict processing on two identification results to obtain a parcel logo type of the current target frame; And combining the logo types of each output target frame to form a logo identification result file.
5. The intelligent multi-modal AI-based video analysis system of claim 4, wherein selecting an optimal analysis frame from each group as the target frame of the current group and outputting the optimal logo region of the target frame comprises: For each analysis frame, sequentially analyzing the positions of all logo areas identified by the analysis frame, if the positions of the logo areas are overlapped with the parcel, taking the logo areas as parcel logo areas, if the positions of the logo areas are overlapped with people, taking the logo areas as person logo areas, and if the positions of the logo areas are overlapped with other targets, taking the logo areas as other target logo areas; judging whether the current group has a parcel logo area, if so, selecting an analysis frame with highest confidence as a target frame of the current group, taking the parcel logo area as the best logo area of the target frame, if not, judging whether the current group has a person logo area, if so, selecting the analysis frame with highest confidence as the target frame of the current group, taking the person logo area as the best logo area of the target frame, and if not, selecting the analysis frames with other target logo areas with highest confidence as the target frame of the current group, and taking the other target logo areas as the best logo area of the target frame.
6. The multi-modal AI-based intelligent video analysis system according to claim 4, wherein for each target frame, using a GPT model to identify a parcel detection frame area and an optimal logo area, and performing conflict processing on two identification results, to obtain a parcel logo type of a current target frame, includes: performing logo type identification on a detection frame area and an optimal logo area of a package by using a GPT model, taking an identification result of the detection frame area of the package as a first identification type, and taking an identification result of the optimal logo area as a second identification type; If the first identification type is unknown, the second identification type is used as the parcel logo type of the current target frame, otherwise, the first identification type is used as the parcel logo type of the current target frame.
7. The intelligent video analysis system based on multi-modal AI of claim 1, wherein the face recognition module is configured to perform face detection on an analysis frame of an appearance person to identify the identity of the person, and specifically comprises: carrying out face detection on the analysis frame with the figures by adopting INSIGHTFACE model, and extracting face boundary frames and key points to form face feature vectors; and performing similarity matching on the face feature vector and the known face feature vector of the known face library, and outputting the identity of the person after the matching is successful.
8. The multi-modality AI-based intelligent video analysis system of claim 1, further comprising a highlight video generation module for screening key frames from the list of important frames and generating a highlight video by fragment overlap merge after extracting video fragments of each key frame.
9. The multi-modal AI-based intelligent video analysis system of claim 8, wherein the highlight video generation module is configured to screen key frames from the important frame list, and generate highlight video by segment overlap merging after extracting video segments of each key frame, specifically comprising: All important frames of the important frame list are arranged in a descending order according to the importance scores of the important frames, top-K frames are selected as key frames, and the key frames are placed in a key frame set; For each key frame of a set of key frames, time stamping the key frame In the center, extracting video clips with fixed duration from an original video segment, wherein the time range of the video clips is as follows: ; setting extraction time length; judging whether the adjacent video clips are overlapped according to the following formula, if so, combining the two adjacent video clips, and performing overlapping judgment as a new video clip and other video clips until all the video clips are not overlapped, and outputting all the video clips without overlapping as a highlight video of the original video clip; ; ; Wherein, the For video clips And video clips Is used for the overlapping duration of (a), And For video clips Is added to the start time stamp and the end time stamp of (c), And For video clips Is added to the start time stamp and the end time stamp of (c), To set the overlap duration threshold.
10. The multi-modality AI-based intelligent video analysis system of claim 9, wherein the highlight video generation module is further configured to modify a keyframe set, specifically: Calculating the timestamp coverage of the current keyframe set according to the following If the time stamp coverage rate If the time stamp coverage rate exceeds the set threshold value, the current key frame set is not processed, otherwise, a designated number of non-key frames are continuously selected from the ordered important frame list and added into the key frame set until the time stamp coverage rate exceeds the set threshold value; ; Wherein, the As the total duration of the original video segment, And The maximum timestamp and the minimum timestamp in the keyframe set, respectively.

Description

Intelligent video analysis system based on multi-mode AI Technical Field The invention belongs to the technical field of video analysis, and particularly relates to an intelligent video analysis system based on a multi-mode AI. Background Video analysis in the current access control system is difficult to meet the requirements of intellectualization, high efficiency and accuracy in practical application, and mainly has the following defects: (1) The processing efficiency is low, the existing video analysis usually adopts a uniform frame extraction or simple differential mode, and a large number of invalid calculations are generated by traversing a whole number of video frames, so that resource waste and response delay are caused, and the requirement of real-time access control cannot be met. (2) The analysis strategy is single, the windowed screening strategy is lacked depending on a single model or full frame analysis, the calling cost is high, the function control is inflexible, and the identification precision is affected due to the lack of a definite confidence fusion method. (3) The modules are not coordinated enough, each functional module independently operates, data flow is not smooth, unified state management and progress tracking are lacked, and data multiplexing and flow automation cannot be realized. Disclosure of Invention The invention provides an intelligent video analysis system based on multi-mode AI, which can efficiently process videos, improve the identification accuracy of targets in the videos and facilitate subsequent retrieval and searching. The invention provides the following technical scheme: An intelligent video analysis system based on a multi-modality AI, comprising: the video preprocessing module is used for acquiring an original video segment to be analyzed and extracting an analysis frame of the original video segment; the IFS and Tracking analysis module is used for carrying out target detection and cross-frame Tracking on each analysis frame, screening important frames in the analysis frames, constructing an important frame list, and combining the important frame list, the analysis frame list and the cross-frame Tracking result to generate an analysis file; the task arrangement and coordination module is used for realizing state management and progress quantification by relying on Redis, judging all detection target types of the analysis frame according to an analysis file stored by the Redis, and triggering the corresponding module to carry out parcel logo recognition or face recognition if the detection target of the analysis frame appears to be a parcel or a person; The logo identification module is used for carrying out grouping screening on the analysis frames with packages, combining target detection with GPT dual identification and conflict processing, determining package logo types group by group, and finally integrating and outputting logo identification result files; The face recognition module is used for carrying out face detection on the analysis frame of the appearance person so as to recognize the identity of the person, and combining the recognized identity of the person with the face picture to generate a person file; and the structured video abstract generating module is used for generating a structured abstract of the original video segment according to the analysis file, the logo recognition result file and the character file. Optionally, the video preprocessing module is configured to obtain an original video segment to be analyzed, and extract an analysis frame of the original video segment, and specifically includes: Naming the original video segment according to a set format and creating a structure catalog to store each file; preprocessing the original video segment, extracting analysis frames according to the configured frame interval, and recording the sequence numbers and time stamps of the analysis frames. Optionally, the IFS and Tracking analysis module is configured to perform target detection and cross-frame Tracking on each analysis frame, and filter important frames in the analysis frames, and construct an important frame list, which specifically includes: performing target detection on each analysis frame through a first target detection model, and outputting the type of a detection target, the size of a detection frame, a center coordinate and a confidence coefficient; Calculating the overlapping degree of the current frame detection frame and the historical track prediction frame by adopting an IOU matching algorithm to construct a cost matrix, and carrying out target association by adopting a global allocation strategy for minimizing the cost matrix; converting each analysis frame into a mathematical vector by using a VIT model, calculating the importance score of each analysis frame according to the following formula, screening the analysis frames exceeding a set threshold as important frames, and constructing an important frame list;