CN-122027863-A - Automatic video processing system based on multi-mode large model

CN122027863ACN 122027863 ACN122027863 ACN 122027863ACN-122027863-A

Abstract

The invention discloses a system and a method for automatically processing a marathon event video based on a multi-mode AI. The system adopts a micro-service distributed architecture and comprises a coordination service, an identification service and a synthesis service, wherein the coordination service receives event information and video data and distributes tasks through a message queue, the identification service carries out player number plate identification on a video stream based on a Qwen2.5-VL multi-mode model and extracts player video fragments, the synthesis service synthesizes personalized Vlog videos with event materials, and stacks player score information at the tail of the video fragments. The system supports two modes of terminal machine position triggering and post-competition centralized synthesis, and achieves real-time performance and integrity of video processing. The invention realizes intelligent identification, automatic editing and personalized synthesis of the marathon event video, and remarkably improves video processing efficiency and user experience.

Inventors

TANG YING
LI SHANHE

Assignees

江苏寒武纪信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (10)

1.A distributed AI video processing system based on a micro-service architecture, comprising: (1) The coordination service module is used for receiving event initialization information and video files of an external system through the REST API interface, distributing video processing tasks to the message queue and providing task state query functions. (2) And the identification service module is used for consuming video processing tasks from the message queue, analyzing the video frame by frame based on the multi-mode AI model to identify a target object, and generating a video fragment containing the target object. (3) And the synthesis service module is used for consuming video synthesis tasks from the message queue and synthesizing personalized target videos based on the video clips generated by the identification service module and preset event materials. (4) And the message queue module is used for carrying out asynchronous message transmission and task distribution among the coordination service module, the identification service module and the synthesis service module. (5) And the cache database module is used for storing event information, task states, video clip metadata and system configuration.
2. The system according to claim 1, wherein the identification service module is specifically configured to: (1) The Qwen2.5-VL multimodal big model was loaded and run. (2) And carrying out dynamic frame skipping processing on the input video, and extracting video frames. (3) And inputting the video frame into the Qwen2.5-VL multi-modal large model to identify player number plate information contained in the video frame. (4) Based on the identified player number plate information and its time stamp, a corresponding player video clip is extracted from the original video.
3. The system according to claim 1, wherein the composition service module is specifically configured to: (1) And acquiring all video clip information of the appointed player from the cache database module according to the synthesis task. (2) And acquiring pre-stored event materials, including a head video, a tail video and background music. (3) And using an FFmpeg tool to perform time sequence splicing on the head video, the plurality of video clips of the appointed player and the tail video, and mixing the head video, the plurality of video clips and the tail video with the background music to generate a primary synthesized video. (4) And superposing player personal information and/or event watermark on the primary synthesized video to generate a final personalized target video.
4. The system of claim 3, wherein the composition service module is further configured to: (1) And in the video synthesis process, obtaining event score information of the appointed player by calling an external score query API. (2) And overlapping the event score information into a designated position and a designated time interval of the final personalized target video in a text form.
5. The system according to claim 1, characterized in that the system further comprises an end point trigger mechanism, in particular: (1) The destination location identity of the event is predefined. (2) And when the recognition service module processes the video from the terminal machine position and recognizes the target object, triggering a video synthesis flow aiming at the target object.
6. The system according to claim 1, characterized in that the system further comprises a centralized synthesis mechanism, in particular: (1) After an event ending event is detected or a batch synthesis instruction is received, the coordination service module scans the cache database module and screens out all target objects of incomplete video synthesis. (2) And generating a video synthesis task for each target object of the incomplete video synthesis, and sending the video synthesis task to the message queue module in batches. (3) And the synthesis service module processes the batch video synthesis tasks in parallel and generates personalized target videos for all target objects.
7. The system of claim 1, wherein the event configuration information stored in the cache database module includes a list of allowed positions for half horse player synthesis and a list of allowed positions for full horse player synthesis, and wherein the synthesis service module selects only video clips belonging to the list of allowed positions for corresponding event types for synthesis when synthesizing video.
8. A distributed AI video processing method based on a micro-service architecture, applied to the system of any of claims 1-7, the method comprising: (1) And receiving event initialization information and video files sent by an external system through a coordination service. (2) And issuing the video processing task to a message queue through the coordination service. (3) And consuming the video processing task from the message queue through an identification service, executing AI identification and generating a video clip, and issuing a video composition task to the message queue. (4) And consuming the video composition task from the message queue through composition service, executing video composition and outputting a final target video. (5) The coordination service, the identification service and the synthesis service complete state synchronization and data transmission through the cache database and the message queue.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of claim 8 when the program is executed by the processor.
10. A non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to claim 8.

Description

Automatic video processing system based on multi-mode large model Technical Field The invention relates to the field of computer technology and artificial intelligence application, in particular to a distributed AI video processing system and method based on a micro-service architecture and artificial intelligence technology, which are particularly suitable for automatic identification of players and personalized Vlog video generation in a large-scale marathon event. Background With the popularization of large-scale sports such as marathons, the demands of contestants for acquiring personal contest image records are increasing. The traditional video recording mode mainly depends on manual shooting and later editing, has low efficiency and high cost, and cannot realize individuation and scale. Some existing automated video processing schemes typically suffer from the following drawbacks: 1. the processing capability is limited, and the concurrent processing requirement of massive video data during the event is difficult to be met by adopting a single architecture, so that the expansibility is poor. 2. The identification accuracy and efficiency are insufficient, the identification accuracy of the player number plate is low and the processing speed is low under the condition of complex scenes (such as light change, player shielding and number plate fouling) depending on the traditional image identification technology. 3. The process is stiff, the video synthesis process is fixed, intelligent triggering processing cannot be performed according to the event progress (such as whether a player reaches a terminal point or not), and personalized information such as event score of the player cannot be dynamically integrated. 4. The coupling degree of the system is high, each processing module is tightly coupled, and once a certain link fails, the whole system is easy to break down, and the reliability is low. 5. The resource utilization rate is low, and the flexible scheduling and the efficient utilization of different demands of computing resources (such as GPU and CPU) cannot be carried out according to different tasks such as identification, synthesis and the like. Thus, there is a strong need in the art for an automatic video processing solution that can achieve high precision, high efficiency, high availability, scalability and intelligence. Disclosure of Invention 1. Object of the invention The invention aims to overcome the defects of the prior art, provides a distributed AI video processing system and a distributed AI video processing method, and aims to solve the problems of low video processing efficiency, poor recognition precision, insufficient system expansibility and incapability of providing personalized video content in a large-scale event scene. Technical proposal In order to achieve the aim of the invention, the invention adopts the following technical scheme: 1. the distributed AI video processing system is characterized in that the system adopts a micro-service architecture, comprises a coordination service module, an identification service module and a synthesis service module, and performs decoupling and communication through a message queue and a cache service; 2. the coordination service module is used as a unified entry of the system, is used for receiving event initialization information and video data of an external system, is responsible for scheduling, distributing and monitoring the state of tasks, and provides an interface to the outside through a REST API; 3. The identification service module is deployed on a computing node equipped with a GPU and is used for consuming video processing tasks from a message queue, loading a multi-mode AI model to analyze an input video frame by frame, identifying players and number plate information thereof, and extracting video fragments containing specific players; 4. The composition service module is used for consuming video composition tasks from the message queue, composing personalized player Vlog videos through the video processing engine according to the identified player video clips, preset event materials and player score information acquired from an external system, and uploading the personalized player Vlog videos to the object storage system; Wherein the system further comprises: 1. Message middleware, which adopts Kafka to asynchronously transfer task messages among the coordination service module, the identification service module and the synthesis service module, and realize service decoupling and load balancing; 2. A database for storing event metadata, task status, player information, video clip metadata and system configuration by using Redis to provide high-speed data access; 3. The dynamic triggering and batch synthesizing mechanism is characterized in that the system is provided with a terminal machine position list, and when the recognition service processes the video from the terminal machine position and recognizes a player, the