CN-121603748-B - Virtual-real fusion short video generation method and system

CN121603748BCN 121603748 BCN121603748 BCN 121603748BCN-121603748-B

Abstract

The application discloses a virtual-real fusion short video generation method and a system, wherein the method comprises the steps of obtaining a live data stream of a shooting site; the method comprises the steps of determining corresponding personalized content based on image frames and audio frames for any live data, generating corresponding virtual picture information and event labels based on character objects shown in the image frames and combining the personalized content, determining corresponding virtual video streams based on virtual pictures corresponding to all live data in the whole shooting process for each virtual view angle, screening corresponding video fragments from each virtual video stream according to the event labels, configuring corresponding background music and special effects for the video fragments, and generating personalized short videos based on each obtained video fragment. According to the method, based on personalized content, a virtual video stream and an event tag are generated, corresponding video fragments are screened from the virtual video stream by using the event tag, personalized short videos are generated, and low-cost and high-quality short video generation can be achieved.

Inventors

LI DAPING
LI JUNJUN
SONG SHIEN

Assignees

湖南快乐阳光互动娱乐传媒有限公司

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (8)

1. A short video generation method based on virtual-real fusion is characterized by comprising the following steps: obtaining a live data stream of a shooting site, wherein any one of the live data streams comprises an image frame and an audio frame; Identifying character object information and prop object information in the image frame for any one of the live data; determining corresponding user intention information based on the user voice shown in the audio frame; Determining corresponding personalized content based on the character object information, the prop object information and the user intention information; Generating corresponding virtual picture information and event labels based on the character objects shown in the image frames and combining the personalized content, wherein the virtual picture information comprises virtual pictures under a plurality of virtual visual angles, the plurality of virtual visual angles comprise a plurality of shooting lenses of a virtual camera, and the plurality of shooting lenses comprise a host position lens, a close-up lens and a panoramic lens; Determining corresponding virtual video streams based on virtual pictures corresponding to all live data in the whole shooting process aiming at each virtual view angle, wherein the audio playing content of the virtual video streams of each virtual view angle is consistent; Screening corresponding video clips from each virtual video stream according to the event labels, and configuring corresponding background music and special effects for the video clips, wherein the time stamps of the video clips obtained by screening are different, and the video clips can be combined to generate a video stream; Generating personalized short video based on each video segment; The generating, based on the person object shown in the image frame and in combination with the personalized content, corresponding virtual picture information and event tags includes: Generating a corresponding real character in a virtual scene based on the character object shown in the image frame; Driving the real persona to execute corresponding actions based on the persona object information shown in the personalized content, and generating corresponding virtual special effects by the virtual scene; based on the prop object information shown in the personalized content, a corresponding virtual prop is newly added on the real persona; determining scene types and effects of the virtual scene based on user intention information shown in the personalized content; generating corresponding virtual picture information based on the virtual scene, the real person role, the virtual special effect and the virtual prop; based on the persona object information and the user intent information, a corresponding event tag is determined.
2. The method of claim 1, wherein the process of identifying persona object information in the image frame comprises: Obtaining skeletal keypoints of a character object shown in a current image frame; judging whether the current image frame is a key frame or not according to skeleton key points of a previous image frame; If the current image frame is the key frame, determining an action tag of the current image frame based on a preset indexing action library, and determining bone data of the current image frame based on bone key points of the current image frame; determining an action tag of the current image frame based on an action tag of a previous key frame if the current image frame is not the key frame, and determining bone data of the current image frame based on bone key points of the current image frame; And determining character object information of the current image frame based on the action tag and the bone data of the current image frame.
3. The method of claim 2, wherein determining the motion label for the current image frame based on a preset indexed motion library comprises: converting coordinates of skeleton key points of the current image frame from an image coordinate system into a coordinate system taking human hip as an origin and height as a normalized scale to obtain corresponding features; Calculating the distance between the characteristic and a plurality of product quantization indexes shown in a preset indexing action library, wherein the indexing action library comprises product quantization indexes of a plurality of sample action labels; And determining the action label of the current image frame based on the sample action label to which the target product quantization index with the minimum distance belongs.
4. The method of claim 1, wherein identifying prop object information in the image frame comprises: judging whether the previous image frame has corresponding prop object information or not; if the previous image frame does not have the corresponding prop object information, carrying out target detection on the current image frame to obtain a prop area, and extracting an image block corresponding to the prop area; based on the image block as input of a classification model, obtaining a corresponding prop type; Determining a prop object corresponding to the current image frame based on the standard prop image corresponding to the prop type; determining the pose corresponding to the prop object based on the feature point matching result between the standard prop image and the image block; Determining prop object information of the current image frame based on the prop object and the corresponding pose; And if the corresponding prop object information exists in the previous image frame, determining the prop object information of the current image frame by combining a filtering tracking algorithm based on the prop object information of the previous image frame.
5. The method of claim 1, wherein generating corresponding virtual picture information and event tags in conjunction with the personalized content based on the character object shown in the image frame comprises: acquiring a shooting mode adopted by a shooting site in advance; If the shooting mode is a professional mode, determining a transparency mask of the image frame based on a chromaticity distance between a pixel point in the image frame and a background reference color in combination with a transparency calculation function; If the shooting mode is a common mode, inputting the image frame into a preset semantic segmentation network model to obtain a transparency mask of the image frame; Extracting a person object in the image frame from a background by using the transparency mask; And based on the character object, combining the personalized content to generate corresponding virtual picture information and event labels.
6. The method of claim 1, wherein filtering corresponding video clips from each virtual video stream according to the event tags, and configuring corresponding background music and special effects for the video clips, comprises: determining basic information of a plurality of interaction events based on event labels of the image frames, wherein the basic information comprises an occurrence time stamp, event types and event parameters; based on the occurrence time stamp of each interaction event, acquiring a candidate segment set corresponding to each interaction event from each virtual video stream, wherein the candidate segment set comprises candidate segments under different virtual viewing angles in the same time period; For each interaction event, based on a preset mapping relation between an event type and a virtual view angle, acquiring candidate fragments matched with the event type from the video fragment set, and taking the candidate fragments as video fragments corresponding to the interaction event; and configuring corresponding background music and special effects for the video clip based on event parameters of the interaction event.
7. A virtual-real fusion short video generation system, comprising: the system comprises a shooting data acquisition unit, a shooting data acquisition unit and a control unit, wherein the shooting data acquisition unit is used for acquiring a live data stream of a shooting site; A personalized content determining unit configured to determine, for the arbitrary live data, a corresponding personalized content based on the image frame and the audio frame; The virtual picture generation unit is used for generating corresponding virtual picture information and event labels based on the person object shown in the image frame and combining the personalized content, wherein the virtual picture information comprises virtual pictures under a plurality of virtual visual angles; the virtual video determining unit is used for determining corresponding virtual video streams based on virtual pictures corresponding to all live data in the whole shooting process aiming at each virtual view angle, wherein the audio playing content of the virtual video streams of each virtual view angle is consistent; The video clip processing unit is used for screening corresponding video clips from each virtual video stream according to the event labels and configuring corresponding background music and special effects for the video clips, wherein the time stamps of the video clips obtained by screening are different, and the video clips can be combined to generate a video stream; A short video synthesis unit for generating personalized short video based on the obtained video clips; The personalized content determining unit is specifically used for identifying character object information and prop object information in an image frame, determining corresponding user intention information based on user voice shown in an audio frame, and determining corresponding personalized content based on the character object information, the prop object information and the user intention information; The virtual picture generation unit is specifically used for generating a corresponding real person role in a virtual scene based on person objects shown in the image frames, driving the real person role to execute corresponding actions based on person object information shown in the personalized content, generating a corresponding virtual special effect by the virtual scene, adding a corresponding virtual prop on the real person role based on prop object information shown in the personalized content, determining scene types and effects of the virtual scene based on user intention information shown in the personalized content, generating corresponding virtual picture information based on the virtual scene, the real person role, the virtual special effect and the virtual prop, and determining a corresponding event label based on the person object information and the user intention information.
8. The electronic device is characterized by comprising a processor, a memory and a bus, wherein the processor is connected with the memory through the bus; The memory is used for storing a program, and the processor is used for running the program, wherein the program is executed by the processor to perform the virtual-real fusion short video generation method according to any one of claims 1 to 6.

Description

Virtual-real fusion short video generation method and system Technical Field The application relates to the technical field of videos, in particular to a short video generation method and system for virtual-real fusion. Background With the vigorous development of the travel industry, new technology card punching has become an important way for tourists to experience and share. However, the existing card punching technology has obvious limitations: (1) The method has the advantages of single form and insufficient individuation, the current mainstream technology focuses on generating static photos, such as simple pictures or filters, can not generate dynamic and narrative short video content, and the user experience is seriously homogenized and lacks uniqueness. (2) The video production threshold is high, the hardware cost is high, and to produce high-quality virtual-real fusion video, professional green film studio, motion capture equipment and expensive later editing software and manpower are generally needed, so that common tourists and small scenic spots are hard to bear. (3) The interaction is weak, experience is passive, most of the prior art is unidirectional shooting-watching, users cannot interact with virtual content in real time and depth (such as scene change driven by actions and voice), and immersion is poor. (4) Relying on more manpower, the efficiency is low, and even in some high-end experiences, the originality, shooting and editing of the video still highly depend on professional staff, so that automatic and large-scale personalized content generation cannot be realized, and the requirement of instantaneity cannot be met. Disclosure of Invention The application provides a virtual-real fusion short video generation method and system, and aims to generate a low-cost high-quality short video aiming at a tourist shooting site. In order to achieve the above object, the present application provides the following technical solutions: a short video generation method of virtual-real fusion comprises the following steps: obtaining a live data stream of a shooting site, wherein any one of the live data streams comprises an image frame and an audio frame; for any one of the live data, determining corresponding personalized content based on the image frames and the audio frames; based on the character object shown in the image frame, combining the personalized content to generate corresponding virtual picture information and event labels, wherein the virtual picture information comprises virtual pictures under a plurality of virtual visual angles; For each virtual view angle, determining a corresponding virtual video stream based on virtual pictures corresponding to all live data in the whole shooting process; Screening corresponding video clips from each virtual video stream according to the event labels, and configuring corresponding background music and special effects for the video clips; based on the obtained video clips, personalized short videos are generated. A virtual-real fusion short video generation system comprising: the system comprises a shooting data acquisition unit, a shooting data acquisition unit and a control unit, wherein the shooting data acquisition unit is used for acquiring a live data stream of a shooting site; A personalized content determining unit configured to determine, for the arbitrary live data, a corresponding personalized content based on the image frame and the audio frame; A virtual picture generation unit for generating corresponding virtual picture information and event tags based on the character object shown in the image frame in combination with the personalized content, wherein the virtual picture information comprises virtual pictures under a plurality of virtual viewing angles; the virtual video determining unit is used for determining corresponding virtual video streams based on virtual pictures corresponding to all live data in the whole shooting process for each virtual view angle; the video clip processing unit is used for screening corresponding video clips from the virtual video streams according to the event labels and configuring corresponding background music and special effects for the video clips; and the short video synthesis unit is used for generating personalized short videos based on the obtained video clips. An electronic device comprises a processor, a memory and a bus, wherein the processor is connected with the memory through the bus; the memory is used for storing a program, and the processor is used for running the program, wherein the program is executed by the processor to execute the virtual-real fusion short video generation method. The technical scheme includes that a live data stream of a shooting site is obtained, corresponding personalized content is determined based on image frames and audio frames for any live data, corresponding virtual picture information and event labels are generated based on person objects shown in the i