CN-121561139-B - Method and system for automatically generating movie script and movie abstract

CN121561139BCN 121561139 BCN121561139 BCN 121561139BCN-121561139-B

Abstract

The application provides a method and a system for automatically generating a movie script and a movie abstract, which relate to the technical field of computers, and are characterized by extracting a key frame set, a voice transcription text and a character action recognition result from a target movie video after the target movie video is acquired, extracting a visual feature vector from the set, searching a scene segmentation scheme with minimum total bit cost by combining a preset scene segmentation principle, segmenting the target movie video into a plurality of scenes, extracting target actor information images from video fragments containing a staff table to generate a first corresponding relation and a second corresponding relation, further generating a target format script and a scene list of each target scene, generating a tree diagram according to the scenes, processing to obtain a plot summary and a database, generating a story outline based on the target format script and the database, and generating the movie script and the movie abstract by combining the scene list, thereby realizing automatic generation of the movie script and the plot summary which are logically coherent, are rich in information and matched with the visual content.

Inventors

LI JIANAN
YUAN HAIJIE

Assignees

立安智通(北京)科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (9)

1.A method for automatically generating a movie script and movie digest, comprising: Acquiring a target film video, extracting a key frame set, a voice transcription text and a character action recognition result from the target film video, and extracting a plurality of visual feature vectors from the key frame set; determining a scene segmentation scheme with minimum total bit cost based on a preset scene segmentation principle and the visual feature vector so as to segment the target film video into a plurality of scenes; Constructing a distance matrix based on the target film video, and generating a first corresponding relation between actor names and scenes and a second corresponding relation between actor names and speakers based on the distance matrix; Based on the first corresponding relation and character position parameters in the target film video, combining the first corresponding relation, the voice transcription text and the character action recognition result to generate a target format script and a scene list corresponding to each target scene; Generating a tree diagram according to the scene, processing the tree diagram to obtain a scenario summary, and identifying a target film video frame matched with the scenario summary to form a database; classifying scene numbers, actor names, updated position information and character action states corresponding to each scene in the target format script as scene basic elements, and classifying event development information of each scenario segment in the database as scenario development elements; According to the second corresponding relation, associating the actor names in the scene basic elements with the target speaker identifications, determining the action state and the position information of actors corresponding to each target speaker identification in the corresponding scene, integrating the action state and the position information into human behavior elements, and selecting actor-related event information from the scenario development elements as target scenario elements; integrating the scene base element, the plot development element, the character behavior element and the target plot element into a story outline according to a fourth time sequence of each scene in the target movie video; associating the scene number set in the story outline with the scene sequence number in the scene list and the target format script to form a movie script first draft; And extracting core events, key character behaviors and plot turning nodes of each scene from the movie script initial draft, generating a preliminary movie abstract by combining plot summaries in the database, and generating a movie script and a movie abstract by combining the movie script initial draft.
2. The method of claim 1, wherein determining a scene segmentation scheme with a minimum total bit cost based on a preset scene segmentation principle and the visual feature vector comprises: dividing the visual feature vector into a plurality of vector segment combinations based on a preset scene division principle; Calculating probability density values of all visual feature vectors and corresponding average feature vectors in each vector segment combination, and calculating total segmentation cost of each vector segment combination according to the probability density values; Calculating a first bit number required for coding all the visual feature vectors in each vector segment combination and a second bit number required for coding the corresponding average feature vector, and calculating the total bit cost of each vector segment combination; And adjusting the segmentation division range of the vector segment combinations according to the second time sequence of the target film video to form a plurality of adjusted vector segment combinations, calculating the total bit cost of each adjusted vector segment combination, and selecting the vector segment combination with the minimum total bit cost or the adjusted vector segment combination as a scene segmentation scheme.
3. The method of claim 1, wherein constructing a distance matrix based on the target movie video, generating a first correspondence of actor names to scenes and a second correspondence of actor names to speakers based on the distance matrix, comprises: Extracting a plurality of first actor information images and second actor information images in different scenes from the target film video; Extracting facial feature information from the first actor information images to calculate facial feature similarity among different first actor information images, and integrating the first actor information images with facial feature similarity higher than a preset threshold value into target actor information images; Calculating a characteristic difference value between any two target actor information images based on facial characteristic information in the target actor information images to construct a distance matrix; calculating the matching degree of each target actor information image and the second actor information image, combining the distance matrix, and distributing actor names in the target actor information image with the highest matching degree to corresponding scenes to form a first corresponding relation between actor names and scenes; And extracting the voice characteristics and the text identification of the speaker from the target film video, and combining the actor names and the distance matrix in the target actor information image to determine a second corresponding relation between the actor names and the speaker.
4. The method of claim 1, wherein generating a target format script and a scene list corresponding to each target scene based on the first correspondence and the character position parameter in the target movie video in combination with the first correspondence, the voice transcription text, and the character action recognition result comprises: extracting character position parameters from the key frame set, and determining a target scene corresponding to each actor name based on the time association relation and the first corresponding relation of the scene and each key frame in the key frame set in the target film video; binding each actor name with character position parameters in the corresponding target scene to obtain initial position information of each actor in the corresponding target scene; According to the first time sequence of the key frames in the target film video, calculating position change data of each actor between adjacent key frames by combining the time interval and position parameter difference of the same actor in the adjacent key frames so as to update initial position information of the corresponding actor and obtain updated position information of each actor in the corresponding target scene; generating a target format script corresponding to each target scene by combining the name of the actor corresponding to each target scene, updated position information, the voice transcription text and a character action recognition result according to a preset script format; and according to a third time sequence of each target scene in the target film video, arranging target format scripts corresponding to all the target scenes to form a scene list.
5. The method of claim 4, wherein generating the target format script for each target scene according to the preset script format in combination with the actor name, updated position information, the voice transcription text, and the character action recognition result for each target scene, comprises: setting a first script format; according to the first corresponding relation, arranging all actor names in each target scene to form an actor list; Taking the identification information of each target scene as a scene number, taking the time span of a key frame corresponding to each target scene in the target film video as scene duration, and filling the scene number, the scene duration and the actor list of each target scene into the first script format to obtain a second format script corresponding to each target scene; Converting updated position information of all actors in the corresponding target scenes into position descriptions, and filling the position descriptions into the second format script to obtain a third format script; Extracting action records of all actors in corresponding target scenes from the character action recognition result, and filling the action records into the third format script to obtain a fourth format script; And determining a target speaker identifier corresponding to the actor in each target scene according to the second corresponding relation, screening text content associated with the target speaker identifier and target key frame duration from the voice transcription text, and filling the text content into the fourth format script to obtain a target format script corresponding to each target scene.
6. The method of claim 1, wherein generating a tree diagram from the scene, processing the tree diagram to obtain a scenario summary, identifying a target movie video frame matching the scenario summary, and forming a database, comprises: Identifying scene environment information and character interaction behaviors from key frames corresponding to each scene to generate target content description of each scene, and determining time sequence association relations among different scenes according to a fourth time sequence of each scene in the target film video; Dividing a plurality of scenes into a plurality of scene groups according to the time sequence association relationship by taking the target movie story venation as a root node of the tree diagram, and constructing the tree diagram by taking each scene group as a first-level sub-node under the root node and taking a single scene in each scene group as a second-level sub-node under the corresponding first-level sub-node, wherein each second-level sub-node binds the target content description of the corresponding scene; integrating target content descriptions with the repeatability higher than a preset repetition threshold value in all secondary sub-nodes under each primary sub-node in the tree diagram into target scenario descriptions, and integrating the target scenario descriptions of all primary sub-nodes to generate scenario summaries; Splitting the scenario summary into a plurality of scenario pieces, matching each scenario piece with a corresponding matching target scene, and extracting target movie video frames from the target scene to form a matching frame set of the corresponding scenario piece; and constructing a database based on the plot summary, each plot fragment and the matched frame set corresponding to each plot fragment.
7. An automatic generation movie script and movie digest system for use in a method of automatically generating movie scripts and movie digests as claimed in any one of claims 1 to 6, comprising: The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a target film video, extracting a key frame set, a voice transcription text and a character action recognition result from the target film video, and extracting a plurality of visual feature vectors from the key frame set; The searching module is used for determining a scene segmentation scheme with minimum total bit cost based on a preset scene segmentation principle and the visual feature vector so as to segment the target film video into a plurality of scenes; the screening module is used for extracting an initial actor information image set from a video segment comprising a staff table in the target film video, screening target actor information images from the initial actor information image set to construct a distance matrix, and generating a first corresponding relation between actor names and scenes and a second corresponding relation between actor names and speakers based on the distance matrix; the generation module is used for generating a target format script and a scene list corresponding to each target scene based on the first corresponding relation and character position parameters in the target film video and combining the first corresponding relation, the voice transcription text and the character action recognition result; The identification module is used for generating a tree diagram according to the scene, processing the tree diagram to obtain a scenario summary, and identifying a target film video frame matched with the scenario summary to form a database; And the extraction module is used for respectively extracting character behavior elements and target plot elements corresponding to the second corresponding relation from the target format script and the database to generate a story outline, and combining the scene list to generate a movie script and a movie abstract.
8. An electronic device, comprising: A memory for storing a computer program; a processor for implementing the steps of a method of automatically generating a movie script and movie digest according to any one of claims 1 to 6 when said computer program is executed.
9. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, the computer program being capable of implementing a method of automatically generating a movie script and a movie digest according to any one of claims 1 to 6 when executed by a processor.

Description

Method and system for automatically generating movie script and movie abstract Technical Field The application relates to the technical field of computers, in particular to a method and a system for automatically generating movie scripts and movie abstracts. Background In the modern movie industry, with the increasing demand for high quality content and the increasing competition in the market for viewers, the rapid and accurate generation of movie scripts and summaries becomes an important technical requirement. This not only helps to improve production efficiency, but also provides powerful support for marketing teams, attracting more viewers through accurate plot summaries. Especially in the fields of independent movie making, network platform self-made episodes and the like, how to efficiently utilize the existing materials for creation is important in the face of limited time and resources. Currently, a more advanced scheme is to automatically analyze video content and extract key information to generate scripts and abstracts by adopting a deep learning algorithm, which mainly relies on computer vision technology to identify elements such as roles, scene transition, actions and the like in video, and convert the vision information into text descriptions by combining with natural language processing technology. However, the existing scheme still has some limitations, for example, due to the fact that the existing scheme depends on a pre-trained model, for films with unique styles or a large number of special effects, the model may not accurately identify roles and scenes in the films, so that generated contents are not accurate enough and even errors occur, when a complex and changeable narrative structure is processed, deep emotion venation and plot turning are difficult to capture, so that generated summaries lack depth and continuity, feature extraction is performed based on a single dimension, interaction of information among different modes is generally ignored, and quality of final output is limited. Disclosure of Invention The application aims to provide a method and a system for automatically generating a movie script and a movie abstract, which are used for solving the problems of inaccurate character scene identification, consistent abstract depth and limited output quality caused by dependence on a pre-training model, single-mode extraction and difficult processing of complex narrative in the prior art. In order to solve the above technical problems, in a first aspect, the present application provides a method for automatically generating a movie script and a movie abstract, including: Acquiring a target film video, extracting a key frame set, a voice transcription text and a character action recognition result from the target film video, and extracting a plurality of visual feature vectors from the key frame set; determining a scene segmentation scheme with minimum total bit cost based on a preset scene segmentation principle and the visual feature vector so as to segment the target film video into a plurality of scenes; Constructing a distance matrix based on the target film video, and generating a first corresponding relation between actor names and scenes and a second corresponding relation between actor names and speakers based on the distance matrix; Based on the first corresponding relation and character position parameters in the target film video, combining the first corresponding relation, the voice transcription text and the character action recognition result to generate a target format script and a scene list corresponding to each target scene; Generating a tree diagram according to the scene, processing the tree diagram to obtain a scenario summary, and identifying a target film video frame matched with the scenario summary to form a database; and respectively extracting character behavior elements and target plot elements corresponding to the second corresponding relation from the target format script and the database to generate a story outline, and generating a movie script and a movie abstract by combining the scene list. Optionally, determining a scene segmentation scheme with the minimum total bit cost based on a preset scene segmentation principle and the visual feature vector includes: dividing the visual feature vector into a plurality of vector segment combinations based on a preset scene division principle; Calculating probability density values of all visual feature vectors and corresponding average feature vectors in each vector segment combination, and calculating total segmentation cost of each vector segment combination according to the probability density values; Calculating a first bit number required for coding all the visual feature vectors in each vector segment combination and a second bit number required for coding the corresponding average feature vector, and calculating the total bit cost of each vector segment combination; And adjusting the segmentation