CN-122002096-A - Cooking video editing method, device and equipment based on large model

CN122002096ACN 122002096 ACN122002096 ACN 122002096ACN-122002096-A

Abstract

The invention relates to the technical field of large models, and provides a cooking video editing method, device and equipment based on a large model. The method comprises the steps of obtaining original video data comprising a cooking scene, preprocessing the original video data to obtain video clips to be clipped, inputting the video clips to be clipped into a preset multi-mode large model to obtain activity information of the video clips to be clipped, scoring the video clips to be clipped according to the activity information to obtain scoring results, and clipping the video clips to be clipped according to the scoring results to generate a target video highlight. The invention can automatically record the cooking and interaction isothermal scenes, and solves the problem that in the prior art, a user is difficult to consider video recording and the interference of smoke on video definition in the cooking process.

Inventors

XIONG ZHANG
CHEN HUI
AI WEI
ZHANG ZHI
DU PEILI
HU GUOHU

Assignees

宁波星巡智能科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251218

Claims (10)

1. A method of video editing for cooking based on large models, the method comprising: acquiring original video data including a cooking scene; Preprocessing the original video data to obtain video clips to be clipped; inputting each video clip to be clipped into a preset multi-mode large model to obtain activity information of each video clip to be clipped; Scoring each video clip to be clipped according to the activity information to obtain scoring results; and editing each video segment to be edited according to the grading result to generate a target video highlight.
2. The method of claim 1, wherein preprocessing the raw video data to obtain each video clip to be clipped comprises: Decomposing the original video frame by frame to obtain video images of each frame; performing contrast adjustment on each video image to obtain a corresponding intermediate video image of each frame; according to a preset dynamic change detection algorithm, smoke detection is carried out on each intermediate video image, and whether smoke exists or not is judged; If smoke exists, repairing the color conversion image by adopting a preset image repairing algorithm to obtain video data after smoke treatment; And segmenting the video data subjected to the smoke treatment according to a preset time period to obtain each video segment to be clipped.
3. The method of claim 2, wherein the determining whether smoke exists by detecting smoke in each of the intermediate video images according to a predetermined dynamic change detection algorithm comprises: Performing color space conversion on each intermediate video image to obtain each frame of color conversion image; calculating pixel difference values of each adjacent color conversion image of the continuous time sequence; And judging whether smoke exists or not through the pixel difference value.
4. The method of claim 2, wherein if smoke is present, repairing the color-converted image using a predetermined image repair algorithm to obtain smoke-processed video data comprises: If smoke exists, detecting the position of a smoke area of the color conversion image to obtain the position information of the smoke area; dividing the color conversion image according to the smoke area position information to obtain a smoke area image and a non-smoke area image; Repairing the smoke area image by adopting a preset image repairing algorithm to obtain a repaired smoke area image; and obtaining video data after smoke treatment according to the non-smoke area image and the repaired smoke area image.
5. The method for video editing according to claim 1, wherein inputting each video clip to be edited into a preset multi-modal large model to obtain activity information of each video clip to be edited comprises: inputting the dish types and the video clips to be clipped into a pre-trained multi-mode large model, and acquiring key cooking step information according to a preset dish knowledge base; Inputting each video segment to be clipped into a human body target detection model, and obtaining interaction information according to human body target detection results; And obtaining activity information according to the key cooking step information and the interaction information.
6. The method of claim 5, wherein the interaction information includes a number of people in each of the video clips, and wherein scoring each of the video clips according to the activity information includes: Scoring each video segment to be clipped according to the key cooking step information to obtain a cooking score, wherein the key cooking step information comprises a judging result of whether a key cooking step exists in the video segment to be clipped, a key cooking step type and hit confidence; Scoring each video clip to be clipped according to the number of people in each video clip to be clipped to obtain interaction score; And carrying out summation calculation on the cooking score and the interaction score to obtain a scoring result.
7. The method of claim 6, wherein scoring each video clip to be clipped according to the key cooking step information, the scoring comprising: According to the key cooking step information, if the judging result shows that the key cooking step does not exist, a first basic score is distributed to the video clips to be clipped, and the first basic score is used as a cooking score; If the judging result shows that the key cooking step exists and the type of the key cooking step is one, distributing a second basic score to the video clip to be clipped, wherein the second basic score is larger than the first basic score; Performing correction calculation on the second basic score according to the hit confidence coefficient to obtain a cooking score; If the judging result shows that the key cooking steps exist and the types of the key cooking steps are two or more, distributing a third basic score to the video clips to be clipped, wherein the third basic score is larger than the second basic score; And correcting and calculating the third basic score according to the hit confidence coefficient to obtain a cooking score.
8. The method of claim 1, wherein the editing each video clip to be edited according to the scoring result, and generating a target video highlight comprises: Inputting a preset user instruction into the multi-mode large model to obtain user demand information; and editing each video clip to be edited according to the user demand information and the grading result to generate a target video highlight.
9. A large model-based cooking video clipping apparatus, the apparatus comprising: the video acquisition module is used for acquiring original video data comprising a cooking scene; The preprocessing module is used for preprocessing the original video data to obtain video clips to be clipped; the video analysis module is used for inputting each video clip to be clipped into a preset multi-mode large model to obtain activity information of each video clip to be clipped; the video scoring module is used for scoring each video clip to be clipped according to the activity information to obtain scoring results; And the video generation module is used for editing each video clip to be edited according to the grading result to generate a target video highlight.
10. An electronic device comprising at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the method of any of claims 1-8.

Description

Cooking video editing method, device and equipment based on large model Technical Field The invention relates to the technical field of large models, in particular to a cooking video editing method, device and equipment based on a large model. Background With the development of artificial intelligence large models, the large models are becoming popular in intelligent kitchens and intelligent living scenes. Meanwhile, various activities in family life, especially cooking and dinner are increasingly becoming important carriers for bearing relatives and friends, people also pay more attention to recording fine moments in life, and people also prefer to share life scenes in short videos due to the rising of the short video industry. At present, the cooking video is usually recorded manually, shooting time is difficult to grasp, video editing operation is complicated, and video recording is performed through person detection, but the method is not suitable for cooking scenes which are full of dynamic changes and need high concentration. The prior Chinese patent CN116095363A (mobile terminal short video highlight moment clipping method based on key behavior identification, 2023.05.09) discloses a method for clipping Gao Guangtu frames and obtained key behavior video slices according to time sequence to obtain video, which can identify the actions of people in the video, but cannot be well applied to cooking scenes due to the complexity of the cooking scenes, and cannot automatically score and manufacture video brocade. The application of the large model in the cooking scene still has multiple challenges that the multi-source heterogeneous data of the cooking scene is complex, the multi-source heterogeneous data comprises information such as video, voice, oil smoke and the like, the definition of the picture can be reduced by the smoke, the accuracy of identifying key events such as standby vegetables, stir-frying, food property change and the like of the large model is reduced, but the smoke is ornamental in some cases and is not suitable for being uniformly weakened. In summary, how to combine the cooking video analysis with the semantic reasoning capability of the large model, ensure the accuracy of large model identification, keep necessary cooking atmosphere pictures, and automatically screen key steps to generate cooking video highlights becomes a key technical problem. Disclosure of Invention In view of the above, the present invention provides a method, apparatus and device for editing cooking video based on a large model, which are used for solving the problem that users are difficult to consider recording and the problem of interference of smoke to video definition in the cooking process in the prior art. The technical scheme adopted by the invention is as follows: in a first aspect, the present invention provides a method of video editing for cooking based on a large model, the method comprising: acquiring original video data including a cooking scene; Preprocessing the original video data to obtain video clips to be clipped; inputting each video clip to be clipped into a preset multi-mode large model to obtain activity information of each video clip to be clipped; Scoring each video clip to be clipped according to the activity information to obtain scoring results; and editing each video segment to be edited according to the grading result to generate a target video highlight. Preferably, the preprocessing the original video data to obtain each video clip to be clipped includes: Decomposing the original video frame by frame to obtain video images of each frame; performing contrast adjustment on each video image to obtain a corresponding intermediate video image of each frame; according to a preset dynamic change detection algorithm, smoke detection is carried out on each intermediate video image, and whether smoke exists or not is judged; If smoke exists, repairing the color conversion image by adopting a preset image repairing algorithm to obtain video data after smoke treatment; And segmenting the video data subjected to the smoke treatment according to a preset time period to obtain each video segment to be clipped. Preferably, the detecting the smoke of each intermediate video image according to the preset dynamic change detecting algorithm, and determining whether the smoke exists includes: Performing color space conversion on each intermediate video image to obtain each frame of color conversion image; calculating pixel difference values of each adjacent color conversion image of the continuous time sequence; And judging whether smoke exists or not through the pixel difference value. Preferably, if smoke exists, repairing the color conversion image by using an image repairing algorithm, and obtaining video data after smoke treatment includes: If smoke exists, detecting the position of a smoke area of the color conversion image to obtain the position information of the smoke area; dividing the col