EP-4742187-A1 - SYSTEMS AND METHODS FOR ANNOTATING CONTENT

EP4742187A1EP 4742187 A1EP4742187 A1EP 4742187A1EP-4742187-A1

Abstract

A computer system obtains a plurality of annotated short segments of content. The computer system trains a model for summarizing longer segments of content using training data comprising the plurality of annotated short segments of content, including: (i) applying a prompt and the plurality of the annotated short segments of content to a first language model to produce a summary of the plurality of annotated short segments of content; (ii) evaluating the summary of the plurality of annotated short segments of content against predefined criteria; and (iii) applying the evaluation of the summary and the prompt to a second language model to produce an updated version of the prompt; and iteratively performing (i), (ii), and (iii) at least two times.

Inventors

KORKINOF, Dimitrios
HU, JIAN
BEGUERISSE DIAZ, Mariano

Assignees

Spotify AB

Dates

Publication Date: 20260513
Application Date: 20251112

Claims (15)

A method, comprising: obtaining a plurality of annotated short segments of content; determining training data comprising the plurality of annotated short segments of content, the training data for use in training a model for summarizing longer segments of content, including: (i) applying a prompt and the plurality of the annotated short segments of content to a first language model distinct from the model to produce a summary of the plurality of annotated short segments of content; (ii) evaluating the summary of the plurality of annotated short segments of content against predefined criteria; and (iii) applying the evaluation of the summary and the prompt to a second language model distinct from the model to produce an updated version of the prompt; and iteratively performing (i), (ii), and (iii) at least two times. after iteratively performing (i), (ii), and (iii) for a final iteration, applying the updated version of the prompt to the first language model to produce a final summary, wherein the final summary is used as an annotation of the plurality of annotated short segments to train the model for summarizing longer segments of content.
The method of claim 1, wherein evaluating the summary of the plurality of annotated short segments of content against predefined criteria includes determining a score representing a quality of the summary produced by the prompt; and the method includes determining the final iteration of performing (i), (ii), and (iii) based on the score.
The method of claim 1, including determining the final iteration of performing (i), (ii), and (iii) based on a maximum number of iterations to be performed.
The method of any preceding claim, wherein obtaining the plurality of annotated short segments of content includes captioning short segments of a content item.
The method of any preceding claim, wherein the prompt includes a fixed portion and a non-fixed portion, wherein the non-fixed portion is updated between iterations and the fixed portion is maintained d between iterations.
The method of any preceding claim, wherein the longer segments of content correspond to one or more hours-long content items.
The method of any preceding claim, wherein the model for summarizing longer segments of content comprises a second model in a system, the system further including a first model, wherein the output of the first model is provided as an input to the second model.
A computer system comprising: one or more processors; and memory storing one or more programs, the one or more programs including instructions for: obtaining a plurality of annotated short segments of content; determining training data comprising the plurality of annotated short segments of content, the training data for use in training a model for summarizing longer segments of content, including: (i) applying a prompt and the plurality of the annotated short segments of content to a first language model distinct from the model to produce a summary of the plurality of annotated short segments of content; (ii) evaluating the summary of the plurality of annotated short segments of content against predefined criteria; and (iii) applying the evaluation of the summary and the prompt to a second language model distinct from the model to produce an updated version of the prompt; and iteratively performing (i), (ii), and (iii) at least two times; and after iteratively performing (i), (ii), and (iii) for a final iteration, applying the updated version of the prompt to the first language model to produce a final summary, wherein the final summary is used as an annotation of the plurality of annotated short segments to train the model for summarizing longer segments of content.
The computer system of claim 8, wherein evaluating the summary of the plurality of annotated short segments of content against predefined criteria includes determining a score representing a quality of the summary produced by the prompt; and the method includes determining the final iteration of performing (i), (ii), and (iii) based on the score.
The computer system of claim 8, the one or more programs including instructions for determining the final iteration of performing (i), (ii), and (iii) based on a maximum number of iterations to be performed.
The computer system of any of claims 8 to 10, wherein obtaining the plurality of annotated short segments of content includes captioning short segments of a content item.
The computer system of any of claims 8 to 11, wherein the prompt includes a fixed portion and a non-fixed portion, wherein the non-fixed portion is updated between iterations and the fixed portion is maintained between iterations.
The computer system of any of claims 8 to 12, wherein the longer segments of content correspond to one or more hours-long content items.
The computer system of any of claims 8 to 13, wherein the model for summarizing longer segments of content comprises a second model in a system, the system further including a first model, wherein the output of the first model is provided as an input to the second model.
A non-transitory computer-readable storage medium storing one or more programs for execution by a computer system with one or more processors, the one or more programs comprising instructions for: obtaining a plurality of annotated short segments of content; determining training data comprising the plurality of annotated short segments of content, the training data for use in training a model for summarizing longer segments of content, including: (i) applying a prompt and the plurality of the annotated short segments of content to a first language model distinct from the model to produce a summary of the plurality of annotated short segments of content; (ii) evaluating the summary of the plurality of annotated short segments of content against predefined criteria; and (iii) applying the evaluation of the summary and the prompt to a second language model distinct from the model to produce an updated version of the prompt; and iteratively performing (i), (ii), and (iii) at least two times; and after iteratively performing (i), (ii), and (iii) for a final iteration, applying the updated version of the prompt to the first language model to produce a final summary, wherein the final summary is used as an annotation of the plurality of annotated short segments to train the model for summarizing longer segments of content.

Description

TECHNICAL FIELD The disclosed embodiments relate generally to annotating content items, and more particularly, to generate data for training a model for summarizing longer segments of content. BACKGROUND Summarization of image and/or video content is a compelling field, with current captioning models achieving remarkable results on single images or second-level videos. However, many videos are much longer than second-level, extending to hour(s)-level durations. Current research on long-form video captioning mostly focuses on minute-level videos, with little exploration into hour(s)-long videos. Additionally, manually annotating hour(s)-long videos (e.g., for the purposes of training models) is challenging due to their length. Despite this, such videos are quite common, making it necessary to develop a model capable of captioning hour(s)-long videos. SUMMARY One approach to annotating longer videos is to perform the annotations recursively. A long video is divided into short segments, which are captioned by a model. The captions of the short segments are then used to summarize a longer portion of video (e.g., by the same model or a different model), and so on, until the full-length video is summarized. Existing methods of recursively summarizing video use a supervised training approach at every level, with human annotations being used to train the model(s). The disclosed embodiments use an unsupervised approach to at least partially train a model to generate a summary of a longer portion of video using captions of shorter portions of video. The unsupervised approach generates summaries using the iterative process shown and explained below. In this way, high quality training data can be generated for training a model for summarizing longer segments of content, thereby enabling a model to be trained to provide a more accurate output in a more efficient manner. To that end, in accordance with some embodiments, a method is provided. The method includes obtaining a plurality of annotated short segments of content. The method further includes determining training data comprising the plurality of annotated short segments of content, the training data for use in training a model for summarizing longer segments of content, including: (i) applying a prompt and the plurality of the annotated short segments of content to a first large language model to produce a summary of the plurality of annotated short segments of content; (ii) evaluating the summary of the plurality of annotated short segments of content against predefined criteria; and (iii) applying the evaluation of the summary and the prompt to a second large language model to produce an updated version of the prompt; and iteratively performing (i), (ii), and (iii) at least two times. The method further includes, after iteratively performing (i), (ii), and (iii) for a final iteration, applying the updated version of the prompt to the first language model to produce a final summary, wherein the final summary is used as an annotation of the plurality of annotated short segments to train the model for summarizing longer segments of content In accordance with some embodiments, an electronic device is provided. The electronic device includes one or more processors and memory storing one or more programs. The one or more programs include instructions for performing any of the methods described herein. In accordance with some embodiments, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium stores one or more programs for execution by an electronic device with one or more processors. The one or more programs comprising instructions for performing any of the methods described herein. Thus, systems are provided with improved methods of generating training data and training a model for summarizing longer segments of content. BRIEF DESCRIPTION OF THE DRAWINGS The embodiments disclosed herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. Like reference numerals refer to corresponding parts throughout the drawings and specification. FIG. 1 is a block diagram illustrating a media content delivery system, in accordance with some embodiments.FIG. 2 is a block diagram illustrating an electronic device, in accordance with some embodiments.FIG. 3 is a block diagram illustrating a media content server, in accordance with some embodiments.FIGS. 4A-4C are example block diagrams for optimizing a prompt to generate a summary of content using captions, in accordance with some embodiments.FIGS. 5A-5B is a block diagram illustrating a system for generating summaries for a full content item using captions, in accordance with some embodiments.FIGS. 6A-6B are flow diagrams illustrating a method for generating summaries for a full content item using captions, in accordance with some embodiments. DETAILED DESCRIPTION Reference will now be made to embodiments, examples o