CN-116508315-B - Multimode game video summary

CN116508315BCN 116508315 BCN116508315 BCN 116508315BCN-116508315-B

Abstract

Video (416) and audio (414) from a computer simulation are processed by a machine learning engine (202) to identify (204) candidate segments of the simulation for use in the simulated video summary. Text input (410) is then used to enhance whether a candidate segment should be included in the video summary.

Inventors

L Kaushik
S. KUMAR
J.YU
K.ZHANG
S. Horam
S. Rao
C. R. Sandalam

Assignees

索尼互动娱乐股份有限公司

Dates

Publication Date: 20260505
Application Date: 20210903
Priority Date: 20201125

Claims (11)

1. An apparatus for providing video summarization, comprising: At least one of the processors is configured to perform, the at least one processor is programmed with instructions for: Receiving Audio Video (AV) data; Providing a video summary of the AV data that is shorter than the AV data at least in part by: inputting first modality data including audio from the AV data to a Machine Learning (ML) engine to identify a plurality of first candidate segments of AV data; Inputting second modality data including computer simulated chat text related to said AV data to said ML engine to extract at least a first parameter from said second modality data and provide said first parameter to an event correlation detector (ERD), and Selecting at least some of the plurality of first candidate segments based at least in part on the first parameter, wherein a first candidate segment is excluded if the first candidate segment is identified as being of interest based on the first modality data but not identified as being of interest based on the second modality data; The video summary of the AV data is received from the ML engine in response to the input of the first modality data and the second modality data, the video summary including the selected plurality of first candidate segments of the AV data.
2. The apparatus of claim 1, wherein the second modality data comprises computer analog video from the AV data.
3. The device of claim 1, wherein the instructions are executable to execute the ML engine to extract at least a second parameter from the first modality data and provide the second parameter to the ERD.
4. The apparatus of claim 3, wherein the instructions are executable to perform the ERD to output the video summary based at least in part on the first parameter and the second parameter.
5. A method for providing video summaries, comprising: Receiving Audio Video (AV) data; identifying, using audio from the AV data, a plurality of first candidate segments of the AV data for building a summary of the data using a Machine Learning (ML) engine; Identifying, using the ML engine, at least one parameter associated with a chat related to the AV data; Selecting at least some of the plurality of first candidate segments based at least in part on the parameters, wherein a first candidate segment is excluded if the first candidate segment is identified as being of interest based on the audio but is not identified as being of interest based on the chat, and Generating a video summary of the AV data shorter than the AV data using at least some of the plurality of first candidate segments.
6. The method of claim 5, further comprising: Identifying a plurality of second candidate segments of the AV data for summarizing the data using the ML engine using video from the AV data, and At least some of the plurality of second candidate segments are selected based at least in part on the parameters, wherein the video summary that generated the AV data also uses at least some of the plurality of second candidate segments.
7. The method of claim 5, comprising presenting the video summary on a display.
8. The method of claim 6, wherein identifying a plurality of second candidate segments of the AV data using video from the AV data comprises identifying one or more selected from a list comprising: scene change in the AV data, and Text in the video of the AV data.
9. The method of claim 5, wherein identifying the plurality of first candidate segments of the AV data using audio from the AV data comprises identifying one or more selected from a list comprising: an acoustic event in the audio; pitch and/or amplitude of at least one voice in the audio; Emotion in the audio, and Words in speech in the audio.
10. The method of claim 9, wherein identifying the parameters associated with chat related to the AV data comprises identifying one or more selected from a list comprising: Emotion of the chat; Emotion of the chat; A topic of the chat; at least one grammar class of at least one word in the chat; And abstracting the chat.
11. A component for providing video summaries, comprising: at least one display device configured to present an Audio Video (AV) computer game; at least one processor associated with the display device and configured with instructions for performing the method of any of claims 5-10.

Description

Multimode game video summary Technical Field The present application relates generally to multi-mode game video summarization in computer simulation and other applications. Background The video summary of the computer simulated video or other video will generate a succinct video for quick viewing of highlights, such as a viewing platform or online gaming platform, to enhance the viewing experience. As understood herein, automatically generating valid summary video is difficult and manually generating a summary is time consuming. Disclosure of Invention An apparatus includes at least one processor programmed with instructions to receive Audio Video (AV) data and provide a video summary of the AV data that is shorter than the AV data at least in part by inputting first modality data and second modality data to a Machine Learning (ML) engine. The instructions are executable to receive the video summary of the AV data from the ML engine in response to the input of the first modality data and the second modality data. In an exemplary embodiment, the first modality data includes audio from the AV data, and the second modality data includes computer analog video from the AV data. In other implementations, the second modality data may include computer simulated chat text related to the AV data. In a non-limiting example, the instructions are executable to execute the ML engine to extract at least a first parameter from the second modality data and provide the first parameter to an event correlation detector (ERD). In these examples, the instructions may be executable to execute the ML engine to extract at least a second parameter from the first modality data and provide the second parameter to the ERD. The instructions may be further executable to perform the ERD to output the video summary based at least in part on the first parameter and the second parameter. In another aspect, a method includes identifying an audio-video (AV) entity, such as a computer game audio-video stream. The method includes identifying a first plurality of candidate segments of the AV entity for use in building a summary of the entity using audio from the AV entity and identifying a second plurality of candidate segments of the AV entity for use in building a summary of the entity also using video from the AV entity. The method further includes identifying at least one parameter associated with a chat related to the AV entity and selecting at least some of the first and second plurality of candidate segments based at least in part on the parameter. The method uses at least some of the first and second candidate segments to generate a video summary of the AV entity that is shorter than the AV entity. In an exemplary implementation of the method, the method may include presenting the video summary on a display. In a non-limiting embodiment, identifying a plurality of second candidate segments for the AV entity using video from the AV entity includes identifying a scene change in the AV entity. Additionally or alternatively, identifying the plurality of second candidate segments of the AV entity using video from the AV entity may include identifying text in the video of the AV entity. In some implementations, identifying a plurality of first candidate segments of the AV entity using audio from the AV entity can include identifying an acoustic event in the audio. Additionally or alternatively, identifying the plurality of first candidate segments of the AV entity using audio from the AV entity may include identifying a pitch and/or an amplitude of at least one voice in the audio. Additionally or alternatively, identifying the plurality of first candidate segments of the AV entity using audio from the AV entity may include identifying emotion in the audio. In addition or alternatively, identifying the plurality of first candidate segments of the AV entity using audio from the AV entity may include identifying words in the speech. In an exemplary implementation, identifying the parameter associated with a chat related to the AV entity may include identifying an emotion of the chat. Additionally or alternatively, identifying the parameter associated with a chat related to the AV entity may include identifying an emotion of the chat. Additionally or alternatively, identifying the parameter associated with a chat related to the AV entity may include identifying a topic of the chat. Additionally or alternatively, identifying the parameter associated with a chat related to the AV entity may include identifying at least one grammar class of at least one word in the chat. Additionally or alternatively, identifying the parameter associated with a chat related to the AV entity may include identifying a summary of the chat. In another aspect, an assembly includes at least one display device configured to present an Audio Video (AV) computer game. At least one processor is associated with the display device and configured with instructions for executing a