US-12621544-B2 - Dynamically altering preprocessing of streaming video data by a large language model dependent upon review by the large language model

US12621544B2US 12621544 B2US12621544 B2US 12621544B2US-12621544-B2

Abstract

A method of preprocessing incoming video data can include preprocessing incoming video data, by a computer processor, according to preprocessing parameters, wherein the preprocessing includes formatting the incoming video data to create first video data of a first region of interest. The method can also include providing the first video data to a first transformer model along with a first prompt requesting the first transformer model to describe the first video data, describing the first video data by the first transformer model to generate at least one description of the first video data, providing the at least one description to a second transformer model with a second prompt requesting the second transformer model to review the at least one description and determine altered preprocessing parameters for preprocessing the incoming video data, and generating, by the second transformer model, altered preprocessing parameters based upon the at least one description.

Inventors

Amol Ajgaonkar

Assignees

INSIGHT DIRECT USA, INC.

Dates

Publication Date: 20260505
Application Date: 20240912

Claims (20)

1 . A method of preprocessing incoming video data having at least one region of interest, the method comprising: accessing the incoming video; preprocessing the incoming video data, by a computer processor, according to preprocessing parameters, wherein the preprocessing includes formatting the incoming video data to create first video data of a first region of interest; providing the first video data to a first transformer model along with a first prompt requesting the first transformer model to describe the first video data; describing the first video data by the first transformer model to generate at least one description of the first video data; providing the at least one description to a second transformer model with a second prompt requesting the second transformer model to review the at least one description and determine altered preprocessing parameters for preprocessing the incoming video data; generating, by the second transformer model, altered preprocessing parameters based upon the at least one description; providing the altered preprocessing parameters to the computer processor; and preprocessing the incoming video data according to the altered preprocessing parameters to create second video data.
2 . The method of claim 1 , wherein the altered preprocessing parameters change at least one of the following video edits of the incoming video data as compared to the unaltered preprocessing parameters: crop, grayscale, contrast, brightness, color threshold, resize, blur, hue saturation value, sharpen, erosion, dilation, Laplacian image processing, Sobel image processing, pyramid up, and pyramid down.
3 . The method of claim 1 , further comprising: formatting, by the second transformer model, the altered preprocessing parameters so as to be accepted and applied by the computer processor.
4 . The method of claim 1 , further comprising: providing the altered preprocessing parameters to a formatting module; and formatting the altered preprocessing parameters, by the formatting module, into a format that is acceptable by the computer processor, wherein the formatting module provides the altered preprocessing parameters in an acceptable format to the computer processor.
5 . The method of claim 1 , wherein the first transformer model and the second transformer model each include a large language model.
6 . The method of claim 1 , wherein the first transformer model is different from the second transformer model.
7 . The method of claim 1 , wherein the first transformer model and the second transformer model are the same transformer model.
8 . The method of claim 1 , further comprising: publishing the first video data to an endpoint, wherein accessing the first video data includes subscribing to the endpoint.
9 . The method of claim 8 , wherein the endpoint is hosted by a gateway.
10 . The method of claim 1 , wherein the incoming video data is received from a camera.
11 . A method of preprocessing incoming video data having at least one region of interest, the method comprising: preprocessing the incoming video data, by a computer processor, according to preprocessing parameters, wherein the preprocessing includes formatting the incoming video data to create first video data of a first region of interest; accessing the first video data by an AI model; processing the first video data by the AI model to determine a first output that is indicative of a first inference dependent upon the first video data; providing the first video data, the first output, and a first prompt to a first transformer model with the first prompt requesting the first transformer model to describe the first video data; generating, by the first transformer model, at least one description of the first video data from the first video data; providing the at least one description and a second prompt to a second transformer model with the second prompt requesting the second transformer model to determine altered preprocessing parameters that alter the incoming video data to create second video data; and determining, by the second transformer model, the altered preprocessing parameters dependent upon the at least one description with the altered preprocessing parameters altering the incoming video data to create the second video data.
12 . The method of claim 11 , further comprising: changing the preprocessing parameters in a configuration file to be the altered preprocessing parameters; accessing the altered preprocessing parameters by the computer processor; and preprocessing the incoming video data according to the altered preprocessing parameters to create the second video data.
13 . The method of claim 12 , further comprising: providing the second video data and a third prompt to the first transformer model with the third prompt requesting the first transformer model to describe the second video data; and describing, by the first transformer model, the second video data to create at least one description of the second video data.
14 . The method of claim 13 , further comprising: compiling the at least one description of the first video data and the at least one description of the second video data into an overall video data description.
15 . The method of claim 11 , further comprising: providing the altered preprocessing parameters to a formatting module; formatting the altered preprocessing parameters, by the formatting module, into a format that is acceptable by the computer processor; and providing the altered preprocessing parameters in an acceptable format to the computer processor.
16 . The method of claim 11 , wherein the first transformer model and the second transformer model each include a large language model.
17 . The method of claim 11 , wherein the first transformer model and the second transformer model are the same transformer model.
18 . The method of claim 11 , further comprising: generating multiple vector embeddings corresponding to multiple descriptions of the at least one description of the first video data.
19 . The method of claim 18 , wherein the at least one description of the first video data includes multiple descriptions with each description being generated by the first transformer model and corresponding to one frame of multiple frames that form the first video data.
20 . The method of claim 19 , further comprising: searching the multiple descriptions to find at least one relevant frame of the first video data.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application is a nonprovisional application claiming the benefit of U.S. provisional Ser. No. 63/553,204, filed on Feb. 14, 2024, entitled “DYNAMICALLY PREPROCESSING OF STREAMING VIDEO DATA AND REVIEW OF THE VIDEO DATA BY A LARGE LANGUAGE MODEL” by Amol Ajgaonkar. TECHNICAL FIELD The disclosure relates generally to processing of video data and, more specifically, to the selection/extraction, preprocessing/processing, and publishing of video data of a region of interest and subsequent review/analysis by a large language model. BACKGROUND Cameras are beneficial for use in all areas of commercial and personal practice. For example, security cameras are used within (and outside) commercial warehouses and on private personal property. Other applications use cameras along assembly lines for quality control purposes. With the increased capabilities of cameras having higher quality imagery (i.e., resolution) and a wider field of view, more area can be shown in the streaming video by the camera. A large portion of the frame/field of view may be of no interest to the consumer (e.g., a security or manufacturing company). However, current practices relay the entirety of the streaming video (i.e., the entire frame/field of view) to the consumer, which can be time and resource consuming due to the need to transfer large frame (i.e., field of view), high resolution video data. SUMMARY A system and method for selection/extraction, preprocessing, and publishing of video data of a region of interest (i.e., a scene) that is a subset of a field of view of streaming video is disclosed herein. The system and method can also include processing the video data by a consumer/subscriber after the video data has been published. Additionally and/or alternatively, the system and method can include processing, reviewing, etc. the video data by a large language model (hereinafter referred to as an “LLM”). Streaming video data is received from a camera with a first field of view. The video data is then preprocessed, by a computer processor such as a gateway or digital/virtual container, according to preprocessing parameters defined within a runtime configuration file that is pushed down to the computer processor. The runtime configuration file can be stored and/or edited distant from the computer processor, and any edits/revisions to the runtime configuration file can be pushed to and applied by the computer processor to the streaming video data in real time to alter the preprocessing applied to the video data. The preprocessing can include formatting/cropping the streaming video data received from the camera to create first video data of a first region of interest (i.e., a scene) having a second field of view that is less than (shows less area than) the first field of view shown by the entirety of the streaming video data from the camera. The preprocessing as defined by the preprocessing parameters in the runtime configuration file can also include altering the video data's grayscale, contrast, brightness, color threshold, size, blur, hue saturation value (HSV), sharpen, erosion, dilation, Laplacian image processing, Sobel image processing, pyramid up, and pyramid down (among others). The video data/frame can then be published to an endpoint (such as a topic on an asynchronous messaging library like ZeroMQ) for subscription and use by a first subscriber/consumer. The first video data can then be viewed, used, and/or processed by the first subscriber, which can be a large language model. The preprocessing as defined in the runtime configuration file can be tailored to the subscriber and the needs/uses of the subscriber and the processing to be performed by the subscriber. For example, the processing performed by the subscriber after publishing of the first video data/frame may be using an artificial intelligence (AI) model to analyze scenarios occurring on/in the first video data/frame. The AI model may require the first video data/frame to be in a particular size, format, etc., which can be selected and applied during the preprocessing as set out in the runtime configuration file so that the subscriber does not need to perform this preprocessing before applying the AI model. The processing, by a computer processor, of the first video data by the subscriber can be performed distant from the camera, the location at which the runtime configuration file is stored and/or edited, and the gateway/container upon which the preprocessing is performed. The first subscriber can perform the processing of the video data to determine at least one output with the output being indicative of an inference dependent on the first video data. For example, the first video data can be processed by an AI model to determine the amount of a particular product that has passed by on an assembly line (i.e., the amount of the product being an inference dependent on the first video data). The processing can include other opera