CN-117223035-B - Efficient test time adaptation for improved video processing time consistency

CN117223035BCN 117223035 BCN117223035 BCN 117223035BCN-117223035-B

Abstract

A method for processing video includes receiving video as input at a first layer of an Artificial Neural Network (ANN). The first frame of the video is processed to generate a first tag. Thereafter, the artificial neural network is updated based on the first tag. The update is performed while concurrently processing the second frame of video. In so doing, temporal inconsistencies between tags are reduced.

Inventors

Y.ZHANG
S. M. Boser
F.M. Polykeri

Assignees

高通股份有限公司

Dates

Publication Date: 20260508
Application Date: 20220309
Priority Date: 20210310

Claims (17)

1. A processor-implemented method for processing video, the processor-implemented method being performed by at least one processor and comprising: receiving the video as input at a first layer of an artificial neural network ANN; processing a first frame of the video to generate a first output portion; generating a pseudo tag for the first frame by applying an argmax function to the first output portion, and The artificial neural network is updated based on a cross entropy loss calculated from a function of the first output portion and the pseudo tag of the first frame of the video.
2. The processor-implemented method of claim 1, further comprising applying the cross entropy loss in backward pass of the artificial neural network.
3. The processor-implemented method of claim 1, further comprising generating a second output portion based on the second frame.
4. The processor-implemented method of claim 1, wherein the artificial neural network is updated at test time.
5. A processor-implemented method for processing video, the processor-implemented method being performed by at least one processor and comprising: Receiving the video as input at a first layer of a first artificial neural network ANN and a second artificial neural network, the first artificial neural network having fewer channels than the second artificial neural network; processing a first frame of the video via the first artificial neural network to generate a first label, the first artificial neural network providing intermediate features extracted from the first frame of the video to the second artificial neural network; processing the first frame of the video via the second artificial neural network to generate a second label based on the intermediate feature and the first frame, and The first artificial neural network is updated based on the first tag while the second artificial neural network concurrently processes a second frame of the video.
6. The processor-implemented method of claim 5, further comprising generating a third tag via the second artificial neural network based on the second frame, and wherein the concurrent processing is performed such that temporal inconsistencies between the second tag and the third tag are reduced.
7. The processor-implemented method of claim 5, wherein the first artificial neural network operates at a lower resolution than the second artificial neural network.
8. An apparatus for processing video, comprising: at least one memory, and At least one processor coupled to the at least one memory, the at least one processor configured to: Receiving the video as input at a first layer of a first artificial neural network ANN and a second artificial neural network, the first artificial neural network having fewer channels than the second artificial neural network; processing a first frame of the video via the first artificial neural network to generate a first label, the first artificial neural network providing intermediate features extracted from the first frame of the video to the second artificial neural network; processing the first frame of the video via the second artificial neural network to generate a second label based on the intermediate feature and the first frame, and The first artificial neural network is updated based on the first tag while the second artificial neural network concurrently processes a second frame of the video.
9. The apparatus of claim 8, wherein the at least one processor is further configured to generate a third tag via the second artificial neural network based on the second frame, and wherein the concurrent processing is performed such that temporal inconsistencies between the second tag and the third tag are reduced.
10. The apparatus of claim 8, wherein the first artificial neural network operates at a lower resolution than a resolution of the second artificial neural network.
11. An apparatus for processing video, comprising: at least one memory, and At least one processor coupled to the at least one memory, the at least one processor configured to: receiving the video as input at a first layer of an artificial neural network ANN; processing a first frame of the video to generate a first output portion; generating a pseudo tag for the first frame by applying an argmax function to the first output portion, and The artificial neural network is updated based on a cross entropy loss calculated from a function of the first output portion and the pseudo tag of the first frame of the video.
12. The apparatus of claim 11, the at least one processor further configured to apply the cross entropy loss in backward pass of the artificial neural network.
13. The apparatus of claim 11, the at least one processor being further configured to: A second output portion is generated based on the second frame.
14. The apparatus of claim 11, wherein the artificial neural network is updated at test time.
15. A non-transitory computer readable medium having encoded thereon program code for processing video, the program code being executed by a processor and comprising: program code for receiving the video as input at a first layer of a first artificial neural network ANN and a second artificial neural network, the first artificial neural network having fewer channels than the second artificial neural network; Program code for processing a first frame of the video via the first artificial neural network to generate a first label, the first artificial neural network providing intermediate features extracted from the first frame of the video to the second artificial neural network; program code for processing the first frame of the video via the second artificial neural network to generate a second label based on the intermediate feature and the first frame, and Program code for updating the first artificial neural network based on the first tag when the second artificial neural network concurrently processes a second frame of the video.
16. The non-transitory computer-readable medium of claim 15, further comprising program code for generating a third tag via the second artificial neural network based on the second frame, and wherein the concurrent processing is performed such that temporal inconsistencies between the second tag and the third tag are reduced.
17. The non-transitory computer-readable medium of claim 15, wherein the first artificial neural network operates at a lower resolution than a resolution of the second artificial neural network.

Description

Efficient test time adaptation for improved video processing time consistency Cross Reference to Related Applications The present application claims priority from U.S. patent application Ser. No.17/198,147, filed on 3/10 of 2021, entitled "EFFICIENT TEST-TIME ADAPTATION FOR IMPROVED TEMPORAL CONSISTENCY IN VIDEO PROCESSING (efficient test time adaptation for improved video processing time consistency"), the disclosure of which is expressly incorporated herein by reference in its entirety. Background FIELD Aspects of the present disclosure relate generally to neural networks, and more particularly to video processing. Background An artificial neural network may include groups of interconnected artificial neurons (e.g., neuron models). The artificial neural network may be a computing device or be represented as a method to be performed by a computing device. The neural network includes a consumption tensor and an operand to generate the tensor. Neural networks may be used to solve complex problems, however, because the network size and the amount of computation that may be performed to produce a solution may be very expensive, the network may take a long time to complete a task. Furthermore, the computational cost of deep neural networks can be problematic because these tasks can be performed on mobile devices (which may have limited computational power). Convolutional neural networks are one type of feedforward artificial neural network. Convolutional neural networks may include a collection of neurons, each having a receptive field and collectively spell an input space. Convolutional Neural Networks (CNNs), such as deep convolutional neural networks (DCNs), have numerous applications. In particular, these neural network architectures are used for various technologies such as image recognition, pattern recognition, speech recognition, autopilot, object segmentation in video streams, video processing, and other classification tasks. Modern deep learning based video processing methods or models may generate inconsistent output over time. In some cases, inconsistencies may be observed in the form of display flicker or other misalignments. Temporally inconsistent outputs (e.g., flicker) may degrade the user's experience and enjoyment, as well as system stability and performance. SUMMARY In one aspect of the disclosure, a method for processing video is provided. The method includes receiving video as input at a first layer of an Artificial Neural Network (ANN). The method also includes processing a first frame of the video to generate a first tag. Additionally, the method includes updating the artificial neural network based on the first tag. The updating of the artificial neural network is performed while concurrently processing the second frame of video. In another aspect of the present disclosure, an apparatus for processing video is provided. The apparatus includes a memory and one or more processors coupled to the memory. The processor(s) is configured to receive video as input at a first layer of an Artificial Neural Network (ANN). The processor(s) is also configured to process a first frame of video to generate a first tag. Further, the processor(s) is configured to update the artificial neural network based on the first tag. The updating of the artificial neural network is performed while concurrently processing the second frame of video. In one aspect of the present disclosure, an apparatus for processing video is provided. The apparatus includes means for receiving video as input at a first layer of an Artificial Neural Network (ANN). The apparatus also includes means for processing a first frame of video to generate a first tag. Additionally, the apparatus includes means for updating the artificial neural network based on the first tag. The updating of the artificial neural network is performed while concurrently processing the second frame of video. In a further aspect of the disclosure, a non-transitory computer readable medium is provided. The computer readable medium has encoded thereon program code for processing video. The program code is executed by the processor and includes code for receiving video as input at a first layer of the artificial neural network. The program code also includes code for processing a first frame of video to generate a first tag. Further, the program code includes code for updating the artificial neural network based on the first tag. The updating of the artificial neural network is performed while concurrently processing the second frame of video. In one aspect of the disclosure, a method for processing video is provided. The method includes receiving video as input at a first layer of a first artificial neural network and a second artificial neural network. The first artificial neural network has fewer channels than the second artificial neural network. The method also includes processing a first frame of video via a first artificial neural network to generate a firs