US-20260127751-A1 - Systems and Methods for Performing Optical Flow Using GPU Tensor Processing Cores
Abstract
The present disclosure relates to machine vision systems and methods for performing optical flow calculations. Machine vision systems in accordance with many embodiments of the invention use GPU tensor processing cores to perform one-dimensional Discrete Fourier Transform (DFT) calculations using real DFT matrices, enabling efficient separable window correlation for optical flow. In one embodiment, the machine vision system includes: a camera; a processor; a processor comprising tensor processing cores; and a memory containing instructions. Executing the instructions using the processors causes the machine vision system to: obtain a pair of sequential images from the camera; identify windows in the images; perform optical flow calculations using separable window correlation, wherein the separable window correlation calculations comprise performing one-dimensional discrete Fourier transform (DFT) calculations using the tensor processing cores, and wherein the one-dimensional DFT calculations are performed on the tensor processing cores using real DFT matrices; and output optical flow information.
Inventors
- Samuel H. Foxman
- Scott A. Bollt
- Morteza Gharib
Assignees
- CALIFORNIA INSTITUTE OF TECHNOLOGY
Dates
- Publication Date
- 20260507
- Application Date
- 20251103
Claims (20)
- 1 . A machine vision system, comprising: a camera; a processor; a processor comprising tensor processing cores; and a memory containing instructions that, when executed by the processor, cause the machine vision system to: obtain a pair of sequential input images from the camera; identify windows in the input images; perform optical flow calculations using separable window correlation, where: the separable window correlation calculations comprise performing one-dimensional Discrete Fourier Transform (DFT) calculations using the tensor processing cores, and the one-dimensional DFT calculations are performed on the tensor processing cores using real Discrete Fourier Transform matrices; and output optical flow information for the input images.
- 2 . The machine vision system of claim 1 , wherein the instructions further cause the machine vision system to generate the real Discrete Fourier Transform matrices by: expanding a complex Discrete Fourier Transform matrix into an expanded matrix; removing redundant rows from the expanded matrix; and scaling DC and Nyquist rows of the resulting matrix.
- 3 . The machine vision system of claim 2 , wherein one of real Discrete Fourier Transform matrices R is defined by: R r , c := { 2 2 if r = 0 2 2 cos ( π c ) if r = 1 cos ( α ⌊ r / 2 ⌋ c ) if r ≥ 2 and r is even sin ( α ⌊ r / 2 ⌋ c ) if r ≥ 3 and r is odd where α k represents a frequency component associated with each row.
- 4 . The machine vision system of claim 1 , wherein the instructions further cause the machine vision system to reconstruct complex Fourier space values for the 2D discrete Fourier transform from outputs of the real Discrete Fourier Transform matrices by: removing a DC×DC component; handling top-left corner values; processing top two rows and left two columns; and reconstructing remaining complex values using 2×2 submatrices.
- 5 . The machine vision system of claim 4 , wherein reconstructing the remaining complex values comprises: for a 2×2 submatrix with top-left corner (u,v), calculating: F u / 2 , v / 2 = ( S u , v - S u + 1 , v + 1 ) + i ( S u , v + 1 + S u + 1 , v ) F W - u / 2 , v / 2 = ( S u , v + S u + 1 , v + 1 ) + i ( S u , v + 1 - S u + 1 , v ) where F represents complex Fourier space values for the 2D discrete Fourier transform, S represents outputs of the real Discrete Fourier Transform matrices, and W is the window size.
- 6 . The machine vision system of claim 1 , wherein the instructions further cause the machine vision system to accelerate an ArgMax calculation by: bit-casting float16 values to int16 format; packing maximum values and their indices into single 32-bit integers; performing a warp-wide int32 max reduction; and extracting a maximum value and its index from the reduction result.
- 7 . The machine vision system of claim 6 , wherein accelerating the ArgMax calculation further comprises: comparing float16 values with zero and bit-casting the maximum to int16 format; left-shifting the bit-cast value by 16 bits and combining it with an index; applying a warp-wide int32 max function to the combined value; and extracting a maximum value and its index from the reduction result using bit masking and shifting operations.
- 8 . The machine vision system of claim 1 , wherein the instructions further cause the machine vision system to accelerate matrix transposition by: executing a nested loop structure in parallel across GPU threads; determining a permutation index for each value in a first matrix (C); and reassigning values to a matrix transpose (B T ) based on the determined permutation indices.
- 9 . The machine vision system of claim 8 , wherein determining the permutation index comprises calculating: L B T - 1 ( L C ( t , v ) ) where L C and L B T are layout functions that define how register indices map to positions in the full matrices C and B T , respectively, t is a thread index, and v is a value index within the thread.
- 10 . The machine vision system of claim 1 , wherein the windows in the input images are 32 pixels by 32 pixels.
- 11 . A method for performing optical flow in a machine vision system, the method comprising: obtaining, using a processor, a pair of sequential input images; identifying, using the processor, windows in the input images; performing, using a processor incorporating at least one tensor processing core, optical flow calculations using separable window correlation, wherein the separable window correlation calculations comprise: obtaining, using the processor incorporating the at least one tensor processing core, one-dimensional discrete Fourier transforms (DFTs) of columns and rows of a window in a first image using a real DFT matrix; obtaining, using the processor incorporating the at least one tensor processing core, one-dimensional DFTs of columns and rows of a corresponding window in a second image using the real DFT matrix; reconstructing, using the processor incorporating the at least one tensor processing core, complex Fourier space values for the 2D discrete Fourier transform from outputs obtained using the real DFT matrices; performing, using the processor incorporating the at least one tensor processing core, elementwise multiply-conjugate operations with respect to the reconstructed complex Fourier space values; converting, using the processor incorporating the at least one tensor processing core, complex products to real values to obtain a real value matrix; obtaining, using the processor incorporating the at least one tensor processing core, one-dimensional inverse discrete Fourier transforms (IDFTs) of rows and columns of the real value matrix; and determining, using the processor incorporating the at least one tensor processing core, subpixel peaks based upon output of the one-dimensional IDFTs; and outputting, using the processor, optical flow information for the input images.
- 12 . The method of claim 11 , wherein obtaining the one-dimensional discrete Fourier transforms using the first real DFT matrix comprises: expanding a complex Discrete Fourier Transform matrix into an expanded matrix; removing redundant rows from the expanded matrix; and scaling DC and Nyquist rows of the resulting matrix to generate the real DFT matrix.
- 13 . The method of claim 12 , wherein the first real DFT matrix R is defined by: R r , c := { 2 2 if r = 0 2 2 cos ( π c ) if r = 1 cos ( α ⌊ r / 2 ⌋ c ) if r ≥ 2 and r is even sin ( α ⌊ r / 2 ⌋ c ) if r ≥ 3 and r is odd where α k represents a frequency component associated with each row.
- 14 . The method of claim 11 , wherein reconstructing complex Fourier space values from outputs obtained using the real DFT matrices comprises: removing a DC×DC component; handling top-left corner values; processing top two rows and left two columns; and reconstructing remaining complex values using 2×2 submatrices.
- 15 . The method of claim 14 , wherein reconstructing the remaining complex values comprises: for a 2×2 submatrix with top-left corner (u,v), calculating: F u / 2 , v / 2 = ( S u , v - S u + 1 , v + 1 ) + i ( S u , v + 1 + S u + 1 , v ) F W - u / 2 , v / 2 = ( S u , v + S u + 1 , v + 1 ) + i ( S u , v + 1 - S u + 1 , v ) where F represents complex Fourier space values for the 2D discrete Fourier transform, S represents outputs of the real Discrete Fourier Transform matrices, and W is the window size.
- 16 . The method of claim 11 , further comprising accelerating an ArgMax calculation by: bit-casting float16 values to int16 format; packing maximum values and their indices into single 32-bit integers; performing a warp-wide int32 max reduction; and extracting a maximum value and its index from the reduction result.
- 17 . The method of claim 16 , wherein accelerating the ArgMax calculation further comprises: comparing float16 values with zero and bit-casting the maximum to int16 format; left-shifting the bit-cast value by 16 bits and combining it with an index; applying a warp-wide int32 max function to the combined value; and extracting a maximum value and its index from the reduction result using bit masking and shifting operations.
- 18 . The method of claim 11 , further comprising accelerating matrix transposition by: executing a nested loop structure in parallel across GPU threads; determining a permutation index for each value in a first matrix (C); and reassigning values to a matrix transpose (B T ) based on the determined permutation indices.
- 19 . The method of claim 18 , wherein determining the permutation index comprises calculating: L B T - 1 ( L C ( t , v ) ) where L C and L B T are layout functions that define how register indices map to positions in the full matrices C and B T , respectively, tis a thread index, and v is a value index within the thread.
- 20 . The method of claim 11 , wherein the windows in the input images are 32 pixels by 32 pixels.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS The current application claims priority under 35 U.S.C. 119 (e) to U.S. Provisional Patent Application Ser. No. 63/715,303, entitled “Portable Real-Time Optical Flow-Field Sensor”, filed Nov. 1, 2024 and U.S. Provisional Patent Application Ser. No. 63/823,578, entitled “Systems and Methods for Performing Optical Flow Using GPU Tensor Processing Cores”, filed Jun. 13, 2025. The disclosures of U.S. Provisional Patent Application Ser. No. 63/715,303 and U.S. Provisional Patent Application Ser. No. 63/823,578 of which is incorporated herein by reference in their entirety. FIELD OF INVENTION The present disclosure relates to machine vision systems and methods, and more particularly to systems and methods for performing optical flow calculations using tensor processing cores within graphics processing units (GPUs). BACKGROUND Optical flow is a computer vision technique that estimates the motion of objects, surfaces, and edges between consecutive frames in a video sequence. Optical flow processes can calculate the apparent movement of pixels or features from one image to the next, providing valuable information about the dynamics of a scene. Optical flow processes can have numerous applications across various fields. In computer vision and robotics, optical flow processes can aid in tasks such as motion detection, object tracking, and navigation. For video compression algorithms, optical flow processes can enable efficient encoding by predicting frame-to-frame changes. In autonomous vehicles, optical flow processes can contribute to obstacle avoidance and path planning. Medical imaging applications can also utilize optical flow for analyzing organ movements and blood flow. Implementing optical flow algorithms on Graphics Processing Units (GPUs) has become increasingly common due to the parallel processing capabilities of these specialized hardware components. GPUs are designed to handle multiple computations simultaneously, making them well-suited for the pixel-level operations involved in optical flow calculations. Adapting optical flow algorithms for GPU architectures typically involves restructuring the computations to exploit parallel processing and optimize memory access patterns. GPU-based optical flow implementations can offer advantages such as improved processing speed and the ability to handle larger datasets. However, challenges exist in efficiently utilizing GPU resources, managing memory bandwidth, and balancing workload distribution across processing units. Additionally, achieving high accuracy while maintaining real-time performance remains an ongoing area of research and development. The architectures of GPUs typically differ from those of Central Processing Units (CPUs) in several ways. GPUs typically contain a large number of smaller, more specialized processing cores optimized for performing many calculations in parallel. This design can allow GPUs to execute certain types of algorithms faster than conventional CPUs, particularly those involving matrix operations and floating-point arithmetic. The parallel processing capabilities of GPUs can make them particularly effective for tasks that can be broken down into many independent calculations. Image processing, including optical flow computations, often falls into this category as operations can be performed on multiple pixels or regions simultaneously. This parallelism enables GPUs to achieve significant speedups compared to sequential processing on CPUs for many computer vision and image analysis tasks. Recent advancements in GPU technology have introduced tensor processing cores, which are specialized hardware units designed to accelerate specific types of mathematical operations commonly used in machine learning and scientific computing. Tensor cores are optimized for matrix multiplication and accumulation operations, which form the basis of many deep learning algorithms and other computationally intensive tasks. Tensor processing cores achieve computational efficiencies through several mechanisms. They operate on lower precision data types, such as 16-bit floating-point numbers, which allows for faster calculations and reduced memory bandwidth usage. Tensor cores also employ specialized matrix multiply-accumulate operations to perform multiple fused multiply-add computations. This hardware-level optimization enables tensor cores to achieve significantly higher throughput for certain types of calculations compared to traditional GPU cores. SUMMARY Systems and methods in accordance with various embodiments of the invention accelerate optical flow calculations by leveraging tensor processing cores within graphics processing units (GPUs). This approach can enable significant performance improvements in terms of throughput and latency compared to traditional implementations. The acceleration of optical flow calculations can enhance real-time processing capabilities for high-resolution image streams, potenti