CN-121564802-B - Gesture recognition method and system based on video stream analysis

CN121564802BCN 121564802 BCN121564802 BCN 121564802BCN-121564802-B

Abstract

The invention belongs to the technical field of computer vision and man-machine interaction, and relates to a gesture recognition method and system based on video stream analysis. The method comprises the steps of obtaining video acquisition equipment configuration and video stream frame images, writing frame time stamps, sampling at fixed intervals, extracting continuous key frames, carrying out key frame index binding to generate a key frame index structure, carrying out size normalization and contrast enhancement to obtain an effective gesture image content structure, carrying out training sample assembly and double-model training to construct a double-model alignment result structure, carrying out multi-frame gesture category consistency comparison, multi-frame gesture description consistency comparison and average confidence threshold comparison processing to generate a gesture judgment result, and constructing a circulation control configuration structure. According to the invention, through multi-frame consistency comparison and dynamic circulation control mechanism, the accuracy and robustness of gesture recognition are effectively improved, the influence of environmental noise and instantaneous error is reduced, and efficient self-adaptive gesture recognition is realized.

Inventors

CHEN HAITAO
OUYANG FEN
WANG MING

Assignees

湖南真通智用人工智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (9)

1. A gesture recognition method based on video stream analysis, comprising: acquiring configuration of video acquisition equipment and video stream frame images, performing frame time stamp writing, fixed interval sampling and continuous three-frame key frame extraction, and key frame index binding processing to generate a key frame index structure; Based on the key frame index structure, carrying out YOLOv s hand region positioning processing, and carrying out hand region clipping and expanding, size normalization and contrast enhancement processing to obtain an effective gesture image content structure; based on the effective gesture image content structure, training sample assembly and dual-model training operation are carried out, three-frame parallel reasoning and gesture description generation processing are carried out, and a dual-model alignment result structure is constructed; Based on the double-model alignment result structure, performing three-frame gesture category consistency comparison, three-frame gesture description consistency comparison and average confidence threshold comparison processing to generate a gesture judgment result, and constructing a circulation control configuration structure; the process of executing three-frame parallel reasoning and gesture description generation processing and constructing the double-model alignment result structure specifically comprises the following steps: the three-frame parallel reasoning processing comprises the steps of sending three-frame normalized gesture images into a multi-mode classification model reasoning pipeline and a joint point recognition model reasoning pipeline in parallel, and outputting three-frame gesture types, three-frame confidence degrees and three-frame joint point coordinates; The gesture description generation processing comprises the steps of extracting fingertip relative distance characteristics, inter-finger included angle characteristics and horizontal displacement characteristics from three frames of joint point coordinates, and performing denoising smoothing processing on the joint point coordinates; calculating each feature according to the index relation of the joint points, and binding the feature results with frames; And carrying out the same-frame alignment and group summarization on the three-frame gesture category, the three-frame confidence level, the three-frame joint point coordinate and the three-frame gesture description, the corresponding frame serial number and the corresponding frame timestamp to generate a double-model alignment result structure.
2. The method of claim 1, wherein performing a frame timestamp writing process further comprises: The frame time stamp writing process comprises a monotonically increasing system clock or a collecting hardware clock to provide time marks, and when two sources of the hardware clock and the system clock exist, time source mapping is completed through a clock alignment table, the hardware time marks are converted into uniform system time marks, meanwhile, a video frame sequence is generated as an intermediate product, the video frame sequence comprises a frame record set which is enqueued according to the increasing sequence of the frame sequence number, the frame record set comprises frame content, the frame sequence number, the frame time stamp, a collecting channel identifier and an abnormal mark, and a ring buffer is adopted to maintain the latest section of continuous frame window.
3. The method of claim 1, wherein the process of sampling at fixed intervals further comprises: The fixed-interval sampling process includes selecting a sampling start from a candidate frame window at a sampling interval in a sequence of video frames, the sampling interval being given by a fixed-interval sampling configuration including a sampling interval value, a start offset, and a number of allowed frame hops, and including a sampling start reset condition that triggers a reset operation when a sampling channel identification switch, a continuity check occurrence clock back-off flag, or a continuous occurrence of an abnormal flag is detected.
4. The method of claim 1, wherein the process of continuous three-frame key frame extraction, key frame index binding process further comprises: The continuous three-frame key frame extraction processing comprises the steps of continuously reading three-frame records backwards at a sampling starting point, wherein the three-frame records are required to meet the same acquisition channel identification and have continuous frame sequence numbers, searching for the next frame backwards according to the allowed frame skip times if the frame missing mark is met, and carrying out key frame index binding processing to generate a key frame index structure, wherein the key frame index structure comprises three-frame key frame image references, corresponding frame sequence numbers, corresponding frame time stamps, acquisition channel identification, abnormal mark summaries and incomplete marks.
5. The method of claim 1, wherein performing YOLOv s hand region localization processing further comprises: The hand region positioning process comprises the steps of carrying out pixel format alignment and scale adjustment on three-frame key frame images by adopting a YOLOv s target detection model, carrying out frame-by-frame processing on the three-frame key frame images by adopting the YOLOv s target detection model, wherein the YOLOv s target detection model comprises a model weight file, a category mapping table, an input size configuration and an inference threshold configuration, the inference threshold configuration comprises a candidate frame confidence level threshold and an overlap suppression threshold, carrying out frame-by-frame input of the three-frame key frame images into an inference pipeline for pixel format alignment and scale adjustment, outputting a candidate boundary frame set, carrying out overlap suppression processing on the candidate boundary frame set to form an effective boundary frame set, writing undetected marks when the effective boundary frame set is empty, ending the round of processing, and selecting a main boundary frame according to the boundary frame confidence level and combining boundary frame area constraint when a plurality of effective boundary frame sets exist.
6. The method of claim 1, wherein performing a hand region cropping and expansion, size normalization, and contrast enhancement process to obtain an effective gesture image content structure further comprises: The hand region cropping and expanding operation comprises expanding the width and the height of a boundary frame outwards on the basis of a main boundary frame, wherein the expanding proportion is defined as ten percent of the boundary frame width and the height, boundary cropping correction is carried out after expanding, cropping consistency correction processing is further carried out on frames with displacement abnormal marks or scale abnormal marks, the coordinates of the frame expanding boundary frame and the coordinates of adjacent frame expanding boundary frame are smoothly fused, then size normalization operation and contrast enhancement operation are carried out, the size normalization operation comprises the step of scaling a cropping gesture image to a uniform space size with 384 pixels in width and height, a scaling strategy with an aspect ratio is adopted, edge filling is carried out on the longer side direction when the aspect ratio is inconsistent, the contrast enhancement operation comprises the step of carrying out block histogram equalization processing on the images with the normalized size to obtain local contrast improvement, and finally, an effective gesture image content structure is generated, wherein the effective gesture image content structure comprises three-frame normalized gesture images, three-frame image quality marks, corresponding frame numbers and corresponding frame time stamps, and image quality marks comprise definition description, brightness description, motion blur description and fuzzy abstract description and mark.
7. The method of claim 1, wherein the process of three frame gesture class consistency comparison further comprises: The three-frame gesture category consistency comparison processing comprises the steps of carrying out the same group comparison on three-frame gesture categories, comparing whether adjacent frame categories are identical according to the sequence of frame numbers, checking whether the three-frame categories are all identical, outputting consistency marks, taking the public category as a candidate gesture category when the three-frame gesture categories are consistent, and writing category conflict marks when the three-frame gesture categories are inconsistent.
8. The method of claim 1, wherein the process of comparing the three-frame gesture description consistency comparison to the average confidence threshold further comprises: The three-frame gesture description consistency comparison processing comprises the steps of extracting fingertip relative distance features, inter-finger included angle features and horizontal displacement features from three-frame gesture descriptions, and respectively executing same-frame consistency check and cross-frame consistency check on each type of sub-feature set, wherein the same-frame consistency check comprises the step of verifying whether the indexes of joint points are complete or not, and the cross-frame consistency check comprises the step of judging whether feature changes meet constraints or not through a preset threshold value set, and outputting gesture consistency marks; The average confidence threshold comparison process includes performing a validity screening of the three frame confidence levels and calculating an average confidence level, comparing the average confidence level to a preset confidence threshold.
9. A gesture recognition system based on video stream analysis, applied to the method of any one of the preceding claims 1-8, comprising: The video stream acquisition and key frame interception module is used for acquiring video acquisition equipment configuration and video stream frame images, writing frame time stamps, sampling at fixed intervals and extracting three continuous key frames, and binding key frame indexes to generate a key frame index structure; The hand region positioning and effective gesture image generating module is used for receiving the key frame index structure, executing YOLOv s hand region positioning processing, and performing operations of hand region cutting and expanding, size normalization and contrast enhancement to obtain an effective gesture image content structure; The dual-model training and parallel reasoning module is used for receiving the effective gesture image content structure, accessing the ineffective image content set and the joint point labeling data set, performing training sample assembly and dual-model training operation, performing three-frame parallel reasoning and gesture description generation processing, and constructing a dual-model alignment result structure; the consistency simultaneous judgment and circulation control module is used for receiving the double-model alignment result structure, carrying out three-frame gesture category consistency comparison and three-frame gesture description consistency comparison and average confidence threshold comparison processing, generating a gesture judgment result, constructing a circulation control configuration structure, and writing the circulation control configuration structure back to the video stream acquisition and key frame interception module.

Description

Gesture recognition method and system based on video stream analysis Technical Field The invention belongs to the technical field of computer vision and man-machine interaction, and particularly relates to a gesture recognition method and system based on video stream analysis. Background At present, gesture recognition technology is widely applied to the fields of intelligent home, virtual reality, augmented reality, vehicle-mounted systems and the like, and becomes an important mode of man-machine interaction. Conventional gesture recognition methods typically rely on specific sensors, such as depth cameras or data gloves, which, while highly accurate, are expensive and have significant limitations in use, which are not convenient to popularize. Gesture recognition technology based on common camera (RGB) video streams is receiving attention because of its low cost and easy integration (such as integration in cell phones, notebook computers). However, this technology faces many challenges in practical applications. Firstly, the background environment of the real scene is complex and changeable, the proportion of gestures in the image can be small, and the gestures are extremely easy to be interfered by factors such as illumination change, partial shielding and the like. Second, the gesture actions themselves have high flexibility and similarity, and there may be only slight differences (such as numbers "1" and "2") between different gestures, and individual differences exist when the same gesture is made by different users, which all cause difficulty in recognition. Most of the existing video stream-based identification methods adopt a single deep learning model (such as CNN) to directly classify images end to end. Although this approach works well on specific, clean data sets, it is prone to false positives in practical applications. For example, an action similar to but not related to the target gesture shape, or a static, non-hand object that is coincidentally shaped into the gesture shape, may be misidentified by the model as a valid gesture, resulting in poor system robustness and poor interaction experience. In addition, in the technical field of computer vision, the existing scheme based on the configuration of video acquisition equipment and gesture recognition of video stream frame images generally surrounds frame timestamp writing, fixed interval sampling, continuous three-frame key frame extraction, key frame index binding processing, hand region positioning processing, hand region clipping and expanding, size normalization and contrast enhancement operation, training sample assembly, double-model training and other links construction, and has the limitations that the associated record of a key frame index structure and video stream frame images is incomplete, the field connection of an invalid image content set and a joint point labeling data set in the training sample assembly link is unclear, the record caliber of a double-model alignment result structure in the same frame alignment and the intra-group summarization link is not uniform and the like. In the existing method, gesture type output or joint point coordinate output is finished by multiple single paths, three-frame gesture type consistency comparison and three-frame gesture description consistency comparison are prone to be inconsistent under the conditions of complex background and illumination change and irrelevant action interference, the trigger condition of average confidence threshold comparison processing and gesture judgment results is not prone to be consistent, a write-back link of a circulation control configuration structure lacks index records consistent with a key frame index structure, and calling relations among the key frame index structure, an effective gesture image content structure and a double-model alignment result structure are not prone to be closed continuously. Aiming at the joint processing of video acquisition equipment configuration and video stream frame image in sampling, alignment, judgment, control and recording links, the prior art generally lacks a through data structure and a call path for covering frame time stamp writing to loop control configuration structure writing back, and is difficult to complete hand region positioning processing to training sample assembling and dual-model training state recording in the same processing link, and a consistent closed loop flow is formed between a gesture judgment result and upstream sampling configuration updating, so that the gesture judgment result and the generation and updating of the loop control configuration structure lack consistent processing link support. Disclosure of Invention In order to solve the technical problems, the invention provides a gesture recognition method based on video stream analysis, which comprises the following steps: acquiring configuration of video acquisition equipment and video stream frame images, performing frame time s