CN-122024285-A - Gesture recognition system based on self-adaptive single-pixel imaging and optimized ResNet network
Abstract
The invention provides a gesture recognition method and device based on self-adaptive Fourier single-pixel imaging and optimized ResNet deep learning. The method comprises the steps of constructing a self-adaptive Fourier single-pixel imaging system by adopting a single-pixel detector, a DMD and a scene perception control module, identifying gesture states by monitoring key Fourier coefficient changes, dynamically adjusting sampling strategies, modulation parameters and imaging modes, obtaining compressed perception measurement values, reconstructing gesture images by adopting a multi-constraint inverse Fourier transform algorithm integrating profile sparse constraints and frequency domain attention weights, constructing an optimization ResNet network comprising a hierarchical attention mechanism, multi-scale feature extraction, depth separable convolution and pyramid pooling modules, training the network by adopting an improved strategy of combining a loss function, a AdamW optimizer and OneCycleLR learning rate scheduling, and finally realizing real-time gesture identification. The invention has the characteristics of light hardware, low power consumption, strong environmental adaptability and high recognition precision, and can be widely applied to the fields of intelligent home, virtual reality, intelligent driving and the like.
Inventors
- JI ZHONG
- WANG HAO
- LIU YUJIN
- CHEN XUELI
Assignees
- 西安电子科技大学广州研究院
Dates
- Publication Date
- 20260512
- Application Date
- 20260203
Claims (10)
- 1. A gesture recognition method based on self-adaptive single-pixel imaging and optimized ResNet deep learning is characterized by comprising the steps of constructing a self-adaptive Fourier single-pixel imaging system by adopting a single-pixel detector, a Digital Micromirror Device (DMD) and a scene perception control module, realizing gesture state recognition by monitoring key Fourier coefficient changes, dynamically adjusting sampling strategies, DMD modulation parameters and imaging modes, obtaining compressed perception measured values of gesture areas, reconstructing gesture images by adopting a multi-constraint Fourier inversion algorithm fusing profile sparse constraints and frequency domain attention weights based on the compressed perception theory, preprocessing, constructing a depth convolution neural network model based on optimized ResNet, wherein the network comprises a dynamic layered attention fusion module, a cross-scale feature interaction module, a lightweight hybrid convolution module and a global-local joint pooling module, training the network model by adopting a customized training strategy, and acquiring single-pixel imaging data in real time and recognizing gestures by adopting a three-stage self-adaptive learning rate scheduling of a focus Loss function, wherein the three-stage self-adaptive learning rate is adopted by the customized training strategy.
- 2. The method according to claim 1, wherein the gesture state recognition in step1 is implemented by monitoring a total variation of 6 key fourier coefficients (u-axis 3, v-axis 3), the total variation being expressed as And when the total variation is greater than a threshold value T (0.0005-0.1, adaptive scene adjustment), judging that the gesture is a dynamic gesture, and otherwise, judging that the gesture is a static gesture.
- 3. The method of claim 1, wherein the dynamic adjustment strategy in step 1 comprises a dynamic gesture activating dynamic imaging mode, an undersampling strategy is adopted, the dynamic gesture activating static imaging mode is adopted, the time resolution is preferentially ensured, the sampling proportion is gradually increased, the signal-to-noise ratio is optimized through multi-round full sampling average, and the spatial resolution and the image quality are preferentially ensured.
- 4. The method of claim 1, wherein the DMD modulation parameter adjustment in step 1 comprises adopting a 22727Hz highest refresh rate in a dynamic imaging mode, adaptively adjusting to 10309-22727Hz according to a sampling proportion in a static imaging mode, dividing a structured pattern into N+1 sequences, wherein seq.0 contains 18 scene detection patterns, and seq.1-seq.N are imaging patterns and are equally divided according to a circular sampling path.
- 5. The method of claim 1, wherein the spatial information encoding is performed in step 1 using a 3-step phase shift fourier basis function pattern expressed as Fourier coefficient pass Calculated (j is the imaginary unit).
- 6. The method of claim 1, wherein the objective function of the multi-constraint inverse fourier transform algorithm in step 2 is Wherein In the form of a morphological gradient operator, For the frequency domain attention weight matrix, λ1 (0.01), λ2 (0.005) are adaptively adjusted regularization parameters.
- 7. The method of claim 1, wherein the hierarchical attention mechanism module in step 3 employs CBAM modules in a shallow layer of a network and an ECA module in a deep layer of the network, the multi-scale feature extraction module designs four parallel branches based on a Inception structure, the depth separable convolution module comprises depth convolution and point-by-point convolution, and the pyramid pooling module comprises 1×1,2×2,3×3, and 6×6 adaptive average pooling layers of four different scales.
- 8. The method of claim 1, wherein the combined Loss function in step 4 is 0.7×focal local+0.3× Label Smoothing Loss, and wherein in the OneCycleLR learning rate scheduling strategy, the learning rate is linearly increased from 1e-4 to 1e-2 (30% training steps), and then decayed from 1e-2 cosine to 1e-6 (70% training steps).
- 9. A gesture recognition device based on self-adaptive Fourier single-pixel imaging and optimization ResNet deep learning is characterized by comprising a self-adaptive single-pixel imaging module, a real-time recognition module and a self-adaptive recognition module, wherein the self-adaptive single-pixel imaging module comprises a single-pixel detector, a DMD and a scene perception control module and is used for monitoring gesture states through key Fourier coefficients, dynamically adjusting sampling strategies, modulation parameters and imaging modes and collecting compressed perception measured values of gesture areas, the multi-constraint image reconstruction module is used for reconstructing gesture images and preprocessing by adopting a multi-constraint Fourier inversion algorithm fusing profile sparse constraints and frequency domain attention weights, the optimization ResNet neural network module is used for constructing a deep convolution neural network comprising a hierarchical attention mechanism, multi-scale feature extraction, depth separable convolution and pyramid pooling modules and achieving training through an improved training strategy, and the real-time recognition module is used for receiving single-pixel imaging data in real time, recognizing and classifying through the trained models and supporting deployment of mobile equipment and embedded systems by adopting a pipeline processing architecture and a self-adaptive reasoning strategy.
- 10. The device of claim 9, wherein the light source of the adaptive single-pixel imaging module covers the visible light to near infrared (500 nm-1100 nm) band, and comprises a 3W broadband white light LED and a 2.6W near infrared LED (1050 nm), and the single-point detector is a Thorlabs PDA100A2 photodiode amplifier, and the response wavelength range is 400nm-1100nm.
Description
Gesture recognition system based on self-adaptive single-pixel imaging and optimized ResNet network Technical Field The invention relates to the technical field of computational imaging and computer vision, in particular to a gesture recognition system based on a self-adaptive single-pixel imaging technology and an optimized ResNet network. The technology integrates the front edge technologies such as computational imaging, compressed sensing theory and deep learning, and provides a brand new technical solution for the field of human-computer interaction. Background Gesture recognition technology has evolved from early contact data gloves to the currently mainstream non-contact optical recognition schemes as an important research direction in the field of human-computer interaction. This process of technology evolution is accompanied by breakthroughs in computer vision and deep learning techniques, making vision-based gesture recognition systems a significant advance in accuracy and practicality. The current mainstream scheme mainly adopts optical equipment such as an RGB camera, a depth camera or an infrared sensor to collect gesture information, and an advanced image processing algorithm is matched to realize the recognition function. However, conventional array camera-based gesture recognition systems have a number of technical bottlenecks. In terms of hardware architecture, these systems require complex sensor arrays, lens assemblies and support structures, resulting in bulky and over-standard overall equipment and high manufacturing costs. Such hardware configuration not only increases the complexity of the system, but also limits the application in mobile devices and embedded systems. In the aspect of data processing, the traditional scheme needs to transmit and process high-resolution image data, and has high requirements on communication bandwidth and computing resources, and huge data volume causes obvious processing delay, so that real-time response is difficult to realize. Environmental suitability is another prominent problem. The performance of the existing system is drastically reduced under the conditions of strong light interference, low illumination or shielding, the angle of view is limited, and the requirements on the posture of the user are severe. Particularly, under the condition of non-direct vision, large-angle inclination or complex background environment, the system does not perform well, and the use experience and the practical application value are seriously affected. With the popularization of the internet of things and mobile computing devices, more stringent requirements on miniaturization, low power consumption and high reliability are put forward on gesture recognition technology. These emerging application scenarios require that the system be able to operate stably in a resource constrained environment while maintaining high accuracy recognition capabilities. Conventional solutions have been difficult to meet these increasing demands, and new technological breakthroughs and innovative solutions are urgently needed. In addition, with the development of deep learning technology, convolutional neural network CNN has achieved significant success in image recognition tasks. The residual network ResNet is used as a classical deep convolutional neural network architecture, and the gradient vanishing problem of the deep network is effectively solved by introducing residual connection, so that the structure is excellent in multiple visual tasks. However, the standard ResNet network still has the problems of lack of a targeted attention mechanism, insufficient multi-scale feature extraction, to-be-lifted computing efficiency, insufficient optimization of training strategies and the like in the gesture recognition task. In the prior art, a part of schemes disclose a Fourier single-pixel imaging system and an image reconstruction method, but lack of self-adaptive optimization aiming at gesture dynamic scenes, a part of schemes provide self-adaptive single-pixel imaging thought, but focus on pattern design or sampling path optimization, and do not combine specific requirements of gesture recognition to realize collaborative optimization of imaging and recognition, a part of schemes combine a simplified imaging technology and gesture recognition, but do not solve the problem of feature extraction of single-pixel imaging low-quality data, and a part of schemes integrate modules such as an attention mechanism, multi-scale feature extraction and the like in a deep learning network, but are all in a fixed combination mode, and do not carry out customized design aiming at single-pixel imaging data characteristics. Therefore, the simple combination of the prior art cannot meet the gesture recognition requirements of light weight, low power consumption, high precision and strong environmental adaptability of hardware, and an innovative technical scheme is needed to be proposed. Disclosure of In