CN-114170546-B - Recognition model training method and recognition method for video target state
Abstract
The invention discloses a model training method and a model training method for identifying a video target state, wherein the model training method comprises the steps of inputting an image frame into a feature extraction module to obtain high-level features; the method comprises the steps of carrying out up-sampling on high-level features through a space reasoning module and comparing the high-level features with a saliency tag for training, calculating a first loss function, converging the first loss function to a preset degree through training, sequentially inputting the high-level features of the high-level features into a ConvLSTM network, a full-connection layer and a Softmax layer in a time reasoning module, obtaining a prediction state of a current target, comparing the prediction state with a state tag for training, calculating a second loss function, converging the second loss function to the preset degree through training, and obtaining a video target state identification model. The identification model can be used for identifying the target state in the video image. The method combining time reasoning and space reasoning has higher state recognition precision, and realizes the automatic feature event detection of the video image sequence.
Inventors
- JIA TAO
- LI LING
- MA LEI
- CHEN JIAZHONG
- ZHONG JIAN
- JIN YI
- DONG YUAN
- ZHANG YANBIN
- LIU YANG
- LIU XIAOPENG
- CUI TIECHENG
Assignees
- 中国人民解放军63861部队
- 中国人民解放军63861部队
- 华中科技大学
- 华中科技大学
Dates
- Publication Date
- 20260421
- Application Date
- 20211116
- Priority Date
- 20211116
Claims (7)
- 1. The method for training the recognition model of the video target state is characterized by comprising the following steps of: Inputting continuous image frames of the training set into a feature extraction module to obtain high-level features; The high-level features are up-sampled through a space reasoning module and compared with the saliency labels for training, a first loss function is calculated, and the first loss function is converged to a preset degree through training; Inputting the high-level features into a time reasoning module, wherein the time reasoning module comprises a ConvLSTM network, a full-connection layer and a Softmax layer, sequentially inputting the high-level features into the ConvLSTM network, the full-connection layer and the Softmax layer to obtain a predicted state of a current target, comparing the predicted state with a state label for training, calculating a second loss function, and converging the second loss function to a preset degree through training to obtain a video target state recognition model; The image frames input into the feature extraction module are preprocessed image frames, and the preprocessing process is as follows: Storing the bitmap data entity of the bmp format file as 16-bit unsigned integer x; Using a first conversion formula Normalizing the 16-bit unsigned integer x to obtain a normalized integer x', wherein max (x) represents the maximum value of the 16-bit unsigned integer x; Converting the normalized integer x' into an 8-bit unsigned integer y, and using a second conversion formula And carrying out nonlinear conversion on the 8-bit unsigned integer y, and storing a conversion result z as a png format file.
- 2. The method for training a recognition model of a video object state according to claim 1, wherein the feature extraction module comprises 5 convolution layers of a VGG-16 network, and the 5 th layer of VGG-16 outputs two paths of high-level features after the input image frame passes through the 5 convolution layers of the VGG-16 network.
- 3. The method for training the recognition model of the video object state according to claim 1, wherein the spatial reasoning module comprises 4 deconvolution layers and 1 convolution layer, the high-level features input into the spatial reasoning module are up-sampled through the 4 deconvolution layers, a significant pixel diagram is obtained through the 1 convolution layer and then output, and the output diagram and the significant label are compared and trained.
- 4. The method for training a recognition model of a video object state according to claim 3, wherein the spatial reasoning module further comprises a Sigmoid activation layer, and after a significant pixel map is obtained through 1 convolution layer, the spatial reasoning module further comprises Sigmoid activating the significant pixel map and outputting the significant pixel map.
- 5. The method of claim 1, wherein the ConvLSTM network includes memory, forget gates and output gates.
- 6. The method for training the recognition model of the video object state according to claim 1, wherein the probability of returning each state after passing through the Softmax layer is taken as the state of the current frame.
- 7. A method for identifying a video object state, comprising: acquiring an identification model of a video target state, wherein the identification model is obtained by the identification model training method of the video target state according to any one of claims 1 to 6; inputting an image frame into the identification model, and outputting a state identification result after a feature extraction module and a time reasoning module of the identification model; Before inputting the image frame into the recognition model, the method further comprises preprocessing the image frame, wherein the preprocessing process is as follows: Storing the bitmap data entity of the bmp format file as 16-bit unsigned integer x; Using a first conversion formula Normalizing the 16-bit unsigned integer x to obtain a normalized integer x', wherein max (x) represents the maximum value of the 16-bit unsigned integer x; Converting the normalized integer x' into an 8-bit unsigned integer y, and using a second conversion formula And carrying out nonlinear conversion on the 8-bit unsigned integer y, and storing a conversion result z as a png format file.
Description
Recognition model training method and recognition method for video target state Technical Field The invention belongs to the technical field of video image data processing, and particularly relates to a video target state recognition model training method and a video target state recognition method. Background The image sequence is a record of actions and states of the target in a continuous period of time, has continuity in time and space information, and has different space-time characteristics in different states. By utilizing the characteristic of the image, an artificial intelligence technology is adopted to construct an image deep learning model to detect and identify the key frame state of the image sequence, so that the method is favorable for automatic processing and analysis of image data on one hand, and can be popularized and applied to various quasi-real-time applications on the other hand, so as to realize rapid and automatic target state grasping and evaluation. However, the video key frame state recognition is mainly performed on the video containing people, and the state recognition of objects is less researched. In video containing a person, a human posture detection model is often used to capture motion information of the person, for example, in applications such as pedestrian gait recognition, the pedestrian target occupies almost the whole image frame. However, when the key frame state of the object is identified, some joint point information cannot be used as assistance, and the targets in some video images only occupy a small proportion of the image area, so that the accuracy of key frame state detection of an image sequence with smaller targets and large background area is not high. Disclosure of Invention Aiming at the defects or improvement demands of the prior art, the invention provides a video target state recognition model training method and a video target state recognition method, and aims to improve the video target state recognition precision. To achieve the above object, according to one aspect of the present invention, there is provided a recognition model training method of a video object state, comprising: Inputting continuous image frames of the training set into a feature extraction module to obtain high-level features; The high-level features are up-sampled through a space reasoning module and compared with the saliency labels for training, a first loss function is calculated, and the first loss function is converged to a preset degree through training; inputting the high-level features into a time reasoning module, wherein the time reasoning module comprises a ConvLSTM network, a full-connection layer and a Softmax layer, sequentially inputting the high-level features into the ConvLSTM network, the full-connection layer and the Softmax layer to obtain a predicted state of a current target, comparing the predicted state with a state label for training, calculating a second loss function, and converging the second loss function to a preset degree through training to obtain a video target state recognition model. Preferably, the image frame input to the feature extraction module is a preprocessed image frame, and the preprocessing process is as follows: Storing the bitmap data entity of the bmp format file as 16-bit unsigned integer x; Using a first conversion formula Normalizing the 16-bit unsigned integer x to obtain a normalized integer x', wherein max (x) represents the maximum value of the 16-bit unsigned integer x; Converting the normalized integer x' into an 8-bit unsigned integer y, and using a second conversion formula And carrying out nonlinear conversion on the 8-bit unsigned integer y, and storing a conversion result z as a png format file. Preferably, the feature extraction module comprises 5 convolution layers of the VGG-16 network, and after the input image frame passes through the 5 convolution layers of the VGG-16 network, the 5 th layer of the VGG-16 outputs two paths of high-level features. Preferably, the spatial reasoning module comprises 4 deconvolution layers and 1 convolution layer, the high-level features input into the spatial reasoning module are up-sampled through the 4 deconvolution layers, then a significant pixel diagram is obtained through the 1 convolution layers and output, and the output diagram and the significance label are compared and trained. Preferably, the spatial reasoning module further comprises a Sigmoid activation layer, and after a significant pixel diagram is obtained through 1 convolution layer, the spatial reasoning module further comprises the step of outputting the significant pixel diagram after Sigmoid activation. Preferably, the ConvLSTM network includes memory, forget gates, and output gates. Preferably, the probability of each state is returned after passing through the Softmax layer, and the state with the highest probability is taken as the state of the current frame. According to another aspect of the present