CN-116109531-B - Image processing method, device, computer equipment and storage medium

CN116109531BCN 116109531 BCN116109531 BCN 116109531BCN-116109531-B

Abstract

The application provides an image processing method, an image processing device, computer equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the steps of mapping a first image into three intermediate images, wherein the first image comprises a human body, the three intermediate images are different in scale and used for representing image features of the first image, fusing the three intermediate images to obtain a second image, and mapping the second image into a target image, and the target image is marked with different parts of the human body. According to the scheme, the first image is mapped into the three images with different scales and then fused, compared with a mode of fusing the middle images of each scale in a complex neural network, the structural complexity is reduced, the calculated amount and the reasoning time are reduced, and therefore deployment can be carried out on the mobile terminal.

Inventors

ZHANG YING
LI CHEN

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20211110

Claims (20)

1. An image processing method, wherein the image processing method is implemented based on a human body analysis model, the human body analysis model is used for analyzing a first image, and outputting images marked with different parts of a human body, the method comprises: Mapping the first image into three intermediate images, wherein the first image comprises a human body, and the three intermediate images have different scales and are used for representing image characteristics of the first image; fusing the three intermediate images to obtain a second image; mapping the second image into a target image, wherein the target image is marked with different parts of the human body; Wherein the method further comprises: preprocessing a first marked image of a sample image to obtain an encoded image of the sample image, wherein the first marked image is used for indicating different parts of a sample human body in the sample image, and the encoded image is used for indicating a prediction result of a previous frame image of the sample image; splicing the sample image and the coded image to obtain an input image; and training a human body analysis model of the ith iteration round based on the input image by taking the first labeling image as supervision information, wherein i is a positive integer.
2. The method of claim 1, wherein mapping the first image into three intermediate images comprises: Convolving the first image to obtain a first intermediate image; Convolving the first intermediate image to obtain a second intermediate image; and carrying out channel feature reinforcement and semantic feature reinforcement on the second intermediate image to obtain a third intermediate image, wherein the channel feature reinforcement is used for reinforcing the importance of different channel features, and the semantic feature reinforcement is used for reinforcing global semantic information.
3. The method of claim 2, wherein fusing the three intermediate images to obtain a second image comprises: Fusing the result of convolution of the third intermediate image and the second intermediate image to obtain a first fused image; And fusing the result of convolution of the first fused image and the first intermediate image to obtain a second fused image, and taking the second fused image as the second image.
4. The method of claim 1, wherein the mapping the second image to a target image comprises: the resolution of the second image is increased, and a third image is obtained, wherein the resolution of the third image is not higher than the resolution of the first image; and convolving the third image to obtain the target image.
5. The method of claim 1, wherein preprocessing the first annotation image of the sample image to obtain the encoded image of the sample image comprises: Performing image transformation on the first marked image of the sample image to obtain a second marked image; and encoding the second marked image to obtain the encoded image.
6. The method of claim 5, wherein performing an image transformation on the first annotation image of the sample image to obtain a second annotation image comprises: And performing at least one of rigid transformation and non-rigid transformation on the first marked image to obtain the second marked image.
7. The method of claim 5, wherein encoding the second annotation image results in the encoded image, comprising: And mapping the pixels in the second labeling image to a target vector space according to the pixel category to which the pixels belong, so as to obtain the coding image.
8. The method of claim 1, wherein the stitching the sample image with the encoded image results in an input image, comprising: And splicing the sample image and the coding image in the channel dimension to obtain the input image.
9. The method according to claim 1, wherein training the body analytical model of the i-th iteration based on the input image with the first labeling image as supervision information comprises: performing human body analysis on the input image based on the human body analysis model of the ith iteration to obtain a predicted image, wherein the predicted image is used for indicating different parts of the sample human body obtained through prediction; Determining, based on the first annotation image and the prediction image, a first loss indicating a difference between the first annotation image and the prediction image, a second loss indicating a difference between the first annotation image and the prediction image after pixel weighting, and a third loss indicating a difference between the first annotation image and the prediction image after pixel addition of dependency information indicating information contained in pixels surrounding the pixel; based on the first loss, the second loss, and the third loss, model parameters of the human body analytical model of the ith round of iteration are adjusted.
10. The method of claim 9, wherein determining the second loss based on the first annotation image and the predictive image comprises: Determining class weights of all pixel classes based on the pixel quantity corresponding to the pixel classes in the predicted image, wherein the class weights are inversely related to the pixel quantity; And determining a weighted cross entropy loss based on the class weight of each pixel class, and taking the weighted cross entropy loss as the second loss.
11. The method of claim 9, wherein determining the third loss based on the first annotation image and the predictive image comprises: Determining a labeling probability distribution based on the first labeling image; determining a predictive probability distribution based on the predicted image; Determining a labeling probability density function, a prediction probability density function and a joint probability density function based on the labeling probability distribution and the prediction probability distribution; And determining a cross entropy loss based on the labeling probability density function, the prediction probability density function and the joint probability density function, and taking the cross entropy loss as the third loss.
12. An image processing apparatus, wherein the steps performed by the image processing apparatus are implemented based on a human body analysis model for performing human body analysis on a first image and outputting images labeled with different parts of a human body, the apparatus comprising: The first mapping module is used for mapping the first image into three intermediate images, wherein the first image comprises a human body, and the three intermediate images are different in scale and are used for representing the image characteristics of the first image; the image fusion module is used for fusing the three intermediate images to obtain a second image; the second mapping module is used for mapping the second image into a target image, and the target image is marked with different parts of the human body; wherein the apparatus further comprises: the preprocessing module is used for preprocessing a first marked image of a sample image to obtain an encoded image of the sample image, wherein the first marked image is used for indicating different parts of a sample human body in the sample image, and the encoded image is used for indicating a prediction result of a previous frame image of the sample image; the splicing module is used for splicing the sample image and the coded image to obtain an input image; and the training module is used for training the human body analysis model iterated in the ith round based on the input image by taking the first marked image as supervision information, wherein i is a positive integer.
13. The apparatus of claim 12, wherein the first mapping module is configured to: Convolving the first image to obtain a first intermediate image; Convolving the first intermediate image to obtain a second intermediate image; and carrying out channel feature reinforcement and semantic feature reinforcement on the second intermediate image to obtain a third intermediate image, wherein the channel feature reinforcement is used for reinforcing the importance of different channel features, and the semantic feature reinforcement is used for reinforcing global semantic information.
14. The apparatus of claim 13, wherein the image fusion module is configured to: Fusing the result of convolution of the third intermediate image and the second intermediate image to obtain a first fused image; And fusing the result of convolution of the first fused image and the first intermediate image to obtain a second fused image, and taking the second fused image as the second image.
15. The apparatus of claim 12, wherein the second mapping module is configured to: the resolution of the second image is increased, and a third image is obtained, wherein the resolution of the third image is not higher than the resolution of the first image; and convolving the third image to obtain the target image.
16. The apparatus of claim 12, wherein the preprocessing module is configured to: Performing image transformation on the first marked image of the sample image to obtain a second marked image; and encoding the second marked image to obtain the encoded image.
17. The apparatus of claim 16, wherein the preprocessing module is configured to: And performing at least one of rigid transformation and non-rigid transformation on the first marked image to obtain the second marked image.
18. The apparatus of claim 16, wherein the preprocessing module is configured to: And mapping the pixels in the second labeling image to a target vector space according to the pixel category to which the pixels belong, so as to obtain the coding image.
19. The apparatus of claim 12, wherein the stitching module is configured to: And splicing the sample image and the coding image in the channel dimension to obtain the input image.
20. The apparatus of claim 12, wherein the training module is configured to: performing human body analysis on the input image based on the human body analysis model of the ith iteration to obtain a predicted image, wherein the predicted image is used for indicating different parts of the sample human body obtained through prediction; Determining, based on the first annotation image and the prediction image, a first loss indicating a difference between the first annotation image and the prediction image, a second loss indicating a difference between the first annotation image and the prediction image after pixel weighting, and a third loss indicating a difference between the first annotation image and the prediction image after pixel addition of dependency information indicating information contained in pixels surrounding the pixel; based on the first loss, the second loss, and the third loss, model parameters of the human body analytical model of the ith round of iteration are adjusted.

Description

Image processing method, device, computer equipment and storage medium Technical Field The present application relates to the field of artificial intelligence, and in particular, to an image processing method, an image processing device, a computer device, and a storage medium. Background Human body analysis is a technique of dividing a human body in an image or video into a plurality of semantically uniform regions, such as dividing the human body into a head, a hand, a leg, and the like. At present, the human body analysis technology generally adopts a deep neural network to predict pixels belonging to the same semantic region in an image, so that the human body in the image is segmented, and a more accurate human body analysis result is obtained. However, the neural network adopted by the scheme has a complex structure, large calculation amount and long reasoning time, so that the problem of difficulty in deployment in the mobile terminal is caused. Disclosure of Invention The embodiment of the application provides an image processing method, an image processing device, computer equipment and a storage medium, which are capable of reducing structural complexity, calculation amount and reasoning time compared with a mode of fusing intermediate images of each scale in a complex neural network, so that deployment can be performed on a mobile terminal. The technical scheme is as follows: in one aspect, there is provided an image processing method, the method including: mapping a first image into three intermediate images, the first image comprising a human body, the three intermediate images having different dimensions for representing image features of the first image; fusing the three intermediate images to obtain a second image; and mapping the second image into a target image, wherein the target image is marked with different parts of the human body. In another aspect, there is provided an image processing apparatus including: the first mapping module is used for mapping a first image into three intermediate images, wherein the first image comprises a human body, and the three intermediate images are different in scale; the image fusion module is used for fusing the three intermediate images to obtain a second image; and the second mapping module is used for mapping the second image into a target image, and the target image is marked with different parts of the human body. In some embodiments, the first mapping module is configured to convolve the first image to obtain a first intermediate image, convolve the first intermediate image to obtain a second intermediate image, and perform channel feature enhancement and semantic feature enhancement on the second intermediate image to obtain a third intermediate image, where the channel feature enhancement is used to enhance importance of different channel features, and the semantic feature enhancement is used to enhance global semantic information. In some embodiments, the image fusion module is configured to fuse a result of convoluting the third intermediate image with the second intermediate image to obtain a first fused image, fuse the first fused image with a result of convoluting the first intermediate image to obtain a second fused image, and use the second fused image as the second image. In some embodiments, the second mapping module is configured to increase the resolution of the second image to obtain a third image, where the resolution of the third image is not higher than the resolution of the first image, and convolve the third image to obtain the target image. In some embodiments, the steps performed by the image processing apparatus are implemented based on a human body analysis model, which is used to analyze the input image for a human body and output images marked with different parts of the human body. In some embodiments, the apparatus further comprises: the preprocessing module is used for preprocessing a first marked image of a sample image to obtain an encoded image of the sample image, wherein the first marked image is used for indicating different parts of a sample human body in the sample image, and the encoded image is used for indicating a prediction result of a previous frame image of the sample image; the splicing module is used for splicing the sample image and the coded image to obtain an input image; and the training module is used for training the human body analysis model iterated in the ith round based on the input image by taking the first marked image as supervision information, wherein i is a positive integer. In some embodiments, the preprocessing module is configured to perform image transformation on the first labeling image of the sample image to obtain a second labeling image, and encode the second labeling image to obtain the encoded image. In some embodiments, the preprocessing module is configured to perform at least one of rigid transformation and non-rigid transformation on the first labeling image to obtain t