CN-115563334-B - Image-text data processing method and processor

CN115563334BCN 115563334 BCN115563334 BCN 115563334BCN-115563334-B

Abstract

The invention discloses a method and a processor for processing graphic data. The method comprises the steps of obtaining an original image and an original text to be processed, wherein the original image and the original text are used for describing at least one same object, extracting image context features from the original image and text context features from the original text, splicing the image context features and the text context features to obtain original feature vectors, and feature encoding the original feature vectors to obtain target texts corresponding to the original image and/or target images corresponding to the original text, wherein the target texts and/or the target images comprise the same object. The invention solves the technical problem of low processing efficiency of image-text data.

Inventors

ZHAO LIMING
XIE CHENWEI
ZHENG BIN
ZHAO DELI

Assignees

阿里巴巴（中国）有限公司

Dates

Publication Date: 20260512
Application Date: 20220920

Claims (10)

1. A method of processing teletext data, comprising: Acquiring an original image and an original text to be processed, wherein the original image and the original text are used for describing at least one same object; extracting image context features from the original image and extracting text context features from the original text; splicing the image context features and the text context features to obtain original feature vectors; Performing feature coding on the original feature vector to obtain a target text corresponding to the original image and/or a target image corresponding to the original text, wherein the target text and/or the target image comprise the same object; The method comprises the steps of carrying out feature coding on an original feature vector to obtain a target text corresponding to the original image, and/or carrying out feature coding on the original feature vector to obtain a target image corresponding to the original text, wherein the target feature vector comprises an image feature vector and a text feature vector, the similarity between an object represented by a feature in the image feature vector and an object represented by a text context feature is larger than a first similarity threshold, the similarity between the object represented by the feature in the text feature vector and the object represented by the image context feature is larger than a second similarity threshold, and generating the image feature vector into the target image corresponding to the original text, and/or generating the text feature vector into the target text corresponding to the original image.
2. The method of claim 1, wherein the original feature vector comprises a plurality of features, and mapping the original feature vector to a target feature vector comprises: Comparing the similarity between each feature in the plurality of features and the features except for each feature in the original feature vector to obtain the feature in the image feature vector or the feature in the text feature vector corresponding to each feature; The target feature vector is generated based on the feature in the image feature vector or the feature in the text feature vector corresponding to the each feature.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises, Generating the image feature vector of the target feature vectors into the target image corresponding to the original text comprises generating the image feature vector of the target feature vectors into the target image corresponding to the original text based on an image generation model, wherein parameters of the image generation model are adjusted by a loss function between the image feature vector and the original image feature vector of the original image, wherein the image generation model is a machine learning model, and/or Generating the text feature vector in the target feature vector into the target text corresponding to the original image based on a text generation model, wherein parameters of the text generation model are adjusted by a loss function between the text feature vector and an original text feature vector of the original text, and the text generation model is a machine learning model.
4. The method according to claim 1, wherein the method further comprises: Extracting global image features from the original image based on an image feature extraction model, wherein parameters of the image feature extraction model are adjusted by a loss function between the image feature vector and the original image feature vector of the original image, retrieving output text with highest similarity with the original image in a text database based on the global image features, and/or Extracting global text features from the original text based on a text feature extraction model, wherein parameters of the text feature extraction model are adjusted by a loss function between the text feature vector and the original text feature vector of the original text; and searching out an output image with highest similarity with the original text in an image database based on the global text characteristics.
5. A method of processing teletext data, comprising: Acquiring an original image sample and an original text sample to be processed, wherein the original image sample and the original text sample are used for describing at least one same object; extracting an image context feature sample from the original image sample, and extracting a text context feature sample from the original text; Splicing the image context feature sample and the text context feature sample to obtain an original feature vector sample; Outputting the original feature vector sample, wherein text for describing the original image sample obtained by feature encoding the original feature vector sample and/or an image for describing the original text sample are used as training samples to train to obtain an image-text processing model; The image-text processing model is used for converting an input image into a target text for describing the input image, the target text comprises the same object, and/or the image-text processing model is used for converting the input text into a target image for describing the input text, and the target image comprises the same object; The method comprises the steps of carrying out feature coding on an original feature vector sample to obtain a text used for describing the original image sample, and/or carrying out feature coding on the original feature vector sample to obtain an image used for describing the original text sample, wherein the original feature vector sample is mapped into a target feature vector sample, the target feature vector sample comprises an image feature vector sample and a text feature vector sample, the similarity between an object represented by a feature in the image feature vector sample and an object represented by the text context feature sample is larger than a first similarity threshold, the similarity between the object represented by the feature in the text feature vector sample and the object represented by the image context feature sample is larger than a second similarity threshold, and/or generating the text feature vector sample into a text used for describing the original image sample and/or generating the image feature vector sample into an image used for describing the original text sample.
6. The method of claim 5, wherein the method further comprises: acquiring an image to be retrieved; converting the image to be searched into the corresponding target text based on the image-text processing model, and extracting global image features from the image to be searched based on the target text corresponding to the image to be searched; And searching out the output text with highest similarity with the image to be searched in a text database based on the global image characteristics.
7. The method of claim 6, wherein the method further comprises: Acquiring a text to be searched; converting the text to be searched into the corresponding target image based on the image-text processing model, and extracting global text features from the text to be searched based on the target image corresponding to the text to be searched; and searching an output image with highest similarity with the text to be searched in an image database based on the global text characteristics.
8. A method of processing teletext data, comprising: displaying an original image and an original text to be processed on a presentation picture of a Virtual Reality (VR) device or an Augmented Reality (AR) device, wherein the original image and the original text are used for describing at least one same object; the VR device or AR device extracts image context features from the original image and text context features from the original text; After the image context feature and the text context feature are spliced to obtain an original feature vector, driving the VR device or the AR device to render and display a target text corresponding to the original image, which is obtained by feature encoding the original feature vector, wherein the target text comprises the same object, and/or driving the VR device or the AR device to render and display a target image corresponding to the original text, which is obtained by feature encoding the original feature vector, wherein the target image comprises the same object; The method further comprises the steps of mapping the original feature vector into a target feature vector, wherein the target feature vector comprises an image feature vector and a text feature vector, the similarity between an object represented by a feature in the image feature vector and an object represented by the text context feature is larger than a first similarity threshold, and the similarity between the object represented by the feature in the text feature vector and the object represented by the image context feature is larger than a second similarity threshold, generating the text feature vector into the target text corresponding to the original image, and/or generating the image feature vector into the target image corresponding to the original text.
9. A method of processing teletext data, comprising: acquiring an original image and an original text to be processed by calling a first interface, wherein the original image and the original text are used for describing at least one same object; extracting image context features from the original image and extracting text context features from the original text, wherein the image context features comprise image features in different image areas in the original image, and the text context features comprise text features at different text positions in the original text; Splicing the image context features and the text context features to obtain original feature vectors, wherein the features in the original feature vectors are the image context features or the text context features; performing feature coding on the original feature vector to obtain a target text corresponding to the original image, wherein the target text comprises the same object, and/or performing feature coding on the original feature vector to obtain a target image corresponding to the original text, and the target image comprises the same object; Outputting the target text and/or the target image by calling a second interface; The method comprises the steps of carrying out feature coding on the original feature vector to obtain a target text corresponding to the original image, and/or carrying out feature coding on the original feature vector to obtain a target image corresponding to the original text, wherein the target feature vector comprises an image feature vector and a text feature vector, the similarity between an object represented by features in the image feature vector and an object represented by text context features is larger than a first similarity threshold, the similarity between the object represented by features in the text feature vector and the object represented by the image context features is larger than a second similarity threshold, and/or generating the text feature vector into the target text corresponding to the original image, and/or generating the image feature vector into the target image corresponding to the original text.
10. A processor for running a program, wherein the program when run performs the method of any one of claims 1 to 9.

Description

Image-text data processing method and processor Technical Field The invention relates to the field of image processing, in particular to a method and a processor for processing image-text data. Background At present, information on the internet is mainly video, pictures and texts, and most of requests of users are mainly texts, so that the establishment of the connection between different modes has very important significance. When establishing the connection between different modalities, a feature model is usually learned to enable the feature model to perform unified feature representation on images and texts, but the feature model needs to compress the images and texts into global feature vectors. Because more image data and text data are easily lost in the process of compressing the image and the text into the global feature vector, the technical problem of low processing efficiency of the image-text data is caused. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the invention provides a processing method and a processor for graphic data, which are used for at least solving the technical problem of low processing efficiency of the graphic data. According to one aspect of the embodiment of the invention, a processing method of graphic data is provided, which comprises the steps of obtaining an original image and an original text to be processed, wherein the original image and the original text are used for describing at least one same object, extracting image context characteristics from the original image and text context characteristics from the original text, splicing the image context characteristics and the text context characteristics to obtain an original characteristic vector, and carrying out characteristic coding on the original characteristic vector to obtain a target text corresponding to the original image and/or a target image corresponding to the original text, wherein the target text and/or the target image comprise the same object. According to another aspect of the embodiment of the invention, a processing method of graphic data is provided, which comprises the steps of obtaining an original image sample and an original text sample to be processed, wherein the original image sample and the original text sample are used for describing at least one same object, extracting an image context feature sample from the original image sample and a text context feature sample from the original text, stitching the image context feature sample and the text context feature sample to obtain an original feature vector sample, outputting the original feature vector sample, wherein the text which is obtained by feature encoding the original feature vector sample and is used for describing the original image sample, and/or the image which is used for describing the original text sample is used for training as a training sample to obtain a graphic processing model, wherein the graphic processing model is used for converting an input image into a target text which is used for describing the input image, the target text comprises the same object, and/or the graphic processing model is used for converting the input text into a target image which is used for describing the input text, and the target image comprises the same object. According to another aspect of the embodiment of the invention, another image-text data processing method is provided, which comprises the steps of displaying an original image and an original text to be processed on a display picture of a virtual reality VR device or an augmented reality AR device, wherein the original image and the original text are used for describing at least one same object, extracting image context features from the original image by the VR device or the AR device, extracting text context features from the original text, splicing the image context features and the text context features to obtain an original feature vector, driving the VR device or the AR device to render and display a target text corresponding to the original image, obtained by feature encoding the original feature vector, and/or driving the VR device or the AR device to render and display a target image corresponding to the original text, obtained by feature encoding the original feature vector, and the target image comprises the same object. According to another aspect of the embodiment of the invention, another image-text data processing method is provided, which comprises the steps of obtaining an original image and an original text to be processed by calling a first interface, wherein the original image and the original text are used for describing at least one same object, extracting image context characteristics from the original image, extracting text context characteristics from the original text, wherein the image context characteristics comprise image characteristics in different image areas in the original