US-12626524-B2 - Method and apparatus for generating captioning device, and method and apparatus for outputting caption

US12626524B2US 12626524 B2US12626524 B2US 12626524B2US-12626524-B2

Abstract

A method and apparatus for generating a captioning device, and a method and apparatus for outputting a caption. The method for generating a captioning device comprises: acquiring a sample image set; inputting the sample image set into an image encoder of a sentence generator, so as to output an object set; grouping the object set into a first object set and a second object set, wherein the first object set is an object set that is included within a preset object set, and the second object set is an object set that is excluded from the preset object set; inputting, into a sentence decoder of the sentence generator, the object set output by the image encoder, and performing a beam search in a decoding step by taking the first object set and the second object set as constraint conditions, so as to generate a pseudo-image sentence pair set; and training the sentence generator by taking the pseudo-image sentence pair set as a sample set, so as to obtain a captioning device.

Inventors

Yingwei PAN
Yehao LI
Ting Yao
Tao Mei

Assignees

JINGDONG TECHNOLOGY HOLDING CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20220106
Priority Date: 20210330

Claims (16)

1 . A method for generating a captioning device, comprising the steps of: acquiring a sample image set; inputting the sample image set into an image encoder of a sentence generator to output an object set; dividing the object set into a first object set and a second object set, wherein the first object set is an object set included in a preset object set, and the second object set is an object set excluded from the preset object set; inputting the object set output by the image encoder into a sentence decoder of the sentence generator, and performing a beam search by using the first object set and the second object set as constraint conditions in a decoding step to generate a pseudo image-sentence pair set; and training the sentence generator by using the pseudo image-sentence pair set as a sample set to obtain the captioning device.
2 . The method according to claim 1 , wherein the method further comprises the steps of: optimizing the captioning device by at least one of: optimizing the captioning device by performing an adversarial training on the captioning device through a sentence discriminator; optimizing the captioning device through an inclusion degree of an object identified by the captioning device in a sentence output by the captioning device; or optimizing the captioning device through a semantic correlation between an image triplet and a corresponding generated sentence, wherein the image triplet comprises a query image, a positive image, and a negative image.
3 . The method according to claim 2 , wherein the optimizing the captioning device by performing the adversarial training on the captioning device through the sentence discriminator comprises: extracting a preset first sample set, wherein each first sample comprises an image and a corresponding true sentence; extracting a pre-established generative adversarial network, wherein the generative adversarial network comprises a captioning device and the sentence discriminator, wherein the captioning device is configured to perform an image-encoding on an input image and then perform a sentence-decoding to obtain a pseudo sentence, and the sentence discriminator is configured to determine whether the input sentence is the pseudo sentence output by the captioning device; and selecting a first sample from the first sample set based on a machine learning method, and performing first training steps of: inputting an image in the selected first sample into the captioning device to output a pseudo sentence; inputting the pseudo sentence and a true sentence in the selected first sample into the sentence discriminator to output a discrimination result; calculating an accuracy rate of the sentence discriminator according to the output discrimination result; and determining that a training of the captioning device is completed in response to the accuracy rate reaching a preset value.
4 . The method according to claim 3 , wherein the method further comprises the step of: calculating, in response to that the accuracy rate does not reach the preset value, an adversarial loss of the sentence discriminator, adjusting a relevant parameter of the sentence discriminator to reduce the adversarial loss, and re-selecting a first sample from the first sample set to continue performing the first training steps.
5 . The method according to claim 3 , wherein the method further comprises the step of: calculating, in response to that the accuracy rate does not reach the preset value, an adversarial reward of the sentence discriminator, adjusting a relevant parameter of the sentence discriminator to increase the adversarial reward, and re-selecting a first sample from the first sample set to continue performing the first training steps.
6 . The method according to claim 2 , wherein the optimizing the captioning device through the inclusion degree of the object identified by the captioning device in the sentence output by the captioning device comprises: extracting a preset second sample set, wherein each second sample comprises an image; and selecting a sample from the second sample set based on a machine learning method, and performing second training steps of: inputting an image in the selected second sample into an image encoder of the captioning device to output a sample object set; inputting the sample object set into a sentence decoder of the captioning device to output a pseudo sentence; calculating a confidence mean score of that the pseudo sentence contains sample objects of the sample object set as an object inclusion reward of the pseudo sentence; and determining that a training of the captioning device is completed in response to the object inclusion reward reaching a preset inclusion reward threshold.
7 . The method according to claim 6 , wherein the method further comprises the step of: adjusting, in response to that the object inclusion reward does not reach the preset inclusion reward threshold, a relevant parameter of the captioning device to increase the object inclusion reward, and re-selecting a second sample from the second sample set to continue performing the second training steps.
8 . The method according to claim 2 , wherein the optimizing the captioning device through the semantic correlation between the image triplet and the corresponding generated sentence comprises: extracting a preset third sample set, wherein each third sample comprises a query image, a positive image and a negative image, the positive image and the query image share at least two objects, and the negative image and the query image have no common object; and selecting a third sample from the third sample set based on a machine learning method, and performing third training steps of: inputting a query image, a positive image, and a negative image in the selected third sample into the captioning device to output a query sentence, a positive sentence, and a negative sentence, respectively; calculating a first semantic similarity of the query sentence and the positive sentence and calculating a second semantic similarity of the query sentence and the negative sentence; calculating a self-supervised triplet loss according to the first semantic similarity and the second semantic similarity; and determining that a training of the captioning device is completed in response to the self-supervised triplet loss being less than a preset loss threshold.
9 . The method according to claim 8 , wherein the method further comprises the step of: adjusting, in response to the self-supervised triplet loss being not less than the preset loss threshold, a relevant parameter of the captioning device to reduce the self-supervised triplet loss, and re-selecting a third sample from the third sample set to continue performing the third training steps.
10 . The method according to claim 8 , wherein the calculating the first semantic similarity of the query sentence and the positive sentence and calculating the second semantic similarity of the query sentence and the negative sentence comprises: calculating, for the query sentence, the positive sentence and the negative sentence, an object-based probability distribution of each word in the sentences, performing a maximum pooling operation, and obtaining a query sentence feature, a positive sentence feature and a negative sentence feature, respectively; and calculating a first semantic similarity of the query sentence feature and the positive sentence feature and calculating a second semantic similarity of the query sentence feature and the negative sentence feature.
11 . The method according to claim 2 , wherein the method further comprises the step of: adjusting, in response to that a weighted sum of the adversarial reward, the object inclusion reward, and the self-supervised triplet loss is greater than a preset target value, a relevant parameters of the captioning device to reduce the weighted sum.
12 . The method according to claim 1 , wherein the image encoder comprises a two-layer LSTM with an area-level attention mechanism, wherein a first layer LSTM serves as a top-down attention module that calculates an object-level attention according to context information, and a second layer LSTM is a language model for generating a sentence.
13 . The method according to claim 1 , wherein the method further comprises outputting a caption by: acquiring a to-be-processed image; and inputting the image into the obtained captioning device, and outputting the caption corresponding to the image.
14 . An apparatus for generating a captioning device, comprising: one or more processors; and a storage apparatus, storing one or more computer programs, wherein the one or more computer programs, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising: acquiring a sample image set; inputting the sample image set into an image encoder of a sentence generator to output an object set; dividing the object set into a first object set and a second object set, wherein the first object set is an object set included in a preset object set, and the second object set is an object set excluded from the preset object set; inputting the object set output by the image encoder into a sentence decoder of the sentence generator, and perform a beam search by using the first object set and the second object set as constraint conditions in a decoding step to generate a pseudo image-sentence pair set; and training the sentence generator by using the pseudo image-sentence pair set as a sample set to obtain the captioning device.
15 . The apparatus according to claim 14 , wherein the operations further compromise: acquiring a to-be-processed image; and inputting the image into the obtained captioning device, and outputting the caption corresponding to the image.
16 . A non-transitory computer readable medium, storing a computer program, wherein the program, when executed by a processor, causes the processor to perform operations, the operations comprising: acquiring a sample image set; inputting the sample image set into an image encoder of a sentence generator to output an object set; dividing the object set into a first object set and a second object set, wherein the first object set is an object set included in a preset object set, and the second object set is an object set excluded from the preset object set; inputting the object set output by the image encoder into a sentence decoder of the sentence generator, and perform a beam search by using the first object set and the second object set as constraint conditions in a decoding step to generate a pseudo image-sentence pair set; and training the sentence generator by using the pseudo image-sentence pair set as a sample set to obtain the captioning device.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application is a U.S. National Stage of International Application No. PCT/CN2022/070476, filed on Jan. 6, 2022, which claims the priority of Chinese Patent Application No. 202110338045.X, filed on Mar. 30, 2021 and entitled “Method and Apparatus for Generating Captioning Device, and Method and Apparatus for Outputting Caption,” the entire disclosures of which are hereby incorporated by reference. TECHNICAL FIELD Embodiments of the present disclosure relate to the field of computer technology, and specifically, to a method and apparatus for generating a captioning device, and a method and apparatus for outputting a caption. BACKGROUND Image captioning is an emerging and rapidly developing research topic, which is a technique for automatically describing images by using natural language sentences. SUMMARY Embodiments of the present disclosure propose a method and apparatus for generating a captioning device, and a method and apparatus for outputting a caption. Embodiments of the present disclosure provide a method for generating a captioning device, and the method includes: acquiring a sample image set; inputting the sample image set into an image encoder of a sentence generator to output an object set; dividing the object set into a first object set and a second object set, where the first object set is an object set included in a preset object set, and the second object set is an object set excluded from the preset object set; inputting the object set output by the image encoder into a sentence decoder of the sentence generator, and performing a beam search by using the first object set and the second object set as constraint conditions in a decoding step to generate a pseudo image-sentence pair set; and training the sentence generator by using the pseudo image-sentence pair set as a sample set to obtain a captioning device. In some embodiments, the method further includes: optimizing the captioning device by at least one of: optimizing the captioning device by performing an adversarial training on the captioning device through a sentence discriminator; optimizing the captioning device through an inclusion degree of an object identified by the captioning device in a sentence output by the captioning device; or optimizing the captioning device through a semantic correlation between an image triplet and a corresponding generated sentence, where the image triplet includes a query image, a positive image, and a negative image. In some embodiments, the optimizing the captioning device by performing an adversarial training on the captioning device through a sentence discriminator includes: extracting a preset first sample set, where each first sample includes an image and a corresponding true sentence; extracting a pre-established generative adversarial network, where the generative adversarial network includes a captioning device and the sentence discriminator, where the captioning device is configured to perform an image-encoding on an input image and then perform a sentence-decoding to obtain a pseudo sentence, and the sentence discriminator is configured to determine whether the input sentence is the pseudo sentence output by the captioning device; and selecting a first sample from the first sample set based on a machine learning method, and performing first training steps of: inputting an image in the selected first sample into the captioning device to output a pseudo sentence; inputting the pseudo sentence and a true sentence in the selected first sample into the sentence discriminator to output a discrimination result; calculating an accuracy rate of the sentence discriminator according to the output discrimination result; and determining that a training of the captioning device is completed in response to the accuracy rate reaching a preset value. In some embodiments, the method further includes: calculating, in response to that the accuracy rate does not reach the preset value, an adversarial loss of the sentence discriminator, adjusting a relevant parameter of the sentence discriminator to reduce the adversarial loss, and re-selecting a first sample from the first sample set to continue performing the first training steps. In some embodiments, the method further includes: calculating, in response to that the accuracy rate does not reach the preset value, an adversarial reward of the sentence discriminator, adjusting a relevant parameter of the sentence discriminator to increase the adversarial reward, and re-selecting a first sample from the first sample set to continue performing the first training steps. In some embodiments, the optimizing the captioning device through an inclusion degree of an object identified by the captioning device in a sentence output by the captioning device includes: extracting a preset second sample set, where each second sample includes an image; and selecting a sample from the second sample set based on a machine learning method,