JP-2026514279-A - Image processing method, image processing apparatus, electronic device, and computer program
Abstract
This application provides an image processing method, apparatus, electronic device, and computer-readable storage medium, the method comprising the steps of: acquiring a pending presentation word; acquiring text features of the pending presentation word and mapping the text features to a generation probability index and description type of the pending presentation word, wherein, in response to the generation probability index being greater than an index threshold, the description type indicating that the pending presentation word does not contain a verb, and the pending presentation word contains multiple clauses, acquiring similar images corresponding to each of the multiple clauses, determining the degree of image difference between the similar images corresponding to each of the multiple clauses; and, in response to the degree of image difference being less than an image difference threshold, using the similar images corresponding to each of the multiple clauses as illustrations for the corresponding clauses. [Selection Diagram] Figure 3A
Inventors
- 郭 卉
Assignees
- ▲騰▼▲訊▼科技(深▲セン▼)有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20240122
- Priority Date
- 20230404
Claims (20)
- An image processing method performed by an electronic device, Steps to obtain a word to be processed, A step of obtaining the text features of the pending suggestion word, and mapping the text features to a generative probability index and description type of the pending suggestion word, wherein the generative probability index is used to represent a score that the pending suggestion word can be used to generate an illustration. Steps to obtain similar images corresponding to each of the multiple clauses in response to the fact that the generative probability index is greater than an index threshold, the description type indicates that the pending presentation word does not contain a verb, and the pending presentation word contains multiple clauses, wherein the image-text similarity between the clause and the corresponding similar image is greater than an image-text similarity threshold, The steps include determining the degree of image difference between similar images corresponding to each of the aforementioned multiple sections, An image processing method comprising the step of, in response to the image difference being smaller than an image difference threshold, setting a similar image corresponding to each of the plurality of sections as an illustration for the corresponding section.
- The aforementioned image processing method is: The further step includes, in response to the image difference being equal to or greater than the image difference threshold, subsequently dividing the plurality of clauses into a plurality of new pending presentation words, The image processing method according to claim 1.
- The step of obtaining the text features of the aforementioned pending presented word is: The steps include converting the aforementioned pending word into a mark sequence, The step includes calling a semantic understanding model based on the mark sequence and performing encoding processing to obtain the text features of the processing-awaited presented words, The step of mapping the text features to the probability index and description type of the awaiting presented word is: The steps include: calling a convolutional network within the first text classifier to perform a convolution operation on the text features to obtain a first convolutional feature; calling a multi-classification layer within the first text classifier to map the first convolutional feature to the first probability of a plurality of candidate generative indicators; and using the candidate generative indicator corresponding to the highest first probability as the generative indicator for the word awaiting processing. The steps include: calling a convolutional network in a second text classifier to perform a convolution operation on the text features to obtain a second convolutional feature; calling a multi-classification layer in the first text classifier to map the second convolutional feature to the second probability of a plurality of description types, and setting the description type corresponding to the highest second probability as the description type of the pending presented word, wherein the description types include types that include verbs and types that do not include verbs; The image processing method according to claim 1.
- The aforementioned image processing method is: The further step includes deleting the pending suggestion in response that the generation probability index is greater than the index threshold and the description type indicates that the pending suggestion contains a verb. The image processing method according to claim 1.
- The aforementioned image processing method is: In response to the fact that the potential generation index is less than or equal to the index threshold, the further step includes storing the illustrations of the pending suggestion words in the text illustration sequence in the order they were generated, wherein different pending suggestion words are sequentially extracted from the text. The image processing method according to claim 1 or 2.
- The step of determining the degree of image difference between similar images corresponding to each of the aforementioned multiple sections is: For each of the aforementioned similar images, The process involves determining the grayscale average value of pixels in each row within the similar image, and combining the grayscale average values of pixels in each row as image features of the similar image. The process includes: determining the variance of the image features of similar images corresponding to each of the aforementioned multiple sections, and using the variance as the degree of image difference between the similar images corresponding to each of the aforementioned multiple sections; The image processing method according to any one of claims 1 to 5.
- The aforementioned image processing method is: A step of obtaining a sample set of image-text pairs, wherein the sample set of image-text pairs includes a plurality of image-text pairs, and each image-text pair includes a sample presentation word and a sample similar image. A step of determining the recall rate of the image-text pair sample set at the current threshold point in ascending order of a plurality of predetermined threshold points, wherein the recall rate is the ratio of the number of recalled image-text pairs to the total number of the plurality of image-text pairs, and the image-text similarity between the sample presentation word and the sample-like image in the recalled image-text pair is greater than or equal to the current threshold point. The steps include determining the current threshold point as the image-text similarity threshold in response to the fact that the recall rate at the current threshold point is equal to or greater than the recall threshold, The steps include determining the variance of the image features of the sample-like images within the reproduced image-text pair, and using the variance of the image features of the sample-like images within the reproduced image-text pair as the image difference threshold, The image processing method according to claim 6.
- The aforementioned image processing method is: The process further includes the steps of obtaining multiple generated images of the pending presentation word in response to the pending presentation word being an appropriate word, determining an illustration for the pending presentation word from the multiple generated images, and saving the illustration for the pending presentation word, wherein the appropriate word is a word that does not satisfy the inappropriate word condition, and the inappropriate word condition includes at least one of the following: the probability of generation is greater than the index threshold, the description type indicates that the pending presentation word does not contain a verb, and the pending presentation word contains multiple clauses, the image difference is greater than or equal to the image difference threshold, and the probability of generation is greater than the index threshold, and the description type indicates that the pending presentation word contains a verb. The image processing method according to any one of claims 1 to 7.
- If the pending presentation word is not the first presentation word extracted from the text, the step of determining the illustration for the pending presentation word from the multiple generated images is: The steps include determining the image-text similarity between each of the plurality of generated images and the pending presented word, and setting the generated image corresponding to an image-text similarity greater than the image-text similarity threshold as the retained image, The steps include: In response to the fact that the retained image contains a noun element within the awaiting presented word and that at least one past element in the database contains the noun element, querying the database for the past element features of the noun element; The steps include: determining the element similarity between the retained image and the past element features; querying the database for the image features of the past illustrations of the past presented words; and determining the image similarity between the retained image and the past illustrations based on the image features of the past illustrations and the image features of the retained image; The steps include obtaining a total fusion score for the retained images by weighting and adding the element similarity and the image similarity, The step of determining the retained image corresponding to the highest fusion total score as the illustration for the pending presented word, The image processing method according to claim 8.
- The step of determining the image-text similarity between each of the plurality of generated images and the processing-awaited presented word is: The steps include: obtaining image features corresponding to each of the multiple generated images; The steps include: identifying noun elements within the awaiting presented word, performing encoding processing on the noun elements, and obtaining the present word element features; The process includes the steps of determining the cosine similarity between the image features of each generated image and the present word element features, and using the cosine similarity as the image-text similarity between the generated image and the present word awaiting processing. The image processing method according to claim 9.
- If the pending presentation word is the first presentation word extracted from the text, the step of determining the illustration for the pending presentation word from the plurality of generated images is: The process includes determining the image-text similarity between each of the plurality of generated images and the pending presentation word, and using the generated image corresponding to the highest image-text similarity as an illustration for the pending presentation word. The image processing method according to claim 9.
- The aforementioned past element feature is a feature of the past element within the aforementioned past-presented word, The step of querying the database for the past element features of the noun element is as follows: The steps include identifying common elements between the retained image and the noun element, A step of querying the database for past element features corresponding to the common element, wherein the database includes the past element features of the previously presented word, The image processing method according to claim 9.
- The step of determining the element similarity between the retained image and the past element features is: A step of determining the element similarity between the retained image and each of the past element features of a different type, wherein the type of past element includes people, environments, and tools. The process includes the step of obtaining an element similarity between the retained image and the past element features by weighting and adding the element similarities between the retained image and each of the past element features of different types. The image processing method according to claim 9.
- The aforementioned image processing method is: If the word awaiting processing is the first word extracted from the text, the database is used. The noun element within the processing-awaited presentation word is treated as the past element, and the past element and its corresponding past element features are stored in the database. The process further includes the step of updating the database by saving the illustration of the pending word as a past image and saving the past image to the database. The image processing method according to claim 9.
- The aforementioned image processing method is: If the aforementioned pending suggestion is not the first suggestion extracted from the text, the database is, The element features of noun elements appearing in the aforementioned retained image are weighted and added together with the past element features of past elements with the same name in the aforementioned database, and the updated past element features obtained replace the previous past element features. The update further includes the step of performing a weighted addition of the image features of the retained image and the image features of the past illustrations in the database, and replacing the image features before the update with the updated image features obtained. The image processing method according to claim 9.
- The step of obtaining multiple generated images of the aforementioned pending presentation word is: The steps include: performing an encoding process on the pending presented word to obtain the text features of the pending presented word and the image features corresponding to the text features of the pending presented word; The steps include: performing noise addition processing on the aforementioned image features to obtain noisy image features; The steps include: a step of fusing the aforementioned text features and the aforementioned noise image features to obtain a fused feature; The steps include: performing noise reduction processing on the aforementioned fused features to obtain reconstructed image features; The process includes the step of performing a decoding process on the reconstructed image features to obtain a plurality of generated images. The image processing method according to any one of claims 1 to 15.
- An image processing device, It comprises an acquisition module, a mapping module, and a decision module, The acquisition module is configured to acquire a pending word, The mapping module is configured to acquire the text features of the pending presented word and to map the text features to the generative possibility index and description type of the pending presented word, the generative possibility index being used to represent a score that the pending presented word can be used to generate an illustration. The acquisition module is further configured to acquire similar images corresponding to each of the multiple clauses in response to the fact that the generative possibility index is greater than the index threshold, the description type indicates that the pending presentation word does not contain a verb, and the pending presentation word contains multiple clauses, and the image-text similarity between the clause and the corresponding similar image is greater than the image-text similarity threshold, The decision module is configured to determine the degree of image difference between similar images corresponding to each of the plurality of sections, The decision module is further configured to set similar images corresponding to each of the plurality of sections as illustrations for the corresponding sections in response to the image difference being less than an image difference threshold, in an image processing apparatus.
- It is an electronic device, Memory for storing computer-executable instructions or computer programs, An electronic device comprising: a processor that executes the image processing method according to any one of claims 1 to 16 when executing the computer-executable instructions or computer programs stored in the memory.
- A computer-readable storage medium storing computer-executable instructions or computer programs that, when executed by the processor, cause the processor to perform the image processing method described in any one of claims 1 to 16.
- A computer program product comprising a computer-executable instruction or computer program that, when executed by a processor, causes the processor to perform the image processing method described in any one of claims 1 to 16.
Description
(Cross-reference to related applications) This application claims priority to the Chinese patent application filed with the China National Patent Office on April 4, 2023, with application number 202310399237.0, the entirety of which is incorporated into this application by reference. This application relates to image processing technology, and more particularly to image processing methods, image processing apparatuses, electronic devices, and computer-readable storage media. In related technologies, in text-based image generation tasks, the user provides text for a desired description, such as a story text or a martial arts novel, as a prompt (also called a "prompt"). Based on the prompt, an image generation model is invoked to generate multiple corresponding story images, which are then used as illustrations (also called "illustrations") for the prompt. However, this method of directly generating images based on prompts often results in less effective illustrations. On the one hand, the descriptions of suggested words entered by the user may be abstract. For example, the described content may be diverse and complex actions, and it is difficult to perfectly represent these actions in an image, resulting in a low degree of matching between the generated image and the suggested word. On the other hand, suggested words provided by the user may contain many clauses, and the elements and content described in these clauses may also differ. Therefore, even with the same suggested word containing multiple clauses, it is possible to generate multiple images with completely different content. The differences between these generated images are large, and there is also a possibility that generated elements may be missing, making them unsuitable as illustrations for the suggested word and affecting the overall correlation of the generated image's effect. The embodiments of this application provide an image processing method, apparatus, electronic device, computer-readable storage medium, and computer program product that improve the overall correlation between images and text by identifying the possibility of generating suggested words in a scene where images are generated from text. The technical solution of the embodiment of this application is realized as follows: Embodiments of the present application provide an image processing method performed by an electronic device, the method being Steps to obtain a word to be processed, A step of obtaining the text features of the pending suggestion word, and mapping the text features to a generative probability index and description type of the pending suggestion word, wherein the generative probability index is used to represent a score that the pending suggestion word can be used to generate an illustration. Steps to obtain similar images corresponding to each of the multiple clauses in response to the fact that the generative probability index is greater than an index threshold, the description type indicates that the pending presentation word does not contain a verb, and the pending presentation word contains multiple clauses, wherein the image-text similarity between the clause and the corresponding similar image is greater than an image-text similarity threshold, The steps include determining the degree of image difference between similar images corresponding to each of the aforementioned multiple sections, The process includes the step of, in response to the image difference being smaller than the image difference threshold, setting a similar image corresponding to each of the plurality of sections as an illustration for the corresponding section. Embodiments of the present invention provide an image processing apparatus comprising an acquisition module, a mapping module, and a determination module. The acquisition module is configured to acquire a pending word, The mapping module is configured to acquire the text features of the pending presented word and to map the text features to the generative possibility index and description type of the pending presented word, the generative possibility index being used to represent a score that the pending presented word can be used to generate an illustration. The acquisition module is further configured to acquire similar images corresponding to each of the multiple clauses in response to the fact that the generative possibility index is greater than the index threshold, the description type indicates that the pending presentation word does not contain a verb, and the pending presentation word contains multiple clauses, and the image-text similarity between the clause and the corresponding similar image is greater than the image-text similarity threshold, The decision module is configured to determine the degree of image difference between similar images corresponding to each of the plurality of sections, The decision module is further configured to use similar images corresponding to each of the multiple sections as illustra