CN-121996836-A - Image-text recommendation method and device and electronic equipment

CN121996836ACN 121996836 ACN121996836 ACN 121996836ACN-121996836-A

Abstract

The invention provides an image-text recommending method, an image-text recommending device and electronic equipment, wherein the image-text recommending method comprises the steps of obtaining a browsed image-text sequence of a user; the browsed image-text sequence is used as a sequence token to be input into an image-text recommendation model which is obtained through training in advance, a preset number of future sequence tokens which are output by the image-text recommendation model at one time are obtained, the image-text recommendation model is obtained through training based on a multi-token prediction mode, and the preset number of future browsing image-text sequences which are matched with the browsed image-text sequence and recommended to a user are determined based on the preset number of future sequence tokens. In addition, the plurality of future browsing image-text sequences are generated in one reasoning, so that the delay and the calculation cost of recommendation generation are greatly reduced, and the recommendation accuracy and the recommendation processing efficiency of the future recommended browsing image-text are improved.

Inventors

WANG LIANGDONG
LIU GUANG

Assignees

北京智源人工智能研究院

Dates

Publication Date: 20260508
Application Date: 20251203

Claims (10)

1. The image-text recommending method is characterized by comprising the following steps of: acquiring a browsed image-text sequence of a user; inputting the browsed image-text sequence as a sequence token into an image-text recommendation model which is obtained through training in advance, and obtaining a preset number of future sequence tokens which are output by the image-text recommendation model at one time, wherein the image-text recommendation model is obtained through training based on a multi-token prediction mode; and determining a preset number of future browsing image-text sequences which are matched with the browsed image-text sequences and recommended to the user based on the preset number of future sequence tokens.
2. The image-text recommendation method according to claim 1, wherein the image-text recommendation model is trained by: Constructing a training data set, wherein the training data set comprises a plurality of training samples, the training samples comprise a group of browsed image-text sequence samples and a group of future sequence token label samples corresponding to the browsed image-text sequence samples; based on each training sample in the training data set, predicting each future sequence token prediction result corresponding to each training sample through an image-text recommendation model; Constructing a multi-token cross entropy loss function based on the future sequence token prediction results and the future sequence token label samples; And under the condition that the function value of the multi-token cross entropy loss function meets the preset requirement, training the image-text recommendation model is completed, and a trained image-text recommendation model is obtained.
3. The method according to claim 2, wherein the predicting, based on each training sample in the training dataset, each future sequence token prediction result corresponding to each training sample by using the image-text recommendation model, specifically includes: obtaining a prediction instruction of a user, wherein the prediction instruction is used for representing a preset number of instructions for predicting the future sequence tokens; determining a mask flag number based on the predicted instruction; based on each training sample in the training data set, predicting through an image-text recommendation model to obtain each future sequence token prediction result corresponding to each training sample, specifically including: According to a mask mechanism, based on each training sample in a training data set, predicting through an image-text recommendation model to obtain future sequence token prediction results which correspond to each training sample and are matched with the mask marks in number; The constructing a multi-token cross entropy loss function based on the future sequence token prediction results and the future sequence token label samples specifically comprises the following steps: and constructing a multi-token cross entropy loss function based on each future sequence token prediction result with a preset number and each future sequence token label sample.
4. The method for recommending graphics according to claim 1, wherein the inputting the browsed graphics sequence as a sequence token into a pre-trained graphics recommendation model to obtain a preset number of future sequence tokens output by the graphics recommendation model at one time, specifically comprises: Under the condition that the preset number is larger than a preset threshold value, based on a sliding window strategy, inputting the browsed image-text sequence as a sequence token to an image-text recommendation model which is obtained through training in advance, and sequentially obtaining a plurality of groups of first number of future sequence tokens which are output by the image-text recommendation model, wherein the first number is smaller than the preset number; and obtaining a preset number of future sequence tokens based on the plurality of groups of first number of future sequence tokens.
5. The teletext recommending method according to claim 1, wherein after the determining of a preset number of future browsing teletext sequences matching the browsed teletext sequence and recommended to the user based on the preset number of future sequence tokens, the method further comprises: Sequencing the preset number of future browsing image-text sequences according to the sequence of the prediction probability from high to low to obtain a future browsing image-text sequence recommendation list; And obtaining the future browsing image-text sequence finally recommended for the user based on the second number of the future browsing image-text sequences before the future browsing image-text sequence recommendation list.
6. The method of claim 5, wherein before the ranking the preset number of future browsing image-text sequences in order of the prediction probability from high to low to obtain a future browsing image-text sequence recommendation list, the method further comprises: Performing de-duplication treatment on the preset number of future browsing image-text sequences to obtain a plurality of future browsing image-text sequences after de-duplication treatment; The method comprises the steps of sequencing the preset number of future browsing image-text sequences according to the sequence from big to small of the prediction probability to obtain a future browsing image-text sequence recommendation list, and specifically comprises the following steps: and sequencing the multiple future browsing image-text sequences after the duplication elimination treatment according to the order of the prediction probability from large to small, so as to obtain a future browsing image-text sequence recommendation list.
7. The method of claim 5, wherein before the ranking the preset number of future browsing image-text sequences in order of the prediction probability from high to low to obtain a future browsing image-text sequence recommendation list, the method further comprises: carrying out diversity adjustment processing on the preset number of future browsing image-text sequences to obtain a plurality of future browsing image-text sequences after the diversity adjustment processing; The method comprises the steps of sequencing the preset number of future browsing image-text sequences according to the sequence from big to small of the prediction probability to obtain a future browsing image-text sequence recommendation list, and specifically comprises the following steps: And sequencing the plurality of future browsing image-text sequences after the diversity adjustment processing according to the order of the prediction probability from large to small to obtain a future browsing image-text sequence recommendation list.
8. An image-text recommending device, characterized in that the device comprises: The acquisition module is used for acquiring the browsed image-text sequence of the user; The processing module is used for inputting the browsed image-text sequence as a sequence token into an image-text recommendation model which is obtained through training in advance, and obtaining a preset number of future sequence tokens which are output by the image-text recommendation model at one time, wherein the image-text recommendation model is obtained through training based on a multi-token prediction mode; And the recommending module is used for determining a preset number of future browsing image-text sequences which are matched with the browsed image-text sequences and recommended to the user based on the preset number of future sequence tokens.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of recommending graphics according to any of claims 1 to 7 when executing the computer program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the teletext recommendation method according to any one of claims 1 to 7.

Description

Image-text recommendation method and device and electronic equipment Technical Field The invention relates to the technical field of artificial intelligence, in particular to an image-text recommendation method, an image-text recommendation device and electronic equipment. Background In the technical field of image-text recommendation, predicting and recommending future contents based on user history browsing image-text behaviors becomes a core research direction. It is known from the related art that the serialization prediction is currently implemented by adopting an autoregressive generation paradigm based on the next token prediction (Next Token Prediction, NTP), and the recommendation model is built by predicting interactive tokens one by one. However, the autoregressive characteristic of NTP leads to the model being optimized by adopting a teacher forced (Teacher Forcing) mode in a training stage, while the inference stage is dependent on a precursor prediction result to gradually generate a sequence, and the mode mismatch directly causes the distribution difference of the training and inference stages to cause an error accumulation effect. Disclosure of Invention The invention provides an image-text recommending method, an image-text recommending device and electronic equipment, which improve the recommending accuracy and recommending processing efficiency of browsing images and texts recommended in the future. The invention provides an image-text recommendation method, which comprises the steps of obtaining a browsed image-text sequence of a user, inputting the browsed image-text sequence as a sequence token into an image-text recommendation model which is obtained through training in advance, obtaining a preset number of future sequence tokens which are output by the image-text recommendation model at one time, wherein the image-text recommendation model is obtained through training based on a multi-token prediction mode, and determining a preset number of future browsed image-text sequences which are matched with the browsed image-text sequence and recommended for the user based on the preset number of future sequence tokens. The image-text recommendation method comprises the steps of constructing a training data set, constructing a plurality of training samples, wherein the training data set comprises a group of browsed image-text sequence samples and a group of future sequence token label samples corresponding to the browsed image-text sequence samples, predicting each future sequence token prediction result corresponding to each training sample through the image-text recommendation model based on each training sample in the training data set, constructing a multi-token cross entropy loss function based on each future sequence token prediction result and each future sequence token label sample, and completing training of the image-text recommendation model under the condition that the function value of the multi-token cross entropy loss function meets the preset requirement to obtain a trained image-text recommendation model. The image-text recommendation method comprises the steps of obtaining prediction instructions of a user, determining the number of mask marks based on the prediction instructions, obtaining the prediction results of future sequence tokens corresponding to training samples through image-text recommendation model prediction based on the training samples in training data sets, obtaining the prediction results of the future sequence tokens corresponding to the training samples and matched with the number of mask marks through image-text recommendation model prediction based on the training samples in training data sets according to a mask mechanism, and constructing a multi-token cross entropy function based on the prediction results of the future sequence tokens and the future sequence token marks. According to the image-text recommendation method provided by the invention, the browsed image-text sequence is input into the image-text recommendation model obtained through training in advance as the sequence token to obtain the preset number of future sequence tokens output by the image-text recommendation model at one time, and the image-text recommendation method concretely comprises the steps of inputting the browsed image-text sequence into the image-text recommendation model obtained through training in advance as the sequence token based on a sliding window strategy under the condition that the preset number is larger than a preset threshold value, sequentially obtaining a plurality of groups of first number of future sequence tokens output by the image-text recommendation model, wherein the first number is smaller than the preset number, and obtaining the preset number of future sequence tokens based on the plurality of groups of first number of future sequence tokens. According to the image-text recommending method provided by the invention, after the preset number of fu