US-12620409-B2 - System and method for fine-tuning an existing machine learning model using out-of-domain data

US12620409B2US 12620409 B2US12620409 B2US 12620409B2US-12620409-B2

Abstract

Systems, methods, and computer-readable media are provided for accessing out-of-domain training data that includes items of non-textual digital media content. Each of the items is labeled with text and background characteristic(s) that indicate an origination category of candidate origination categories for the item. A pre-trained model is used to generate vector embeddings of the out-of-domain training data and a particular vector embedding of a particular item of in-domain data that is labeled with text but is not labeled with any background characteristic(s) that indicate any origination categories. The generated vector embeddings are used to train another machine learning model to predict the background characteristic(s) based on vector embeddings of non-textual digital media content. The other machine learning model is further used to determine out-of-domain vector embeddings corresponding to the vector embeddings of the out-of-domain training data and in-domain vector embedding(s) corresponding to the in-domain data. Distances are determined between out-of-domain and in-domain vector embedding(s). Based on the distances, a textual content generation model is tuned on item(s) of the out-of-domain data. The item(s) of out-of-domain data to use for tuning may be selected and/or ordered based on the distances. A resulting model may be stored and used to transform unlabeled item(s) of non-textual content to textual content.

Inventors

Nikolaos Lagos
Ioan Calapodescu

Assignees

NAVER CORPORATION

Dates

Publication Date: 20260505
Application Date: 20240722

Claims (20)

1 . A computer-implemented method comprising: accessing a set of training data comprising a plurality of items of non-textual digital media content, wherein each item of the plurality of items of non-textual digital media content is labeled with corresponding textual content and one or more background characteristics that indicate an origination category of a plurality of candidate origination categories of the item of non-textual digital media content; using one or more pre-trained machine learning models to generate vector embeddings of the set of training data; using the generated vector embeddings to train another machine learning model to predict the one or more background characteristics; using the one or more pre-trained machine learning models to generate a particular vector embedding that represents one or more particular items of non-textual digital media content other than the plurality of items of non-textual digital media content, wherein each particular item of the one or more particular items is labeled with corresponding textual content but not with any background characteristics that indicate any of the plurality of candidate origination categories of the particular item of non-textual digital media content; using the other machine learning model to determine at least a first set of vector embeddings corresponding to the vector embeddings of the set of training data and a second particular vector embedding corresponding to the particular vector embedding that represents the one or more particular items of non-textual digital media content; wherein the first set of vector embeddings comprises a first vector embedding corresponding to a vector embedding of one or more first items of the plurality of items and a second vector embedding corresponding to a vector embedding of one or more second items of the plurality of items; determining a first distance between the second particular vector embedding and the first vector embedding and a second distance between the second particular vector embedding and the second vector embedding; based at least in part on the first distance and the second distance, generating a first tuned textual content generation model at least in part by tuning a textual content generation model on the one or more first items including first corresponding textual content of the one or more first items; storing a particular tuned textual content generation model based at least in part on the first tuned textual content generation model; using the particular tuned textual content generation model to transform one or more unlabeled items of non-textual digital media content to one or more items of corresponding textual content.
2 . The computer-implemented method of claim 1 , wherein generating the first tuned textual content generation model is based at least in part on the first distance being greater than the second distance, the computer-implemented method further comprising: after generating the first tuned textual content generation model, generating a second tuned textual content generation model at least in part by tuning another particular tuned textual content generation model based at least in part on the first tuned textual content generation model, wherein generating the second tuned textual content generation model uses the one or more second items including second corresponding textual content of the one or more second items; and wherein the particular tuned textual content generation model is based at least in part on the first tuned textual content generation model by being based at least in part on the second tuned textual content generation model that is based at least in part on the first tuned textual content generation model.
3 . The computer-implemented method of claim 1 , wherein generating the first tuned textual content generation model is based at least in part on the first distance being lesser than the second distance, wherein the particular tuned textual content generation model is not based at least in part on the one or more second items.
4 . The computer-implemented method of claim 1 , wherein the one or more background characteristics comprise a content purpose, a manner of content delivery, and a source of content.
5 . The computer-implemented method of claim 1 , wherein the first distance and the second distance are distances determined based on a comparison of numerical vector coordinates between the second particular vector embedding and corresponding vector coordinates of another vector embedding.
6 . The computer-implemented method of claim 1 , wherein the one or more particular items of non-textual digital media content are in a target domain and are no more than 30 seconds long and no more than 50 in number.
7 . The computer-implemented method of claim 1 , wherein the plurality of items of non-textual digital media content are audio files.
8 . The computer-implemented method of claim 1 , wherein the one or more pre-trained machine learning models comprise a multi-layer artificial neural network, and wherein using the one or more pre-trained machine learning models to generate vector embeddings of the set of training data and using the one or more pre-trained machine learning models to generate the particular vector embedding comprise extracting vector embeddings from a hidden layer of the multi-layer artificial neural network.
9 . The computer-implemented method of claim 8 , wherein the multi-layer artificial neural network is a feed forward artificial neural network, and wherein the hidden layer is a last hidden layer of the feed forward artificial neural network.
10 . The computer-implemented method of claim 1 , wherein using the one or more pre-trained machine learning models to generate vector embeddings of the set of training data comprises representing parts of an individual item of the one or more first items with first separate vector embeddings, and aggregating the first separate vector embeddings, and representing parts of an individual item of the one or more second items with second separate vector embeddings, and aggregating the second separate vector embeddings; and wherein using the one or more pre-trained machine learning models to generate the particular vector embedding comprises representing parts of an individual item of the one or more particular items with particular separate vector embeddings, and aggregating the particular separate vector embeddings.
11 . The computer-implemented method of claim 10 , wherein aggregating the first separate vector embeddings comprises determining a mean value from the first separate vector embeddings, wherein aggregating the second separate vector embeddings comprises determining a mean value from the second separate vector embeddings, and wherein aggregating the particular separate vector embeddings comprises determining a mean value from the particular separate vector embeddings.
12 . The computer-implemented method of claim 1 , wherein the first distance and the second distance are determined based at least in part on a vector similarity search library.
13 . A system comprising: one or more processors; one or more non-transitory computer-readable media storing instructions, which, when executed by the system, cause the system to perform a set of actions comprising: accessing a set of training data comprising a plurality of items of non-textual digital media content, wherein each item of the plurality of items of non-textual digital media content is labeled with corresponding textual content and one or more background characteristics that indicate an origination category of a plurality of candidate origination categories of the item of non-textual digital media content; using one or more pre-trained machine learning models to generate vector embeddings of the set of training data; using the generated vector embeddings to train another machine learning model to predict the one or more background characteristics based on vector embeddings of non-textual digital media content; using the one or more pre-trained machine learning models to generate a particular vector embedding that represents one or more particular items of non-textual digital media content other than the plurality of items of non-textual digital media content, wherein each particular item of the one or more particular items is labeled with corresponding textual content but not with any background characteristics that indicate any of the plurality of candidate origination categories of the particular item of non-textual digital media content; using the other machine learning model to determine at least a first set of vector embeddings corresponding to the vector embeddings of the set of training data and a second particular vector embedding corresponding to the particular vector embedding that represents the one or more particular items of non-textual digital media content; wherein the first set of vector embeddings comprises a first vector embedding corresponding to a vector embedding of one or more first items of the plurality of items and a second vector embedding corresponding to a vector embedding of one or more second items of the plurality of items; determining a first distance between the second particular vector embedding and the first vector embedding and a second distance between the second particular vector embedding and the second vector embedding; based at least in part on the first distance and the second distance, generating a first tuned textual content generation model at least in part by tuning a textual content generation model on the one or more first items including first corresponding textual content of the one or more first items; storing a particular tuned textual content generation model based at least in part on the first tuned textual content generation model; using the particular tuned textual content generation model to transform one or more unlabeled items of non-textual digital media content to one or more items of corresponding textual content.
14 . A computer-implemented method for tuning a pre-existing textual content generation model, comprising: receiving out-of-domain data and in-domain seed data that comprise items of non-textual digital media content; applying one or more pre-trained machine learning models to (i) data from the out-of-domain data to extract out-of-domain embeddings, and (ii) data from the in-domain seed data to extract in-domain embeddings; grouping, into a plurality of groups, at least some out-of-domain embeddings of the out-of-domain embeddings based at least in part on distances between the at least some out-of-domain embeddings and the in-domain embeddings; and tuning the pre-existing textual content generation model using out-of-domain data associated with each group of the plurality of groups starting with those groups having out-of-domain embeddings that are further from the in-domain embeddings before progressively finetuning the model on out-of-domain data associated with other groups of the plurality of groups having out-of-domain embeddings that are closer to the in-domain embeddings.
15 . The computer-implemented method of claim 14 , further comprising sampling out-of-domain embeddings, based on distance from the in-domain embeddings, up to a stopping criterion to define a tuning dataset; wherein said sampling treats in-domain embeddings as a query and matches the in-domain embeddings to a most similar out-of-domain embedding using a distance function.
16 . The computer-implemented method of claim 15 , wherein the distance function is one of a cosine distance, a Euclidean distance, a Pearson correlation coefficient, a Manhattan distance, a Minkowski distance, a hamming distance, a Chebyshev distance, a Jaccard distance, a Haversine distance, a Sorensen-Dice distance, or any combination or function thereof.
17 . The computer-implemented method of claim 14 , wherein the extracted out-of-domain embeddings overlap with the in-domain embeddings on one or more characteristics.
18 . The computer-implemented method of claim 14 , wherein the out-of-domain data and the in-domain seed data comprises one or more of audio files, video files, image files, images of handwriting, or audiovisual files.
19 . The computer-implemented method of claim 14 , wherein the in-domain seed data is audio data representing one minute of audio recordings plus or minus up to 30 seconds, and wherein the out-of-domain data is audio data representing greater than six thousand hours of audio data plus or minus up to 3000 hours.
20 . The computer-implemented method of claim 14 , further comprising using the finetuned textual generation model to transform one or more unlabeled items of non-textual digital media content to one or more items of corresponding textual content.

Description

FIELD The present disclosure relates to machine learning and more particularly to systems and methods for using potentially out-of-domain data based on limited in-domain data to fine-tune an existing machine learning model to perform a task. BACKGROUND Machine learning models are often trained or tuned on data in the same subject matter or content domain as the production data (“the target domain”), to promote the best predictions or decision-making by the machine learning model in the target domain even if the specific combinations of values provided to the model have never been seen before. In some scenarios, training data might not be available in the target domain as the model is used to face new problems, old problems but involving different actors, circumstances, or topics, or problems for which production-quality data is not available. In many scenarios where training data in the target domain is not available, machine learning models are trained and/or tuned on large sets of training data that are not domain-specific. These general-purpose models may perform well enough in some scenarios, but the general-purpose models can only go so far in certain domains. Without sufficient training data, if the general-purpose model does not provide accurate-enough predictions, an organization may undergo considerable expense to generate new training data for the target domain. Even if the organization is willing to spend considerable time and resources to generate new training data, such training data may have problems due to lack of full coverage of the target domain such as by failing to address edge cases that appear more frequently than expected in the target domain, undetected quality issues that prevent such training data from being used effectively by models, and/or due to other unintended biases introduced by the organization. Without high-quality training data in a target domain, a poorly performing model might result in poor outcomes for the organization with little practical opportunity for improving those outcomes. BRIEF SUMMARY In some embodiments, a computer-implemented method includes accessing out-of-domain training data that includes items of non-textual digital media content. Each of the items is labeled with text and background characteristic(s) that indicate an origination category of candidate origination categories for the item. A pre-trained model is used to generate vector embeddings of the out-of-domain training data and a particular vector embedding of a particular item of in-domain data that is labeled with text but is not labeled with any background characteristic(s) that indicate any origination categories. The generated vector embeddings are used to train another machine learning model to predict the background characteristic(s) based on vector embeddings of non-textual digital media content. The other machine learning model is further used to determine out-of-domain vector embeddings corresponding to the vector embeddings of the out-of-domain training data and in-domain vector embedding(s) corresponding to the in-domain data. Distances are determined between out-of-domain and in-domain vector embedding(s). Based on the distances, a textual content generation model is tuned on item(s) of the out-of-domain data. The item(s) of out-of-domain data to use for tuning may be selected and/or ordered based on the distances. A resulting model may be stored and used to transform unlabeled item(s) of non-textual content to textual content. In one embodiment, a computer-implemented method includes accessing a set of training data comprising a plurality of items of non-textual digital media content. Each item of the plurality of items of non-textual digital media content is labeled with corresponding textual content and one or more background characteristics that indicate an origination category of a plurality of candidate origination categories of the item of non-textual digital media content. The computer-implemented method further includes using one or more pre-trained machine learning models to generate vector embeddings of the set of training data. The generated vector embeddings are used to train another machine learning model to predict the one or more background characteristics based on vector embeddings of non-textual digital media content. The one or more pre-trained machine learning models are further used to generate a particular vector embedding that represents one or more particular items of non-textual digital media content other than the plurality of items of non-textual digital media content. Each particular item of the one or more particular items is labeled with corresponding textual content but not with any background characteristics that indicate any of the plurality of candidate origination categories of the particular item of non-textual digital media content. The other machine learning model is further used to determine at least a first set of vector embeddings correspond