EP-4738195-A1 - ARCHITECTURE AND TRAINING METHOD FOR MULTIMODAL CONTENT MODERATION MODEL

EP4738195A1EP 4738195 A1EP4738195 A1EP 4738195A1EP-4738195-A1

Abstract

Certain aspects provide a method of performing content moderation with a multimodal machine learning (ML) architecture, wherein: the multimodal ML architecture includes: a plurality of encoders, each configured to encode content of one of a plurality of modalities; a plurality of projectors, each associated with one of the plurality of encoders and configured to process output from the one of the plurality of encoders; and a large language model configured to generate a content moderation output based on outputs from the plurality of projectors, and the method includes: processing an input including contents of the plurality of modalities to generate a plurality of embeddings associated, respectively, with the plurality of modalities; processing the plurality of embeddings to generate a plurality of projected embeddings, each including a parameter number that the large language model is configured to process; and processing the plurality of projected embeddings to generate the content moderation output.

Inventors

RIMCHALA, Tharathorn
PENA PENA, KARELIA DEL CARMEN

Assignees

Intuit Inc.

Dates

Publication Date: 20260506
Application Date: 20250326

Claims (15)

A method of performing content moderation with a multimodal machine learning (ML) architecture that comprises: a plurality of encoders, each encoder configured to encode content of one of a plurality of modalities; a plurality of projectors, each projector associated with one of the plurality of encoders and configured to process output from the one of the plurality of encoders; and a large language model configured to generate a content moderation output based on outputs from the plurality of projectors, the method comprising: processing, with the plurality of encoders, an input comprising contents of the plurality of modalities to generate a plurality of embeddings associated, respectively, with the plurality of modalities; processing, with the plurality of projectors, the plurality of embeddings to generate a plurality of projected embeddings, each projected embedding comprising a parameter number that the large language model is configured to process; and processing, with the large language model, the plurality of projected embeddings to generate the content moderation output.
The method of Claim 1, wherein each of the plurality of projectors comprises one or more multilayer perceptrons (MLPs) specific to one of the plurality of modalities.
The method of Claim 1 or 2, wherein the large language model comprises one or more modality-specific low-rank adaptation (LoRA) layers configured for following instructions for unimodal content moderation and multimodal content moderation.
The method of any of Claims 1 to 3, wherein the large language model comprises a pre-trained large language model trained for unimodal content moderation.
The method of any of Claims 1 to 4, wherein: processing, with the large language model, the plurality of projected embeddings comprises generating a content moderation prompt and prompting the large language model with the content moderation prompt, and the content moderation prompt comprises: a task instruction; a customizable policy comprising a plurality of content moderation categories and associated descriptions; a multimodal content placeholder; and an output instruction.
The method of Claim 5, wherein the output instruction comprises a description of an output structure, comprising: a proposed action; a content moderation category name indicative of a reason for the proposed action; a harm rating; and one or more example outputs.
The method of Claim 5 or 6, wherein the customizable policy and the multimodal content placeholder are marked by a set of tokens indicating a beginning and an end of the customizable policy and a beginning and an end of the multimodal content placeholder.
A method of training a multimodal machine learning (ML) architecture to perform content moderation, wherein: the multimodal ML architecture comprises: a plurality of encoders, each encoder configured to encode content of one of a plurality of modalities; a plurality of projectors, each projector associated with one of the plurality of encoders and configured to process output from the one of the plurality of encoders; and a large language model configured to generate a content moderation output based on outputs from the plurality of projectors, and the method comprises: performing a first stage of training, including training each of the plurality of projectors based on one or more unique bi-modality datasets while freezing parameters of the plurality of encoders and the large language model; performing a second stage of training, including training each of the plurality of projectors based on one or more tri-modality datasets while freezing the parameters of the plurality of encoders and the large language model; and performing a third stage of training, including training each of the plurality of projectors, one or more low-rank adaptation (LoRA) layers of each of the plurality of encoders, and one or more LoRA layers of the large language model.
The method of Claim 8, wherein training each of the plurality of projectors based on the one or more unique bi-modality datasets comprises: training a first multilayer perceptron (MLP) specific to an image modality while keeping the large language model and the plurality of encoders frozen, and training a second MLP specific to an audio modality while keeping the large language model and the plurality of encoders frozen.
The method of Claim 9, wherein training each of the plurality of projectors based on the one or more unique bi-modality datasets comprises training the first MLP and the second MLP separately.
The method of Claim 8, 9 or 10, wherein: the one or more tri-modality datasets comprises a plurality of segmented clips of a video, and each segmented clip of the plurality of segmented clips of the video comprises a threshold level of similarity amongst a first image at a beginning of the segmented clip, a second image at a middle of the segmented clip, and a third image at an end of the segmented clip.
The method of Claim 11, wherein training each of the plurality of projectors based on the one or more tri-modality datasets comprises: training a first multilayer perceptron (MLP) specific to an image modality based on the second images of the plurality of segmented clips of the video, and training a second MLP specific to an audio modality based on audio content of the plurality of segmented clips of the video.
The method of Claim 12, wherein training each of the plurality of projectors based on the one or more tri-modality datasets comprises training the first MLP and the second MLP independently and simultaneously.
The method of any of Claims 8 to 13, wherein the first stage of training and the second stage of training comprise a curriculum-based training of each of the plurality of projectors for aligning a plurality of parameters from a first representation associated with an encoded content from one of the plurality of encoders to a second representation associated with the large language model.
The method of any of Claims 8 to 14, wherein performing the third stage of training comprises training based on a content moderation instruction fine-tuning dataset comprising a unimodal dataset and a multimodal dataset, each comprising associated content moderation instructions.

Description

BACKGROUND Field Aspects of the present disclosure relate to content moderation using a multimodal machine learning (ML) architecture and methods for training a multimodal content moderation model. Description of Related Art Creation and consumption of digital content is now ubiquitous. More recently, machine learning models, such as large language models, are being used to generate content. Intentionally or unintentionally, machine-generated content may include harmful content. Harmful content includes, for example, impolite, rude, insensitive, obscene, illegal, profane, insulting, and/or otherwise offensive content. The presence of such harmful content in the machine-generated content may lead to significant consequences, including legal consequences, loss of employment, etc. Content moderation is generally the process of determining whether content is harmful. One way of performing content moderation is to prompt an ML model to determine whether content is harmful. However, determining content as harmful through an ML model may not always be straightforward, such as when the content is multimodal (e.g., including text and images). For example, a text, such as "13-year-old me forced by my parents to talk to a relative I never met in my life," by itself may not necessarily be considered harmful. However, when this text is placed within an image of an animal holding a phone and making an obscene gesture and/or combined with an audio of shouting of an obscene phrase, the combined content may be identified as being harmful. Thus, identifying multimodal content as being harmful poses a challenging technical problem. Accordingly, there is a need for an improved method of content moderation. SUMMARY Particular aspects are set out in the appended independent claims. Various optional embodiments are set out in the dependent claims. One aspect provides a method of performing content moderation with a multimodal ML architecture, wherein: the multimodal ML architecture comprises: a plurality of encoders, each encoder configured to encode content of one of a plurality of modalities; a plurality of projectors, each projector associated with one of the plurality of encoders and configured to process output from the one of the plurality of encoders; and a large language model configured to generate a content moderation output based on outputs from the plurality of projectors, and the method comprises: processing, with the plurality of encoders, an input comprising contents of the plurality of modalities to generate a plurality of embeddings associated, respectively, with the plurality of modalities; processing, with the plurality of projectors, the plurality of embeddings to generate a plurality of projected embeddings, each projected embedding comprising a parameter number that the large language model is configured to process; and processing, with the large language model, the plurality of projected embeddings to generate the content moderation output. Another aspect provides a method of training a multimodal ML architecture to perform content moderation, wherein: the multimodal ML architecture comprises: a plurality of encoders, each encoder configured to encode content of one of a plurality of modalities; a plurality of projectors, each projector associated with one of the plurality of encoders and configured to process output from the one of the plurality of encoders; and a large language model configured to generate a content moderation output based on outputs from the plurality of projectors, and the method comprises: performing a first stage of training, including training each of the plurality of projectors based on one or more unique bi-modality datasets while freezing parameters of the plurality of encoders and the large language model; performing a second stage of training, including training each of the plurality of projectors based on one or more tri-modality datasets while freezing the parameters of the plurality of encoders and the large language model; and performing a third stage of training, including training each of the plurality of projectors, one or more low-rank adaptation (LoRA) layers of each of the plurality of encoders, and one or more LoRA layers of the large language model. Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; computer-readable mediums comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer readable storage medium comprising code for performing the aforementioned methods as well as those described herein; and a processing system comprising means for performing the aforementioned methods as well as those described herein. The following description and the related drawings set forth in detail certain illustrative features