US-12621323-B2 - Content-oblivious fraudulent email detection system
Abstract
A system supporting one or more machine learning models may receive, via a cloud-based platform that supports a multi-tenant system, metadata associated with a set of electronic communication messages for a tenant of the multi-tenant system. The system may normalize the metadata by extracting fields of the metadata into a format readable by the machine learning model to identify a set of fraudulent users associated with the set of electronic messages. The system may utilize the machine learning model to identify the set of fraudulent users based on executing a set of detection models and performing pattern matching between a set of previously authenticated user activity logs and a set of newly generated user activity logs in the metadata. Upon detection of the set of fraudulent users, the system may generate and transmit a report indicating the set of fraudulent users and the respective electronic message corresponding to the respective fraudulent user.
Inventors
- Xiao Zhang
- Robin Stuart
- John Seymour
Assignees
- SALESFORCE, INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20230428
Claims (19)
- 1 . A method for data processing, comprising: receiving, via a cloud-based platform supporting a plurality of tenants, metadata for a tenant of the plurality of tenants, the metadata comprising a list of fields comprising a plurality of previously authenticated user activity logs associated with a first plurality of electronic communication messages and a plurality of newly generated user activity logs associated with a second plurality of electronic communication messages different from the first plurality of electronic communication messages, wherein each respective electronic communication message comprises a respective subject line and a respective set of content body text, and wherein the respective subject line and the respective set of content body text within each electronic communication message of the first plurality of electronic communication messages and the second plurality of electronic communication messages is absent from the metadata associated with the first plurality of electronic communication messages and the second plurality of electronic communication messages; normalizing the metadata by extracting one or more fields from the list of fields and transforming into a format readable by a machine learning model, wherein the machine learning model is trained on the plurality of previously authenticated user activity logs associated with the first plurality of electronic communication messages and is configured to identify a plurality of fraudulent users associated with the second plurality of electronic communication messages based at least in part on a training of the machine learning model; executing the machine learning model for identification of the plurality of fraudulent users based at least in part on inputting the normalized metadata into the machine learning model wherein the machine learning model is executed to perform a pattern matching between the plurality of previously authenticated user activity logs and the plurality of newly generated user activity logs; and generating a report indicating the plurality of fraudulent users, wherein a respective fraudulent user of the plurality of fraudulent users is associated with a respective electronic communication message from the second plurality of electronic communication messages.
- 2 . The method of claim 1 , further comprising: training the machine learning model with a list of known fraudulent users, a list of known authenticated users, and the plurality of previously authenticated user activity logs associated with the list of known authenticated users.
- 3 . The method of claim 1 , wherein executing the machine learning model further comprises: executing a plurality of detection models using the normalized metadata to identify the plurality of fraudulent users, wherein the plurality of detection models are configured to run concurrently for each electronic communication message of the second plurality of electronic communication messages.
- 4 . The method of claim 3 , wherein a fraudulent user is detected if at least one detection model of the plurality of detection models identifies a user of the respective electronic communication message of the second plurality of electronic communication messages as a fraudulent user.
- 5 . The method of claim 3 , wherein executing the plurality of detection models comprises: executing a bulk sign up detection model configured to identify a respective fraudulent user based at least in part on a gibberish detection score; executing an impersonation detection model configured to identify a respective fraudulent user based at least in part on a detected pattern and organization size of the respective fraudulent user; and executing a mass-mail detection model configured to identify a respective fraudulent user based at least in part on a quantity of electronic communication messages.
- 6 . The method of claim 5 , wherein executing the bulk sign up detection model further comprises: generating the gibberish detection score for a respective user identifier associated with a respective newly generated user activity log of the plurality of newly generated user activity logs; detecting that the gibberish detection score satisfies a gibberish detection score threshold; and identifying that the respective user identifier is associated with a fraudulent user based at least in part on the gibberish detection score threshold being satisfied.
- 7 . The method of claim 5 , wherein executing the impersonation detection model further comprises: determining the organization size associated with a respective user identifier associated with a respective newly generated user activity log of the plurality of newly generated user activity logs, the organization size associated with the respective user identifier corresponding to an organization associated with the respective user identifier; detecting a pattern between the respective user identifier and user identifiers of a list of known fraudulent users and that the organization size satisfies an organization size threshold; and identifying the respective user identifier is associated with a fraudulent user based at least in part on the detected pattern and the organization size threshold being satisfied.
- 8 . The method of claim 5 , wherein executing the mass-mail detection model further comprises: determining the quantity of electronic communication messages associated with at least one of a respective user identifier from the second plurality of electronic communication messages, a time difference between a sign-up time associated with the respective user identifier and an electronic communication message transmission time, an organization size associated with the respective user identifier, an application programming interface used to transmit an electronic communication message, or a combination thereof, the organization size associated with the respective user identifier corresponding to an organization associated with the respective user identifier; detecting that the quantity of the electronic communication messages satisfies an electronic communication message quantity threshold, the time difference between the sign-up time and the electronic communication message transmission time satisfies a time difference threshold, the organization size satisfies an organization size threshold, that the application programming interface matches a mass-mail application programming interface, or any combination thereof; and identifying that the respective user identifier is associated with a fraudulent user based at least in part on detecting that the quantity of the electronic communication messages satisfies the electronic communication message quantity threshold, the time difference satisfies the time difference threshold, the organization size satisfies the organization size threshold, that the application programming interface matches the mass-mail application programming interface, or any combination thereof.
- 9 . The method of claim 1 , wherein normalizing the metadata further comprises: extracting the one or more fields from each respective electronic communication message of the second plurality of electronic communication messages into a tabular format, wherein the one or more fields exclude the respective subject line and the respective set of content body text from each respective electronic communication message.
- 10 . The method of claim 9 , wherein a plurality of columns of the tabular format for the normalized metadata are associated with the list of fields in the metadata.
- 11 . The method of claim 1 , wherein the list of fields in the metadata for a respective newly generated user activity log of the plurality of newly generated user activity logs include at least one of a username, an organization size, a sign-up date, or any combination thereof, and wherein the organization size for the respective newly generated user activity log corresponding to an organization associated with the respective newly generated user activity log.
- 12 . The method of claim 1 , wherein the report indicating the plurality of fraudulent users includes at least one of an organization identifier of an organization corresponding to respective fraudulent users of the plurality of fraudulent users, a user identifier, an associated organization name of the organization corresponding to respective fraudulent users of the plurality of fraudulent users, a creation date, or any combination thereof.
- 13 . An apparatus for data processing, comprising: one or more processors; one or more memories coupled with the one or more processors; and instructions stored in the one or more memories and executable by the one or more processors to cause the apparatus to: receive, via a cloud-based platform supporting a plurality of tenants, metadata for a tenant of the plurality of tenants, the metadata comprising a list of fields comprising a plurality of previously authenticated user activity logs associated with a first plurality of electronic communication messages and a plurality of newly generated user activity logs associated with a second plurality of electronic communication messages different from the first plurality of electronic communication messages, wherein each respective electronic communication message comprises a respective subject line and a respective set of content body text, and wherein the respective subject line and the respective set of content body text within each electronic communication message of the first plurality of electronic communication messages and the second plurality of electronic communication messages is absent from the metadata associated with the first plurality of electronic communication messages and the second plurality of electronic communication messages; normalize the metadata by extracting one or more fields from the list of fields and transforming into a format readable by a machine learning model, wherein the machine learning model is trained on the plurality of previously authenticated user activity logs associated with the first plurality of electronic communication messages and is configured to identify a plurality of fraudulent users associated with the second plurality of electronic communication messages based at least in part on a training of the machine learning model; execute the machine learning model for identification of the plurality of fraudulent users based at least in part on the normalized metadata being input into the machine learning model wherein the machine learning model is executed to perform a pattern matching between the plurality of previously authenticated user activity logs and the plurality of newly generated user activity logs; and generate a report indicating the plurality of fraudulent users, wherein a respective fraudulent user of the plurality of fraudulent users is associated with a respective electronic communication message from the second plurality of electronic communication messages.
- 14 . The apparatus of claim 13 , wherein the instructions are further executable by the one or more processors to cause the apparatus to: train the machine learning model with a list of known fraudulent users, a list of known authenticated users, and the plurality of previously authenticated user activity logs associated with the list of known authenticated users.
- 15 . The apparatus of claim 13 , wherein the instructions to execute the machine learning model are further executable by the one or more processors to cause the apparatus to: execute a plurality of detection models using the normalized metadata to identify the plurality of fraudulent users, wherein the plurality of detection models are configured to run concurrently for each electronic communication message of the second plurality of electronic communication messages.
- 16 . The apparatus of claim 15 , wherein a fraudulent user is detected if at least one detection model of the plurality of detection models identifies a user of the respective electronic communication message of the second plurality of electronic communication messages as a fraudulent user.
- 17 . The apparatus of claim 15 , wherein the instructions to execute the plurality of detection models are executable by the one or more processors to cause the apparatus to: executing a bulk sign up detection model configured to identify a respective fraudulent user based at least in part on a gibberish detection score; executing an impersonation detection model configured to identify a respective fraudulent user based at least in part on a detected pattern and organization size of the respective fraudulent user; and executing a mass-mail detection model configured to identify a respective fraudulent user based at least in part on a quantity of electronic communication messages.
- 18 . The apparatus of claim 13 , wherein the instructions to normalize the metadata are further executable by the one or more processors to cause the apparatus to: extract the one or more fields from each respective electronic communication message of the second plurality of electronic communication messages into a tabular format, wherein the one or more fields exclude the respective subject line and the respective set of content body text from each respective electronic communication message.
- 19 . A non-transitory computer-readable medium storing code for data processing, the code comprising instructions executable by a one or more processors to: receive, via a cloud-based platform supporting a plurality of tenants, metadata for a tenant of the plurality of tenants, the metadata comprising a list of fields comprising a plurality of previously authenticated user activity logs associated with a first plurality of electronic communication messages and a plurality of newly generated user activity logs associated with a second plurality of electronic communication messages different from the first plurality of electronic communication messages, wherein each respective electronic communication message comprises a respective subject line and a respective set of content body text, and wherein the respective subject line and the respective set of content body text within each electronic communication message of the first plurality of electronic communication messages and the second plurality of electronic communication messages is absent from the metadata associated with the first plurality of electronic communication messages and the second plurality of electronic communication messages; normalize the metadata by extracting one or more fields from the list of fields and transforming into a format readable by a machine learning model, wherein the machine learning model is trained on the plurality of previously authenticated user activity logs associated with the first plurality of electronic communication messages and is configured to identify a plurality of fraudulent users associated with the second plurality of electronic communication messages based at least in part on a training of the machine learning model; execute the machine learning model for identification of the plurality of fraudulent users based at least in part on the normalized metadata being input into the machine learning model wherein the machine learning model is executed to perform a pattern matching between the plurality of previously authenticated user activity logs and the plurality of newly generated user activity logs; and generate a report indicating the plurality of fraudulent users, wherein a respective fraudulent user of the plurality of fraudulent users is associated with a respective electronic communication message from the second plurality of electronic communication messages.
Description
FIELD OF TECHNOLOGY The present disclosure relates generally to database systems and data processing, and more specifically to content-oblivious fraudulent email detection system. BACKGROUND A cloud platform (i.e., a computing platform for cloud computing) may be employed by multiple users to store, manage, and process data using a shared network of remote servers. Users may develop applications on the cloud platform to handle the storage, management, and processing of data. In some cases, the cloud platform may utilize a multi-tenant database system. Users may access the cloud platform using various user devices (e.g., desktop computers, laptops, smartphones, tablets, or other computing systems, etc.). In one example, the cloud platform may support customer relationship management (CRM) solutions. This may include support for sales, service, marketing, community, analytics, applications, and the Internet of Things. A user may utilize the cloud platform to help manage contacts of the user. For example, managing contacts of the user may include analyzing data, storing and preparing communications, and tracking opportunities and sales. In some cases, users may frequently receive spam and phishing email messages in an attempt to steal data and information and gain access into secure systems. Spam messages may be messages sent to a large number of users and may have dangerous content linked inside the message, such as computer viruses. Phishing messages may be messages where someone (a person sending the message) pretends to be a person, brand, or user, that a user may trust (e.g., a fraudulent user pretending to be an authentic user). In some examples, the fraudulent user may attempt to have a user give up personal or confidential information. To prevent such emails from causing harm to users, some programs may be deployed and the contents of emails (e.g., email subject line, email body text) may be scanned for text indicative of a spam or phishing email. However, allowing such programs access to contents of user's emails may prove to be a security risk if the program stores personal, private, or confidential data or uses the data to train such programs. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an example of a data processing system that supports content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. FIG. 2 shows an example of a workflow that supports content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. FIG. 3 shows an example of a machine learning model diagram that supports content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. FIG. 4 shows an example of a process flow that supports content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. FIG. 5 shows a block diagram of an apparatus that supports content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. FIG. 6 shows a block diagram of a fraudulent user detection module that supports content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. FIG. 7 shows a diagram of a system including a device that supports content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. FIGS. 8 through 12 show flowcharts illustrating methods that support content-oblivious fraudulent email detection system in accordance with aspects of the present disclosure. DETAILED DESCRIPTION In some examples, tenants in a multi-tenant based system may frequently receive spam and phishing electronic messages or electronic communication mail (e.g., email, text messages, instant messaging messages) attempting to steal data and information and gain access into secure systems. Spam and phishing emails may cause reputation damages to reputable people and brands as well as cause financial damages to users. In some examples, such a system may detect phishing and spam emails via natural language processing (NLP) techniques, where the system may scan the contents and subjects of emails and run them through machine learning models. For example, a system may identify spam emails by learning from the text of user's emails and classifying messages as spam and non-spam. However, training and implementing of such machine learning models may be very time consuming and computationally inefficient in detecting spam and phishing emails. Additionally, such machine learning models may rely on access to the content of emails (e.g., the subject line, the email body text), which may present a security risk as the contents of emails may include personal, private, or other confidential information. Although, if the contents of an email are not available for training, such machine learning models may not be able to detect if an email is a spam or phishing email. The te