CN-115630231-B - Social media emotion prediction algorithm based on text content and user portrayal

CN115630231BCN 115630231 BCN115630231 BCN 115630231BCN-115630231-B

Abstract

The invention belongs to the technical field of big data mining, and particularly relates to a social media emotion prediction algorithm based on text content and user portraits. The method comprises the steps of obtaining past emotion index features through a SKEP algorithm pre-training model, obtaining past word frequency statistical features through a linguistic statistical word list, obtaining past theme distribution features through an LDA theme model, inputting the word frequency features and the theme distribution features into a back propagation neural network to conduct feature extraction to obtain text features, classifying users according to a series of personal information to obtain historical crowd image features participating in discussion of a certain topic of social media, merging the text features, the user image features, the quantity features and the emotion features and inputting the merged text features, the user image features, the quantity features and the emotion features into a long-short-period memory network of perception time, and taking the last hidden layer of the long-short-period memory network of perception time as the input of the final back propagation neural network to obtain social media emotion indexes of the next day. The algorithm of the invention has high prediction accuracy.

Inventors

RUAN HUI
CHEN YANG
GONG QINGYUAN

Assignees

复旦大学

Dates

Publication Date: 20260512
Application Date: 20220924

Claims (1)

1. The social media emotion prediction algorithm based on text content and user portraits is characterized by comprising the following specific steps: Step 1, data cleaning is carried out on a past social media text, and all daily emotion indexes are calculated for the text which is cleaned through a SKEP algorithm pre-training model; Step 2, carrying out word frequency statistics on the text released on the same day in the previous 6-10 days by using a linguistic statistics word list, obtaining trend features of contents in the past text from a word level, namely word frequency features, carrying out topic distribution statistics on the text released on the same day in the previous 6-10 days by using a topic model on the washed text, obtaining trend features of contents in the past text from the topic level, namely topic distribution features, inputting the word frequency features and the topic distribution features into a back propagation neural network for feature extraction to obtain dimensionalized text features, classifying users released on the same day in the previous 6-10 days according to a series of personal information to obtain historical crowd image features participating in topic discussion of social media, wherein the defined number features refer to the number of microblogs in the previous 6-10 days, and the emotion features refer to the daily emotion indexes in the previous 6-10 days; Step 3, combining all time sequence vector features obtained in the step 2, including word frequency features, theme distribution features, crowd image features, quantity features and emotion features, inputting the combined features into a long-short-period memory network (T-LSTM) of a perception time, and taking the last hidden layer of the long-short-period memory network of the perception time as the input of a final back propagation neural network to obtain social media emotion indexes of the next day; The step 1 specifically comprises the following steps: step 1-1, cleaning data, which comprises the steps of cleaning all non-alphabetical characters, deleting all useless words, and deleting all missing lines; step 1-2, pre-training a model through SKEP algorithm to obtain the probability that the emotion of the text is positive for the washed text, taking the probability as the emotion index of the text; The step 2 specifically comprises the following steps: step 2-1, for the washed text, performing word frequency statistics on the text released on the same day by using a linguistic statistics word list, and obtaining the tendency characteristics of the content in the past text from the word level; The method comprises the steps of carrying out vocabulary analysis on the cleaned social media text by utilizing a pre-designed linguistic statistical word list, wherein the vocabulary analysis comprises the following steps: Using LIWC dictionary to analyze seventeen kinds of words, namely, positive, negative, anxiety, anger, wounded, society, family, friends, health, space, time, work, leisure, residence, money, area and death, wherein the appearance frequency of each text is used as the trend and the style characteristics of the text and finally is blended in when training the model; step 2-2, counting the topic distribution of the text released on the same day by using a topic model for the text which is already cleaned, and obtaining the tendency characteristics of the content in the past text from the topic layer; Step 2-3, inputting word frequency characteristics and theme distribution characteristics into a back propagation neural network to perform characteristic extraction, so as to obtain dimension-reduced text characteristics; Step 2-4, classifying users who release content on the same day according to a series of personal information to obtain historical crowd portrayal features participating in discussion of a certain topic of social media; the step 3 specifically comprises the following steps: step 3-1, merging all the time sequence vector features obtained in the step 2 and inputting the merged time sequence vector features into a long-period memory network for sensing time; Step 3-2, taking the last hidden layer of the long-period memory network of the sensing time as the input of the final back propagation neural network to obtain social media emotion indexes of the next day; during model training, the mean square variance is used as a loss function of training, and the specific formula is as follows: ; Where s is the time step of the input feature, x= (x 1 ,x 2 ,…x n ),x∈R n represents the feature of the input final back propagation neural network, i.e. the last hidden layer of the long-short-term memory network of the perceived time, y= (y 1 ,y 2 ,…y n ), y e (0, 1) represents the actual daily emotion index of the day, f (x, θ) is a model function representing the predicted daily emotion index output given x, θ is a model parameter.

Description

Social media emotion prediction algorithm based on text content and user portrayal Technical Field The invention belongs to the technical field of big data mining, and particularly relates to a social media emotion prediction algorithm based on text content and user portraits. Background Emotion analysis of social media is the content of many public opinion studies, where emotion analysis is a more important measure. Predicting emotion index facilitates institutions to learn about the public's emotion changes in time, which is important for rapid development and continuous improvement affecting significant management measures. There have been many studies in the academia regarding emotion prediction, but there is a lack of prediction of emotion index. The main research is to use the historical emotion indexes to predict the targets of future stock price, death and other non-emotion indexes. The inventor finds that when carrying out emotion index prediction research, the current research on emotion indexes only can predict future states, namely, states are classified according to the emotion index size (very positive, generally positive, neutral, generally negative and very negative), and classification problems rather than regression problems are predicted. Disclosure of Invention In view of the above, the present invention aims to provide a social media emotion prediction algorithm based on text content and user portraits, which can predict a specific emotion index in the future and has high accuracy. The social media emotion prediction algorithm based on text content and user portraits is shown in figure 1, and the overall flow is shown as figure 1, and comprises the steps of calculating a daily emotion index through SKEP [1] algorithm, carrying out vocabulary analysis through a linguistic statistics word list, obtaining trend characteristics (called word frequency characteristics) of content in a past text from the word level, carrying out topic distribution statistics on the text released on the same day through an LDA topic model [2], obtaining trend characteristics (called topic distribution characteristics) of content in the past text from the topic level, inputting the word frequency characteristics and the topic distribution characteristics into a reverse propagation neural network to carry out characteristic extraction, obtaining a dimensionality reduction text characteristic, classifying users releasing the content on the same day according to a series of personal information, and obtaining historical crowd image characteristics participating in social media discussion, wherein the definition quantity characteristics refer to the number of microblogs on the first days, and the emotion characteristics refer to the daily emotion index on the first days. Inputting all time sequence vector features (including text features (including word frequency features and theme distribution features), crowd image features, quantity features and emotion features) into a long-short-period memory network (T-LSTM) 3 of a perception time, and taking the last hidden layer of the long-period memory network of the perception time as the input of a final back propagation neural network to obtain social media emotion indexes of the next day; The method comprises the following specific steps: Step 1, data cleaning is carried out on a past social media text, and all daily emotion indexes are calculated for the text which is cleaned through a SKEP algorithm pre-training model; Step 2, for the washed text, performing word frequency statistics on the text released on the same day in the previous days (usually 6-10 days, typically 7 days, the same time) by using a linguistic statistics word list, obtaining trend characteristics of the content in the past text from a word level, performing topic distribution statistics on the text released on the same day in the previous days by using an LDA topic model for the washed text, obtaining trend characteristics of the content in the past text from a topic level, inputting the word frequency characteristics and the topic distribution characteristics into a back propagation neural network for characteristic extraction to obtain dimensionalized text characteristics, classifying users released on the same day in the previous days according to a series of personal information to obtain historical crowd image characteristics participating in social media discussion, wherein the defined quantity characteristics refer to the microblog quantity in the previous days, and the emotion characteristics refer to the daily emotion indexes in the previous days; and 3, merging all the time sequence vector features (including word frequency features, theme distribution features, crowd image features, quantity features and emotion features) obtained in the step 2, inputting the merged time sequence vector features into a long-short-period memory network (T-LSTM) of a perception time, and taking the last