CN-121617635-B - Microblog user depression risk identification method based on large model and remote sensing environment sensing

CN121617635BCN 121617635 BCN121617635 BCN 121617635BCN-121617635-B

Abstract

The invention discloses a microblog user depression risk identification method based on a large model and remote sensing environment sensing, and aims to overcome the defect that the characteristics of the prior art are sparse and environmental influence is ignored. The method comprises the steps of preprocessing microblog texts, segmenting words, constructing 13 types of depression feature word sets through a transducer large model, calculating feature weights based on information values, extracting geographic positions of users, calculating 500-meter radius greening rate through GEE platform and Sentinel-2 satellite data, constructing environment adjustment coefficients, correcting negative emotion indexes, and achieving risk classification according to normal distribution thresholds. Experiments prove that the accuracy rate 92.05% and the recall rate 94.12% of the method are combined with the text and the environmental characteristics, the multidimensional degree and the ecological efficiency of the identification are improved, and the user depression risk can be accurately identified.

Inventors

ZHOU ZIYU

Assignees

河海大学

Dates

Publication Date: 20260508
Application Date: 20260202

Claims (6)

1. The microblog user depression risk identification method based on large model and remote sensing environment perception is characterized by comprising the following steps of: Step S1, preprocessing and word segmentation of a microblog text, namely acquiring microblog text samples of a depression patient group U dep and a common user group U norm , and preprocessing, wherein the method specifically comprises the following steps: The method comprises the steps of S1.1, sample acquisition, namely definitely dividing a depression patient group U dep and a common user group U norm , respectively acquiring microblog texts released by the two groups to serve as training samples, and reserving target user microblog texts to serve as samples to be predicted; Step S1.2, filtering stop words, namely traversing microblog texts by adopting a Chinese stop word library containing a plurality of irrelevant words in a text matching mode, automatically filtering grammar virtual words, logic connecting words, human-called pronouns, time class expressions and digital class expressions, and finishing text cleaning and standardization; Step S1.3, word segmentation processing, namely traversing the cleaned text by adopting a maximum probability path searching algorithm based on a prefix dictionary and dynamic programming, determining a preliminary word segmentation result according to dictionary words and word occurrence probability, and carrying out state reasoning and recognition on words which are not recorded in the prefix dictionary in the preliminary word segmentation based on a preset state transition probability matrix, an observation probability matrix and an initial probability matrix through a Hidden Markov Model (HMM), and outputting a word sequence with complete semantics; Step S1.4, counting total word frequency, namely traversing word sequences of a depression patient group and a common user group respectively, accumulating the occurrence times of all words of the two groups to obtain corresponding total word frequency, and marking the total word frequency as T dep 、T norm; Step S2, constructing a semantic expansion and classification system of depression feature words, namely adopting a large language model based on a Transformer structure to encode the preprocessed microblog text, and utilizing a model attention mechanism to perform feature fusion and clustering to construct a depression feature word set V containing 13 semantic categories; Step S3, counting the frequency and the duty ratio of the characteristic words, namely respectively calculating the frequency of the characteristic words of each type in the 13-type depression characteristic word set V of two types of groups, and the duty ratio and the average duty ratio of each type of characteristic words in the total vocabulary of the corresponding groups; Step S4, feature discrimination quantization and weight calculation, namely quantizing discrimination of various feature words in two types of groups based on an information value IV calculation formula, introducing a scaling factor alpha for smoothing treatment, and obtaining final weight i (v) of each type of feature words; Step S5, environmental factor extraction and adjustment coefficient construction, namely extracting geographic position information from user microblog data, converting the geographic position information into standard longitude and latitude coordinates, calculating greening rate GR in a set radius R area with the user coordinates as a center based on a normalized vegetation index NDVI, and calculating an environmental adjustment coefficient alpha env ; and S6, individual depression risk prediction and classification, namely repeating word segmentation processing in the step S1 on the target user microblog, completing characteristic word sense matching and classification through a large language model, calculating weighted depression words and original negative emotion indexes, correcting through an environment adjustment coefficient to obtain adjusted negative emotion indexes, and setting risk division standards based on normal distribution to realize target user depression risk classification.
2. The method for identifying the risk of depression of the microblog user based on the large model and the remote sensing environment perception according to claim 1, wherein the step S2 specifically comprises the following steps: step S2.1, text encoding, namely segmenting the preprocessed microblog text into token sequences according to a model input format by adopting a large language model based on a transducer structure, and generating a high-dimensional semantic vector of each token by an embedding layer, a self-attention layer and a feedforward neural network layer of the model, wherein the vector comprises contextual semantic information and emotion characteristics of words; s2.2, feature fusion and clustering, namely calculating cosine similarity among semantic vectors of all the tokens by using a self-attention mechanism of a large language model, focusing key tokens with similarity higher than a preset threshold and related to depression, and carrying out preliminary clustering on the key tokens by a K-means clustering algorithm; step S2.3, constructing a classification system, namely constructing a depression feature word set V containing 13 semantic categories by combining psychological depression related knowledge, clinical depression diagnosis standards and social media depression expression language style analysis: The 13-class feature word sets are specifically negative expressions of pain/distress/affliction, sadness/deep sorrow/delegation, anger/violence/dysphoria, anxiety/tension, strong negative exclamation/curse, tiredness/debilitation/drowsiness, aversion/suspicion, fear/panic, self negation/self spelt/shame, solitary/separation/isolation, remorse/regret/self-responsibility, aversion/social fear, and other difficult-to-classify; And step S2.4, new word adaptation, namely calculating the similarity between semantic vectors of new words which are not recorded in the depression characteristic word set and the heart vectors in the 13-class depression characteristic word set through a large language model, classifying the new words into the category with the highest similarity, and realizing the expansion of the meaning of the characteristic words.
3. The method for identifying the risk of depression of the microblog user based on the large model and the remote sensing environment perception according to claim 2, wherein the step S3 specifically comprises the following steps: Step S3.1, calculating the frequency of characteristic words, namely traversing the word sequence of each user U in the depression patient group U dep , counting the occurrence times count u (v) of v-th type characteristic words of each user U in the 13-type characteristic word set, and calculating the frequency f dep (v) of the characteristic words of the depression patient group on the category v according to the following formula (1): ;(1) In the same way, the word sequence of the ordinary user group U norm is traversed, and the feature word frequency f norm (v) of the ordinary user group on the category v is calculated according to the following formula (2): ;(2) Step S3.2, calculating the average duty ratio of the feature words, namely calculating all types of feature words in the feature word set according to the following formula (3) to obtain the average duty ratio of the depression feature words of the depression patient group: ;(3) In the same way, the average duty cycle of the common user group depression feature words is calculated according to the following formula (4): ;(4) step S3.3, calculating the specific duty ratio of the characteristic words, namely calculating the specific duty ratio P dep (v) of the v-th characteristic words in the total vocabulary of the patient group with depression according to the following formula (5): ;(5) in the same way, the specific duty ratio P norm (v) of the category v feature words in the general user population is calculated according to the following formula (6): ;(6) Wherein T dep represents the total word frequency of the group of patients with depression, and T norm represents the total word frequency of the group of common users.
4. The method for identifying the risk of depression of the microblog user based on the large model and the remote sensing environment perception according to claim 3, wherein the step S4 specifically comprises the following steps: Step S4.1 calculating the information value IV substituting P dep (v) and P norm (v) obtained in step S3 into the following formula (7): ;(7) Wherein, when P dep (v) or P norm (v) is 0, the nonsensical logarithm is avoided by adding a minimum value epsilon, and the IV value is used for quantifying the distribution difference of the category v feature words in the two types of groups, and the higher the IV value is, the stronger the distinction degree is; And step S4.2, weight smoothing and determination, namely introducing a scaling factor alpha to obtain the final weight i (v) =IV (v) ×alpha of each class of feature words.
5. The method for identifying the risk of depression of the microblog user based on the large model and the remote sensing environment perception according to claim 4, wherein the step S5 specifically comprises the following steps: Step S5.1, extracting geographic position information, namely extracting geographic position information from user microblog data through a microblog opening interface or a text semantic recognition algorithm, wherein the geographic position information comprises positioning data recorded by a microblog sign-in function, geographic position labels actively marked by a user or location descriptions recognized from text content; s5.2, converting longitude and latitude, namely converting the extracted geographic position information into standard longitude and latitude coordinates under a WGS-84 coordinate system by adopting a geographic coding technology; Step 5.3, calculating greening rate, namely calling Level-2A Level image data of a Sentinel-2 satellite by using a Google EARTH ENGINE remote sensing data analysis platform, constructing a circular analysis area with a radius of 500 meters by taking longitude and latitude coordinates of a user as a center, and calculating normalized vegetation index NDVI by a formula (8) based on satellite images of the area: NDVI=(ρ NIR -ρ Red )/(ρ NIR +ρ Red );(8) setting pixels with NDVI >0.2 as effective vegetation pixels, and counting the ratio of the number of the effective vegetation pixels to the total number of pixels in an analysis area to obtain greening rate GR of the periphery of a user; S5.4, constructing an environment adjustment coefficient, namely counting the greening rates GR of all users in a training sample, taking a median value as a reference greening rate GR baseline , setting an empirical adjustment weight parameter beta, substituting the empirical adjustment weight parameter beta into a formula (9) to calculate the environment adjustment coefficient, and automatically calling a preset missing adjustment coefficient when the geographical position information of the users cannot be extracted; α env =1-β×(GR-GR baseline );(9)。
6. The method for identifying the risk of depression of the microblog user based on the large model and the remote sensing environment perception according to claim 5, wherein the step S6 specifically comprises: Step S6.1, target user text processing, namely, repeating the stop word filtering and word segmentation processing flow of the step S1 on the microblog text of the target user to obtain a word sequence of the target user, and counting the microblog total word frequency N of the target user; Step 6.2, feature matching and word frequency statistics, namely adopting the large language model in the step 2 to encode each word in the word sequence of the target user to generate a semantic vector, calculating cosine similarity of the semantic vector and a central vector of each class V in the 13 class feature word set V, classifying the words with similarity higher than a preset threshold into corresponding classes V, and counting the feature word occurrence times count (V) of the target user on each class V; Step S6.3, calculating a negative emotion index, namely calculating the final weight i (v) of each type of feature words obtained in the step S4 and the count (v) of the corresponding category of the target user according to the following formula (10) to obtain the weighted depression word frequency W of the target user: ;(10) calculating according to a formula (11) to obtain an original negative emotion index NEI original , and calculating according to a formula (12) to obtain an adjusted negative emotion index NEI adjusted ; NEI original =W/N;(11) NEI adjusted =NEI original ×α env ;(12) Step S6.4, risk classification, namely setting adjusted negative emotion indexes NEI adjusted of all users to obey normal distribution, calculating the mean value mu and standard deviation sigma of NEI adjusted of all users in a training sample, and outputting the depression risk classification of the target user according to the following risk classification standard; when NEI adjusted is less than mu-0.5σ, judging as normal population; when mu-0.5σ is less than or equal to NEI adjusted < mu+1.5σ, judging that the patient is at risk of mild depression; When mu+1.5σ is less than or equal to NEI adjusted < mu+2.5σ, judging as moderate depression risk; when NEI adjusted is equal to or greater than mu+2.5σ, the risk of major depression is judged.

Description

Microblog user depression risk identification method based on large model and remote sensing environment sensing Technical Field The invention belongs to the technical field of social media user psychological risk identification, and particularly relates to a microblog user depression risk identification method based on a large model and remote sensing environment sensing. Background The user is subjected to depression risk detection based on social media information such as microblogs, so that potential depression patients can be found in time, and dangerous behaviors can be effectively prevented. The current mainstream detection method is to construct a text knowledge base based on microblogs issued by patients with depression, and then to use the knowledge base to carry out matching detection on microblogs of other users, so as to realize depression risk prediction. However, the prior art has the obvious defects that on one hand, the language expression habits of different social groups are different, so that the migration capability of a detection model constructed based on a specific group is limited, the precision is lower when a new user and a user with fewer microblog publications are predicted, and the root cause is that the parts of speech are not effectively combed when an emotion dictionary is constructed by the existing method, so that feature words are sparse and are easily interfered by emotion words and extreme words, and the judgment accuracy is influenced. On the other hand, environmental psychology researches prove that natural environment contact is closely related to mental health, and the environment with high greening rate can relieve pressure, improve emotion and reduce depression risk, but the social platform depression risk recognition method in the present stage ignores the key variable and further restricts the improvement of recognition accuracy. Disclosure of Invention The invention aims to solve the technical problems that feature words are sparse and are easily interfered by emotion words and extreme words, the model migration capability is weak and the influence of environmental factors is not considered in the existing microblog user depression risk recognition method, and provides a depression risk recognition method which has the characteristics of texts and environment, high recognition precision and strong robustness. In order to achieve the above purpose, the present invention is implemented by adopting the following scheme: a microblog user depression risk identification method based on large model and remote sensing environment sensing comprises the following steps: Step S1, preprocessing and word segmentation are carried out on microblog texts, namely microblog text samples of a depression patient group U dep and a common user group U norm are obtained, and preprocessing is carried out; Step S2, constructing a semantic expansion and classification system of depression feature words, namely adopting a large language model based on a Transformer structure to encode the preprocessed microblog text, and utilizing a model attention mechanism to perform feature fusion and clustering to construct a depression feature word set V containing 13 semantic categories; step S3, counting the frequency and the duty ratio of the feature words, namely respectively calculating the frequency of the feature words of each of the 13 types of feature words of the two types of groups, and the duty ratio and the average duty ratio of the various types of feature words in the total vocabulary of the corresponding groups; step S4, feature discrimination quantization and weight calculation, namely quantizing discrimination of various feature words in two types of groups based on an information value IV formula, introducing a scaling factor alpha for smoothing treatment, and obtaining final weight i (v) of each type of feature words; Step S5, environmental factor extraction and adjustment coefficient construction, namely extracting geographic position information from user microblog data, converting the geographic position information into standard longitude and latitude coordinates, calculating greening rate GR in a set radius R area with the user coordinates as a center based on a normalized vegetation index NDVI, and calculating an environmental adjustment coefficient alpha env; and S6, individual depression risk prediction and classification, namely repeating word segmentation processing in the step S1 on the target user microblog, completing characteristic word sense matching and classification through a large language model, calculating weighted depression words and original negative emotion indexes, correcting through an environment adjustment coefficient to obtain adjusted negative emotion indexes, and setting risk division standards based on normal distribution to realize target user depression risk classification. Further preferably, the step S1 specifically includes: And S1.1, sample