US-12620044-B2 - Systems and methods for tracking disaster footprints with social streaming data

US12620044B2US 12620044 B2US12620044 B2US 12620044B2US-12620044-B2

Abstract

Various embodiments for systems and methods of tracking disaster footprints using social streaming media using nonnegative matrix factorization are disclosed herein. The system extracts a summarization output from historical data and compares the summarization output with incoming data to identify differing or similar topics within the data. The summarization output is projected to adjust a time-dependency of the summarization output to enable a more direct comparison. The system additionally uses the summarization output to encode topic data within historical data to reduce computational overhead.

Inventors

Lu Cheng
Jundong Li
Kasim Candan
Huan Liu

Assignees

Lu Cheng
Jundong Li
Kasim Candan
Huan Liu

Dates

Publication Date: 20260505
Application Date: 20211207

Claims (20)

1 . A system, comprising: a processor in communication with a memory, the memory including instructions, which, when executed, cause the processor to: access historical social streaming data associated with first timesteps; conduct matrix factorization of the social streaming data to derive a base matrix defining a summarization output including a latent factor embedded in a vector space within the first timesteps, the latent factor encoding discovered topics from the historical social streaming data; access incoming social streaming data associated with second timesteps; conduct matrix factorization of the incoming social streaming data to derive an incoming matrix from the incoming social streaming data; apply a linear transformation on the base matrix to obtain a transformation matrix defining a transformed summarization output, such that in a new transformed feature space of the transformed summarization output common and distinct topics of the historical social streaming data can be found along with common and distinct topics of the incoming social streaming data, wherein the transformation matrix accommodates dynamic adjustment of dependency between the historical social streaming data and the incoming social streaming data, accommodates minimization of distances between common topic representations in the historical social streaming data and the incoming social streaming data, and accommodates maximization of distances and minimization of similarity between distinct topic representations in historical social streaming data and the incoming social streaming data; project the summarization output associated with the historical social streaming data into the transformed feature space to adaptively align topic relationships between historical and incoming social streaming data using a joint non-negative matrix factorization process, wherein matrices corresponding to the historical and incoming social streaming data are concatenated to form a unified input for the joint non-negative matrix factorization; and output common and different topics between the first timesteps and the second timesteps associated with the historical social streaming data and the incoming social streaming data leveraging the transformed feature space that accommodates comparison between topics that are more similar and distinctiveness between topics that are more likely to be different between the incoming matrix associated with the incoming social streaming data and the transformation matrix associated with the historical social streaming data.
2 . The system of claim 1 , wherein the memory further includes instructions, which, when executed, further cause the processor to: apply a nonnegative matrix factorization technique to the historical social streaming data, the first social streaming data defining a historic data matrix that encodes a word distribution in each topic of the plurality of topics indicated within the historic data matrix.
3 . The system of claim 1 , wherein the memory further includes instructions, which, when executed, further cause the processor to: iteratively apply a joint nonnegative matrix factorization technique to an incoming data matrix indicative of the set of incoming social streaming data; and identify a plurality of common topics of the plurality of topics that are shared between the incoming data matrix and a historic data matrix indicative of the set of historical social streaming data.
4 . The system of claim 3 , wherein the memory further includes instructions, which, when executed, further cause the processor to: jointly update a summarization output and a coefficient matrix for a current time step, the summarization output encoding a word distribution in each topic found within the incoming data matrix, and the coefficient matrix capturing a correlation between a factorized incoming topic matrix indicative of a plurality of topics within the incoming data matrix using a summarization output from a previous timestep.
5 . The system of claim 4 , wherein the memory further includes instructions, which, when executed, further cause the processor to: project the summarization output from the previous timestep into a new feature space to adaptively adjust a dynamic correlation between the summarization output from the previous timestep and the set of incoming social streaming data and yield a transformed summarization output.
6 . The system of claim 5 , wherein the memory further includes instructions, which, when executed, further cause the processor to: minimize a norm between the transformed summarization output and a matrix product of a transformation matrix with the factorized incoming topic matrix.
7 . The system of claim 4 , wherein the memory further includes instructions, which, when executed, further cause the processor to: minimize a reconstruction error on the set of incoming social streaming data by minimizing a difference between the incoming data matrix and a reconstructed common topic matrix and a reconstructed distinct topic matrix.
8 . The system of claim 4 , wherein the memory further includes instructions, which, when executed, further cause the processor to: minimize respective distances between a plurality of common topics in the transformed summarization output and the factorized incoming topic matrix.
9 . The system of claim 4 , wherein the memory further includes instructions, which, when executed, further cause the processor to: minimize respective similarities between a plurality of distinct topics in the transformed summarization output and the factorized incoming topic matrix.
10 . A system, comprising: a processor in communication with a memory, the memory including instructions, which, when executed, cause the processor to: obtain a base matrix defining a summarization output indicative of a set of historical social streaming data, the summarization output including a plurality of topics embedded within the set of historical social streaming data; access incoming social streaming data; apply a linear transformation on the base matrix to obtain a transformation matrix defining a transformed summarization output, such that in a new transformed feature space of the transformed summarization output common and distinct topics of the historical social streaming data can be found along with common and distinct topics of incoming social streaming data, wherein the transformation matrix accommodates dynamic adjustment of dependency between the historical social streaming data and the incoming social streaming data, accommodates minimization of distances between common topic representations in the historical social streaming data and the incoming social streaming data, and accommodates maximization of distances and minimization of similarity between distinct topic representations in historical social streaming data and the incoming social streaming data; generate a transformation matrix defining a transformed summarization output, thereby accommodating, via a new transformed feature space, to adaptively adjust a dynamic correlation between the summarization output from a previous timestep and a set of incoming social streaming data; project the summarization output associated with the historical social streaming data into the transformed feature space to adaptively align topic relationships between historical and incoming social streaming data using a joint non-negative matrix factorization process; and update, at a timestep of a plurality of timesteps, a listing of common topics and a listing of distinct topics within incoming social streaming data using the transformed summarization output.
11 . The system of claim 10 , wherein the memory further includes instructions, which, when executed, further cause the processor to: iteratively apply, at a timestep of the plurality of timesteps, a joint nonnegative matrix factorization technique to an incoming data matrix indicative of the set of incoming social streaming data; and identify a plurality of common topics of the plurality of topics that are shared between the incoming data matrix and a historic data matrix indicative of the set of historical social streaming data.
12 . A method, comprising: obtaining, by a first computer-implemented nonnegative matrix factorization module, a base matrix defining a summarization output indicative of a set of historical social streaming data, the summarization output including a plurality of topics embedded within the set of historical social streaming data; access incoming social streaming data; applying a linear transformation on the base matrix to obtain a transformation matrix defining a transformed summarization output, such that in a new transformed feature space of the transformed summarization output common and distinct topics of the historical social streaming data can be found along with common and distinct topics of the incoming social streaming data, wherein the transformation matrix accommodates dynamic adjustment of dependency between the historical social streaming data and the incoming social streaming data, accommodates minimization of distances between common topic representations in the historical social streaming data and the incoming social streaming data, and accommodates maximization of distances and minimization of similarity between distinct topic representations in historical social streaming data and the incoming social streaming data; generating a transformation matrix defining a transformed summarization output, thereby accommodating, via a new transformed feature space dynamic adjustment of dependency between the historical social streaming data and incoming social streaming data; project the summarization output associated with the historical social streaming data into the transformed feature space to adaptively align topic relationships between historical and incoming social streaming data using a joint non-negative matrix factorization process; and updating, at a timestep of a plurality of timesteps and by a second computer-implemented nonnegative matrix factorization module, a listing of common topics and a listing of distinct topics within the incoming social streaming data using the transformed summarization output.
13 . The method of claim 12 , further comprising: applying a nonnegative matrix factorization technique to a historic data matrix indicative of the set of historical social streaming data that encodes a word distribution in each topic of the plurality of topics indicated within the historic data matrix.
14 . The method of claim 12 , further comprising: iteratively applying, at a timestep of the plurality of timesteps, a joint nonnegative matrix factorization technique to an incoming data matrix indicative of the set of incoming social streaming data; and identifying a plurality of common topics of the plurality of topics that are shared between the incoming data matrix and a historic data matrix indicative of the set of historical social streaming data.
15 . The method of claim 14 , further comprising: jointly updating a summarization output and a coefficient matrix for a current time step, the summarization output encoding a word distribution in each topic found within the incoming data matrix, and the coefficient matrix capturing a correlation between a factorized incoming topic matrix indicative of a plurality of topics within the incoming data matrix using a summarization output from a previous timestep.
16 . The method of claim 15 , further comprising: projecting the summarization output from the previous timestep into a new feature space to adaptively adjust a dynamic correlation between the summarization output from the previous timestep and the set of incoming social streaming data and yield a transformed summarization output.
17 . The method of claim 16 , further comprising: minimizing a norm between the transformed summarization output and a matrix product of a transformation matrix with the factorized incoming topic matrix.
18 . The method of claim 15 , further comprising: minimizing a reconstruction error on the set of incoming social streaming data by minimizing a difference between the incoming data matrix and a reconstructed common topic matrix and a reconstructed distinct topic matrix.
19 . The method of claim 15 , further comprising: minimizing respective distances between a plurality of common topics in the transformed summarization output and the factorized incoming topic matrix.
20 . The method of claim 15 , further comprising: minimize respective similarities between a plurality of distinct topics in the transformed summarization output and the factorized incoming topic matrix.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This is a non-provisional application that claims benefit to U.S. Provisional Patent Application Ser. No. 63/122,287 filed 7 Dec. 2020, which is herein incorporated by reference in its entirety. GOVERNMENT SUPPORT This invention was made with government support under grants 1610282 and 1909555 awarded by the National Science Foundation. The government has certain rights in the invention. FIELD The present disclosure generally relates to tracking disaster footprints; and in particular, systems and methods for tracking disaster footprints using social streaming media. BACKGROUND Social media reveals dynamic changes of discussions with topics evolving over time. Take the Asia tsunami disaster as an example, major topics of the reports evolved from “financial aids” to “debt” and “reconstruct” over different stages. Online topic tracking can benefit disaster responders in the following ways: (1) For emergency managers and people affected by the natural calamities, it is often of particular interest to identify topics that prevail over time, i.e., common topics, such as “disaster rescue”, as well as to be alerted to any new emerging themes of disaster-related discussions that are fast gathering in social media streams, i.e., distinct topics such as “the latest tsunami destruction”. (2) For global participants, a quick update of the disaster status-quo, i.e., the commonness and distinctiveness between previous and current topics, is necessary for them to provide immediate and effective assistance. A major obstacle to disaster-related topic tracking, however, is that social media generates massive amount of data each day and it is notorious for a sea of unwanted and noisy content such as spam and daily chatter. For example, during Hurricane Harvey, Twitter reported there have been 21.2 million hurricane-related tweets within the first six days and a large portion was generated in a short period of time to spread rumors. Consequently, a new way of effective online topics discoveries using social media data during disaster response is urgent. It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a simplified illustration showing the studied problem; FIG. 2 is a simplified diagram showing a system for tracking topics from incoming social streaming data; FIG. 3 is a proves flow showing a process for tracking topics from incoming social streaming data according to the system of FIG. 2; FIGS. 4A-4C are graphical representations showing performance comparisons of different methods of using a Harvey dataset; FIGS. 5A-5C are graphical representations showing performance comparisons of different methods of using a Florence dataset; FIGS. 6A and 6B show comparisons of computing time for the Harvey and Florence datasets, respectively; FIGS. 7A-7C show graphical representations of parameter studies of α, β, and Kc; and FIG. 8 is an illustration showing an exemplary computer system for executing the functionalities of the framework. Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims. DETAILED DESCRIPTION Social media has become an indispensable tool in the face of natural disasters due to its broad appeal and ability to quickly disseminate information. For instance, Twitter is an important source for disaster responders to search for (1) topics that have been identified as being of particular interest over time, i.e., common topics such as “disaster rescue”; (2) new emerging themes of disaster-related discussions that are fast gathering in social media streams, i.e., distinct topics such as “the latest tsunami destruction”. To understand the status quo and allocate limited resources to most urgent areas, emergency managers need to quickly sift through relevant topics generated over time and investigate their commonness and distinctiveness. A major obstacle to the effective usage of social media, however, is its massive amount of noisy and undesired data. Hence, a naive method, such as set intersection/difference to find common/distinct topics, is often not practical. To address this challenge, the present disclosure discusses a new topic tracking problem that seeks to effectively identify the common and distinct topics with social streaming data. The problem is important as it presents a promising new way to efficiently search for accurate information during emergency response. This is achieved by an online Nonnegative Matrix Factorization (NMF) technique that conducts a faster update of latent factors, and a joint NMF technique that seeks the balance between the reconstruction error of topic identification and the losses induced by discovering common and distinct topics. Extensive experimental results on real-world datasets collected during Hurricane Harv