US-12619829-B2 - Automatic generation of scientific article metadata

US12619829B2US 12619829 B2US12619829 B2US 12619829B2US-12619829-B2

Abstract

Examples of the disclosure are directed to systems and methods of using natural language processing techniques to automatically assign metadata to articles as they are published. The automatically-assigned metadata can then feed into the algorithms that calculate updated causation scores for agent-outcome hypotheses, powering live visualizations of the data that update automatically as new scientific articles become available.

Inventors

Andrea Melissa Boudreau
Lauren Caston
Naresh CHEBOLU
Adam GROSSMAN
Liyang Hao
David LOUGHRAN
Michael Ragland
Robert Reville
Chun-Yuen Teng

Assignees

Praedicat, Inc.

Dates

Publication Date: 20260505
Application Date: 20240421

Claims (11)

1 . A computer-implemented method of updating a general causation visualization for an agent and an outcome displayed on a remote device, the method comprising: on a regular time interval, polling a remote source of scientific literature articles for new scientific literature articles to track analysis of the agent and the outcome over time; in response to the polling of the remote source of scientific literature articles, downloading a first set of new scientific literature articles from the remote source of scientific literature articles; after downloading the first set of new scientific literature articles, generating a set of update metadata associated with the agent and the outcome for each article in the first set of new scientific literature articles, comprising: generating directionality data for each respective article indicating whether the respective article supports or rejects a hypothesis that the agent causes the outcome, using natural language processing on text of the respective article, generating evidence data for each article indicating how well methodology of the respective article can demonstrate a causal relationship between the agent and the outcome using natural language processing on the text of the respective article, and generating a proximity categorization for each article indicating directness of evidence in the respective article using natural language processing on the text of the respective article; aggregating the set of update metadata with the existing metadata to obtain aggregate metadata, wherein a first causation score has been previously computed based on the existing metadata and not based on the update metadata; computing a second causation score based on the aggregate metadata; and while the remote device is displaying a representation of the first causation score, pushing the second causation score to the remote device, wherein the remote device updates the general causation visualization to display a representation of the second causation score instead of the representation of the first causation score.
2 . The method of claim 1 , wherein the general causation visualization plots causation scores over time, and updating the general causation visualization to display the representation of the second causation score instead of the representation of the first causation score includes: replacing the representation of the first causation score at a location associated with a first time period with the representation of the second causation score at the location associated with the first time period.
3 . The method of claim 1 , wherein the general causation visualization plots causation scores over time, the representation of the first causation score is displayed at a location associated with a first time period, and updating the general causation visualization to display the representation of the second causation score instead of the representation of the first causation score includes: displaying the representation of the second causation score at a second location associated with a second time period, different from the first time period.
4 . The method of claim 1 , wherein the general causation visualization includes a ranked list of causation scores, and updating the general causation visualization further includes reordering the ranked list based on the second causation score instead of the first causation score.
5 . The method of claim 1 , wherein updating the general causation visualization further includes changing an element of the general causation visualization from a first color associated with the first causation score to a second color associated with the second causation score.
6 . The method of claim 1 , wherein the general causation visualization includes a plurality of causation score representations for causation scores of different agents.
7 . The method of claim 1 , wherein the general causation visualization includes a plurality of causation score representations for different outcomes and all associated with a single agent.
8 . The method of claim 1 , wherein the general causation visualization includes a plurality of causation score representations for different causation scores of a single agent over time.
9 . The method of claim 1 , the method further comprising determining whether a first article in the first set of new scientific literature articles is relevant to a causation hypothesis that the agent causes the outcome by: determining whether the first article is relevant to the agent based on a plurality of agent terms associated with the agent; determining whether the first article is relevant to the outcome based on a plurality of outcome terms associated with the outcome; and determining whether the first article is relevant to the causation hypothesis based on a plurality of causation terms associated with causation.
10 . A computer-implemented method of updating a set of causation scores, each respective causation score corresponding to one of a plurality of agent-outcome pairs, the method comprising: on a regular time interval, polling a source of scientific literature articles for new scientific literature articles to track analysis of the agent and the outcome over time; in response to the polling of the source of scientific literature articles, downloading a plurality of new scientific literature articles from the source of scientific literature articles; for each respective article in the plurality of new scientific literature articles, classifying the respective article as relevant or not relevant to each respective agent-outcome pair in the plurality of agent-outcome pairs based on natural language processing on text of the respective article; aggregating, into a first set of articles, a subset of the plurality of new scientific literature articles that are classified as relevant to a first agent-outcome pair including a first agent and a first outcome; generating metadata for each article in the first set of articles by: generating directionality data for each article, indicating whether the respective article supports or rejects a hypothesis that the first agent causes the first outcome, generated using natural language processing on the text of the respective article, generating evidence data for each article, indicating how well methodology of the respective article can demonstrated a causal relationship between the first agent and the first outcome, generated using natural language processing on the text of the respective article, and generating a proximity categorization for each article, indicating directness of evidence in the respective article, generated using natural language processing on the text of the respective article; computing a set of causation scores based on the metadata for each article in the first set of article, comprising: determining a respective magnetism score for each respective article in the first set of articles based on the directionality data and the evidence data associated with the respective article, aggregating the respective magnetism scores for the articles in the first set of articles to obtain a magnetism score for the first set of articles, weighting the magnetism score based on the proximity categorization for each article, and computing the causation score based on the weighted magnetism score; and updating the set of causation scores, including replacing a previous causation score associated with the first agent-outcome pair with the causation score computed based on the weighted magnetism score.
11 . A computer-implemented method of updating a general causation visualization displayed on a remote device, the method comprising: displaying, in the general causation visualization on the remote device, a plurality of causation score representations associated with an agent and an outcome of the agent, each associated with a respective causation score, comprising: displaying a first causation score representation associated with a first causation score computed based on literature metadata relevant to the first causation score; while the plurality of causation score representations are displayed on the remote device, receiving user input at the remote device selecting first literature criteria to update the plurality of causation score representations associated with the agent and the outcome of the agent, wherein applying the first literature criteria to the literature metadata includes a first subset of the literature metadata and excludes a second subset of the literature metadata; in response to the user input at the remote device selecting the first literature criteria, computing a plurality of updated causation scores including computing an updated first causation score based on the first subset of the literature metadata and not the second subset of the literature metadata, comprising: aggregating respective magnetism scores for each article in the first subset to obtain a magnetism score for the first subset, weighting the magnetism score based on proximity categorizations of each article in the first subset, and computing the first causation score based on the weighted magnetism score; and updating the general causation visualization on the remote device to display a plurality of updated causation score representations associated with the, each associated with a respective updated causation score in the plurality of updated causation scores.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 17/248,498, entitled “AUTOMATIC GENERATION OF SCIENTIFIC ARTICLE METADATA” filed Jan. 27, 2021, which is a continuation of U.S. Pat. No. 10,909,323, entitled “AUTOMATIC GENERATION OF SCIENTIFIC ARTICLE METADATA” filed Mar. 9, 2020, which is a continuation of U.S. Pat. No. 10,585,990, entitled “LIVE UPDATING VISUALIZATION OF CAUSATION SCORES BASED ON SCIENTIFIC ARTICLE METADATA” filed Mar. 15, 2019, all of which are hereby incorporated by reference in its entirety. FIELD OF THE DISCLOSURE This relates generally to methods of automatic generation of scientific article metadata for determining causation of an outcome by an agent. SUMMARY OF THE INVENTION U.S. Pat. No. 9,430,739, granted on Aug. 30, 2016, incorporated by reference herein in its entirety, is directed to a method of quantifying and visualizing general acceptance in scientific literature of a hypothesis that a particular agent causes a particular outcome. For example, based on metadata of scientific articles published regarding the hypothesis that bisphenol A (BPA) causes reproductive injury in humans, a causation score can be calculated that represents the acceptance of such a hypothesis in the literature as a whole. Such a causation score can distill a literature into a single, actionable value, enabling the visualization of general acceptance over time and comparison of diverse risks on a common scale. However, peer-reviewed journals publish hundreds of thousands of scientific articles every year, and human analysts may not be able to keep up with the pace to code the metadata on each article that feeds into the computation of causation scores for a myriad of agent-outcome hypotheses. Manually analyzing articles to feed into such an algorithm may require limiting both the pace of updating causation scores and the number of agent-outcome hypotheses that are monitored. Examples of the disclosure are directed to systems and methods of using natural language processing techniques to automatically assign metadata to articles as they are published. The automatically-assigned metadata can then feed into the algorithms that calculate updated causation scores, powering live visualizations of the data that update automatically as new scientific articles become available. Because human intervention may not be required, the pace of updating causation scores and visualizations may be limited only by the pace of the literature itself, and any number of agent-outcome hypotheses may be monitored. For example, a company can monitor all the chemicals it produces or uses for new advances in scientific literature that suggest increased risk of bodily injury as a result of exposure to those chemicals. Further, this dynamic calculation of causation scores makes it possible to filter and slice the literature in different ways to, for example, exclude low-impact journals or give a lower weight to industry-funded studies in the causation computations. Although examples of the disclosure are described in terms of harmful outcomes such as cancer, examples are not so limited and can be instead directed to beneficial outcomes such as vaccination against a disease or a mixture of beneficial, harmful, and/or neutral outcomes. In addition, agents/outcomes may be in the fields of health, bodily injury, energy (e.g., wastewater injection), environmental, and/or property, among other possibilities. BRIEF DESCRIPTION OF DRAWINGS For a better understanding of the various described embodiments, reference should be made to the Detailed Description below, in conjunction with the following drawing in which like reference numerals refer to corresponding parts throughout the figures. FIGS. 1A-1J illustrate an exemplary general causation visualization interface according to examples of the disclosure. FIGS. 2A-2D illustrate an exemplary flow of data between devices according to examples of the disclosure. FIGS. 3A and 3B illustrate an exemplary method of determining a causation score according to examples of the disclosure. FIG. 4 illustrates an exemplary data flow according to examples of the disclosure. FIGS. 5A-5D illustrate exemplary classifier structures for generating relevance data, directionality data, proximity data, and evidence data according to examples of the disclosure. FIG. 6 is a flow diagram illustrating a method of updating a general causation visualization for an agent and an outcome in accordance with some embodiments. FIGS. 7A-7B are flow diagrams illustrating a method of updating a set of causation scores, each respective causation score corresponding to one of a plurality of agent-outcome pairs in accordance with some embodiments. FIG. 8 is a flow diagram illustrating a method of updating a general causation visualization in accordance with some embodiments. DETAILED DESCRIPTION OF THE INVENTION In the following description of embodiments, reference is mad