Search

EP-4739803-A1 - EFFICIENT DETECTION OF DECENTRALIZED BIOMARKERS OF GROUPS OF DNA SEQUENCES IN MICROBIOME DNA

EP4739803A1EP 4739803 A1EP4739803 A1EP 4739803A1EP-4739803-A1

Abstract

Detecting biomarkers in microbiome DNA to predict a biological condition. A microbiome network may be generated comprising nodes representing microbiome DNA sequences and edges representing a co-occurrence of each pair of microbiome DNA sequences in a same DNA sample or sub-length. Nodes may be bundled into distinct groups based on the node's degree quantifying a number of its edges indicating a number of unique microbiome DNA sequences that co-occur with the microbiome DNA sequence represented by the node in the same DNA sample or sub-length. Groups of bundled microbiome DNA sequences may be validated having an internal connectivity that satisfies an anomaly condition indicating the microbiome DNA sequences of that group co-occur with a probability that is unlikely randomly statistical, e.g., that does not follow a power law distribution. A machine learning model may be trained with the validated groups to predict a biological condition correlated therewith.

Inventors

  • ALTSHULER, YANIV

Assignees

  • Alphabiome AI Ltd

Dates

Publication Date
20260513
Application Date
20240825

Claims (20)

  1. 1. A method for detecting biomarkers in microbiome DNA of one or more livestock hosts to predict a biological condition, the method comprising: storing a plurality of microbiome DNA sequences, sequenced from microbiome DNA of one or more livestock hosts; generating a microbiome network comprising a plurality of nodes representing the respective plurality of microbiome DNA sequences and a plurality of edges representing a co-occurrence of each pair of microbiome DNA sequences in a same sample or sub-length of the microbiome DNA; bundling, into distinct groups, microbiome DNA sequences represented by nodes based on the node's degree, the degree of each node quantifying a number of edges that contain the node indicating a number of unique microbiome DNA sequences that co-occur with the microbiome DNA sequence represented by the node in the same sample or sublength of the microbiome DNA; validating groups of microbiome DNA sequences that each have an internal connectivity between nodes of the same group that satisfies an anomaly condition indicating the microbiome DNA sequences of that group co-occur with a probability that is unlikely randomly statistical; generating a training dataset correlating the validated groups of microbiome DNA sequences from one or more livestock hosts with a biological condition measured in one or more livestock hosts; and training a machine learning model with the training dataset to input microbiome DNA sequences from one or more livestock hosts and predict the biological condition for one or more livestock hosts.
  2. 2. The method of claim 1, wherein the internal connectivity is determined based on a ratio between a number of internal edges connecting two nodes internal to the same group and a total number of overall edges connecting nodes within the group to any other node internal or external to the group.
  3. 3. The method of claim 2, wherein the internal connectivity condition for a group of nodes having a same degree d is that the group's internal connectivity is greater than or equal to: is a threshold in which the probability of observing more than the threshold is less than a value ∈ and | | is the number of nodes in the microbiome network of degree d.
  4. 4. The method of claim 1, wherein the internal connectivity is determined based on a number of nodes in the same group, number of edges in the same group, a number of pre-defined partially or fully connected clusters in the same group or deviation thereof.
  5. 5. The method of claim 1, wherein the internal connectivity condition is satisfied if the internal connectivity for a group of nodes deviates from an expected internal connectivity in a network that follows a power law distribution.
  6. 6. The method of claim 1 comprising filtering the plurality of microbiome DNA sequences to include only microbiome DNA sequences that occur greater than a predefined integer number of times in the same sample or sub-length of the microbiome DNA.
  7. 7. The method of claim 1 comprising filtering the plurality of microbiome DNA sequences to include only nodes within a predefined degree range and edges connecting those nodes.
  8. 8. The method of claim 1 comprising super-positioning a plurality of networks each representing a distinct microbiome DNA sample to generate a composite network representing a plurality of microbiome DNA samples.
  9. 9. The method of claim 1 comprising retraining the model based on new microbiome DNA.
  10. 10. The method of claim 1, wherein the biological condition of the one or more livestock hosts is selected from the group consisting of: feed, additive or medicinal efficacy, methane emissions, dairy or meat quality, composition or yield, gastrointestinal or overall health, disease susceptibility, disease tolerance, likelihood of disease recovery, life expectancy, and/or fatality risk.
  11. 11. A system for detecting biomarkers in microbiome DNA of one or more livestock hosts to predict a biological condition, the system comprising: one or more memories configured to store a plurality of microbiome DNA sequences, sequenced from microbiome DNA of one or more livestock hosts; and one or more processors configured to: generate a microbiome network comprising a plurality of nodes representing the respective plurality of microbiome DNA sequences and a plurality of edges representing a co-occurrence of each pair of microbiome DNA sequences in a same sample or sub-length of the microbiome DNA, bundle, into distinct groups, microbiome DNA sequences represented by nodes based on the node's degree, the degree of each node quantifying a number of edges that contain the node indicating a number of unique microbiome DNA sequences that co-occur with the microbiome DNA sequence represented by the node in the same sample or sub-length of the microbiome DNA, validate groups of microbiome DNA sequences that each have an internal connectivity between nodes of the same group that satisfies an anomaly condition indicating the microbiome DNA sequences of that group co-occur with a probability that is unlikely randomly statistical, generate a training dataset correlating the validated groups of microbiome DNA sequences from one or more livestock hosts with a biological condition measured in one or more livestock hosts, and train a machine learning model with the training dataset to input microbiome DNA sequences from one or more livestock hosts and predict the biological condition for one or more livestock hosts.
  12. 12. The system of claim 11, wherein the one or more processors are configured to determine the internal connectivity based on a ratio between a number of internal edges connecting two nodes internal to the same group and a total number of overall edges connecting nodes within the group to any other node internal or external to the group.
  13. 13. The system of claim 12, wherein the one or more processors are configured to determine that a group of nodes having a same degree d satisfy the internal connectivity condition if the group's internal connectivity is greater than or equal to: is a threshold in which the probability of observing more than the threshold is less than a value is the number of nodes in the microbiome network of degree d.
  14. 14. The system of claim 11, wherein the one or more processors are configured to determine the internal connectivity based on a number of nodes in the same group, number of edges in the same group, a number of pre-defined partially or fully connected clusters in the same group or deviation thereof.
  15. 15. The system of claim 11, wherein the one or more processors are configured to determine that the internal connectivity condition is satisfied if the internal connectivity for a group of nodes deviates from an expected internal connectivity in a network that follows a power law distribution.
  16. 16. The system of claim 11, wherein the one or more processors are configured to filter the plurality of microbiome DNA sequences to include only microbiome DNA sequences that occur greater than a predefined integer number of times in the same sample or sub-length of the microbiome DNA.
  17. 17. The system of claim 11, wherein the one or more processors are configured to filter the plurality of microbiome DNA sequences to include only nodes within a predefined degree range and edges connecting those nodes.
  18. 18. The system of claim 11, wherein the one or more processors are configured to superposition a plurality of networks each representing a distinct microbiome DNA sample to generate a composite network representing a plurality of microbiome DNA samples.
  19. 19. The system of claim 11, wherein the one or more processors are configured to retrain the model based on new microbiome DNA.
  20. 20. The system of claim 11, wherein the biological condition of the one or more livestock hosts is selected from the group consisting of: feed, additive or medicinal efficacy, methane emissions, dairy or meat quality, composition or yield, gastrointestinal or overall health, disease susceptibility, disease tolerance, likelihood of disease recovery, life expectancy, and fatality risk.

Description

EFFICIENT DETECTION OF DECENTRALIZED BIOMARKERS OF GROUPS OF DNA SEQUENCES IN MICROBIOME DNA FIELD OF THE INVENTION [0001] Embodiments of the present invention relate generally to the field of microbiome DNA. In particular, some embodiments of the invention relate to analyzing microbiome DNA to predict biological conditions (e.g., methane production, additive efficacy, dairy yield) in livestock (e.g., cows). BACKGROUND OF THE INVENTION [0002] Ruminants, in an intricate symbiotic relationship to their resident microbiota, have the unique ability to breakdown complex polysaccharides like cellulose and hemi-cellulose, which constitute the primary components of their plant-based diet. This process is facilitated by the host animal's provision of a stable environment, facilitating continuous mixing, deconstruction, and fermentation of ingested plant material. This, in turn, results in the production of short-chain fatty acids which serve as a digestible energy source for the host animal. [0003] The assembly and development of the rumen microbiota is a multifactorial process, influenced by several host and environmental factors. These include the host's age, diet, genetic makeup, and herd origin, all of which play a pivotal role in defining the microbiota's compositional layout. Moreover, the stochastic colonization events of the rumen during early life stages can leave lasting imprints on the ruminant microbiome's structure. [0004] While this symbiotic relationship allows ruminants to thrive on fibrous diets, it also has an environmental cost. The ruminant digestive process is a significant contributor to the emission of methane, a potentgreenhouse gas, which accounts for about 14% of total greenhouse emissions and has a global warming potential 28 times higher than carbon dioxide (CO2). Notably, livestock are estimated to contribute to nearly 30% of all anthropogenic methane emissions. [0005] Efforts to mitigate the environmental impact of dairy farming have given rise to several strategies. One such approach involves the utilization of microbial biomarkers to identify cows with high methane emission rates, thereby enabling targeted management strategies aimed at reducing methane emissions and fostering environmental sustainability. Another strategy characterizes microbial gene abundances as proxies for methane emissions, focusing specifically on metabolic pathways expected to exhibit variation between low and high methane emitters. [0006] Correlating high methane emission rates with biomarkers, particularly in livestock microbiome DNA, presents a unique challenge because the biological effect of the constituent of the microbiome DNA sequences are largely unknown. [0007] Microbiome DNA samples are sequenced into a plurality (e.g., tens to millions) of "reads," each representing a continuous sequence of (e.g., fixed or variable length, such as, 100-150) nucleotides. Each read is then sub-divided into a plurality of "k-mers," each representing relatively shorter continuous fixed or variable length sequences (e.g., a fixed-length integer number k, such as, 30; or variable length including k=30, 60, 72, etc.) of the read's nucleotides. For example, Each DNA sequence position can be one of four nucleotides (A, T, C, or G), so the total number of possible k-mers of length k in the microbiome DNA is 4k. Experiments indicate k- mer lengths of 30-60 nucleotides associate with optimal phenotypic expression of biomarkers (e.g., shorter k-mer lengths often suffer higher false positives due to a higher likelihood of randomly appearing and longer k-mer lengths often suffer higher false negatives as longer sequences obfuscate or dilute significant segments). With 430 or more possible 30-mer combinations per DNA sample, there are too many k-mer combinations to practically model correlations between k-mers and biological effect. Groups of multiple k-mers, the combination of which often correlates with biological expression, compounds this problem as there are exponentially more combinations of k-mer groups than individual k-mers (e.g., for k-mers of length k, there are 22*k possible k-mers, approximately (22*k)2 possible pairs, and 2A(22*k) possible groups of k-mers, which for k-mers of length k=30, would be 260 - approximately 1018 possible k- mers and 21000000000000000000 possible k-mer groups). Such massive numbers of combinations of groups of k-mers makes it realistically impossible to model their correlation to biological effect. [0008] Accordingly, there is longstanding need inherent in the art for, and a wealth of knowledge to be gained from, efficiently modelling the biological effect of groups of k-mer or other nucleotide sequences in microbiome DNA. SUMMARY OF THE INVENTION [0009] Embodiments of the invention overcome this longstanding need inherent in the art by efficiently modeling correlations between biological conditions and groups of k-mer or other nucleotide sequences in microbiome DNA. [00010] A system, d