Search

US-20260127061-A1 - RELIABILITY ANALYSIS FRAMEWORK FOR NODE-LOCAL INTERMEDIARY STORAGE ARCHITECTURES

US20260127061A1US 20260127061 A1US20260127061 A1US 20260127061A1US-20260127061-A1

Abstract

A method, computing device, and a non-transitory computer-readable medium are provided. The computing device determines an initial node-local burst buffer content at a start of a time period T. A current node-local burst buffer content is received by the computing device during the time period. For each checkpoint/restart time interval, the computing device: estimates stochastic transition rates λ and μ; estimates input flow data rates of data entering the node-local burst buffer from a compute node; and estimates drain data rates of the data leaving the node-local burst buffer to a parallel file system. The computing device models an average statistical reliability function of the node-local burst buffer within the time period T with respect to not exceeding a predetermined threshold value. When the average statistical reliability function has a value that is less than a predetermined threshold, the computing device performs an action.

Inventors

  • Antwan CLARK
  • Nicole Fleming
  • Giovanni BERRIOS
  • Yu Shao
  • Jiawen Bai

Assignees

  • THE JOHNS HOPKINS UNIVERSITY

Dates

Publication Date
20260507
Application Date
20231012

Claims (20)

  1. 1 . A method for performing real-time reliability analysis of node-local burst buffer architectures, the method comprising: determining, by a computing device, an initial node-local burst buffer content at a start of a time period T; receiving, by the computing device, a current node-local burst buffer content during the time period; for each checkpoint/restart time interval, performing by the computing device: estimating stochastic transition rates Δ=λ 12 and μ=λ 21 , indicating when the node-local burst buffer is receiving and draining, respectively, estimating input flow data rates of data entering the node-local burst buffer from a compute node, and estimating drain data rates of the data leaving the node-local burst buffer to be stored to a parallel file system; modeling, by the computing device, an average statistical reliability function of the node-local burst buffer within the time period T with respect to not exceeding a predetermined threshold value; and performing an action, by the computing device, when the average statistical reliability function has a value that is less than a predefined value.
  2. 2 . The method of claim 1 , wherein the estimating the stochastic transition rates comprises: estimating the λ by dividing a number of transitions to a node-local burst buffer receiving state by a cumulative amount of time that the node-local burst buffer is in the node-local burst buffer receiving state; and estimating the μ by dividing a number of transitions to a node-local buffer draining state by a cumulative amount of time that the node-local burst buffer is in the node-local burst buffer draining state.
  3. 3 . The method of claim 2 , wherein when the λ and the μ exceed a predefined threshold, the method further comprises: performing expectation maximization to estimate final values of the λ and the μ.
  4. 4 . The method of claim 1 , wherein: W 1 (x, t) is equal to a probability that an amount of data in the node-local burst buffer is less than or equal to a predefined threshold given that the node-local burst buffer is in a node-local burst buffer draining state, W 2 (x, t) is equal to a probability that an amount of data in the node-local burst buffer is less than or equal to the predefined threshold given that the node-local burst buffer is in a node-local burst buffer receiving state, and calculating a value of the statistical reliability function based on the sum of W 1 (x, t) and W 2 (x, t).
  5. 5 . The method of claim 4 , wherein: when the node-local burst buffer is initially empty at a start of the each checkpoint/restart time interval, W 1 ( x , t ) = { λ 21 + λ 12 ⁢ e - ( λ 12 + λ 21 ) ⁢ t λ 12 + λ 21 , 0 < t < x / ϕ 2 λ 21 + λ 12 ⁢ e - ( λ 12 + λ 21 ) ⁢ t λ 12 + λ 21 - ( 1 λ 21 + λ 12 ) ? ? ∫ 0 t ? ( t - v , x ) ⁢ h ⁡ ( v , x ) ⁢ dv , t > x / ϕ 2 , W 2 ( x , t ) = { λ 12 λ 12 + λ 21 ⁢ ( 1 - ? ) , 0 < t < x / ϕ 2 λ 12 λ 12 + λ 21 ⁢ ( 1 - ? ) - ( 1 λ 21 + λ 12 ) ? ∫ 0 t ? ( t - v , x ) ⁢ g ⁡ ( v , x ) ⁢ dv , t > x / ϕ 2 , where f 1 ( t , x ) = 1 - ? h ⁡ ( t , x ) = λ 12 ⁢ λ 21 ⁢ ϕ - 2 ( ϕ 2 - ϕ 1 ) ⁢ ( λ 12 + λ 21 ) ⁢ e ( - λ 12 ⁢ ϕ 2 - λ 21 ⁢ ϕ 1 ϕ 2 - ϕ 1 ⁢ t ) × { ? ( ? ( x , t ) ) - 1 ? ( x , t ) ? ( ? x , t ) ) } , g ⁡ ( t , x ) = δ ⁡ ( t ) + - λ 12 ⁢ λ 21 ⁢ ϕ 2 ⁢ ϕ 1 ϕ 2 - ϕ 1 ⁢ e ( - λ 12 ⁢ ϕ 2 - λ 21 ⁢ ϕ 1 ϕ 2 - ϕ 1 ⁢ t ) × ( ? ( x , t ) - 1 ? ( x , t ) ) ? ( ? ( x , t ) ) , ? ( x , t ) = 2 ⁢ t ⁢ - λ 12 ⁢ λ 21 ⁢ ϕ 1 ⁢ ϕ 2 ϕ 2 - ϕ 1 ? ( x , t ) , ? ( x , t ) = ( 1 - ( 1 ϕ 1 - 1 ϕ 2 ) ⁢ x t ) , ? indicates text missing or illegible when filed and I n is a modified Bessel function of order n=0, 1, 2.
  6. 6 . The method of claim 4 , wherein: an initial content of the node-local burst buffer is greater than the predefined threshold and |φ 1 |=|φ 2 |, when 0<t≤v 0 , W 1 (x, t)=W 2 (x, t)=0, when ⁢ 0 < t < v ~ 0 , W 1 ( x , t ) = λ 21 2 ⁢ ( λ 12 + λ 21 ) ? × { 2 ? sinh [ ? ( t - v 0 ) ] + ? ( λ 21 - λ 12 ) ⁢ ∫ v 0 t sinh [ ? ( v - t ) ] ⁢ ? [ ? ( v ) ] ⁢ dv - 2 ? ∫ v 0 t ? v 0 ⁢ sinh [ ? ( v - t ) ] y ⁡ ( v ) ⁢ ? [ ? ( v ) ] ⁢ dv } and W 2 ( ? t ) = ? 2 ⁢ ( λ 12 + λ 21 ) × { ∫ v 0 t ( ? + ? ) ? v 0 y ⁡ ( v ) ⁢ ? [ ? y ⁡ ( v ) ] ⁢ dv - ( λ 12 - λ 21 ) 2 ? ( ? + ? ) ⁢ ? [ ? y ⁡ ( v ) ] ⁢ dv + ( ? + ? ) - ( λ 12 + λ 21 ) ⁢ ? [ ? y ⁡ ( v ) ] } ? , when ⁢ t > v ~ 0 , W 1 ( x , ? ) = ? λ 12 + λ 21 ( λ 21 ? sinh [ ? ( t - v 0 ) ] + λ 21 2 ? 2 ⁢ λ 12 - λ 12 ? 2 + ? ( t ) 2 ⁢ λ 12 ) + ? λ 12 × ( ? y ⁡ ( t ) ? [ ? ( t ) ] + ( λ 12 - λ 21 ) ⁢ ? [ ? ( t ) ] 2 - ? ) and W 2 ( x , t ) = ? λ 12 + λ 21 × { sinh [ ? x ϕ ] ⁢ ( λ 12 ? - λ 21 ? ) + ( λ 12 + λ 21 ) ? 2 ⁢ ( ? [ ? ( t ) ] + ? [ ? ( t ) ] ) + ? ( t ) } , where ? = λ 12 + λ 21 2 , ? = λ 12 ⁢ λ 21 , v 0 = 1 ϕ ⁢ ( ? - x ) , ? = 1 ϕ ⁢ ( u + x ) , y ⁡ ( t ) = t 2 - v 0 2 , ? ( t ) = t 2 - v ~ 0 2 , ? ( t ) = α 2 ? { ( λ 21 - λ 12 ) ⁢ I 0 [ α ⁢ y ⁡ ( v ) ] - 2 ⁢ α ⁢ v 0 ⁢ ? [ ? y ⁡ ( v ) ] y ⁡ ( v ) } ? sinh [ ? ( t - v ) ] ⁢ dv + ( λ 12 - λ 21 ) ? ( λ 21 2 ? - λ 12 2 ? ) ⁢ I 0 [ ? ( v ) ] ⁢ dv + ? ? ? ( v ) ( λ 21 2 ? - λ 12 2 ? ) ⁢ I 0 [ ? ( v ) ] ⁢ dv , and ? ( t ) = 1 4 ⁢ ∫ v 0 t ( λ 21 ? + λ 12 ? ) × [ I 0 [ α ⁢ y ⁡ ( v ) ] ⁢ ( λ 21 - λ 12 ) - 2 ⁢ α ⁢ ? [ α ⁢ y ⁡ ( v ) ] y ⁡ ( v ) ] ⁢ dv + 1 4 ? ( λ 21 ? + λ 12 ? ) × [ I 0 [ α ? ( v ) ] ⁢ ( λ 21 - λ 12 ) - 2 ⁢ α ⁢ ? [ α ⁢ y ⁡ ( v ) ] ? ( v ) ] ⁢ dv , ? indicates text missing or illegible when filed and I n are modified Bessel functions of an order n=0, 1.
  7. 7 . The method of claim 4 , wherein: an initial content of the node-local burst buffer is less than or equal to the predefined threshold and |φ 1 |=|φ 2 |, when ⁢ 0 < t ≤ v 0 , W 1 ( x , t ) = λ 21 λ 12 + λ 21 ⁢ ( 1 - e - ( λ 12 + λ 21 ) ⁢ t ) W 2 ( x , t ) = λ 12 λ 12 + λ 21 + λ 21 λ 12 + λ 21 ⁢ ( 1 - e - ( λ 12 + λ 21 ) ⁢ t ) When ⁢ 0 < t ≤ v ~ 0 , W 1 ( x , t ) = λ 21 λ 12 + λ 21 ⁢ ( 1 - e - ( λ 12 + λ 21 ) ⁢ t ) - ( g 14 ( x , t ) + g 15 ( x , t ) + g 17 ( x , t ) ) , W 2 ( x , t ) = λ 12 λ 12 + λ 21 + λ 21 λ 12 + λ 21 ⁢ ( 1 - e - ( λ 12 + λ 21 ) ⁢ t ) ? ( g 22 ( x , t ) + g 25 ( x , t ) + g 27 ( x , t ) + g 29 ( x , t ) + g 210 ( x , t ) + g 211 ( x , t ) + g 212 ( x , t ) ) when ⁢ t > v ~ 0 , W 1 ( x , t ) = λ 21 λ 12 + λ 21 ⁢ ( 1 - e - ( λ 12 + λ 21 ) ⁢ t ) + ? ( g 110 ( x , t ) - g 113 ( x , t ) ) - ( g 14 ( x , t ) + g 15 ( x , t ) + g 17 ( x , t ) ) and W 2 ( x , t ) = λ 12 λ 12 + λ 21 + λ 21 λ 12 + λ 21 ⁢ ( 1 - e - ( λ 12 + λ 21 ) ⁢ t ) - ? ( g 22 ( x , t ) + g 25 ( x , t ) + g 27 ( x , t ) + g 29 ( x , t ) + g 210 ( x , t ) + g 211 ( x , t ) + g 212 ( x , t ) ) + ? ( g 214 ( x , t ) + g 215 ( x , t ) + g 217 ( x , t ) + g 218 ( x , t ) + g 220 ( x , t ) + g 222 ( x , t ) ) where g 12 ⁢ in ( x , t , v ) = e - av ⁢ α ⁢ v 0 ⁢ ? [ α ⁢ y ⁡ ( v ) ] y ⁡ ( v ) g 13 ⁢ in ( x , t , v ) = 1 - e - ( λ 12 + λ 21 ) ⁢ ( t - v ) g 14 ( x , t , v ) = λ 21 2 ⁢ ( λ 12 + λ 21 ) ⁢ e ( x - u ) ⁢ ( λ 12 - λ 21 ) 2 ⁢ ϕ ⁢ ∫ v 0 t g 12 ⁢ in ( x , t , v ) ⁢ g 13 ⁢ in ( x , t , v ) ⁢ dv g 15 ( x , t ) = λ 21 2 ⁢ ( λ 12 + λ 21 ) ⁢ e [ ( x - u ) ⁢ ( λ 12 - λ 21 ) 2 ⁢ ϕ - av 0 ] ( 1 - e - ( λ 12 + λ 21 ) ⁢ ( t - v 0 ) ) g 16 ⁢ in ( x , v ) = 1 2 ⁢ e - av ⁢ ? [ α ⁢ y ⁡ ( v ) ] g 17 ( x , t ) = λ 21 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 + λ 21 ) ? ∫ v 0 t g 16 ⁢ in ( x , t , v ) ⁢ g 13 ⁢ in ( x , t , v ) ⁢ dv g 18 ⁢ in ( x , t , v ) = 1 2 ⁢ e - av ⁢ ? [ α ? ( v ) ] g 19 ⁢ in ( x , t , v ) = 1 2 ⁢ e - ( λ 12 + λ 21 ) ⁢ ( t - v ) ⁢ e - av ⁢ ? [ α ⁢ y ⁡ ( v ) ] g 111 ⁢ in ( x , t , v ) = ( α + ? ) ⁢ e - av ⁢ ? [ α ⁢ y ⁡ ( v ) ] ? ( v ) g 112 ⁢ in ( x , t , v ) = e - ( λ 12 + λ 21 ) ⁢ ( t - v ) ⁢ e - av ⁢ α ⁢ ? [ α ? ( v ) ] ? ( v ) g 110 ( x , t ) = ? λ 12 ? ( t ) ? [ ? ( v ) ] + λ 12 - λ 21 2 ⁢ λ 12 ⁢ ? [ ? ( t ) ] + ( λ 12 - λ 21 ) ⁢ ? [ ? ( x , t , v ) 2 ⁢ λ 12 + λ 21 - ? ( x , t , v ) 2 ⁢ λ 12 ( λ 12 + λ 21 ) ] ⁢ dv g 113 ( x , t ) = ? λ 12 ? ( t ) ? [ ? ( t ) ] - 1 2 ⁢ ( λ 12 + λ 21 ) [ ? + λ 21 2 2 ⁢ λ 12 ? ] + 1 2 ⁢ ( λ 12 + λ 21 ) ⁢ ? [ ? ( x , t , v ) - ? ( x , t , v ) 2 ⁢ λ 12 ( λ 12 + λ 21 ) ] g 22 ( x , t ) = 1 2 ⁢ ? [ α ⁢ y ⁡ ( t ) ] g 23 ⁢ in ( x , t , v ) = 1 2 ⁢ ? [ α ⁢ y ⁡ ( v ) ] g 24 ⁢ in ( x , t , v ) = e - ( λ 12 + λ 21 ) ⁢ ( t - v ) g 25 ( x , t ) = ( λ 12 - λ 21 ) 2 ? g 23 ⁢ in ( x , t , v ) ⁢ g 24 ⁢ in ( x , t , v ) ⁢ dv g 26 ⁢ in ( x , t , v ) = 1 - e - ( λ 12 + λ 21 ) ⁢ ( t - v ) g 27 ( x , t ) = λ 12 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 - λ 21 ) ? g 23 ⁢ in ( x , t , v ) ⁢ g 26 ⁢ in ( x , t , v ) ⁢ dv g 28 ⁢ in ( x , t , v ) = ? ? [ α ⁢ y ⁡ ( v ) ] y ⁡ ( v ) g 29 ( x , t ) = λ 12 ( λ 12 - λ 21 ) ? g 28 ⁢ in ( x , t , v ) ⁢ g 26 ⁢ in ( x , t , v ) ⁢ dv g 210 ( x , t ) = λ 12 ( λ 12 + λ 21 ? ( 1 - e - ( λ 12 + λ 21 ) ⁢ ( t - v 0 ) ) g 211 ( x , t ) = 1 2 ? g 28 ⁢ in ( x , t , v ) ⁢ g 24 ⁢ in ( x , t , v ) ⁢ dv g 212 ( x , t ) = 1 2 ⁢ e [ - ( λ 12 + λ 21 ) ⁢ ( t - v 0 ) - av 0 ] g 213 ⁢ in ( x , t , v ) = ? ? [ α ? ( v ) ] y ⁡ ( v ) g 214 ( x , t ) = λ 21 2 ⁢ ( λ 12 - λ 21 ) ? g 213 ⁢ in ( x , t , v ) ⁢ dv g 215 ( x , t ) = - λ 21 2 ⁢ ( λ 12 - λ 21 ) ? g 216 ⁢ in ( x , t , v ) = ? α ⁢ ? [ α ? ( v ) ] ? ( v ) g 217 ( x , t ) = λ 21 2 ⁢ ( λ 12 - λ 21 ) ? g 216 ⁢ in ( x , t , v ) ⁢ dv + ? g 218 ( x , t ) = 1 2 ⁢ e - at ⁢ ? [ α ? ( t ) ] g 219 ⁢ in ( x , t , v ) = 1 2 ⁢ e - av ⁢ e - ( λ 12 + λ 21 ) ⁢ ( t - v ) ⁢ ? [ ? ( v ) ] g 220 ( x , t ) = λ 21 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 + λ 21 ) ? g 219 ⁢ in ( x , t , v ) ⁢ dv g 221 ⁢ in ( x , t , v ) = 1 2 ⁢ e - a ⁢ v ⁢ ? [ α ? ( v ) ] g 222 ( x , t ) = λ 12 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 + λ 21 ) ? g 221 ⁢ in ( x , t , v ) ⁢ dv ? indicates text missing or illegible when filed
  8. 8 . The method of claim 4 , further comprising: calculating {tilde over (λ)} 12 from Δtλ 12 and {tilde over (λ)} 21 from Δtλ 21 , wherein Δt=t k+1 −t k ∀t k ∈ .
  9. 9 . A computing device for performing real-time reliability analysis of node-local burst buffer architectures, the computing device comprising: at least one processor; and a memory connected with the at least one processor; and a node-local burst buffer, wherein: the at least one processor is configured to perform operations comprising: determining an initial node-local burst buffer content at a start of a time period T; receiving a current node-local burst buffer content during the time period; for each checkpoint/restart time interval, performing: estimating stochastic transition rates λ 12 and μ, indicating when the node-local burst buffer is receiving and draining, respectively, estimating input flow data rates of data entering the node-local burst buffer from a compute node, and estimating drain data rates of the data leaving the node-local burst buffer to a parallel file system; modeling an average statistical reliability function of the node-local burst buffer within the time period T with respect to not exceeding a predetermined threshold value; and performing an action when the average statistical reliability function has a value that is less than a predefined value.
  10. 10 . The computing device of claim 9 , wherein the estimating the stochastic transition rates comprises: estimating the λ=λ 12 by dividing the number of transitions to a node-local buffer receiving state by a cumulative amount of time that the node-local burst buffer is in the node-local burst buffer receiving state; and estimating the μ=λ 21 by dividing the number of transitions to the node-local buffer draining state by a cumulative amount of time that the node-local burst buffer is in the node-local burst buffer draining state.
  11. 11 . The computing device of claim 10 , wherein when the λ=λ 12 and the μ=λ 21 exceed a predefined threshold, the method further comprises: performing expectation maximization to estimate final values of λ and μ.
  12. 12 . The computing device of claim 9 , wherein: W 1 (x, t) is equal to the probability that an amount of data in the node-local burst buffer is less than or equal to a predefined threshold given that the node-local burst buffer is in a node-local burst buffer draining state; W 2 (x, t) is equal to the probability that an amount of data in the node-local burst buffer is less than or equal to the predefined threshold given that the node-local burst buffer is in a node-local burst buffer receiving state; and calculating a value of the statistical reliability function based on a sum of W 1 (x, t) and W 2 (x, t).
  13. 13 . The computing device of claim 12 , wherein: the node-local burst buffer is initially empty at a start of the each checkpoint/restart time interval, W 1 ( x , t ) = { λ 21 + λ 12 ? λ 12 + λ 21 , 0 < t < x / ϕ 2 λ 21 + λ 12 ? λ 12 + λ 21 - ( 1 λ 21 + λ 12 ) × ? ( t - v , x ) ⁢ h ⁡ ( v , x ) ⁢ dv , ? > x / ϕ 2 W 2 ( x , t ) = { λ 12 λ 12 + λ 21 ⁢ ( 1 - ? ) , 0 < ? < x / ϕ 2 λ 12 λ 12 + λ 21 ⁢ ( 1 - ? ) - ( λ 12 λ 21 + λ 12 ) ? ( t - v , x ) ⁢ g ⁡ ( v , x ) ⁢ dv , ? > x / ϕ 2 , where ? ( t , x ) = 1 - ? h ⁡ ( t , x ) = λ 12 ⁢ λ 21 ⁢ ϕ 2 ( ϕ 2 - ϕ 1 ) ⁢ ( λ 12 + λ 21 ) ? × { ? ( ? ( x , t ) ) - 1 ? ( x , t ) ? ( ? ( x , t ) ) } ? g ⁡ ( t , x ) = δ ⁡ ( t ) + - λ 12 ⁢ λ 21 ⁢ ϕ 2 ⁢ ϕ 1 ϕ 2 - ϕ 1 ? × ( ? ( x , t ) - 1 ? ( x , t ) ) ? ( ? ( x , t ) ) , ? ( x , t ) = 2 ⁢ t ⁢ - λ 12 ⁢ λ 21 ⁢ ϕ 1 ⁢ ϕ 2 ϕ 2 - ϕ 1 ? ( x , t ) , ? = ( 1 - ( 1 ϕ 1 - 1 ϕ 2 ) ⁢ x t ) , ? indicates text missing or illegible when filed and I n is a modified Bessel function of order n=0, 1, 2.
  14. 14 . The computing device of claim 12 , wherein: an initial content of the node-local burst buffer is greater than the predefined threshold and |φ 1 |=|φ 2 |, when 0<t≤v 0 , W 1 (x, t)=W 2 (x, t)=0, W 1 ( x , t ) = λ 21 2 ⁢ ( λ 12 + λ 21 ) ? × { ? sinh [ ? - v 0 ) ] + ? ( λ 21 - λ 12 ) ? sinh [ ? ( v - t ) ] ⁢ ? [ ? ] ⁢ dv - 2 ? ? sinh [ ? ( v - t ) ] ? ⁢ ? [ ? ] ⁢ dv } and W 2 ( x , ? ) = ? 2 ⁢ ( λ 12 + λ 21 ) × { ? ? ? ? dv - ( λ 12 - λ 21 ) 2 ? ( ? + ? ) ? dv + ( ? + ? ) - ( λ 12 + ? ) ? when ⁢ t > v ~ 0 , W 1 ( x , t ) = ? λ 12 + λ 21 ( λ 21 ? sinh [ ? ] + λ 21 ? 2 ⁢ λ 12 - λ 12 ? 2 + ? 2 ⁢ λ 12 ) + ? λ 12 × ( ? ? ? + ? 2 - ? ) and W 2 ( x , t ) = ? λ 12 + λ 21 × { sinh [ ? ? ] ⁢ ( λ 12 ? - λ 21 ? ) + ( λ 12 + λ 21 ) ? 2 ? where ? = λ 12 + λ 21 2 , ? = λ 12 ⁢ λ 21 , ? = 1 ? ⁢ ( ? - x ) , ? = 1 ? ⁢ ( ? - x ) y ⁡ ( t ) = t 2 - v 0 2 , ? ( t ) = t 2 - ? , ? = ? { ( λ 21 - λ 12 ) ? - ? ? ? sinh ? dv + ( λ 12 - λ 21 ) ? ( λ 21 2 ? - λ 12 2 ? ) ? dv + ? ? ? ( λ 21 2 ? - λ 12 2 ? ) ? dv ? and ? = 1 4 ? ( λ 21 ? + λ 12 ? ) × [ ? ( λ 21 - λ 12 ) - ? ? ] ⁢ dv + 1 4 ? ( λ 21 ? + λ 12 ? ) × [ ? ( λ 21 - λ 12 ) - ? ? ] ⁢ dv , ? indicates text missing or illegible when filed and I n are modified Bessel functions of an order n=0, 1.
  15. 15 . The computing device of claim 12 , wherein: an initial content of the node-local burst buffer is less than or equal to the predefined threshold and |φ 1 |=|φ 2 |, when ⁢ 0 < t ≤ v 0 , W 1 ( x , t ) = λ 21 λ 12 + λ 21 ⁢ ( 1 - ? ) W 2 ( x , t ) = λ 12 λ 12 + λ 21 + λ 21 λ 12 + λ 21 ⁢ ( 1 - ? ) when ⁢ v 0 < t ≤ v ~ 0 , W 1 ( x , t ) = λ 21 λ 12 + λ 21 ⁢ ( 1 - ? ) - ( g 14 ( x , t ) + g 15 ( x , t ) + g 17 ( x , t ) ) W 2 ( x , t ) = λ 12 λ 12 + λ 21 + λ 21 λ 12 + λ 21 ⁢ ( 1 - ? ) - ? ( g 22 ( x , t ) + g 25 ( x , t ) + g 27 ( x , t ) + g 29 ( x , t ) + g 210 ( x , t ) + g 211 ( x , t ) + g 212 ( x , t ) ) when ⁢ t > v ~ 0 , W 1 ( x , t ) = λ 21 λ 12 + λ 21 ⁢ ( 1 - ? ) + ? ( g 110 ( x , t ) - g 113 ( x , t ) ) - ( g 14 ( x , t ) + g 15 ( x , t ) + g 17 ( x , t ) ) and W 2 ( x , t ) = λ 12 λ 12 + λ 21 + λ 21 λ 12 + λ 21 ⁢ ( 1 - ? ) - ? ( g 22 ( x , t ) + g 25 ( x , t ) + g 27 ( x , t ) + g 29 ( x , t ) + g 210 ( x , t ) + g 211 ( x , t ) + g 212 ( x , t ) ) + ? ( ⁠ g 214 ( x , t ) + ( g 215 ( x , t ) + ( g 217 ( x , t ) + ( g 218 ( x , t ) + ( g 220 ( x , t ) + ( g 222 ( x , t ) ) where g 12 ⁢ in ( x , t , v ) = e - av ⁢ α ⁢ v 0 ⁢ ? [ α ⁢ y ⁡ ( v ) ] y ⁡ ( v ) g 13 ⁢ in ( x , t , v ) = 1 - e - ( λ 12 + λ 21 ) ⁢ ( t - v ) g 14 ( x , t , v ) = λ 21 2 ⁢ ( λ 12 + λ 21 ) ⁢ e ( x - u ) ⁢ ( λ 12 - λ 21 ) 2 ⁢ ϕ ⁢ ∫ v 0 t g 12 ⁢ in ( x , t , v ) ⁢ g 13 ⁢ in ( x , t , v ) ⁢ dv g 15 ( x , t ) = λ 21 2 ⁢ ( λ 12 + λ 21 ) ⁢ e [ ( x - u ) ⁢ ( λ 12 - λ 21 ) 2 ⁢ ϕ - av 0 ] ( 1 - e - ( λ 12 + λ 21 ) ⁢ ( t - v 0 ) ) g 16 ⁢ in ( x , v ) = 1 2 ⁢ e - av ⁢ ? [ α ⁢ y ⁡ ( v ) ] g 17 ( x , t ) = λ 21 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 + λ 21 ) ⁢ e ( x - u ) ⁢ ( λ 12 - λ 21 ) 2 ⁢ ϕ ⁢ ∫ v 0 t g 16 ⁢ in ( x , t , v ) ⁢ g 13 ⁢ in ( x , t , v ) ⁢ dv g 18 ⁢ in ( x , t , v ) = 1 2 ⁢ e - av ⁢ ? [ α ? ( v ) ] g 19 ⁢ in ( x , t , v ) = 1 2 ? e - av ⁢ ? [ α ⁢ y ⁡ ( v ) ] g 111 ⁢ in ( x , t , v ) = ( α + ? ) ? ? ? g 112 ⁢ in ( x , t , v ) = ? e - av ⁢ α ⁢ ? [ α ? ( v ) ] ? ( v ) g 110 ( x , ? ) = α ⁢ t λ 12 ? ( t ) ⁢ ? [ α ? ( v ) ] + λ 12 - λ 21 2 ⁢ λ 12 ⁢ ? [ α ? ( t ) ] + ( λ 12 - λ 21 ) ⁢ ? [ λ 12 ⁢ g 18 ⁢ in ( x , t , v ) 2 ⁢ λ 12 + λ 21 - λ 21 2 ⁢ g 19 ⁢ in ( x , t , v ) 2 ⁢ λ 12 ( λ 12 + λ 21 ) ] ⁢ dv g 113 ( x , t ) = ? λ 12 ? ? [ ? ] - 1 2 ⁢ ( λ 12 + λ 21 ) [ λ 12 ? + λ 21 2 2 ⁢ λ 12 ? ] + 1 2 ⁢ ( λ 12 + λ 21 ) ⁢ ? [ λ 12 ⁢ g 111 ⁢ in ( x , t , v ) - λ 21 2 ⁢ g 112 ⁢ in ( x , t , v ) 2 ⁢ λ 12 ( λ 12 + λ 21 ) ] g 22 ( x , t ) = 1 2 ⁢ ? [ α ⁢ y ⁡ ( t ) ] g 23 ⁢ in ( x , t , v ) = 1 2 ⁢ e - av ⁢ ? [ α ⁢ y ⁡ ( v ) ] g 24 ⁢ in ( x , t , v ) = e - ( λ 12 + λ 21 ) ⁢ ( t - v ) g 25 ( x , t ) = ( λ 12 - λ 21 ) 2 ? g 23 ⁢ in ( x , t , v ) ⁢ g 24 ⁢ in ( x , t , v ) ⁢ dv g 26 ⁢ in ( x , t , v ) = 1 - ? g 27 ( x , t ) = λ 12 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 - λ 21 ) ? g 23 ⁢ in ( x , t , v ) ⁢ g 26 ⁢ in ( x , t , v ) ⁢ dv g 28 ⁢ in ( x , t , v ) = e - av ⁢ ? [ α ⁢ y ⁡ ( v ) ] y ⁡ ( v ) g 29 ( x , t ) = λ 12 ( λ 12 - λ 21 ) ? g 28 ⁢ in ( x , t , v ) ⁢ g 26 ⁢ in ( x , t , v ) ⁢ dv g 210 ( x , t ) = λ 12 ( λ 12 + λ 21 ⁢ e - av 0 ( 1 - e - ( λ 12 + λ 21 ) ⁢ ( t - v 0 ) ) g 211 ( x , t ) = 1 2 ? g 28 ⁢ in ( x , t , v ) ⁢ g 24 ⁢ in ( x , t , v ) ⁢ dv g 212 ( x , t ) = 1 2 ⁢ e [ - ( λ 12 + λ 21 ) ⁢ ( ? - v 0 ) - av 0 ] g 213 ⁢ in ( x , t , v ) = ? e - av ⁢ ? [ α ? ( v ) ] y ⁡ ( v ) g 214 ( x , t ) = λ 21 2 ⁢ ( λ 12 - λ 21 ) ? g 213 ⁢ in ( x , t , v ) ⁢ dv g 215 ( x , t ) = - λ 21 2 ⁢ ( λ 12 - λ 21 ) ? g 216 ⁢ in ( x , t , v ) = e - av ⁢ α ⁢ ? [ α ? ( v ) ] ? ( v ) g 217 ( x , t ) = λ 21 2 ⁢ ( λ 12 - λ 21 ) ? g 216 ⁢ in ( x , t , v ) ⁢ dv + ? g 218 ( x , t ) = 1 2 ⁢ e - at ⁢ ? [ α ? ( t ) ] g 219 ⁢ in ( x , t , v ) = 1 2 ⁢ e - av ⁢ ? [ ? ( v ) ] g 220 ( x , t ) = λ 21 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 + λ 21 ) ? g 219 ⁢ in ( x , t , v ) ⁢ dv g 221 ⁢ in ( x , t , v ) = 1 2 ⁢ e - a ⁢ v ⁢ ? [ α ? ( v ) ] g 222 ( x , t ) = λ 12 ( λ 12 - λ 21 ) 2 ⁢ ( λ 12 + λ 21 ) ? g 221 ⁢ in ( x , t , v ) ⁢ dv . ? indicates text missing or illegible when filed
  16. 16 . The computing device of claim 12 , wherein the operations further comprise: calculating {tilde over (λ)} 12 from Δtλ 12 and {tilde over (λ)} 21 from Δtλ 21 , wherein Δt=t k+1 −t k ∀t k ∈ .
  17. 17 . A non-transitory computer-readable medium having instructions recorded thereon for a processor of a computing device to perform operations comprising: determining an initial node-local burst buffer content at a start of a time period T; receiving a current node-local burst buffer content during the time period; for each checkpoint/restart time interval, performing: estimating stochastic transition rates λ=λ 12 and μ=λ 21 , indicating when the node-local burst buffer is receiving and draining, respectively, estimating input flow data rates of data entering the node-local burst buffer from a compute node, and estimating drain data rates of the data leaving the node-local burst buffer to be stored to a parallel file system; modeling an average statistical reliability function of the node-local burst buffer within the time period T with respect to not exceeding a predetermined threshold value; and performing an action when the average statistical reliability function has a value that is less than a predefined threshold value.
  18. 18 . The non-transitory computer-readable medium of claim 17 , wherein the estimating the stochastic transition rates comprises: estimating the λ=λ 12 by dividing a number of transitions to a node-local buffer receiving state by a cumulative amount of time that the node-local burst buffer is in the node-local burst buffer receiving state; and estimating the μ=λ 21 by dividing a number of transitions to the node-local buffer draining state by a cumulative amount of time that the node-local burst buffer is in the node-local burst buffer draining state.
  19. 19 . The non-transitory computer-readable medium of claim 18 , wherein when the λ and the μ exceed a predefined threshold, the method further comprises: performing expectation maximization to estimate final values of λ and μ.
  20. 20 . The non-transitory computer-readable medium of claim 17 , wherein: W 1 (x, t) is equal to the probability that an amount of data in the node-local burst buffer is less than or equal to a predefined threshold given that the node-local burst buffer is in a node-local burst buffer draining state; W 2 (x, t) is equal to the probability that an amount of data in the node-local burst buffer is less than or equal to the predefined threshold given that the node-local burst buffer is in a node-local burst buffer receiving state; and calculating a value of the statistical reliability function based on a sum of W 1 (x, t) and W 2 (x, t).

Description

This application is the national stage entry of International Patent Application No. PCT/US2023/035013, filed on Oct. 12, 2023, and published as WO 2024/086054 A1 on Apr. 25, 2024, which claims the benefit of U.S. Provisional Patent Application No. 63/380,484, filed Oct. 21, 2022, and U.S. Provisional Patent Application No. 63/382,580, filed Nov. 7, 2022, which are hereby incorporated by reference in their entireties. This invention was made with United States government support under contract S900294BAH awarded by the Army Research Laboratory. The United States government has certain rights in the invention. BACKGROUND OF THE INVENTION High performance computing (HPC) systems transformed the way that information is processed and stored because they can handle vast amounts of data. However, they also come with the challenge of handling input/output (I/O) bottlenecks due to the following reasons. First, big data applications running in these environments perform many read and write operations to handle workloads and thus consume much I/O bandwidth. Additionally, application-based checkpointing and restarting (C/R) is burdensome on I/O infrastructure because checkpointing operations perform a myriad number of write requests to a parallel file system (PFS) which also degrade storage server bandwidth. Job heterogeneity is also an issue because job requests of various sizes and priorities compete with each other for I/O bandwidth and other resources. This results in prolonged average I/O time because processing of smaller jobs would be delayed due to concurrent processing of larger jobs. As a result, an application C/R process is also affected because lower-priority jobs could frequently interrupt the checkpointing of higher-priority jobs. Scientists have addressed these concerns by proposing burst buffers (BBs) as brokers via developing infrastructures and algorithms to minimize effects of I/O contention in supercomputing infrastructures. One approach is to create node-local BB architectures in which each burst buffer is collocated with a corresponding compute node. This is advantageous for its scalability while also improving checkpoint bandwidth because aggregate bandwidth increases proportionally to the number of compute nodes. Since researchers at the San Diego Supercomputer Center (SDSC) illustrated this proof of concept via a DASH supercomputing cluster, several current HPCs have adopted these types of storage. SUMMARY OF THE INVENTION In a first embodiment, a method is provided for performing real-time reliability analysis of node-local burst buffer architectures. A computing device determines an initial node-local burst buffer content at a start of a time period. The computing device receives a current node-local burst buffer content during the time period. For each checkpoint/restart time interval, the computing device: estimates stochastic transition rates λ and μ, indicating when the node-local burst buffer is receiving and draining, respectively; estimates input flow data rates of data entering the node-local burst buffer from a compute node; and estimates drain data rates of the data leaving the node-local burst buffer to a parallel file system. The computing device models an average statistical reliability function of the node-local burst buffer within the time period T with respect to not exceeding a predetermined threshold value. The computing device performs an action when the average statistical reliability function has a value that is less than a predefined value. In a second embodiment, a computing device is provided for performing real-time reliability analysis of node-local burst buffer architectures. The computing device includes at least one processor, a memory connected with the at least one processor, and a node-local burst buffer. The at least one processor is configured to perform operations. According to the operations: an initial node-local burst buffer content is determined at a start of a time period T; a current node-local burst buffer content is received during the time period; for each checkpoint/restart time interval, stochastic transition rates λ and μ are estimated, indicating when the node-local burst buffer is receiving and draining, respectively, input flow data rates of data entering the node-local burst buffer from a compute node are estimated, and drain data rates of the data leaving the node-local burst buffer to be stored to a parallel file system are estimated. An average statistical reliability function of the node-local burst buffer within the time period T is modeled with respect to not exceeding a predetermined threshold value. An action is performed when the average statistical reliability function has a value that is less than a predefined value. In a third embodiment, at least one non-transitory computer-readable storage medium has computer instructions stored thereon for a processor of a computing device to perform operations. According to the operations, a