EP-4740378-A1 - SYSTEM AND METHOD FOR ALARM RAISE / CLEAR SYNCHRONIZATION AND CLEAR RACE CONDITION HANDLING

EP4740378A1EP 4740378 A1EP4740378 A1EP 4740378A1EP-4740378-A1

Abstract

The present disclosure relates to a system (100) for an alarm raise / clear synchronization and clear race condition handling. The system includes a collector component (150) to receive a data set comprising a plurality of alarms generated by FCAPS from a network element. A parsing unit (212) is provided within the collector component, where the parsing unit (212) parses the plurality of alarms from the data set, and transforms the plurality of alarms into standardized format. A categorizing unit (214), is provided within the collector component (150), to categorize the plurality of alarms as either a raise alarm event, or a clear alarm event based on attributes associated with each of the alarm in the data set. A fault manager master (110) receives the plurality of alarms from the collector component, where the fault manager master assigns unique alarm identifiers (ids) for each alarm.

Inventors

BHATNAGAR, AAYUSH
BISHT, SANDEEP
MISHRA, RAHUL
DIVY, Dipankar
MISHRA, SOMYA
KUMAR, VIPUL
MAGU, Sameer

Assignees

Jio Platforms Limited

Dates

Publication Date: 20260513
Application Date: 20240702

Claims (17)

1. A method for an alarm raise / clear synchronization and clear race condition handling, the method comprising the steps of: receiving, by a one or more processor (202), a data set comprising a plurality of alarms generated by FCAPS (Fault, Configuration, Accounting, Performance, and Security) data from a network element at a collector component (150); parsing, by the one or more processor (202), the plurality of alarms from the data set, and further transforming the plurality of alarms from the data set into a standardized format within the collector component (150); categorizing, by the one or more processor (202), the plurality of alarms as either a raise alarm event, or a clear alarm event based on attributes associated with each of the alarm in the data set; and sending, by the one or more processor (202), at least one of the raise alarm event, or the clear alarm event to a fault manager master (110).
2. The method as claimed in claim 1, wherein the parsing, comprises extracting, by the one or more processor (202), relevant information from the plurality of alarms, and formatting the plurality of alarms in a pre-defined set.
3. The method as claimed in claim 1, comprises checking, by the one or more processor (202), existence of at least one of the raise alarm event, or the clear alarm event in a distributed I/O cache (135).
4. The method as claimed in claim 3, comprises: inserting, by the one or more processor (202), at least one of the raise alarm event, or the clear alarm event into the distributed TO cache (135); updating, by the one or more processor (202), at least one of the raise alarm event, or the clear alarm event by adding a new alarm timestamp to a timestamp array, and increment in an occurrence count; and assigning, by the one or more processor (202), a unique alarm identifier (ids) to at least each one of the raise alarm event, or the clear alarm event.
5. The method as claimed in claim 4, comprises: segregating and mapping, by the one or more processor (202), at least one of the raise alarm event, or the clear alarm event having the unique alarm identifiers with a raise fault manager (130), or a clear fault manager (115); running, by the one or more processor (202), the raise fault manager (130) periodically to consume the unique alarm identifiers and fetch the corresponding alarm data from the distributed I/O cache (135) and a database (140); running, by the one or more processor (202), the clear fault manager (115) periodically to consume the unique alarm identifiers and fetch the corresponding alarm data from the distributed I/O cache (135) and the database (140); and eliminating, by the one or more processor (202), the raise alarm event from an active section by the clear fault manager (115) and further add a clearance metadata to the alarm to obtain a resolved alarm.
6. The method as claimed in claim 5, comprises: modifying, by the one or more processor (202), the resolved alarm and further storing in an archived section of the database (140); adding, by the one or more processor (202), a retry count by the clear fault manager (115), in case the raise alarm is not received or processed; running, by the one or more processor (202), a clear retry fault manager (120) periodically to consume clearance data from a retry stream, further for each data consumed, the clear retry fault manager (120) checks the database (140) for the corresponding raise alarm event; eliminating, by the one or more processor (202), the raise alarm event from an active section by the clear retry fault manager (120) and further add a clearance metadata to the raise alarm to obtain another resolved alarm; detecting, by the one or more processor (202), if the retry count has reached a threshold by the clear retry fault manager (120); incrementing, by the one or more processor (202), the retry count and reproducing the clear alarm data at the retry stream for another attempt; and logging, by the one or more processor (202), an error if the retry count is exhausted.
7. The method as claimed in claim 1, comprises identifying, by the one or more processor (202), stranded alarms present in the distributed I/O cache (135).
8. The method as claimed in claim 1, comprises checking, by the one or more processor (202), a clearance timestamp stamped on the raise alarm event, to prevent any issue causing clearance of the raise alarm event occurred, wherein the timestamp of the raise alarm event is compared with the clear alarm events.
9. A system (100) for an alarm raise / clear synchronization and clear race condition handling, the system (100) comprising: a collector component (150) configured to receive a data set comprising a plurality of alarms generated by FCAPS (Fault, Configuration, Accounting, Performance, and Security), from a network element; a parsing unit (212) provided within the collector component (150), wherein the parsing unit (212) is configured to parse the plurality of alarms from the data set, and transform the plurality of alarms into standardized format; a categorizing unit (214) provided within the collector component (150) configured to categorize the plurality of alarms as either a raise alarm event, or a clear alarm event based on attributes associated with each of the alarm in the data set; and a fault manager master (110) configured to receive the plurality of alarms from the collector component, wherein the fault manager master (110) is configured to assign unique alarm identifiers (ids) for each alarm.
10. The system (100) as claimed in claim 9, wherein the collector component (150) is further configured to extract relevant information from the plurality alarms and formatting the plurality alarms in a pre-defined set.
11. The system (100) as claimed in claim 9, wherein the fault manager master (110) further checks existence of at least one of the raise alarm event, or the clear alarm event in a distributed I/O cache (135).
12. The system (100) as claimed in claim 11, wherein the fault manager master (110) is further configured to: insert at least one of the raise alarm event, or the clear alarm event into the distributed I/O cache (135); update at least one of the raise alarm event, or the clear alarm event by adding a new alarm timestamp to a timestamp array, and increment in an occurrence count; assign a unique alarm identifier (ids) to at least each one of the raise alarm event, or the clear alarm event; and segregate and map at least one of the raise alarm event, or the clear alarm event having the unique alarm identifiers with a raise fault manager (130), or a clear fault manager (115).
13. The system (100) as claimed in claim 12, wherein the raise fault manager (130) is configured to run periodically to consume the unique alarm identifiers and fetch the corresponding alarm data from the distributed I/O cache (135) and a database (140).
14. The system (100) as claimed in claim 12, wherein the clear fault manager (115) is configured to: run periodically to consume the unique alarm identifiers and fetch the corresponding alarm data from the distributed I/O cache (135) and the database (140); eliminate the raise alarm event from an active section and further add a clearance metadata to the alarm to obtain a resolved alarm; modify the resolved alarm and further stores in an archived section of the database (140); and add a retry count in case the raise alarm event is not received or processed.
15. The system (100) as claimed in claim 9, comprises the clear retry fault manager (120) is configured to: periodically run to consume clearance data from a retry stream, further for each data consumed, is checked in the database for the corresponding raise alarm event; eliminate the raise alarm event from an active section and further add a clearance metadata to the raise alarm event to obtain another resolved alarm; detect if the retry count has reached a threshold; increment the retry count and reproduce the clear alarm data at the retry stream for another attempt; and log an error if the retry count is exhausted.
16. The system (100) as claimed in claim 9, comprises a fault manager auditor (125) is configured to: identify stranded alarms present in the distributed I/O cache (135); and check a clearance timestamp stamped on the raise alarm event, to prevent any issue causing clearance of the raise alarm event occurred, wherein the timestamp of the raise alarm event is compared with the clear alarm events.
17. A non-transitory computer-readable medium having stored thereon computer- readable instructions that, when executed by a processor, cause the processor to: receive, a data set comprising a plurality of alarms generated by FCAPS (Fault, Configuration, Accounting, Performance, and Security) data from a network element at a collector component (150); parse, the plurality of alarms from the data set, and further transforming the plurality of alarms from the data set into a standardized format within the collector component (150); categorize, the plurality of alarms as either a raise alarm event, or a clear alarm event based on attributes associated with each of the alarm in the data set; and send, at least one of the raise alarm event, or the clear alarm event to a fault manager master (110).

Description

SYSTEM AND METHOD FOR ALARM RAISE / CLEAR SYNCHRONIZATION AND CLEAR RACE CONDITION HANDLING FIELD OF THE INVENTION [0001] The present invention relates to the field of network monitoring and event management, specifically addressing alarm generation, correlation, and resolution for critical events occurring within a network. The invention pertains to an integrated Alarm Management System (AMS) that efficiently handles the detection, processing, and resolution of various network alarms, providing a comprehensive and streamlined approach to manage network incidents. BACKGROUND OF THE INVENTION [0002] In Network Management Systems (NMS) are essential for monitoring and maintaining the health and performance of computer networks. These systems generate alarms or alerts when specific events occur, indicating potential issues or anomalies within the network infrastructure. [0003] Conventional alarm systems face challenges in effectively managing the continuous influx of alarms, particularly in scenarios where rapid fluctuations or flapping events generate a significant number of raise and clear alarms. Flapping refers to the situation where alarms repeatedly alternate between raise and clear states due to unstable or intermittent network conditions. [0004] In existing systems, when a raise alarm occurs, it is stored in the system's storage and subsequently processed. However, the challenge arises when a corresponding clear alarm is received. The system needs to associate the clear alarm with its corresponding raise alarm in order to remove the alert from an active window and move it to the archive. This association process becomes complex when multiple raise and clear events occur in quick succession. [0005] For example, consider a situation where the CPU percentage of a device exceeds a critical threshold, such as 95%. This triggers a raise alarm indicating the high CPU usage. Once the CPU percentage conies down below the threshold, a clear alarm is sent to indicate that the problem is resolved. The system needs to accurately associate the clear alarm with the previous raise alarm to ensure proper closure of the alert. [0006] Another issue in conventional systems is the lack of efficient alarm correlation. When multiple devices or components are affected by the same underlying problem, such as an interface outage, each affected device generates separate alarms. This leads to the generation of multiple individual tickets for closely related incidents, causing redundancy and inefficiencies in the resolution process. It is desirable to correlate these alarms based on common conditions or criteria, grouping them together for more effective incident management. [0007] For instance, if an interface goes down, multiple devices connected to that interface may raise similar alarms. Instead of creating separate tickets for each device, the system should correlate these alarms and group them together under a single incident, simplifying the resolution process. [0008] Additionally, the existing alarm systems often provide limited enrichment and contextual information about the alarms. Enrichment involves augmenting the basic alarm details, such as the problem description, host information, and IP address, with additional inventory data to identify the precise location of the affected device within the network infrastructure. This enrichment is crucial for efficient incident handling and troubleshooting. [0009] Moreover, mis-associations between clear and raise alarms can occur when raise and clear events are received simultaneously or in rapid succession. This misassociation impacts the tracking of alarm histories, making it challenging to analyze historical alarm patterns and generate accurate reports. [0010] There is a need for a solution to address above challenges. SUMMARY OF THE INVENTION [0011] One or more embodiments of the present disclosure provide a system and a method for an alarm raise / clear synchronization and clear race condition handling in network. [0012] In one aspect of the present invention, a method for an alarm raise / clear synchronization and clear race condition handling is disclosed. The method includes receiving, by a one or more processor, a data set comprising a plurality of alarms generated by FCAPS (Fault, Configuration, Accounting, Performance, and Security) data from a network element at a collector component. Further, the method includes parsing, by the one or more processor, the plurality of alarms from the data set, and further transforming the plurality of alarms from the data set into a standardized format within the collector component. Further, the method includes categorizing, by the one or more processor, the plurality of alarms as either a raise alarm event, or a clear alarm event based on attributes associated with each of the alarm in the data set. Further, the method includes sending, by the one or more processor, at least one of the raise alarm event, or the clear alarm e