US-12619619-B2 - Detecting duplicate incidents using machine learning techniques

US12619619B2US 12619619 B2US12619619 B2US 12619619B2US-12619619-B2

Abstract

Methods, apparatus, and processor-readable storage media for detecting duplicate incidents using machine learning techniques are provided herein. An example method includes obtaining information associated with tracking a first incident in an incident database, generating a summary of the first incident by processing at least a portion of the information using at least one first machine learning model, and generating an embedding of the first incident by processing the generated summary using at least one second machine learning model. The method also includes computing a set of similarity scores for the first incident, determining whether the first incident is a duplicate of at least one of a plurality of additional incidents in the incident database based on the set of similarity scores, and initiating an update to one or more data records in the incident database based at least in part on a result of the determining.

Inventors

Shreyans Jasoriya
Anil Kumar Koluguri
David C. Sydow
Sebastian Brunet
Soham Joshi

Assignees

DELL PRODUCTS L.P.

Dates

Publication Date: 20260505
Application Date: 20230929

Claims (20)

1 . A computer-implemented method comprising: obtaining information associated with tracking at least a first incident in at least one incident database; processing at least a portion of the information by at least one first machine learning model, comprising a transformer-based machine learning model, in a first stage of a multi-stage machine learning pipeline, wherein the at least one first machine learning model generates a summary of the first incident comprising semantic and contextual information, wherein the at least one first machine learning model comprises an attention mechanism that learns contextual relations between parts of the at least the portion of the information; processing the generated summary by at least one second machine learning model in a second stage of the multi-stage machine learning pipeline, wherein the at least one second machine learning model generates a fixed-size embedding encoding the semantic and contextual information from the generated summary of the first incident; computing a set of one or more similarity scores for the first incident, wherein a given similarity score in the set is based at least in part on a comparison between the generated fixed-size embedding of the first incident and a fixed-size embedding generated by the at least one second machine learning model for one of a plurality of additional incidents in the at least one incident database; determining whether the first incident is a duplicate of at least one of the plurality of additional incidents based at least in part on the set of similarity scores; and initiating an update to one or more data records in the at least one incident database associated with at least one of the first incident and the at least one of the plurality of additional incidents based at least in part on a result of the determining; wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
2 . The computer-implemented method of claim 1 , wherein the at least one first machine learning model comprises a bidirectional auto-regressive transformer model.
3 . The computer-implemented method of claim 1 , wherein the at least one second machine learning model comprises a robustly optimized bidirectional encoder representations from transformers approach model.
4 . The computer-implemented method of claim 1 , wherein at least one of: the at least one first machine learning model and the at least one second machine learning model is pretrained on at least one generic textual dataset and is further trained using a dataset that is specific to the at least one incident database.
5 . The computer-implemented method of claim 4 , wherein the dataset that is specific to the at least one incident database comprises a plurality of items, wherein each item comprises: a first portion comprising text associated with a first historical incident of the at least one incident database; a second portion comprising text associated with a second historical incident of the at least one incident database, wherein the second historical incident is a duplicate of the first historical incident; and a third portion comprising text associated with a third historical incident of the at least one incident database that is not a duplicate of the first historical incident.
6 . The computer-implemented method of claim 1 , wherein the first incident relates to a software defect in one or more software code repositories.
7 . The computer-implemented method of claim 6 , wherein the information associated with tracking the first incident comprises at least one of: a title of the first incident; one or more user comments related to the software defect; a location of the software defect within software code associated with the one or more software code repositories; one or more log messages related to the software defect; and one or more errors related to the software defect.
8 . The computer-implemented method of claim 1 , further comprising: performing a denoising operation that identifies one or more variables corresponding to at least one of: (i) a timestamp, (ii) an internet protocol address, (iii) a system name, and (iv) a process identifier in the information; and replacing one or more values corresponding to the one or more variables with a designated placeholder value.
9 . The computer-implemented method of claim 1 , wherein computing the similarity score between the first incident and a given additional incident of the plurality of additional incidents comprises: computing a cosine similarity of the embedding corresponding to the first incident and the embedding corresponding to the given additional incident.
10 . The computer-implemented method of claim 1 , wherein the initiating the update comprises at least one of: assigning a flag to at least one of the first incident and the at least one of the plurality of additional incidents in the at least one incident database; adding one or more comments related to at least one of the first incident and the at least one of the plurality of additional incidents in the at least one incident database; and performing a merge operation to combine the first incident and the at least one of the plurality of additional incidents in the at least one incident database.
11 . A non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device: to obtain information associated with tracking at least a first incident in at least one incident database; to process at least a portion of the information by at least one first machine learning model, comprising a transformer-based machine learning model, in a first stage of a multi-stage machine learning pipeline, wherein the at least one first machine learning model generates a summary of the first incident comprising semantic and contextual information, wherein the at least one first machine learning model comprises an attention mechanism that learns contextual relations between parts of the at least the portion of the information; to process the generated summary by at least one second machine learning model in a second stage of the multi-stage machine learning pipeline, wherein the at least one second machine learning model generates a fixed-size embedding encoding the semantic and contextual information from the generated summary of the first incident; to compute a set of one or more similarity scores for the first incident, wherein a given similarity score in the set is based at least in part on a comparison between the generated fixed-size embedding of the first incident and a fixed-size embedding generated by the at least one second machine learning model for one of a plurality of additional incidents in the at least one incident database; to determine whether the first incident is a duplicate of at least one of the plurality of additional incidents based at least in part on the set of similarity scores; and to initiate an update to one or more data records in the at least one incident database associated with at least one of the first incident and the at least one of the plurality of additional incidents based at least in part on a result of the determining.
12 . The non-transitory processor-readable storage medium of claim 11 , wherein the at least one first machine learning model comprises a bidirectional auto-regressive transformer model.
13 . The non-transitory processor-readable storage medium of claim 11 , wherein the at least one second machine learning model comprises a robustly optimized bidirectional encoder representations from transformers approach model.
14 . The non-transitory processor-readable storage medium of claim 11 , wherein at least one of: the at least one first machine learning model and the at least one second machine learning model is pretrained on at least one generic textual dataset and is further trained using a dataset that is specific to the at least one incident database.
15 . The non-transitory processor-readable storage medium of claim 14 , wherein the dataset that is specific to the at least one incident database comprises a plurality of items, wherein each item comprises: a first portion comprising text associated with a first historical incident of the at least one incident database; a second portion comprising text associated with a second historical incident of the at least one incident database, wherein the second historical incident is a duplicate of the first historical incident; and a third portion comprising text associated with a third historical incident of the at least one incident database that is not a duplicate of the first historical incident.
16 . An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the at least one processing device being configured: to obtain information associated with tracking at least a first incident in at least one incident database; to process at least a portion of the information by at least one first machine learning model, comprising a transformer-based machine learning model, in a first stage of a multi-stage machine learning pipeline, wherein the at least one first machine learning model generates a summary of the first incident comprising semantic and contextual information, wherein the at least one first machine learning model comprises an attention mechanism that learns contextual relations between parts of the at least the portion of the information; to process the generated summary by at least one second machine learning model in a second stage of the multi-stage machine learning pipeline, wherein the at least one second machine learning model generates a fixed-size embedding encoding the semantic and contextual information from the generated summary of the first incident; to compute a set of one or more similarity scores for the first incident, wherein a given similarity score in the set is based at least in part on a comparison between the generated fixed-size embedding of the first incident and a fixed-size embedding generated by the at least one second machine learning model for one of a plurality of additional incidents in the at least one incident database; to determine whether the first incident is a duplicate of at least one of the plurality of additional incidents based at least in part on the set of similarity scores; and to initiate an update to one or more data records in the at least one incident database associated with at least one of the first incident and the at least one of the plurality of additional incidents based at least in part on a result of the determining.
17 . The apparatus of claim 16 , wherein the at least one first machine learning model comprises a bidirectional auto-regressive transformer model.
18 . The apparatus of claim 16 , wherein the at least one second machine learning model comprises a robustly optimized bidirectional encoder representations from transformers approach model.
19 . The apparatus of claim 16 , wherein at least one of: the at least one first machine learning model and the at least one second machine learning model is pretrained on at least one generic textual dataset and is further trained using a dataset that is specific to the at least one incident database.
20 . The apparatus of claim 19 , wherein the dataset that is specific to the at least one incident database comprises a plurality of items, wherein each item comprises: a first portion comprising text associated with a first historical incident of the at least one incident database; a second portion comprising text associated with a second historical incident of the at least one incident database, wherein the second historical incident is a duplicate of the first historical incident; and a third portion comprising text associated with a third historical incident of the at least one incident database that is not a duplicate of the first historical incident.

Description

BACKGROUND Issue tracking systems generally refer to systems that can manage and maintain information related to issues or other incidents. For example, such systems are often used to track software errors and/or flaws in software development projects. SUMMARY Illustrative embodiments of the disclosure provide techniques for detecting duplicate incidents using machine learning techniques. An exemplary computer-implemented method includes obtaining information associated with tracking at least a first incident in at least one incident database and generating a summary of the first incident by processing at least a portion of the information using at least one first machine learning model. The method additionally includes generating an embedding of the first incident by processing the generated summary using at least one second machine learning model and computing a set of one or more similarity scores for the first incident, where a given similarity score in the set is based at least in part on a comparison between the generated embedding of the first incident and an embedding generated for one of a plurality of additional incidents in the at least one incident database. The method also includes determining whether the first incident is a duplicate of at least one of the plurality of additional incidents based at least in part on the set of similarity scores and initiating an update to one or more data records in the at least one incident database associated with at least one of the first incident and the at least one of the plurality of additional incidents based at least in part on a result of the determining. Illustrative embodiments can provide significant advantages relative to conventional incident detection techniques. For example, technical problems associated with detecting duplicate incidents are mitigated in one or more embodiments by automatically identifying duplicate incidents in an incident database by generating summaries and embeddings for such incidents at respective stages of a machine learning framework and then comparing such embeddings to identify duplicate embeddings. Such embodiments can help reduce the amount of time and resources that are needed to detect and/or prevent duplicate incidents, for example. These and other illustrative embodiments described herein include, without limitation, methods, apparatus, systems, and computer program products comprising processor-readable storage media. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows an information processing system configured for detecting duplicate incidents using machine learning techniques in an illustrative embodiment. FIG. 2 shows an example of a process for collecting and preprocessing incident data in an illustrative embodiment. FIG. 3 shows an example of a machine learning pipeline for detecting duplicate incidents in an illustrative embodiment. FIG. 4 shows an example of an integration framework in an illustrative embodiment. FIG. 5 shows a flow diagram of a process for detecting duplicate incidents using machine learning techniques in an illustrative embodiment. FIGS. 6 and 7 show examples of processing platforms that may be utilized to implement at least a portion of an information processing system in illustrative embodiments. DETAILED DESCRIPTION Illustrative embodiments will be described herein with reference to exemplary computer networks and associated computers, servers, network devices or other types of processing devices. It is to be appreciated, however, that these and other embodiments are not restricted to use with the particular illustrative network and device configurations shown. Accordingly, the term “computer network” as used herein is intended to be broadly construed, so as to encompass, for example, any system comprising multiple networked processing devices. Effectively tracking incidents for complex projects (e.g., software projects) can be technically challenging. As an example, for a given software project, many developers can submit code to one or more code databases. During the development process, the submitted code can include errors, which can be tracked using incident or ticket tracking systems, for example. A significant portion of the tracked incidents are duplicate incidents. In this context and elsewhere herein, the term “duplicates” is intended to be broadly construed so as to encompass two or more data structures that are created for a same or a substantially similar issue or other incident. Such data structures can comprise information identifying and/or describing an issue or other incident (e.g., title, one or more descriptions, comments, and/or other information related to tracking such incidents). Identifying and managing duplicate incidents can be time consuming and inefficient. For example, if two or more tickets are created for a single software error, then multiple developers may be assigned to triage (e.g., evaluate) the same software error. Additional computing resources ar