EP-4738374-A1 - METHOD, SYSTEN AND COMPUTER NETWORK FOR IDENTIFYING PERSONAL DATA IN A HEALTHCARE DATA STREAM USING A PRE-TRAINED MACHINE LEARNING MODEL

EP4738374A1EP 4738374 A1EP4738374 A1EP 4738374A1EP-4738374-A1

Abstract

A computer-implemented method of processing healthcare data. The method comprising steps of: receiving a healthcare data stream, the data stream including technical data; screening the received data stream to identify potential inclusions of personal data, using a pre-trained machine learning model, within the received data stream; and upon identifying the potential inclusion of personal data within the received data stream, performing a mitigation action with respect to the potential personal data identified in the received data stream. Wherein the pre-trained machine learning model has been trained only on technical data so as to identify technical data in data streams, and so identify potential personal data as anomalous data within the received data stream.

Inventors

DE LUCA, DOMENICO
Taeymans, Bert

Assignees

Roche Diagnostics International AG

Dates

Publication Date: 20260506
Application Date: 20241030

Claims (15)

A computer-implemented method of processing healthcare data, comprising steps of: receiving a healthcare data stream, the data stream including technical data; screening the received data stream to identify potential inclusions of personal data, using a pre-trained machine learning model, within the received data stream; and upon identifying the potential inclusion of personal data within the received data stream, performing a mitigation action with respect to the potential personal data identified in the received data stream; wherein the pre-trained machine learning model has been trained only on technical data so as to identify technical data in data streams, and so identify potential personal data as anomalous data within the received data stream.
The computer-implemented method of claim 1, wherein the pre-trained machine learning model has an architecture size parameter which is below a predetermined threshold, wherein the predetermined threshold is determined based on an amount of computing resource available.
The computer-implemented method of claim 1 or claim 2, the method further comprising a step of storing the stream of processed data in a database when no potential inclusions of personal data have been identified.
The computer-implemented method of any preceding claim, wherein the mitigation action includes any one or more of: removing and/or anonymizing any identified personal data; sending an alert to a data management component; and generating synthetic technical data based on at least the received technical data.
The computer-implemented method of any preceding claim, further comprising a step, performed before the mitigation action, of performing a secondary validation to identify personal data.
The computer-implemented method of claim 5, wherein the secondary validation includes regular expression checking for personal data, detecting anomalies, and/or verifying data integrity.
The computer-implemented method of any preceding claim, wherein the data stream is provided via an API exposure layer, which interfaces with the pre-trained machine learning model for screening.
The computer-implemented method of any preceding claim, including an initial validation step performed before screening the received data, the initial validation step checking the format of the data stream and/or the integrity of the data stream.
The computer-implemented method of any preceding claim, further comprising steps of obtaining feedback on the identification of potential inclusions of personal data; and adjusting the pre-trained machine learning model based on the obtained feedback.
The computer-implemented method of any preceding claim, wherein the pre-trained machine learning model is a large language model.
The computer-implemented method of any preceding claim, wherein the data stream is an encrypted data stream, which is decrypted during the screening process and re-encrypted before storage or subsequent retransmission.
A data monitoring component, including one or more processors and memory, the memory containing machine executable instructions which, when executed on the processor(s), cause the processor to: receive a healthcare data stream, the data stream including technical data; screen the received data stream to identify potential inclusions of personal data, using a pre-trained machine learning model, within the received data stream; and upon identifying the potential inclusion of personal data within the received data stream, perform a mitigation action with respect to the potential personal data identified in the received data stream; wherein the pre-trained machine learning model has been trained only on technical data so as to identify technical data in data streams, and so identify potential personal data as anomalous data within the received data stream.
A computer network including: the data monitoring component of claim 12; a healthcare data source, connected to the data monitoring component and which provides the healthcare data stream to the data monitoring component; and a database; wherein the data monitoring component is configured to transmit the stream of data to the database when no potential inclusions of personal data identified.
The computer network of claim 13 further comprising: a synthetic data generator, which receives at least a part of the data stream from the data monitoring component, and is configured to generate synthetic technical data based on the received at least a part of the data stream.
The computer network of claim 13 or 14, further comprising a secondary validation module, connected to the data monitoring component, and configured to perform a secondary validation to identify personal data.

Description

TECHNICAL FIELD The present invention relates to a method, a system, and a computer network. BACKGROUND In diagnostic healthcare, the extensive processing of technical and personal data used to enhance diagnoses often leads to inadvertent breaches of privacy due to the inclusion of personal information in datasets intended for technical use only. The distinction between personal and technical data is important, yet challenging. This is compounded in fast-paced development environments focused on innovation and minimum viable product deployment, where control mechanisms might be bypassed. The present invention was arrived at in light of the above considerations. SUMMARY Accordingly, in a first aspect, embodiments of the invention provide a computer-implemented method of processing healthcare data, comprising steps of: receiving a healthcare data stream, the data stream including technical data; screening the received data to identify potential inclusions of personal data, using a pre-trained deep learning model, within the received data stream; and upon identifying the potential inclusion of personal data within the received data stream, performing a mitigation action with respect to the potential personal data identified in the received data stream; wherein the pre-trained machine learning model has been fine-tuned only on technical data to identify potential personal data as anomalous data within the received data stream. Such a method can advantageously improve the quality of the provided technical data by identifying and mitigating against the inclusion of personal data, which also improves privacy compliance. A healthcare data stream is data collected from one or more healthcare data sources, for example a laboratory information system, hospital management system, in-vitro diagnostic instrument, point-of-care device, or laboratory middleware. It includes, as technical data, technical values pertaining to the source of data(for example reagent levels in an instrument, etc.) . It may also include personal data, such as name, date of birth, location etc. This personal data may have been removed by the source or by the healthcare data system connected to the source, but in some instances may remain (for example due to failure to identify it). The healthcare data stream may be received from a healthcare data system, which may be external to the entity performing the screening. By stream of data, it is meant a continuously received data stream comprising individual elements of data. The elements of data may be contained within a message sent from the healthcare data source and so the data stream may be a series of received messages. The screening of the data stream can therefore be performed per element of data (or per message) in that the pre-trained machine learning model may interrogate each element of data individually. In other examples, the messages may be grouped or each message may contain plural elements of data, for example a data record. A data record may comprise a group of data elements, for example a clinical entry with the results of a panel of blood tests, or a parameterised radiographic image. The pre-trained machine learning model may have an architecture size parameter which is below a predetermined threshold, wherein the predetermined threshold is determined based on an amount of computing resource available. The pre-trained machine learning model may have undergone a portability process after training but before deployment, to reduce the size of the model (e.g., to reduce the number of parameters, depth of the neural network, number of nodes, cache size etc.). The pre-trained machine learning model may have no more than 210 million parameters, 100 million parameters, no more than 90 million parameters, no more than 80 million parameters, no more than 70 million parameters, or no more than 66 million parameters. The amount of computing resource may be an amount of memory, for example RAM, and may be no more than 16 GB, no more than 32 GB, or no more than 64 GB. The pre-trained machine learning model should be configured so as to utilise no more than around 5% of the available RAM (e.g., around 800 MB). The pre-trained machine learning model may have been trained on a dataset of at least 300,000 and/or no more than 700,000 tagged data points. The training data is technical data which, as mentioned above, encompasses only technical values pertaining to the source of data. For example, a device ID, timestamp value for the message, . An example of entries in the dataset is shown in Annex A. The step of screening the received data may be performed on an edge device. For example, on an in-vitro diagnostic analyser, a point-of-care device, a laboratory middleware, a laboratory information management system, or a desktop computer. The method may be performed by a data monitoring component. The data monitoring component may be on the edge device, may be located in a cloud computing environment, or