US-12626017-B2 - Systems and methods for de-identifying healthcare data in a health analysis platform
Abstract
Systems and methods for improving data security in a computerized health analysis platform are disclosed. For instance, a method includes (i) reading clinical study data sets regarding one or more clinical studies from a hardware storage device, (ii) executing a first computer executable de-identification function with respect to direct identifiers included in each of the clinical study data sets to generate modified clinical study data sets, (iii) generating a standardized data set based on the modified clinical study data sets, (iv) executing a second computer-executable de-identification function with respect to clinical study information portion included in the standardized data set to generate a modified standardized data set, (v) generating a first data structure representing the modified standardized data set, and (vi) storing the first data structure in the hardware storage device.
Inventors
- Krishnamoorthy Chinnathambu
- Bhargav Koduru
- Luke Morgan
- Trey Moore
Assignees
- MEDIDATA SOLUTIONS, INC.
Dates
- Publication Date
- 20260512
- Application Date
- 20241018
Claims (20)
- 1 . A system for improving data security in a computerized health analysis platform, the system comprising: a hardware storage device; at least one processor; and a memory subsystem communicatively coupled to the at least one processor, the memory subsystem storing instructions which, when executed by the at least one processor, cause the at least one processor to perform operations comprising: reading, from the hardware storage device, a plurality of clinical study data sets regarding one or more clinical studies, wherein each of the clinical study data sets comprises: one or more first data fields storing one or more first values representing identification information associated with the one or more clinical studies and a plurality of entities participating in the one or more clinical studies, and one or more second data fields storing one or more second values representing clinical study information gathered during the one or more clinical studies; executing a first computer executable de-identification function with respect to each of the plurality of clinical study data sets to generate a plurality of modified clinical study data sets, wherein executing the first computer executable de-identification function masks the first one or more first values in the plurality of clinical study data sets, and wherein executing the first computer executable de-identification function comprises modifying the one or more first values in the plurality of clinical study data sets according to a first set of rules to generate a de-identified representation of the identification information; generating a standardized data set based on the plurality of modified clinical study data sets, wherein generating the standardized data set comprises: concatenating the modified clinical study data sets into the standardized data set, and at least one of: remapping at least one of the first data fields in the standardized data set to a respective standardized first data field, remapping at least one of the second fields in the standardized data set to a respective standardized second data field, remapping at least one of the first values in the standardized data set to a respective standardized first value, or remapping at least one of the second values in the standardized data set to a respective standardized second value; executing a second computer-executable de-identification function with respect to the standardized data set to generate a modified standardized data set, wherein executing the second computer executable de-identification function masks at least one of a study design, a data collection, or treatment information of the one or more clinical studies, and wherein executing the second computer-executable de-identification function comprises modifying at least a portion of the standardized data set according to a second set of rules to generate a de-identified representation of the clinical study information; generating a first data structure representing the modified standardized data set; and storing the first data structure using the hardware storage device.
- 2 . The system of claim 1 , wherein the operations comprise providing the first data structure to a computerized health analysis platform.
- 3 . The system of claim 1 , wherein the identification information comprises at least one of: demographic information regarding the entities, identifiers associated with at least one of the entities, the one or more clinical studies, or one or more parties associated with the one or more clinical studies, location information regarding the one or more clinical studies, treatment information regarding at least one of a drug, a product, or a therapy used in the one or more clinical studies, or date information associated with the one or more clinical studies.
- 4 . The system of claim 3 , wherein the first set of rules comprises: replacing, in the plurality of clinical study data sets, at least one of the demographic information, the identifiers, the location information, or the treatment information with one or more alphanumeric characters or symbols.
- 5 . The system of claim 3 , wherein the first set of rules comprises: determining that one or more first identifiers are associated with one or more of the entities, determining that one or more second identifiers are associated with the one or more parties associated with the one or more clinical studies, wherein the one or more parties provide at least one of the drug, the product, or the therapy used in the one or more clinical studies, removing the one or more first identifiers from the plurality of clinical study data sets, and replacing the one or more second identifiers in the plurality of clinical study data sets with one or more alphanumeric characters or symbols.
- 6 . The system of claim 3 , wherein the first set of rules comprises: determining that the date information comprises calendar dates, and replacing the calendar dates in the plurality of clinical study data sets with relative dates, wherein each of the relative dates represents a respective time offset relative to a pre-determined event.
- 7 . The system of claim 6 , wherein the pre-determined event corresponds to a start date of clinical study participation of a respective one of the entities according to the one or more clinical studies.
- 8 . The system of claim 1 , wherein at least some of the clinical study information is represented by naming conventions for at least one of (i) visit information, (ii) dosing phase, (iii) tests or examinations, (iv) treatment names, (v) treatment groups that are associated with the plurality of entities, (vi) specific time point references regarding patients being subject to product or the test or examinations, or (vii) categories of the tests or examinations.
- 9 . The system of claim 8 , wherein the second set of rules comprises: modifying, in the standardized data set, the naming conventions for at least one of (i) the visit information, (ii) the dosing phase, (iii) the tests or examinations, (iv) the treatment names, (v) the treatment groups, (vi) the specific time point references regarding patients being subject to product or the test or examinations, or (vii) the categories of the tests or examinations.
- 10 . The system of claim 1 , wherein the standardized data set comprises a plurality of groups associated with (i) the plurality of entities that share one or more common criteria or traits and (ii) the one or more second values in the standardized data set, and wherein executing the computer-executable second de-identification function comprises: performing data aggregation by (i) combining the one or more second values from two or more of the plurality of groups and (ii) generating one or more larger, less specific groups based on one or more additional common criteria or traits that are shared by the one or more larger, less specific groups.
- 11 . The system of claim 1 , wherein the standardized data set comprises a plurality of groups associated with (i) the plurality of entities that share one or more common criteria or traits and (ii) the one or more second values in the standardized data set, and wherein executing the computer-executable second de-identification function comprises: performing data cohorting by (i) generating a plurality of additional groups that exceed a number of the plurality of groups and (ii) regrouping the plurality of entities and the one or more second values in the standardized data set into the plurality of additional groups.
- 12 . A method comprising: reading, by an electronic device, a plurality of clinical study data sets regarding one or more clinical studies from a hardware storage device, wherein each of the clinical study data sets comprises: one or more first data fields storing one or more first values representing identification information associated with the one or more clinical studies and a plurality of entities participating in the one or more clinical studies, and one or more second data fields storing one or more second values representing clinical study information gathered during the one or more clinical studies; executing, by the electronic device, a first computer executable de-identification function with respect to each of the plurality of clinical study data sets to generate a plurality of modified clinical study data sets, wherein executing the first computer executable de-identification function masks the first one or more first values in the plurality of clinical study data sets, and wherein executing the first computer executable de-identification function comprises modifying the one or more first values in the plurality of clinical study data sets according to a first set of rules to generate a de-identified representation of the identification information; generating, by the electronic device, a standardized data set based on the plurality of modified clinical study data sets, wherein generating the standardized data set comprises: concatenating the modified clinical study data sets into the standardized data set, and at least one of: remapping at least one of the first data fields in the standardized data set to a respective standardized first data field, remapping at least one of the second fields in the standardized data set to a respective standardized second data field, remapping at least one of the first values in the standardized data set to a respective standardized first value, or remapping at least one of the second values in the standardized data set to a respective standardized second value; executing, by the electronic device, a second computer-executable de-identification function with respect to the standardized data set to generate a modified standardized data set, wherein executing the second computer executable de-identification function masks at least one of a study design, a data collection, or treatment information of the one or more clinical studies, and wherein executing the second computer-executable de-identification function comprises modifying at least a portion of the standardized data set according to a second set of rules to generate a de-identified representation of the clinical study information; generating, by the electronic device, a first data structure representing the modified standardized data set; and storing, by the electronic device, the first data structure in the hardware storage device.
- 13 . The method of claim 12 , comprising: providing, by the electronic device, the first data structure to a computerized health analysis platform.
- 14 . The method of claim 12 , wherein the identification information comprises at least one of: demographic information regarding the entities, identifiers associated with at least one of the entities, the one or more clinical studies, or one or more parties associated with the one or more clinical studies, location information regarding the one or more clinical studies, treatment information regarding at least one of a drug, a product, or a therapy used in the one or more clinical studies, or date information associated with the one or more clinical studies.
- 15 . The method of claim 14 , wherein the first set of rules comprises: replacing, in the plurality of clinical study data sets, at least one of the demographic information, the identifiers, the location information, or the treatment information with one or more alphanumeric characters or symbols.
- 16 . The method of claim 12 , wherein at least some of the clinical study information is represented by naming conventions for at least one of (i) visit information, (ii) dosing phase, (iii) tests or examinations, (iv) treatment names, (v) treatment groups that are associated with the plurality of entities, (vi) specific time point references regarding patients being subject to product or the test or examinations, or (vii) categories of the tests or examinations.
- 17 . The method of claim 16 , wherein the second set of rules comprises: modifying, in the standardized data set, the naming conventions for at least one of (i) the visit information, (ii) the dosing phase, (iii) the tests or examinations, (iv) the treatment names, (v) the treatment groups, (vi) the specific time point references regarding patients being subject to product or the test or examinations, or (vii) the categories of the tests or examinations.
- 18 . The method of claim 12 , wherein the standardized data set comprises a plurality of groups associated with (i) the plurality of entities that share one or more common criteria or traits and (ii) the one or more second values in the standardized data set, and wherein executing the computer-executable second de-identification function comprises: performing data aggregation by (i) combining the one or more second values from two or more of the plurality of groups and (ii) generating one or more larger, less specific groups based on one or more additional common criteria or traits that are shared by the one or more larger, less specific groups.
- 19 . The method of claim 12 , wherein the standardized data set comprises a plurality of groups associated with (i) the plurality of entities that share one or more common criteria or traits and (ii) the one or more second values in the standardized data set, and wherein executing the computer-executable second de-identification function comprises: performing data cohorting by (i) generating a plurality of additional groups that exceed a number of the plurality of groups and (ii) regrouping the plurality of entities and the one or more second values in the standardized data set into the plurality of additional groups.
- 20 . One or more non-transitory computer-readable media storing instructions which, when executed by at least one processor, cause the at least one processor to perform; reading a plurality of clinical study data sets regarding one or more clinical studies from a hardware storage device, wherein each of the clinical study data sets comprises: one or more first data fields storing one or more first values representing identification information associated with the one or more clinical studies and a plurality of entities participating in the one or more clinical studies, and one or more second data fields storing one or more second values representing clinical study information gathered during the one or more clinical studies; executing a first computer executable de-identification function with respect to each of the plurality of clinical study data sets to generate a plurality of modified clinical study data sets, wherein executing the first computer executable de-identification function masks the first one or more first values in the plurality of clinical study data sets, and wherein executing the first computer executable de-identification function comprises modifying the one or more first values in the plurality of clinical study data sets according to a first set of rules to generate a de-identified representation of the identification information; generating a standardized data set based on the plurality of modified clinical study data sets, wherein generating the standardized data set comprises: concatenating the modified clinical study data sets into the standardized data set, and at least one of: remapping at least one of the first data fields in the standardized data set to a respective standardized first data field, remapping at least one of the second fields in the standardized data set to a respective standardized second data field, remapping at least one of the first values in the standardized data set to a respective standardized first value, or remapping at least one of the second values in the standardized data set to a respective standardized second value; executing a second computer-executable de-identification function with respect to the standardized data set to generate a modified standardized data set, wherein executing the second computer executable de-identification function masks at least one of a study design, a data collection, or treatment information of the one or more clinical studies, and wherein executing the second computer-executable de-identification function comprises modifying at least a portion of the standardized data set according to a second set of rules to generate a de-identified representation of the clinical study information; generating a first data structure representing the modified standardized data set; and storing the first data structure in the hardware storage device.
Description
TECHNICAL FIELD This description generally relates to systems and methods for de-identifying healthcare data in a health analysis platform by assessing and modifying physiological measurements for filtered healthcare data. BACKGROUND In general, maintaining privacy in healthcare data, including clinical trial data, is important for safeguarding patient information and preventing unauthorized access to sensitive data. Privacy breaches can lead to harmful consequences, including unauthorized use and distribution of personal health information. To mitigate these risks, de-identification or data masking techniques may be needed to protect and maintain privacy in the healthcare data. SUMMARY Implementations according to this disclosure includes a system for improving data security in a computerized health analysis platform. The system includes a hardware storage device, at least one processor, and a memory subsystem communicatively coupled to the at least one processor. The memory subsystem stores instructions which, when executed by the at least one processor, cause the at least one processor to perform operations comprising: (i) reading, from the hardware storage device, a clinical study data sets regarding one or more clinical studies; (ii) executing a first computer executable de-identification function with respect to each of the clinical study data sets to generate modified clinical study data sets; (iii) generating a standardized data set based on the modified clinical study data sets; (iv) executing a second computer-executable de-identification function with respect to the standardized data set to generate a modified standardized data set; (v) generating a first data structure representing the modified standardized data set; and (vi) storing the first data structure using the hardware storage device. Each of the clinical study data sets includes (i) one or more first data fields storing one or more first values representing identification information associated with the one or more clinical studies and entities participating in the one or more clinical studies, and (ii) one or more second data fields storing one or more second values representing clinical study information gathered during the one or more clinical studies. Executing the first computer executable de-identification function masks the one or more first values in the clinical study data sets. Executing the first computer executable de-identification function includes modifying the one or more first values in the clinical study data sets according to a first set of rules to generate a de-identified representation of the identification information. Generating the standardized data set includes: concatenating the modified clinical study data sets into the standardized data set; and at least one of: (i) remapping at least one of the first data fields in the standardized data set to a respective standardized first data field, (ii) remapping at least one of the second fields in the standardized data set to a respective standardized second data field, (iii) remapping at least one of the first values in the standardized data set to a respective standardized first value, or (iv) remapping at least one of the second values in the standardized data set to a respective standardized second value. Executing the second computer executable de-identification function masks at least one of a study design, a data collection, or treatment information of the one or more clinical studies. Executing the second computer-executable de-identification function includes modifying at least a portion of the standardized data set according to a second set of rules to generate a de-identified representation of the clinical study information. Implementations according to this disclosure includes a method for improving data security in a computerized health analysis platform. The method includes: (i) reading clinical study data sets regarding one or more clinical studies from a hardware storage device; (ii) executing a first computer executable de-identification function with respect to direct identifiers included in each of the clinical study data sets to generate modified clinical study data sets; (iii) generating a standardized data set based on the modified clinical study data sets; (iv) executing a second computer-executable de-identification function with respect to clinical study information portion included in the standardized data set to generate a modified standardized data set; (v) generating a first data structure representing the modified standardized data set; and (vi) storing the first data structure in the hardware storage device. Each of the clinical study data sets includes (i) one or more first data fields storing one or more first values representing identification information associated with the one or more clinical studies and entities participating in the one or more clinical studies, and (ii) one or more second data fields storing one or more second v