US-20260128128-A1 - SYSTEMS FOR ASSESSING AND IMPROVING THE QUALITY OF MULTIPLEX MOLECULAR ASSAYS

US20260128128A1US 20260128128 A1US20260128128 A1US 20260128128A1US-20260128128-A1

Abstract

A method of identifying extant proteins, including (a) inputting to a computer processor: (i) a plurality of empirical binding profiles, individual empirical binding profiles including empirical binding outcomes for binding of an extant protein to a plurality of different affinity reagents, (ii) a plurality of candidate outcome profiles, individual candidate outcome profiles including binding outcomes for binding of a candidate protein to the plurality of different affinity reagents, and (iii) a plurality of pseudo outcome profiles, individual pseudo outcome profiles including a rearrangement of a candidate outcome profile; (b) performing a process in the computer processor to identify extant proteins based on the empirical binding profiles of the extant proteins and the plurality of candidate outcome profiles; and (c) performing a process in the computer processor to determine a false discovery statistic for the extant proteins based on the plurality of pseudo outcome profiles.

Inventors

Jarrett D. EGERTSON
James Sherman
Vadim Lobanov
Parag Mallick
James Henry Joly

Assignees

NAUTILUS SUBSIDIARY, INC.

Dates

Publication Date: 20260507
Application Date: 20251107

Claims (20)

1 . A protein characterization system, comprising: (a) a solid support having attached proteins from a sample, wherein each of the attached proteins from the sample are at individual addresses that are spatially separated from each other; (b) a fluidic system configured to apply a set of different affinity reagents to the attached proteins of the solid support; (c) a detector configured to detect binding of each of the different affinity reagents to the attached proteins, thereby generating an empirical binding profile representing binding outcomes for each of the attached proteins at the individual addresses with each of the different affinity reagents; (d) one or more databases including: (i) candidate protein binding profiles indicating, for each candidate protein known, or suspected to be, in the sample, probabilities of each of the different affinity reagents binding to the candidate protein, and (ii) decoy protein binding profiles indicating, for each decoy protein known not to be present in the sample, probabilities of each of the different affinity reagents binding to the decoy protein; and (e) a computing system configured to: (i) determine, using the empirical binding profile and the candidate protein binding profiles, candidate protein probabilities for each of the attached proteins being one of the candidate proteins; (ii) determine, using the empirical binding profiles and the plurality of decoy outcome profiles, decoy protein probabilities for each of the attached proteins being one of the decoy proteins that are not present in the sample, and (iii) quantify the attached proteins of the sample based on the candidate protein probabilities and the decoy protein probabilities.
2 . The protein characterization system of claim 1 , wherein the computing system is further configured to determine a false identification rate of the attached proteins based on the candidate protein probabilities and the decoy protein probabilities, and wherein quantifying the attached proteins is further based on the false identification rate.
3 . The protein characterization system of claim 1 , wherein the computing system is further configured to generate the decoy protein binding profiles from the candidate protein binding profiles.
4 . The protein characterization system of claim 3 , wherein the decoy protein binding profiles are generated from the candidate protein binding profiles by rearranging the candidate protein binding profiles.
5 . The protein characterization system of claim 4 , wherein rearranging the candidate protein binding profiles includes shuffling an order of binding probabilities of the affinity reagents binding to the candidate proteins.
6 . The protein characterization system of claim 1 , wherein each candidate protein of the candidate protein binding profiles is associated with a different decoy protein of the decoy protein binding profiles.
7 . The protein characterization system of claim 1 , wherein quantifying the attached proteins of the sample based on the candidate protein probabilities and the decoy protein probabilities includes excluding attached proteins with a higher decoy protein probability than a candidate protein probability from quantification.
8 . The protein characterization system of claim 1 , wherein the decoy proteins are proteins known to be absent from the sample.
9 . The protein characterization system of claim 1 , wherein the computing system is further configured to store the quantification of the attached proteins to a non-transitory computer-readable medium.
10 . The protein characterization system of claim 1 , wherein the attached proteins are full-length proteins.
11 . The protein characterization system of claim 1 , wherein the fluidic system is configured to iteratively apply the set of different affinity reagents to the attached proteins of the solid support.
12 . The protein characterization system of claim 1 , wherein the binding outcomes represent positive and negative binding outcomes of the different affinity reagents binding to the attached proteins.
13 . The protein characterization system of claim 1 , wherein the candidate protein binding profiles are different than the decoy protein binding profiles.
14 . The protein characterization system of claim 1 , wherein the computing system is further configured to determine a false discovery statistic based on the candidate protein probabilities and the decoy protein probabilities.
15 . The protein characterization system of claim 14 , wherein the computing system is configured to quantify the attached proteins further based on the false discovery statistic.
16 . The protein characterization system of claim 1 , wherein the attached proteins from the sample at individual addresses includes at least 1000 attached proteins bound to at least 1000 addresses.
17 . The protein characterization system of claim 1 , wherein the attached proteins from the sample at individual addresses includes at least 1 million attached proteins bound to at least 1 million addresses.
18 . The protein characterization system of claim 1 , wherein the attached proteins from the sample at individual addresses includes at least 1 billion attached proteins bound to at least 1 billion addresses.
19 . The protein characterization system of claim 1 , wherein the attached proteins correspond with the candidate protein binding profiles, and wherein the attached proteins do not correspond with the decoy protein binding profiles.
20 . The protein characterization system of claim 19 , wherein the attached proteins are one or more candidate proteins of the candidate protein binding profiles, wherein the attached proteins are not included in the decoy protein binding profiles.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a divisional of U.S. application Ser. No. 18/301,891, filed on Apr. 17, 2023, which claims the benefit of U.S. Provisional Application No. 63/334,586 filed on Apr. 25, 2022, and U.S. Provisional Application No. 63/385,722 filed on Dec. 1, 2022, each of which applications is incorporated by reference in its entirety. Any and all applications for which a foreign or domestic priority claim is identified in the Application Data Sheet as filed with the present application are hereby incorporated by reference under 37 CFR 1.57. SEQUENCE LISTING The instant application contains a Sequence Listing which has been submitted electronically in XML format and is hereby incorporated by reference in its entirety. Said XML copy, created on Jan. 15, 2026, is named NBIOT012D1ReplacementSeqListing.xml and is 2,002 bytes in size. BACKGROUND The proteome is among the most dynamic and valuable sources of biological insight. Current proteomics techniques are limited in their sensitivity and throughput, covering at best 35% of the human proteome in a single experiment (see Blume et al., Nat Commun 11, 3662 (2020) and Clark et al., Cell 180, 207 (2020), each of which is incorporated herein by reference). Despite the wealth of insights gained from now routine genomics and transcriptomics studies in biomedical research, a large gap remains between genome/transcriptome and phenotype. Proteomics is crucial to bridging this gap as proteins constitute the main structural and functional components of cells. However, protein sequencing technologies lag behind DNA sequencing technologies, in part due to the complex nature of proteins and proteomes as well as the high dynamic range (˜109) in the quantities of different proteins present at any given time in any given cell (see Aebersold et al., Nat Chem Biol 14, 206-214 (2018), which is incorporated herein by reference). Moreover, about 10% of the proteins predicted to comprise the human proteome have not been confidently observed at all (see Omenn et al., J Proteome Res 19, 4735-4746 (2020) and Adhikari et al., Nat Commun 11, 5301 (2020), each of which is incorporated herein by reference). Recently, single-molecule identification has been postulated as a method to analyze small samples (including single cells) and rare proteins (see Alfaro et al., Nat Methods 18, 604-617 (2021) and Restrepo-Perez et al., Nat Nanotechnol 13, 786-796 (2018), each of which is incorporated herein by reference). Traditional bulk identification techniques like mass spectrometry and immunoassays have been adapted towards detection of single proteins (see Keifer & Jarrold, Mass Spectrom Rev 36, 715-733 (2017) and Risin et al., Nat Biotechnol 28, 595-599 (2010), each of which is incorporated herein by reference). Several concepts have been proposed to achieve single-molecule protein sequencing. These all use sequential processes to determine the positional information of amino acids within proteins e.g., Edman-type degradation (Swaminathan, et al. Nat Biotechnol (2018) and Swaminathan, et al., PLoS Comput Biol 11, e1004080 (2015), each of which is incorporated herein by reference) or directional protein translocation through a nanopore channel (Kolmogorov, et al., PLoS Comput Biol 13, e1005356 (2017), each of which is incorporated herein by reference). SUMMARY The present disclosure provides a method of identifying extant proteins. The method can include steps of: (a) providing inputs to a computer processor, the inputs including (i) a plurality of empirical outcome profiles, individual empirical outcome profiles of the plurality of empirical outcome profiles each including a plurality of empirical measurement outcomes for an extant protein, individual empirical measurement outcomes of the plurality of empirical measurement outcomes each including a measured outcome for reaction of the extant protein with a different assay reagent, (ii) a plurality of candidate outcome profiles, individual candidate outcome profiles of the plurality of candidate outcome profiles each including a plurality of statistical measures for a candidate protein, wherein the candidate proteins are known or suspected of being present in the sample, and (iii) a plurality of pseudo outcome profiles, individual pseudo outcome profiles of the plurality of pseudo outcome profiles each including a plurality of statistical measures that is known to not occur for any of the candidate proteins; (b) performing a process in the computer processor to identify extant proteins of the plurality of different extant proteins based on the empirical outcome profiles of the extant proteins and the plurality of candidate outcome profiles; and (c) performing a process in the computer processor to determine a false discovery statistic for the extant proteins based on the plurality of empirical outcome profiles and the plurality of pseudo outcome profiles. Optionally, the empirical outcome profiles are empiri