CN-121789769-B - B cell immune repertoire antibody sequence characteristic analysis and specificity screening method and related equipment

CN121789769BCN 121789769 BCN121789769 BCN 121789769BCN-121789769-B

Abstract

The application discloses a B cell immune repertoire antibody sequence characteristic analysis and specificity screening method and related equipment, which can be applied to the technical field of data processing. According to the application, after the corresponding first sequencing Fv sequence and germ line Fv sequence are obtained by comparing and identifying the antibody sequencing sequence, the first sequencing Fv sequence is subjected to gene integrity filtration and structural integrity filtration, clonotype division and root node sequence reconstruction to obtain a second clonotype collection, and then a first pedigree forest is constructed, and sequence characteristic analysis is carried out in multiple dimensions according to the second sequencing Fv sequence, the second clonotype collection or the first pedigree forest, so that the antibody sequence characteristics can be analyzed in multiple dimensions such as sequence quality, sequence abundance, mutation degree, mutation preference, aggregation degree and the like, so that the BCR sequence after subsequent filtration has high affinity, and a visual map is generated according to sequence characteristic analysis results in multiple dimensions or the first pedigree forest, thereby facilitating correlation personnel to check the sequence characteristics.

Inventors

ZHOU SHUXIAN
LIU HUAQING
WEI BIN
CHEN PEIYI

Assignees

广州赛业百沐生物科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260303

Claims (9)

1. A method for sequence characterization and specific screening of B cell immune repertoires antibodies, comprising the steps of: obtaining antibody sequencing sequences from the B cell immune repertoire; Performing germ line comparison on the antibody sequencing sequences, recognizing to obtain a first sequencing Fv sequence corresponding to each antibody sequencing sequence, and generating a germ line Fv sequence corresponding to each antibody sequencing sequence, wherein the first sequencing Fv sequence comprises a V gene fragment, a D gene fragment, a J gene fragment or a C gene fragment; Carrying out gene integrity filtration and structure integrity filtration on the first sequencing Fv sequence to obtain a second sequencing Fv sequence; Performing clonotype division on all the second sequencing Fv sequences according to the germline Fv sequences to obtain a plurality of first clonotype sets, and performing root node sequence reconstruction on each first clonotype set to obtain a second clonotype set; Constructing a first lineage forest based on isotype class transition probabilities and the second clonotype sets, the first lineage forest including a plurality of first evolutionary trees, each first evolutionary tree corresponding to one of the second clonotype sets; Performing a multi-dimensional sequence profiling based on the second sequenced Fv sequence, the second clonal collection, or the first lineage forest, the multi-dimensional sequence profiling including a profiling of sequence quality, a profiling of sequence abundance, a profiling of mutation level, a profiling of mutation preference, and a profiling of aggregation level; Generating a visual map according to sequence feature analysis results of multiple dimensions or the first pedigree forest; Filtering the second clone type collection or the second sequencing Fv sequence according to sequence characteristic analysis results of multiple dimensions and preset filtering conditions to obtain a qualified antibody sequence; Wherein the characteristic analysis of the sequence quality comprises: analyzing the number and position of cysteines in the second sequenced Fv sequence; Analyzing the sequence integrity of the second sequenced Fv sequence; And carrying out bitwise mask prediction on the second sequencing Fv sequence through an antibody language model to obtain a first pseudo likelihood value of each site, and calculating to obtain an AI likelihood value of the second sequencing Fv sequence according to the first pseudo likelihood values of all sites in the second sequencing Fv sequence.
2. The method of claim 1, wherein the characterization of sequence abundance comprises: analyzing the occurrence frequency of the DNA sequence or the protein sequence in the second sequencing Fv sequence to obtain the sequence occurrence frequency; Analyzing the number of occurrences of the complementarity determining region sequence in the second sequenced Fv sequence to obtain the complementarity determining region sequence enrichment.
3. The method of claim 2, wherein the characterization of the degree of mutation comprises: analyzing homology between the second sequenced Fv sequence and the homologous germline sequence; Comparing the second sequencing Fv sequence with corresponding germline sequences to analyze the number of somatic hypermutations of the second sequencing Fv sequence; And analyzing a ratio index between the total branch length of the subtree corresponding to the target node in the first pedigree forest and the total branch length on the path from the target node to the root node.
4. A method according to claim 3, characterized in that the feature analysis of the mutation preference comprises: Analyzing the high frequency somatic hypermutation frequency of the second clonal collection; analyzing the distance between each of said second sequenced Fv sequences and consensus sequences within said second collection of clonotypes; And analyzing the local branch density corresponding to the target node in the first pedigree forest.
5. The method of claim 4, wherein the characterization of the degree of aggregation comprises: analyzing the connectivity of the complementarity determining region sequences in the second sequenced Fv sequence; analyzing the connectivity of the framework region sequences in the second sequenced Fv sequence.
6. The method of claim 1, wherein said subjecting the first sequenced Fv sequence to gene integrity filtration and structural integrity filtration results in a second sequenced Fv sequence comprising: If the V gene fragment or the J gene fragment in the first sequencing Fv sequence is not aligned in the gene bank, removing the corresponding first sequencing Fv sequence and taking the rest of the first sequencing Fv sequence as a third sequencing Fv sequence; If the third sequencing Fv sequence has a deletion of a gene fragment in a framework region or a complementarity determining region or the third sequencing Fv sequence is a sequence with incomplete ends, the corresponding third sequencing Fv sequence is deleted, and the rest of the third sequencing Fv sequence is taken as the second sequencing Fv sequence.
7. The method of claim 1, wherein said clonotyping all of said second sequenced Fv sequences from said germline Fv sequences to obtain a plurality of first clonotypic sets and root node sequence reconstruction of each of said first clonotypic sets to obtain a second clonotypic set comprises: Classifying the second sequencing Fv sequences of the same length of CDR3 sequences, in which the V and J gene segments in the germline Fv sequences corresponding to heavy and light chains in all the second sequencing Fv sequences are identical, into the same first collection of clonotypes; Aligning all of the second sequenced Fv sequences and the germline Fv sequences in each of the first pool of clonotypes; If the base types of any sites in all the aligned second sequencing Fv sequences and the germline Fv sequences are different, setting the base of the corresponding site in the germline Fv sequences as N base; And taking the germline Fv sequence after base modification as a root node sequence of the first clone type collection to obtain the second clone type collection.
8. An electronic device, comprising: at least one processor; at least one memory for storing at least one program; The at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1 to 7.
9. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

Description

B cell immune repertoire antibody sequence characteristic analysis and specificity screening method and related equipment Technical Field The application relates to the technical field of data processing, in particular to a B cell immune repertoire antibody sequence characteristic analysis and specificity screening method and related equipment. Background In the related art, the B cell immune repertoire (B cell repertoire) refers to a collection of B cell receptor sequences (BCR) or antibody variable region sequences of all B cells in an individual, reflecting the recognition capability of the organism to various antigens and the diversity of immune responses. With the rapid development of high-throughput sequencing technology (such as single-cell paired sequencing), the B cell immune repertoire can be sequenced with extremely high resolution and depth at present, so that massive antibody sequence data can be obtained. These antibody sequence data contain rich information including key immunological events such as clonal expansion of B cells, maturation processes, responses to antigens, etc. However, in the face of these massive amounts of antibody sequence data, how to efficiently mine and analyze the valuable BCR sequences therein is a challenge. As in current antibody discovery projects, it is still difficult to directly judge antigen specificity from sequences that are expected to recognize which BCR sequences have higher affinity for a particular antigen. Because the whole library is not detected (SPR, BLI, elisa) due to long time and high cost, the common practice is to select sequences with high abundance to verify, thus possibly causing missing some better BCR sequences, and even hundreds of detected at higher cost cannot find strong binding, so that some better BCR sequences are frequently missed. Also, because of the small amount of experimental data, it is not sufficient to train a deep learning or machine learning model. The current bioinformatic analysis of B cell immune repertoires is only limited by clone diversity, V/J gene use frequency, somatic Hypermutation (SHM) and evolution analysis, and multi-dimensional and comprehensive characteristics cannot be mined for a BCR sequence set, and a BCR sequence with high affinity cannot be identified. In summary, the technical problems in the related art are to be improved. Disclosure of Invention The main purpose of the embodiment of the application is to provide a B cell immune repertoire antibody sequence characteristic analysis and specificity screening method and related equipment, which can effectively identify and obtain a BCR sequence with high affinity. In order to achieve the above objective, according to an aspect of the embodiments of the present application, a method for analyzing sequence characteristics and specifically screening B cell immune repertoires is provided, the method comprising the following steps: obtaining antibody sequencing sequences from the B cell immune repertoire; Performing germ line comparison on the antibody sequencing sequences, recognizing to obtain a first sequencing Fv sequence corresponding to each antibody sequencing sequence, and generating a germ line Fv sequence corresponding to each antibody sequencing sequence, wherein the first sequencing Fv sequence comprises a V gene fragment, a D gene fragment, a J gene fragment or a C gene fragment; Carrying out gene integrity filtration and structure integrity filtration on the first sequencing Fv sequence to obtain a second sequencing Fv sequence; Performing clonotype division on all the second sequencing Fv sequences according to the germline Fv sequences to obtain a plurality of first clonotype sets, and performing root node sequence reconstruction on each first clonotype set to obtain a second clonotype set; Constructing a first lineage forest based on isotype class transition probabilities and the second clonotype sets, the first lineage forest including a plurality of first evolutionary trees, each first evolutionary tree corresponding to one of the second clonotype sets; Performing a multi-dimensional sequence profiling based on the second sequenced Fv sequence, the second clonal collection, or the first lineage forest, the multi-dimensional sequence profiling including a profiling of sequence quality, a profiling of sequence abundance, a profiling of mutation level, a profiling of mutation preference, and a profiling of aggregation level; Generating a visual map according to sequence feature analysis results of multiple dimensions or the first pedigree forest; and filtering the second clone type collection or the second sequencing Fv sequence according to sequence characteristic analysis results of multiple dimensions and preset filtering conditions to obtain a qualified antibody sequence. In some embodiments, the analysis of the characteristics of the sequence quality comprises: analyzing the number and position of cysteines in the second sequenced Fv se