CN-121979885-A - Method, equipment, medium and product for column type storage and indexing of mass spectrum data analysis

CN121979885ACN 121979885 ACN121979885 ACN 121979885ACN-121979885-A

Abstract

The application discloses a column type storage and indexing method, equipment, medium and product for mass spectrum data analysis, relating to the field of computer data processing, wherein the method comprises the steps of constructing a partition column type storage system; the method and the device for achieving the mass spectrum analysis based on the data compression and coding comprise the steps of obtaining a data compression and coding strategy corresponding to each partition in a partition column type storage system, executing data writing and index construction, locating a target data block through the index partition according to the retrieval request type of a user, and carrying out full-link tracing.

Inventors

DING MINJIE
YANG SIRUI
ZHU YAMEI
Erepati Abdi Saimaiti
Kuang Zhiguang
LI YAOTONG
WU JIANHUA
TANG MINLU
LIU ZHENYU
ZHANG MENG

Assignees

上海计算机软件技术开发中心

Dates

Publication Date: 20260505
Application Date: 20260120
Priority Date: 20260114

Claims (10)

1. The method for storing and indexing the column facing the mass spectrum data analysis is characterized by comprising the following steps: The method comprises the steps of constructing a partitioned column-type storage system, wherein the partitioned column-type storage system comprises an original partition used for storing original spectrogram data acquired by experiments, a characteristic partition used for storing peak/ion characteristic data extracted from the original spectrogram data, a result partition used for storing qualitative and/or quantitative analysis results and an index partition used for storing index data for accelerating retrieval, wherein the original partition, the characteristic partition, the result partition and the index partition establish cross-partition logic mapping through preset associated fields, and respectively execute determinant storage; acquiring a corresponding data compression and coding strategy of each partition in a partitioned column type storage system; The method comprises the steps of obtaining line-type original spectrogram data, opening a column-type buffer memory area in a memory, splitting and filling the line-type original spectrogram data according to a partition column-type memory system to obtain a plurality of data blocks, executing a data compression and coding strategy corresponding to corresponding partitions on each column when the data of the buffer memory area reaches a preset threshold value, and writing the compressed column data blocks into a disk in batches according to the sequence of the partitions; and positioning the target data block through the index partition and performing full-link tracing according to the search request type of the user, wherein the search request type comprises range search and spectrogram similarity search.
2. The method for storing and indexing a column for analyzing mass spectrum data according to claim 1, wherein the preset association field comprises an original spectrogram reference field of a feature partition and an associated feature identifier field of a result partition, the original spectrogram reference field is used for pointing to a spectrogram unique identifier of the original partition, and the associated feature identifier field is used for pointing to a feature unique identifier of the feature partition.
3. The method for storing and indexing columns oriented to mass spectrometry according to claim 1, wherein the index structure of the index partition comprises a global RT index, a barrel m/z index and a vector index; the global RT index is used for mapping a retention time range and a physical address of a data block by adopting a B+ tree structure; the sub-bucket m/z index is used for dividing a mass-to-charge ratio range into buckets with preset fixed widths, and establishing an inverted index from the bucket to the row group ID; the vector index is used for establishing the mapping of the spectrogram vector and the row group ID based on an approximate nearest neighbor search algorithm of the graph structure.
4. A method for storing and indexing data in a column based on mass spectrometry according to claim 3, wherein the data compression and encoding strategy corresponding to each partition specifically comprises: The data compression and encoding strategy of the original partition is: The method comprises the steps of disassembling original spectrogram data into a fixed-length scanning metadata column group and a variable-length numerical value data column, wherein the scanning metadata column group is used for storing a spectrogram unique identifier, retention time, mass spectrum level, parent ion mass-to-charge ratio and ion implantation time; Adopting a row group division strategy to take a continuous spectrogram reaching a preset threshold value as a row group, and adopting a storage mode of differential coding and bit compression for reserved time columns in a scanning metadata column group; the data compression and coding strategy of the feature partition is as follows: defining a column set containing a unique identifier, a start retention time, an end retention time, a mass-to-charge ratio center value, a charge state and an intensity sum according to peak/ion characteristic data extracted from original spectrogram data; The charge state column is compressed by adopting run length coding, a Min-Max index is established for the initial retention time column and the end retention time column, and the peak/ion characteristic data in a set time window is screened by recording the minimum value and the maximum value of the corresponding column in the row group; the resulting partitioned data compression and encoding strategy is: Defining a column set comprising a matching spectrogram identifier, a qualitative result sequence, protein reference information, a matching score and an associated feature ID according to qualitative and/or quantitative analysis results; Establishing a global qualitative result dictionary table, adopting dictionary coding to qualitative result columns, and mapping the repeated sequences into unique integer IDs for storage; The data compression and encoding strategy of the index partition is: A tree index for retention time range retrieval, a barrel m/z index for mass-to-charge ratio barrel retrieval and a vector index structure for spectrogram similarity retrieval are constructed.
5. The method for storing and indexing a column for analyzing mass spectrum data according to claim 3, wherein the method for locating the target data block and performing full link tracing through the index partition according to the type of the retrieval request of the user specifically comprises: When the search request type of the user is range search, receiving a search request containing a target mass-to-charge ratio, tolerance and a retention time range, and reading index partition data; loading a mass-to-charge ratio array column, a strength array column and a retention time column in the candidate row group list, skipping over an irrelevant column, decompressing, screening target data, aggregating and calculating, and outputting a search result; When the search request type of the user is spectrogram similarity search, receiving a target vector, acquiring a candidate row group list with highest similarity through vector index, loading corresponding original spectrogram data, and outputting a search result; and in the partitioned column type storage system, cross-partition association is carried out according to the search result and a preset association field, and bidirectional full-link tracing from qualitative and/or quantitative analysis results to corresponding original spectrogram data and peak/ion characteristic data is carried out.
6. The method for columnar storage and indexing of mass spectrometry data according to claim 1, wherein, the column type storage and indexing method oriented to mass spectrum data analysis supports dynamic evolution of data Schema.
7. The method for storing and indexing data according to claim 1, wherein the raw spectrum data comprises gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data and secondary mass spectrometry data.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of columnar storage and indexing for mass spectrometry-oriented data analysis of any of claims 1-7.
9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the columnar storage and indexing method for mass spectrometry oriented data analysis of any of claims 1 to 7.
10. A computer program product comprising a computer program which, when executed by a processor, implements the columnar storage and indexing method for mass spectrometry-oriented data analysis of any one of claims 1 to 7.

Description

Method, equipment, medium and product for column type storage and indexing of mass spectrum data analysis Technical Field The application relates to the field of computer data processing, in particular to a column type storage and indexing method, equipment, medium and product for mass spectrum data analysis. Background Mass spectrometry (Mass Spectrometry, MS) technology is a key analytical tool in the fields of life sciences, clinical diagnostics, drug development, environmental monitoring, etc. With the popularity of high resolution, high throughput mass spectrometers, a single experiment can produce millions of spectra, with the amount of data exhibiting explosive growth, typically up to tens or even hundreds of Gigabytes (GB). This massive, high-dimensional data nature constitutes a significant challenge to conventional storage, management and analysis methods. At present, the storage and analysis of mass spectrum data mainly depend on the following technical paths, but all have obvious defects, and the specific steps are as follows: 1. storage modes based on standardized file formats, such as mzML, mzXML and the like, are data exchange formats which are common in the industry. Such formats are typically based on XML (extensible markup language) with metadata (e.g., retention time, parent ion information) and numerical data (mass-to-charge ratio and intensity arrays) for each scan as a whole record. However, this approach has the following inherent drawbacks: (1) The searching efficiency is low, when common analysis tasks such as extracting ion chromatograms (Extracted Ion Chromatogram, XIC) are performed, since the data are stored sequentially in units of spectrograms (i.e. rows), the analysis software must analyze the whole file from beginning to end, and check whether each spectrogram contains signals within the range of target mass-to-charge ratio (m/z) one by one. For large files of GB level, the process involves a large amount of redundant data reading and calculation, which takes a very long time, and severely limits the efficiency of interactive data exploration. (2) The storage and I/O overhead is large, and the text format of XML leads to high data redundancy and huge file size. Meanwhile, when the line type storage structure analyzes and inquires, even if only a few fields (such as retention time and mass-to-charge ratio) are needed, the whole data must be completely loaded into the memory, so that great waste of disk I/O and memory resources is caused. 2. In order to solve the search bottleneck of file type storage, part of schemes attempt to import mass spectrum data into a traditional relational database (such as MySQL) or NoSQL database. However, these generic databases are not optimized for the nature of scientific data such as mass spectra, and suffer from the following drawbacks: (1) The analysis bottleneck of the line database is that the traditional relational database usually adopts line storage, has similar I/O amplification problem with file format, and is not suitable for analysis intensive query. (2) The index applicability is poor, standard indexes such as B tree of a general database are poor, the effect is not ideal for processing high-precision, continuous and sparse floating point number range query such as mass-to-charge ratio, vector index support for high-dimensional spectrogram data is lacking, and nearest neighbor search (Nearest Neighbor Search) based on spectrogram similarity cannot be processed efficiently. (3) The deployment and maintenance cost is high, namely, a database system capable of bearing mass spectrum data, especially a distributed database cluster (such as a Hadoop/Spark ecosystem), needs professional technical knowledge and high hardware cost, and the threshold is too high for small and medium-sized laboratories with limited budgets and manpower. 3. The performance bottleneck of similar spectrogram retrieval is that in spectral Library Search (Library Search) or unknown object clustering analysis, the core task is to calculate the similarity between a query spectrogram and a massive historical spectrogram. The prior art generally lacks an efficient vectorization indexing mechanism (such as approximate nearest neighbor retrieval), and can only rely on one-by-one comparison of violence or simple prefiltering, so that the similarity retrieval on a large-scale data set takes a very long time, and the requirement of real-time analysis is difficult to meet. 4. The data association and version traceability are lost, and a complete mass spectrum analysis flow comprises a plurality of links such as original data acquisition, feature extraction (peak identification, isotope removal and the like), database retrieval, qualitative and quantitative analysis and the like. In the prior art, the original data, the intermediate characteristic data and the final result data generated by these links are often scattered in different storage positions in the form of indepe