CN-121996620-A - Data writing method, data query method, electronic device and storage medium

CN121996620ACN 121996620 ACN121996620 ACN 121996620ACN-121996620-A

Abstract

The application discloses a data writing method, a data query method, electronic equipment and a storage medium, wherein the method comprises the steps of generating a target data file based on data to be written, and storing the target data file into a target storage system to obtain a first storage path; generating at least one target index file corresponding to the target data file, storing each target index file in a target storage system to obtain a second storage path, separating and storing the target data file and the corresponding target index file in the target storage system, generating a target metadata file by utilizing the first storage path and the second storage path, and performing atomic commit on the target metadata file to switch the target metadata file into a current metadata file of Iceberg tables. Through the mode, the data query efficiency can be improved.

Inventors

LIU YANG
QIAN HAODONG
ZHANG KAN
CHEN WENCAN
Yang Xuze
LI KANG

Assignees

浙江大华技术股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251205

Claims (10)

1. A method of writing data, the method comprising: Generating a target data file based on data to be written, and storing the target data file into a target storage system to obtain a first storage path; Generating at least one target index file corresponding to the target data file, and storing each target index file to the target storage system to obtain a second storage path, wherein the target data file and the corresponding target index file are stored separately in the target storage system; Generating a target metadata file by using the first storage path and the second storage path; And performing atomic commit on the target metadata file to switch the target metadata file into the current metadata file of the Iceberg table.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The at least one target index file comprises at least one of a full text index file and a vector index file.
3. The method of claim 2, wherein the at least one target index file comprises the full text index file, and wherein the generating the at least one target index file corresponding to the target data file comprises: word segmentation processing is carried out on text data in the target data file, so that a plurality of entries are obtained; acquiring the entry position of each entry in the target data file, and establishing a mapping relation between each entry and the corresponding entry position to obtain an inverted index corresponding to the target data file; and generating the full-text index file for the inverted index corresponding to the target data file.
4. The method of claim 2, wherein the at least one target index file comprises the vector index file, and wherein the generating at least one target index file corresponding to the target data file comprises: Vector conversion is carried out on the data in the target data file, and a high-dimensional vector is obtained; Generating a vector index corresponding to the target data file based on the high-dimensional vector; and generating the vector index file for the vector index corresponding to the target data file.
5. The method of claim 4, wherein the data in the target data file comprises image data and text data, and wherein vector converting the data in the target data file to obtain a high-dimensional vector comprises: And carrying out vector conversion on the image data and the text data in the target data file to respectively obtain high-dimensional vectors corresponding to the image data and the text data.
6. The method of claim 1, wherein each of the target index files corresponding to the target data file is generated based on a corresponding index configuration attribute, wherein generating the target metadata file using the first storage path and the second storage path comprises: Generating an initial metadata file by using the first storage path and the second storage path; And writing each index configuration attribute in the initial metadata file to obtain the target metadata file.
7. The method of claim 6, wherein the step of providing the first layer comprises, The at least one target index file comprises a full text index file, the full text index file is generated based on a plurality of entries, the entries are obtained by utilizing a target word segmentation device to segment text data in a target data file, and the index configuration attribute comprises a word segmentation device identifier corresponding to the target word segmentation device and an index type corresponding to the full text index file; And/or the at least one target index file comprises a vector index file, the vector index file is generated based on a high-dimensional vector, the high-dimensional vector is obtained by vector conversion of data in the target data file by using a vector conversion model, and the index configuration attribute comprises the vector conversion model, a distance measurement mode and an index type corresponding to the vector index file.
8. A method of querying data, the method comprising: Acquiring a data query task and a current metadata file of Iceberg tables, wherein the data query task comprises first data to be queried and query conditions, the query conditions are used for indicating a target index file for data query, and the current metadata file is obtained by using the data writing method according to any one of claims 1-7; Determining a first target storage path of the target index file and a second target storage path of a target data file corresponding to the target index file from the current metadata file; Reading the target index file from the first target storage path, and searching index data matched with the first data to be queried from the target index file to serve as an index result; And reading the target data file from the second target storage path, and searching target data matched with the index result from the target data file to serve as a query result corresponding to the first data to be queried.
9. An electronic device comprising a memory for storing program instructions and a processor for executing the program instructions to implement the method of any of claims 1-8.
10. A computer readable storage medium, characterized in that the computer readable storage medium is for storing program instructions, the program instructions being executable to implement the method of any one of claims 1-8.

Description

Data writing method, data query method, electronic device and storage medium Technical Field The present application relates to the field of data storage and processing technologies, and in particular, to a data writing method, a data query method, an electronic device, and a storage medium. Background With the rapid development of big data and artificial intelligence technology, enterprise data has the characteristics of multiple modes, large scale and high dimensionality. The multi-mode data retrieval requirement is greatly increased, and real-time recommendation and search are realized, namely keyword matching (full text search) and semantic similarity matching (vector search) are required to be supported simultaneously by an e-commerce and a content platform. However, the target data retrieval depends on an external system, and data needs to be exported to a ELASTICSEARCH, SOLR search engine, so that data redundancy, synchronization delay, consistency maintenance difficulty and low data retrieval efficiency are caused. Disclosure of Invention The application mainly solves the technical problem of providing a data writing method, a data query method, electronic equipment and a storage medium, which can improve the data query efficiency. In order to solve the technical problems, the first aspect of the application provides a data writing method, which comprises the steps of generating a target data file based on data to be written, storing the target data file in a target storage system to obtain a first storage path, generating at least one target index file corresponding to the target data file, storing each target index file in the target storage system to obtain a second storage path, wherein the target data file and the corresponding target index file are stored in the target storage system separately, generating a target metadata file by utilizing the first storage path and the second storage path, and performing atomic submission on the target metadata file to switch the target metadata file into a current metadata file of a Iceberg table. Wherein the at least one target index file comprises at least one of a full text index file and a vector index file. The method comprises the steps of obtaining a plurality of entry positions of each entry in a target data file, establishing a mapping relation between each entry and the corresponding entry position to obtain an inverted index corresponding to the target data file, and generating the full-text index file for the inverted index corresponding to the target data file. The method comprises the steps of generating at least one target index file corresponding to a target data file, performing vector conversion on data in the target data file to obtain a high-dimensional vector, generating a vector index corresponding to the target data file based on the high-dimensional vector, and generating the vector index file for the vector index corresponding to the target data file. The method comprises the steps of carrying out vector conversion on the data in the target data file to obtain a high-dimensional vector, and carrying out vector conversion on the image data and the text data in the target data file to respectively obtain the high-dimensional vectors corresponding to the image data and the text data. The method comprises the steps of generating each target index file corresponding to the target data file based on the corresponding index configuration attribute, generating the target metadata file by utilizing a first storage path and a second storage path, generating an initial metadata file by utilizing the first storage path and the second storage path, and writing each index configuration attribute into the initial metadata file to obtain the target metadata file. The method comprises the steps of generating at least one target index file, wherein the at least one target index file comprises a full text index file, the full text index file is generated based on a plurality of entries, the plurality of entries are obtained by utilizing a target word segmentation device to segment text data in the target data file, index configuration attributes comprise word segmentation device identifiers corresponding to the target word segmentation device and index types corresponding to the full text index file, and/or the at least one target index file comprises a vector index file, the vector index file is generated based on a high-dimensional vector, the high-dimensional vector is obtained by utilizing a vector conversion model to carry out vector conversion on data in the target data file, and the index configuration attributes comprise the vector conversion model, a distance measurement mode and the index types corresponding to the vector index file. In order to solve the technical problems, a second aspect of the application provides a data query method, which comprises the steps of obtaining a data query task and obtaining a current metadata file of Icebe