CN-121996738-A - Semi-structured text multi-level information retrieval method
Abstract
The invention discloses a multi-level information retrieval method of a semi-structured text, belonging to the field of information retrieval and data mining. The method comprises the steps of firstly analyzing semi-structured text data, respectively constructing a filtering index for structured metadata, and constructing a semantic vector index and a keyword index for unstructured text content in parallel. During searching, firstly, a structuring condition is utilized to carry out quick coarse screening to reduce the range, then semantic searching and keyword searching are carried out in parallel in a candidate set to respectively obtain semantic relevance scores and keyword matching scores, and then the two types of scores are subjected to weighted fusion through a dynamic fusion layer, wherein weight parameters can be dynamically invoked according to user identification or query scenes. The invention realizes the balance of the retrieval efficiency and the precision, solves the problem of static stiffness of the traditional mixed retrieval weight, can adaptively adjust the retrieval strategy according to the implicit feedback of the user, and improves the personalized intelligent level of the retrieval system.
Inventors
- Diao Zhengang
- CHEN JINYONG
- ZHAI LIZHI
- LUO MENG
- LIANG YUXUAN
- HAN JIANGLONG
- Sun Kaidi
- JIA QINGCHAO
- Yao sai
- WANG WENBO
Assignees
- 中国电子科技集团公司第五十四研究所
Dates
- Publication Date
- 20260508
- Application Date
- 20260409
Claims (10)
- 1. A semi-structured text multi-level information retrieval method is characterized by comprising the following steps: Step 1, analyzing semi-structured text data, separating structured metadata from unstructured text content, constructing an aggregation filter index for the structured metadata, and constructing a semantic vector index and a keyword index for the unstructured text content; step 2, receiving user inquiry, firstly carrying out coarse screening based on indexes of the structured metadata to obtain an initial candidate set, then carrying out a semantic retrieval path and a keyword retrieval path in parallel for documents in the initial candidate set to respectively obtain semantic relevance scores and keyword matching scores of each document; step 3, collecting implicit interactive behavior data of a user on a search result, constructing positive and negative sample pairs based on the implicit interactive behavior data, calculating sorting loss under the current weight parameters, carrying out real-time incremental update on the weight parameters of the dynamic fusion layer according to the sorting loss, and associating the updated weight parameters with the identifications of the corresponding users and storing the identifications for the subsequent search process of the users.
- 2. The method for multi-level information retrieval of semi-structured text according to claim 1, wherein in step 1, the manner of constructing the semantic vector index is: Fine tuning the pre-training language model by using the domain-related corpus to obtain a domain-enhanced semantic coding model; encoding unstructured text content into semantic vectors using a domain-enhanced semantic encoding model; The semantic vector is stored into a vector database supporting approximate nearest neighbor search.
- 3. The method for multi-level information retrieval of semi-structured text according to claim 1, wherein in step 1, the keyword index is constructed by: Based on the domain dictionary and the stop word list, word segmentation processing is carried out on unstructured text content; an inverted index is constructed based on the word segmentation results, and statistical information of terms is calculated to support a BM25 relevance scoring model.
- 4. The method for multi-level information retrieval of semi-structured text according to claim 1, wherein in step 2, coarse screening is performed based on the index of the structured metadata, specifically by: analyzing the user query and identifying the structural constraint conditions in the user query; And combining the structural constraint conditions into Boolean query, and filtering the documents in the aggregation filter index to form an initial candidate set.
- 5. The method for searching multi-level information of semi-structured text according to claim 1, wherein in step 2, the semantic search path comprises mapping the user query and the candidate document content into semantic vectors respectively by using a field-enhanced semantic coding model, calculating cosine similarity between the query vector and the document vector, and normalizing the cosine similarity to obtain a semantic relevance score.
- 6. The method for searching the multi-level information of the semi-structured text according to claim 1, wherein in the step 2, a keyword searching path is specifically characterized in that the user query and candidate document contents are segmented, the matching degree between the query and the document is calculated by adopting a BM25 statistical correlation model, and normalization is carried out to obtain a keyword matching score.
- 7. The method for multi-level information retrieval of semi-structured text according to claim 1, wherein in step 2, the specific way of weighted fusion is as follows: Final_Score = α Score_semantic + β Score_keyword Wherein final_score is the Final ranking Score, score_ semantic is the semantic relevance Score, score_keyword is the keyword matching Score, α and β are weight parameters, and α+β=1; The weight parameters are dynamically obtained from a preset weight configuration library according to the user identification or the query scene type.
- 8. The method for multi-level information retrieval of semi-structured text according to claim 1, wherein in step 3, positive and negative sample pairs are constructed based on implicit interaction behavior data, specifically by: Marking a document which is clicked by a user and has a stay time exceeding a first threshold value or a conversion behavior as a positive sample; in the returned search result list, documents ranked above or near the positive sample but not clicked or with a residence time after clicking below a second threshold are marked as negative samples.
- 9. The method for multi-level information retrieval of semi-structured text according to claim 1 or 8, wherein in step 3, the sorting loss is calculated by using a hinge loss function based on pairwise learning, and the loss is calculated for a pair of positive and negative samples P and N The calculation mode of (2) is as follows: ; Wherein, the For the preset value of the boundary value, And The final ranking scores for positive samples P and negative samples N, respectively.
- 10. The method for searching the multi-level information of the semi-structured text according to claim 1, wherein in the step 3, an online gradient descent method is adopted to update the weight parameters of the dynamic fusion layer in real time, specifically: Calculating the ranking loss versus weight parameter Gradient of (2) ; The weight parameter is calculated From the original value Updated to : ; Wherein, the Is the learning rate; According to The weight parameter is calculated The value of (2) is updated to : 。
Description
Semi-structured text multi-level information retrieval method Technical Field The invention relates to the field of information retrieval and data mining, in particular to a half-structured text multi-level information retrieval method which can realize the intelligent and self-adaptive multi-level information retrieval of half-structured text data. Background In today's big data environment, vast amounts of information exist in semi-structured form and continue to grow. Such data includes both structured attributes and unstructured text content. For example, in the field of electronic commerce, commodity information includes structured fields such as price, brand, category, etc., unstructured text such as commodity title, description, user evaluation, etc., documents include attributes such as author, department, creation date, etc., and core text such as report text, technical scheme, etc., and papers include attributes such as publication year, journal, author, etc., and abstract and full text in an academic database. How to efficiently and accurately retrieve information required by users from such semi-structured text data is a core challenge for many application systems. Existing search technologies are mainly classified into the following categories, but have respective limitations: The first is a retrieval based on structured attributes. Such methods utilize database query languages (e.g., SQL) or filters to locate data by exact matching or range filtering. Its advantages are high speed and result determination, but very obvious disadvantage is that it can not understand and process the abundant semantic information contained in unstructured text. When the user inquires that the intention needs to be expressed through text content, such as a mobile phone with good photographing effect, simple structured retrieval cannot take the effect. The second category is keyword-based full text retrieval. Such techniques (e.g., lucene, elasticsearch based on inverted index) word-segment text content, index, and measure relevance of documents to queries by computing statistics (e.g., TF-IDF, BM25 model) of word frequency, inverse document frequency, etc. Its advantages are high technical maturity and efficiency, and high matching effect to precise terms and phrases. However, the essence is "word matching", which has the problem of "semantic gap" that synonyms (such as "notebook" and "laptop") cannot be identified, semantically related words (such as "apple" company and "iPhone" product), and intent understanding ability of natural language query is weak (such as "science fiction movie suitable for children to watch", keyword search is difficult to effectively relate "children", "teenagers" and "science fiction pieces"). The third class is semantic based vector retrieval. With the development of deep learning, semantic retrieval techniques based on pre-trained language models (e.g., BERT, sentence-BERT, etc.) are applied. The method converts the text into high-dimensional vectors (Embedding), and the semantic relevance is measured by calculating cosine similarity between the vectors. The method has the core advantages of being capable of deeply understanding the semantics, overcoming the word semantic gap and returning the results of related semantics and different words. But there are also short boards, firstly, the accuracy of the hard information which needs to be matched accurately for model, code, proper noun and the like is possibly inferior to the keyword retrieval, and furthermore, the retrieval effect is seriously dependent on the field correlation of training corpus. To remedy the shortcomings of single technology, hybrid search strategies have been developed, namely, attempting to combine the advantages of multiple search technologies. However, the existing hybrid search schemes still have the following prominent problems: 1. Fusion mode rigidifies-most schemes employ simple linear weighted summation or fixed priority concatenation (e.g., keyword-first then semantic, or vice versa). The weight parameters (such as the weight ratio of the semantic score to the keyword score) are usually static, and the global value preset through offline experiments cannot adapt to the dynamic requirements of different query scenes. For example, for the query "iPhone 13 Pro Max 256GB blue", accurate commodity model matching (keyword dominance) should dominate, while for "elegant gifts to women", semantic understanding (gifts, elegance, association of women) is more critical. Static weights do not enable such scene-adaptive adjustment. 2. The lack of adaptation and personalization capabilities-existing systems, once deployed, have their retrieval model (including the mix weights) fixed. It cannot self-optimize according to the actual interaction behavior of the real user. The search preferences of different user groups and even the same user under different scenes may be different, but the system cannot perceive and adapt