CN-116028811-B - Data backtracking method, medium, device and computing equipment
Abstract
The embodiment of the disclosure provides a data backtracking method, a medium, a device and a computing device, and relates to the technical field of data processing, wherein the data backtracking method comprises the steps of obtaining configuration information corresponding to an offline model, wherein the configuration information comprises data backtracking starting time and whether backtracking marks are formed; if the backtracking mark is the backtracking mark, the data pulling operator stops the existing consumption thread, creates a new consumption thread, acquires target data based on the data backtracking starting time, and generates sample data for updating the offline model according to the target data. The method and the device can greatly improve the efficiency of data backtracking, and further improve the real-time performance and maintainability of online learning of the offline model.
Inventors
- ZHENG LEI
Assignees
- 杭州网易云音乐科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20221229
Claims (17)
- 1. The utility model provides a data backtracking method, is applied to distributed data processing system, the operation has a plurality of flight operators in the distributed data processing system, a plurality of flight operators include data pull operator, characteristic processing operator, theme screening operator and output operator, a plurality of flight operator stream processing realizes the flight task, the theme screening operator includes first passageway and second passageway, the data backtracking method includes: Acquiring configuration information corresponding to an offline model, wherein the configuration information comprises data backtracking start time, a backtracking mark, an offline word list address of the offline model and a designated channel; if the backtracking mark is backtracking, stopping the existing consumption thread by the data pulling operator, and creating a new consumption thread, wherein the new consumption thread acquires target data based on the data backtracking starting time; The feature processing operator extracts the features of the target data, and determines whether the features are in a local vocabulary of the feature processing operator, wherein the local vocabulary is an offline vocabulary acquired according to the offline vocabulary address; if yes, the feature processing operator determines that the features are sample data, the sample data and the configuration information are sent to a downstream flank operator of the feature processing operator, and the sample data are output through the downstream flank operator; If not, the feature processing operator adds 1 to the frequency corresponding to the feature to determine whether the frequency is larger than a threshold, if so, the feature is determined to be the sample data, the sample data and the configuration information are sent to a downstream link operator of the feature processing operator, the subject screening operator determines a target channel according to the designated channel, the target channel is the first channel or the second channel, the designated channel is a channel switched when the offline model training is completed, the subject screening operator sends the sample data to an output operator corresponding to the target channel through the target channel, and the output operator outputs the sample data for updating the offline model.
- 2. The data backtracking method of claim 1, the plurality of flank operators further comprising an input operator, the obtaining configuration information corresponding to an offline model comprising: and responding to the start of the Flink task or the completion of the offline model training, the input operator acquires the configuration information corresponding to the offline model, or the input operator periodically acquires the configuration information corresponding to the offline model.
- 3. The data backtracking method according to claim 2, wherein if the backtracking flag is backtracking, before the data pulling operator stops the existing consumption thread, further comprising: The input operator determines whether the backtracking mark is backtracking; If yes, the input operator broadcasts the configuration information, and the data pulling operator receives the configuration information.
- 4. A data backtracking method as in any one of claims 1-3, the configuration information further comprising a configuration version, the data pull operator stopping an existing consuming thread and before creating a new consuming thread, further comprising: The data pulling operator determines whether to update the local configuration of the data pulling operator based on the configuration version and the local configuration version of the data pulling operator; If yes, updating the local configuration of the data pulling operator; if not, the local configuration of the data pulling operator is not updated.
- 5. The data backtracking method of claim 1, the configuration information further comprising a configuration version, the feature processing operator further comprising, prior to extracting the feature of the target data: The feature processing operator determines whether to update the local configuration of the feature processing operator based on the configuration version and the local configuration version of the feature processing operator; If yes, updating the local configuration of the feature processing operator, and replacing the local word list of the feature processing operator with the offline word list; if not, the local configuration of the feature processing operator is not updated.
- 6. The data backtracking method of claim 3, wherein the configuration information further includes a configuration version, and the topic screening operator further includes, before determining the target channel according to the specified channel: the topic screening operator determines whether to update the local configuration of the topic screening operator based on the configuration version and the local configuration version of the topic screening operator; if yes, updating the local configuration of the topic screening operator, and recording the sequence number corresponding to the instance of the updated local configuration in the topic screening operator based on the configuration version; if not, the local configuration of the topic screening operator is not updated.
- 7. The data backtracking method according to claim 6, wherein after the sequence number corresponding to the locally configured instance is updated in the topic filtering operator, the recording further includes: and when the number of the sequence numbers is determined to be the same as the parallelism of the topic screening operator by the input operator, updating the whether backtracking mark is not backtracking.
- 8. The utility model provides a data backtracking device, is applied to distributed data processing system, it has a plurality of flight operators to operate in the distributed data processing system, a plurality of flight operators include data pull operator, feature processing operator, theme screening operator and output operator, a plurality of flight operator stream processing realizes the flight task, the theme screening operator includes first passageway and second passageway, data backtracking device includes: the acquisition module is used for acquiring configuration information corresponding to the offline model, wherein the configuration information comprises data backtracking start time, whether backtracking marks, an offline word list address of the offline model and a designated channel; the processing module is used for stopping the existing consumption thread and creating a new consumption thread if the backtracking mark is backtracking, and the new consumption thread acquires target data based on the data backtracking starting time; the generation module is used for extracting the characteristics of the target data by the characteristic processing operator and determining whether the characteristics are in a local vocabulary of the characteristic processing operator, wherein the local vocabulary is an offline vocabulary acquired according to the offline vocabulary address; if yes, the feature processing operator determines that the features are sample data, the sample data and the configuration information are sent to a downstream flank operator of the feature processing operator, and the sample data are output through the downstream flank operator; If not, the feature processing operator adds 1 to the frequency corresponding to the feature to determine whether the frequency is larger than a threshold, if so, the feature is determined to be the sample data, the sample data and the configuration information are sent to a downstream link operator of the feature processing operator, the subject screening operator determines a target channel according to the designated channel, the target channel is the first channel or the second channel, the designated channel is a channel switched when the offline model training is completed, the subject screening operator sends the sample data to an output operator corresponding to the target channel through the target channel, and the output operator outputs the sample data for updating the offline model.
- 9. The data backtracking apparatus of claim 8, the plurality of flank operators further comprising an input operator, the obtaining module being specifically configured to: and responding to the start of the Flink task or the completion of the offline model training, the input operator acquires the configuration information corresponding to the offline model, or the input operator periodically acquires the configuration information corresponding to the offline model.
- 10. The data backtracking apparatus of claim 9, the processing module further to: And before the data pulling operator stops the existing consumption thread if the backtracking mark is backtracking, determining whether the backtracking mark is backtracking or not by the input operator, if so, broadcasting the configuration information by the input operator, and receiving the configuration information by the data pulling operator.
- 11. The data backtracking apparatus of any one of claims 8-10, the configuration information further comprising a configuration version, the processing module further to: before the data pulling operator stops the existing consumption thread and creates a new consumption thread, the data pulling operator determines whether to update the local configuration of the data pulling operator based on the configuration version and the local configuration version of the data pulling operator; If yes, updating the local configuration of the data pulling operator; if not, the local configuration of the data pulling operator is not updated.
- 12. The data backtracking apparatus of claim 8, the configuration information further comprising a configuration version, the generation module further to: Before the feature processing operator extracts the feature of the target data, the feature processing operator determines whether to update the local configuration of the feature processing operator based on the configuration version and the local configuration version of the feature processing operator; If yes, updating the local configuration of the feature processing operator, and replacing the local word list of the feature processing operator with the offline word list; if not, the local configuration of the feature processing operator is not updated.
- 13. The data backtracking apparatus of claim 10, the configuration information further comprising a configuration version, the generation module further to: Before the topic screening operator determines a target channel according to the specified channel, the topic screening operator determines whether to update the local configuration of the topic screening operator based on the configuration version and the local configuration version of the topic screening operator; if yes, updating the local configuration of the topic screening operator, and recording the sequence number corresponding to the instance of the updated local configuration in the topic screening operator based on the configuration version; if not, the local configuration of the topic screening operator is not updated.
- 14. The data backtracking apparatus of claim 13, the generation module further to: After the sequence numbers corresponding to the locally configured examples are updated in the topic screening operator are recorded, the input operator updates the whether backtracking mark is not backtracked when the number of the sequence numbers is determined to be the same as the parallelism of the topic screening operator.
- 15. A computing device includes a processor, and a memory communicatively connected to the processor; the memory stores computer-executable instructions; The processor executes computer-executable instructions stored in the memory to implement the data backtracking method of any one of claims 1 to 7.
- 16. A storage medium having stored therein computer program instructions which, when executed, implement the data backtracking method of any one of claims 1 to 7.
- 17. A computer program product comprising a computer program which, when executed by a processor, implements the data backtracking method of any one of claims 1 to 7.
Description
Data backtracking method, medium, device and computing equipment Technical Field Embodiments of the present disclosure relate to the field of data processing technologies, and more particularly, to a data backtracking method, medium, apparatus, and computing device. Background This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section. The offline model is a model obtained by training using historical full-scale data. And the data backtracking is to replace a real-time dictionary with a dictionary (used for recording admission characteristics of the offline model) trained by the offline model after the offline model is trained, and consume data again from the end point of the offline model data to generate a real-time sample, and update the offline model according to the real-time sample, so that the effect of updating the model on the basis of the offline model is achieved. Currently, data backtracking is typically performed through a Flink task. Specifically, when data backtracking is performed on a real-time sample, a Flink task is restarted manually, a starting mode is designated as a timestamp (timestamp) mode when the Flink task is restarted, and a timestamp to be backtracked is designated. After the Flink task is restarted, the data can be re-consumed from the designated timestamp needing to be traced back, a real-time sample is generated, and the offline model is updated according to the real-time sample. However, the efficiency of data backtracking by manually restarting the link task is low. Disclosure of Invention The disclosure provides a data backtracking method, medium, device and computing equipment, which are used for solving the problem that the efficiency of data backtracking is low in a mode of manually restarting a Flink task. In a first aspect of the embodiments of the present disclosure, a data backtracking method is provided, which is applied to a distributed data processing system, where a plurality of flank operators are operated in the distributed data processing system, where the plurality of flank operators include a data pulling operator, and the plurality of flank operators implement a flank task through streaming processing, where the data backtracking method includes: Acquiring configuration information corresponding to the offline model, wherein the configuration information comprises data backtracking start time and whether backtracking marks are needed; If the backtracking mark is backtracking, stopping the existing consumption thread by the data pulling operator, and creating a new consumption thread, wherein the new consumption thread acquires target data based on the data backtracking starting time; Sample data for updating the offline model is generated from the target data. In a possible implementation manner, the plurality of flank operators further comprise a feature processing operator, the configuration information further comprises an offline word list address of the offline model, sample data for updating the offline model are generated according to target data, the feature processing operator extracts features of the target data and determines whether the features are in a local word list of the feature processing operator, the local word list is an offline word list obtained according to the offline word list address, if yes, the feature processing operator determines the features as sample data, the sample data and the configuration information are sent to a downstream flank operator of the feature processing operator, the sample data are output through the downstream flank operator, if not, the feature processing operator adds 1 to the frequency corresponding to the features to determine whether the frequency is greater than a threshold, if yes, the features are determined as sample data, the sample data and the configuration information are sent to the downstream flank operator of the feature processing operator, and the sample data are output through the downstream flank operator. In a possible implementation manner, the plurality of flank operators further comprise a theme screening operator and an output operator, the theme screening operator comprises a first channel and a second channel, the configuration information further comprises a designated channel, and the downstream flank operator outputs sample data, the theme screening operator determines a target channel according to the designated channel, the target channel is the first channel or the second channel, the designated channel is a channel switched when the offline model training is completed, the theme screening operator sends the sample data to the output operator corresponding to the target channel through the target channel, and the output operator outputs the sample data. In one possible implementation, the plurality of flank operators further c