Search

US-20260127668-A1 - DECISION TREE DATA STRUCTURE BASED PROCESSING SYSTEM

US20260127668A1US 20260127668 A1US20260127668 A1US 20260127668A1US-20260127668-A1

Abstract

A decision tree-based processing system implements a decision tree data structure to process publicly available files, such as websites suspected to contain data generated by a data transaction processing system which may be constantly fluctuating and varying. The data transaction processing system generates the data based on processing electronic data transaction request messages received over a network.

Inventors

  • Arjun Parmar
  • David John Geddes

Assignees

  • CHICAGO MERCANTILE EXCHANGE INC.

Dates

Publication Date
20260507
Application Date
20260102

Claims (20)

  1. 1 . A computer implemented method comprising: identifying and locating, by a processor, one or more web pages suspected to contain data generated by a data transaction processing system that generates continuously updated data as a result of processing electronic data transaction request messages; and for each web page: accessing, by the processor, the web page; processing, by the processor, content of the web page using a machine learning model stored in a memory coupled with the processor, the machine learning model having been trained using a training data set comprising a plurality of files each labeled to indicate whether the labeled file includes data previously generated by the data transaction processing system, wherein training the machine learning model comprises evaluating relationships between attributes identified in the plurality of labeled files and corresponding outcome labels indicating presence or absence of data generated by the data transaction processing system, at least some of the plurality of files having been received from the data transaction processing system and labeled in response to processing electronic data transaction request messages; determining, by the processor based on an output of the machine learning model, whether the web page contains data generated by the data transaction processing system; and upon determining that the web page contains data generated by the data transaction processing system, determining, by the processor, whether a presence of the data generated by the data transaction processing system is authorized; and generating and transmitting, based on determining that the presence of the data generated by the data transaction processing system is unauthorized, a message that causes access to the web page to be impeded.
  2. 2 . The computer implemented method of claim 1 , wherein the plurality of labeled files include exchange-generated market data.
  3. 3 . The computer implemented method of claim 2 , wherein the exchange-generated market data includes data indicative of at least one of financial instrument product codes, exchange identifiers, price-related keywords, or predefined text patterns.
  4. 4 . The computer implemented method of claim 1 , wherein training the machine learning model comprises: evaluating the plurality of labeled files to determine how well each attribute classifies files that include data generated by the data transaction processing system from files that do not include data generated by the data transaction processing system; and selecting attributes for use in the machine learning model based on an information gain associated with each attribute.
  5. 5 . The computer implemented method of claim 1 , wherein the machine learning model comprises a supervised learning model generated using a classification algorithm selected from ID3, C4.5, or a software-implemented machine learning library.
  6. 6 . The computer implemented method of claim 1 , wherein determining whether the web page contains data generated by the data transaction processing system comprises generating a binary classification output indicating presence or absence of the data.
  7. 7 . The computer implemented method of claim 1 , wherein determining whether the web page contains data generated by the data transaction processing system comprises generating a score indicative of a likelihood that the web page contains the data.
  8. 8 . The computer implemented method of claim 1 , wherein determining whether the presence of the data generated by the data transaction processing system is authorized comprises querying a database storing identifiers of entities licensed to display, transmit, or forward the data.
  9. 9 . The computer implemented method of claim 1 , further comprising: verifying, by the processor, correctness of the determination that the web page contains data generated by the data transaction processing system; labeling, by the processor, the web page based on the verification; and retraining, by the processor, the machine learning model using the labeled web page to update classification accuracy.
  10. 10 . The computer implemented method of claim 1 , wherein training the machine learning model comprises recursively partitioning the plurality of labeled files into subsets based on absence or presence of selected attributes until a subset exhibits a uniform outcome label or further partitioning fails to improve classification accuracy.
  11. 11 . The computer implemented method of claim 1 , further comprising: generating, for each labeled file of the plurality of labeled files, an array of a plurality of generated arrays that includes a plurality of attributes associated with exchange-generated market data and, for each attribute, data indicating absence or presence of the attribute in the labeled file; and training the machine learning model using the plurality of generated arrays and corresponding outcome labels.
  12. 12 . A computer system comprising: a processor and a non-transitory memory coupled therewith wherein the memory stores computer executable instructions that when executed by the processor, cause the processor: identify and locate one or more web pages suspected to contain data generated by a data transaction processing system that generates continuously updated data as a result of processing electronic data transaction request messages; and for each web page: access the web page; process content of the web page using a machine learning model stored in a memory coupled with the processor, the machine learning model having been trained using a training data set comprising a plurality of files each labeled to indicate whether the labeled file includes data previously generated by the data transaction processing system, wherein training the machine learning model comprises evaluating relationships between attributes identified in the plurality of labeled files and corresponding outcome labels indicating presence or absence of data generated by the data transaction processing system, at least some of the plurality of files having been received from the data transaction processing system and labeled in response to processing electronic data transaction request messages; determine, based on an output of the machine learning model, whether the web page contains data generated by the data transaction processing system; and upon determination that the web page contains data generated by the data transaction processing system, determine whether a presence of the data generated by the data transaction processing system is authorized; and generate and transmit, based on the determination that the presence of the data generated by the data transaction processing system is unauthorized, a message that causes access to the web page to be impeded.
  13. 13 . The computer system of claim 12 , wherein the plurality of labeled files include exchange-generated market data.
  14. 14 . The computer system of claim 13 , wherein the exchange-generated market data includes data indicative of at least one of financial instrument product codes, exchange identifiers, price-related keywords, or predefined text patterns.
  15. 15 . The computer system of claim 12 , wherein training the machine learning model comprises: evaluation of the plurality of labeled files to determine how well each attribute separates files that include data generated by the data transaction processing system from files that do not include data generated by the data transaction processing system; and selection of attributes for use in the machine learning model based on an information gain associated with each attribute.
  16. 16 . The computer system of claim 12 , wherein the machine learning model comprises a supervised learning model generated using a classification algorithm selected from ID3, C4.5, or a software-implemented machine learning library.
  17. 17 . The computer system of claim 12 , wherein the computer executable instructions further cause the processor to: generate a binary classification output indicating presence or absence of the data to determine whether the web page contains data generated by the data transaction processing system comprises.
  18. 18 . The computer system of claim 12 , wherein the computer executable instructions further cause the processor to generate a score indicative of a likelihood that the web page contains data generated by the data transaction processing system.
  19. 19 . The computer system of claim 12 , wherein the computer executable instructions further cause the processor to query a database storing identifiers of entities licensed to display, transmit, or forward the data to determine whether the presence of the data generated by the data transaction processing system is authorized.
  20. 20 . The computer system of claim 12 , wherein the computer executable instructions further cause the processor to: verify correctness of the determination that the web page contains data generated by the data transaction processing system; label the web page based on the verification; and retrain the machine learning model using the labeled web page to update classification accuracy.

Description

REFERENCE TO RELATED APPLICATIONS This application claims priority to and the benefit as a continuation of U.S. patent application Ser. No. 18/635,566, filed Apr. 15, 2024, now U.S. Pat. No. ______, which is a continuation of U.S. patent application Ser. No. 17/308,603, filed May 5, 2021, now U.S. Pat. No. 11,983,771, which is a continuation of U.S. patent application Ser. No. 15/921,301, filed Mar. 14, 2018, now U.S. Pat. No. 11,030,691, the entirety of which are incorporated by reference herein and relied upon. BACKGROUND Data providers typically generate and provide data to paying customers or subscribers, such as an on a regular, periodic, on-going or continuous basis. The data may be provided to the data subscribers as a data feed or data stream. Data providers may enter into complex agreements with a large number of data recipients, where some, but not all, data recipients may provide, e.g. copy or forward, the data to other data recipients. The data may be subscribed to, provided and consumed in a non-linear, multilateral fashion. For example, a data set may be manipulated or modified more than once, by more than one different entity, and consumed more than once and in different forms. Moreover, the data may be generated by a data transaction processing system, such as a data transaction processing system which implements a market for trading financial instruments, which processes incoming transactions and generates results based thereon, and may reflect the current state of the data transaction processing system, e.g., the state of the market implemented thereby, so that the data is updated and fluctuating in real time, i.e., based on processing incoming transactions. It should be appreciated that real time may mean that the data generated by the data transaction processing system fluctuates as the system state is updated based on the processing of incoming transactions as they are received by the data transaction processing system. In other words, the data generated by the data transaction processing system changes as quickly as the incoming data transaction requests are received by the data transaction processing system. Although the data providers may attempt to exercise control over the data being generated (which in many industries is extremely valuable and subscriptions can cost millions of dollars per year), given the ease with which electronic data may be replicated and transmitted, in the case of many recipients, data subscription licenses can be difficult to control, audit, and bill. An example of data feeds provided by a data generator are market data feeds that are provided by a financial instrument trading system, such as a futures exchange, such as the Chicago Mercantile Exchange Inc. (CME). CME generates a voluminous amount of market data which reflects the current state of the market for thousands of financial instruments, and provides a variety of market data feeds to data subscribers. Traders and market participants rely on market data feeds to understand the state of the market for one or more financial instruments at a given moment in time, e.g., the present, at a past time period, or relative between multiple time periods. The state of the market may vary extremely rapidly, e.g., within microseconds. Market participants may formulate their trading strategies based on the market data feeds. As noted above, the market data feeds may fluctuate as the corresponding markets for financial instruments varies, e.g., as the data transaction processing system processes transactions which affect the state of the system. Accordingly, market data feeds contain extremely valuable information for market participants. A futures exchange, such as the CME, may accordingly provide market data to consumers under a license. Only users agreeing to the license can legally receive, use, and/or distribute market data. The futures exchange may provide a variety of different licenses for different users, e.g., one license may allow a single user to consume market data for a specified financial instrument, whereas another license may allow all the employees of a trading firm to consume and distribute market data for an entire class of financial instruments. However, once market data is generated by the futures exchange and disseminated to some licensed users, control of the market data becomes difficult. Market data is electronic data that can be shared digitally over the Internet or other wide area networks. Users can easily forward/distribute market data to unlicensed users. Electronic market data can be easily reproduced, copied, and forwarded to countless other parties. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts a computer network system, according to some embodiments. FIG. 2 depicts a general computer system, according to some embodiments. FIG. 3A depicts a storage data structure, according to some embodiments. FIG. 3B depicts another storage data structure, according to some embodiments. FIG. 3