US-12619651-B2 - Hierarchical dictionary with statistical filtering based on word frequency

US12619651B2US 12619651 B2US12619651 B2US 12619651B2US-12619651-B2

Abstract

A hierarchical dictionary having methods of storing words based on frequency thereof in one or more documents that includes the steps of identifying a hash value corresponding to an inputted word; storing the word in a first hash map and in a second hash map having a substantially larger word storage capacity than the first hash map based on the identified hash value; clearing the first hash map at every predetermined period or triggering event; determining whether a frequency of the word as stored in the second hash map exceeds a predetermined value; and if so, promoting the word from the second hash map to a third hash map having a substantially larger word storage capacity than the second hash map for long-term storage and later retrieval.

Inventors

Ralph MEIER
Johannes HAUSMANN
Harry Urbschat
Thorsten WANSCHURA

Assignees

HYLAND SWITZERLAND SÀRL

Dates

Publication Date: 20260505
Application Date: 20250106

Claims (20)

1 . A method performed by a processor of a computing device for organizing a plurality of words associated with a document, the method comprising: obtaining, by the processor, the plurality of words; storing, by the processor, a word in the plurality of words in a first data structure stored in memory accessible to the processor, wherein the first data structure has a first word capacity; updating, by the processor, a frequency of the word in the first data structure; storing, by the processor, the word in a second data structure stored in the memory accessible to the processor, wherein the second data structure has a second word capacity that is greater than the first word capacity; and ranking, by the processor and based upon the frequency, the word relative to each other of the plurality of words; promoting, by the processor and based upon the ranking, the word to a third data structure at a longer period of time than when stored in second data structure after a predetermined period of time that initiates in response to the word reaching a predetermined frequency limit in the second data structure, wherein the third data structure has a third word capacity that is greater than the second word capacity of the second data structure; gathering into the second data structure, by the processor based upon the ranking and instructions in the second data structure, statistics on each of the plurality of words in the document, wherein the second data structure comprises the instructions that operate to statistically filter the plurality of words being inputted for gathering the statistics on each of the plurality of words in the document and to filter the plurality of words stored in the first data structure and the third data structure; and based on the instructions in the second data structure, filtering, by the processor, the plurality of words stored in the first data structure and the third data structure by transferring and removing between the first data structure and the third data structure each of the plurality of words based on the statistics that the second data structure gathered.
2 . The method of claim 1 , wherein the plurality of words are stored in a first hash map stored in the first data structure, wherein the second plurality of words are stored in a second hash map stored in the second data structure.
3 . The method of claim 1 , wherein the first data structure, the second data structure, and the third data structure form layers of a data structure, each of the layers comprising a fixed size hash map that is associated with a different capacity limit, a different frequency limit, and a different period of time from one another among the plurality of words.
4 . The method of claim 1 , wherein the document is a dictionary.
5 . The method of claim 1 , wherein the word includes phrases, images, and sounds.
6 . The method of claim 1 , wherein the plurality of word are input to a first hash map and a second hash map, wherein the first hash map has a first word capacity and the second hash map has a second word capacity, wherein the second word capacity is greater than the first word capacity.
7 . The method of claim 6 , wherein for each word of the plurality of words: identifying a hash value corresponding to a word in the first hash map and the second hash map, determining whether a bucket associated with the hash value is empty; upon a determination that the bucket is empty, setting a frequency of the word to a first value and storing the word and the frequency of the word in the bucket; and upon a determination that the bucket is not empty, updating the frequency of the word in the bucket to a second value.
8 . The method of claim 7 , further comprising: determining, in the second hash map and after a predetermined period of time has elapsed, whether the frequency of at least one of the plurality of words is equal to or greater than a predetermined limit; and upon a positive determination that the frequency of the at least one of the plurality of words is equal to or greater than the predetermined limit, promoting each of the at least one of the plurality of words to a third hash map, wherein the third hash map has a third word capacity, wherein the third word capacity is greater than the second word capacity, wherein a first word in the plurality of words has a frequency that is less than the predetermined limit stored in the second hash map and a second word in the plurality of words has a frequency that is equal to or greater than the predetermined limit stored in the third hash map, the second hash map keeping respective frequencies of the plurality of words and including program instructions for performing the promoting from the first hash map to the third hash map.
9 . A computing device, comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform acts comprising: obtaining, by the processor, a plurality of words associated with a document; storing, by the processor, a word in the plurality of words in a first data structure stored in memory accessible to the processor, wherein the first data structure has a first word capacity; updating, by the processor, a frequency of the word in the first data structure; storing, by the processor, the word in a second data structure stored in the memory accessible to the processor, wherein the second data structure has a second word capacity that is greater than the first word capacity; and ranking, by the processor and based upon the frequency, the word relative to each other of the plurality of words; promoting, by the processor and based upon the ranking, the word to a third data structure at a longer period of time than when stored in second data structure after a predetermined period of time that initiates in response to the word reaching a predetermined frequency limit in the second data structure, wherein the third data structure has a third word capacity that is greater than the second word capacity and the second data structure; gathering into the second data structure, by the processor based upon the ranking and instructions in the second data structure, statistics on each of the plurality of words in the document, wherein the second data structure comprises the instructions that operate to statistically filter the plurality of words being inputted for gathering the statistics on each of the plurality of words in the document and to filter the plurality of words stored in the first data structure and the third data structure; and based on the instructions in the second data structure, filtering, by the processor, the plurality of words stored in the first data structure and the third data structure by transferring and removing between the first data structure and the third data structure each of the plurality of words based on the statistics that the second data structure gathered.
10 . The computing device of claim 9 , wherein the plurality of words are stored in a first hash map stored in the first data structure, wherein the second plurality of words are stored in a second hash map stored in the second data structure.
11 . The computing device of claim 9 , wherein the word is cleared from the third data structure periodically.
12 . The computing device of claim 9 , wherein the document is a dictionary.
13 . The computing device of claim 9 , wherein the word includes phrases, images, and sounds.
14 . The computing device of claim 9 , wherein the plurality of word are input to a first hash map and a second hash map, wherein the first hash map has a first word capacity and the second hash map has a second word capacity, wherein the second word capacity is greater than the first word capacity.
15 . A non-transitory computer-readable storage medium having stored therein a first data structure for storing a plurality of words associated with a document for a predetermined period of time, the first data structure having a first word capacity, a second data structure, wherein the second data structure has a second word capacity that is greater than the first word capacity; the medium further including instructions for performing acts comprising: obtaining, by a processor, the plurality of words; storing, by the processor, a word in the plurality of words in a first data structure stored in memory accessible to the processor; updating, by the processor, a frequency of the word in the first data structure; storing, by the processor, the word in a second data structure stored in the memory accessible to the processor; and comparing by the processor and based upon the frequency, the word relative to each other of the plurality of words; promoting, by the processor and based upon the comparing, the word to a third data structure at a longer period of time than when stored in second data structure after a predetermined period of time that initiates in response to the word reaching a predetermined frequency limit in the second data structure, wherein the third data structure has a third word capacity that is greater than the second word capacity and the second data structure; gathering into the second data structure, by the processor based upon the comparing and instructions in the second data structure, statistics on each of the plurality of words in the document, wherein the second data structure comprises the instructions that operate to statistically filter the plurality of words being inputted for gathering the statistics on each of the plurality of words in the document and to filter the plurality of words stored in the first data structure and the third data structure; and based on the instructions in the second data structure, filtering, by the processor, the plurality of words stored in the first data structure and the third data structure by transferring and removing between the first data structure and the third data structure each of the plurality of words based on the statistics that the second data structure gathered.
16 . The non-transitory computer-readable storage medium of claim 15 , wherein the plurality of words are stored in a first hash map stored in the first data structure, wherein the second plurality of words are stored in a second hash map stored in the second data structure.
17 . The non-transitory computer-readable storage medium of claim 15 , wherein the word is cleared from the third data structure periodically.
18 . The non-transitory computer-readable storage medium of claim 15 , wherein the document is a dictionary.
19 . The non-transitory computer-readable storage medium of claim 15 , wherein the word includes phrases, images, and sounds.
20 . The non-transitory computer-readable storage medium of claim 15 , wherein the plurality of word are input to a first hash map and a second hash map, wherein the first hash map has a first word capacity and the second hash map has a second word capacity, wherein the second word capacity is greater than the first word capacity.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 17/175,288 (“the '288 application”), filed on Feb. 12, 2021, and entitled “HIERARCHICAL DICTIONARY WITH STATISTICAL FILTERING BASED ON WORD FREQUENCY,” which will issue on Jan. 7, 2025, which, in turn, was a continuation of U.S. patent application Ser. No. 15/395,778 (“the '778 application”), filed on Dec. 30, 2016, and entitled “HIERARCHICAL DICTIONARY WITH STATISTICAL FILTERING BASED ON WORD FREQUENCY” and which has an issue date of Feb. 16, 2021 as U.S. Pat. No. 10,922,347. The '778 application claims priority to U.S. Provisional Patent Application No. 62/288,032 (“the '032 application”), filed on Jan. 28, 2016, and entitled “Hierarchical Dictionary with Statistical Filtering Used for Automatic Online Extraction Value Validation”. Each of these prior applications are incorporated in their entireties herein by reference. STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT None. REFERENCE TO SEQUENTIAL LISTING, ETC. None. BACKGROUND 1. Technical Field The present disclosure pertains to a dictionary having methods for storing words, and more particularly to, a hierarchical dictionary generally having short, medium, and long-term storage layers as filtered based on frequency. 2. Description of the Related Art Humans have an implicit ability to spot errors i.e., misspellings, within text despite the fact that they do not explicitly know all words possible within specific documents or might read a word or a phrase for the first time. For example, within the phrase “PHYSICS EDU POLE VLT” a human reader can spot the mixture of two words: “Physics Education” and “Pole Vault”. A well-grounded understanding of words is typically formed by learning and exposure. In creating dictionaries, words are often assigned to a particular unique identifier. These types of dictionaries, however, not only take up a substantial amount of memory as more words are added overtime but also lack meaning, as they are incapable of giving users a view of how words are used in processed documents. Accordingly, there is a need for a system and methods of storing words into a dictionary which mimics a human brain's capability of storing words at a short or long term basis depending on a number of times a word has been used. SUMMARY A system and methods for organizing a set of words associated with one or more documents based on frequency are disclosed. A hierarchical dictionary stored in a memory and communicatively coupled to one or more applications in a computing device may include a first layer of data structure for storing a first set words associated with a portion of a document, a second layer of data structure for storing a second set of words including the first set of words and corresponding frequencies thereof in the document, and a third layer of data structure for storing a third set of words from the second set of words exceeding a predetermined frequency limit. All of the first, second, and third layer of data structures may be implemented as hash maps and may be treated as independent dictionaries. The first set of words stored in the first data structure may be swiped clean following a predetermined period or a triggering event. The second data structure acts as a filter for promoting a set of words from the first data structure exceeding a predetermined frequency limit to the third data structure or for retaining the set of words therein. The third data structure, when receiving words from the second data structure, may store words at a substantially longer period of time in the memory coupled to or integral with the computing device relative to being stored in the first and second data structures. In one example embodiment, a method for storing words associated with a document includes: identifying a hash value associated with each word; storing in the first and second hash maps the word to a bucket position associated with the identified hash value; following a predetermined period of time, determining whether a frequency of the word exceeded a predetermined frequency limit; and promoting the word to a next layer of data structure upon a positive determination that the predetermined frequency limit for the word has been exceeded. Other embodiments, objects, features and advantages of the disclosure will become apparent to those skilled in the art from the detailed description, the accompanying drawings and the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS The above-mentioned and other features and advantages of the present disclosure, and the manner of attaining them, will become more apparent and will be better understood by reference to the following description of example embodiments taken in conjunction with the accompanying drawings. Like reference numerals are used to indicate the same element throughout the specification. FIG. 1 is a system including a hierarchical dictionary for storing