Two Pass Hash Extraction Of Text Strings

Abstract: Data compression and key word recognition may be provided. A first pass may walk a text string, generate terms, and calculate a hash value for each generated term. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, a second pass may walk the text string, generate terms, and calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then the term may be considered a frequent term. Consequently, the term may be added to a dictionary along with a corresponding frequency count. Then, the dictionary may be examined to remove terms that may not be frequent, but appeared due to hash collisions.

Patent Information

Application #

Filing Date

29 March 2010

Publication Number

34/2010

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

MICROSOFT CORPORATION

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399

Inventors

1. POUZIN, DOMINIC

C/O MICROSOFT CORPORATION, INTERNATIONAL PATENTS, ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399

Specification

BACKGROUND [001] In information theory, data compression or source coding is a process of encoding information using fewer bits than an unencoded representation would use through use of specific encoding schemes. For example, a text could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp." One conventional instance of compression with which many computer users are familiar is the "ZIP" file format, which, as well as providing compression, acts as an archiver, storing many files in a single output file. [002] As with any communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, a text makes sense only if a receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver. [003] Compression is useful because it helps reduce the consumption of expensive resources, such as memory or transmission bandwidth. On the downside, compressed data must be decompressed to be viewed (or heard). This extra decompression processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). Data compression schemes therefore involves trade-offs among various factors, including memory, the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and decompress the data. SUMMARY [004] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter. Nor is this Summary intended to be used to limit the claimed subject matter's scope. [005] Data compression and key word recognition may be provided. A first pass may walk a text string, generate terms, and calculate a hash value for each generated term. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, a second pass may walk the text string, generate terms, and calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then the term may be considered a frequent term. Consequently, the term may be added to a dictionary along with a corresponding frequency count. Then, the dictionary may be examined to remove terms that may not be frequent, but appeared due to hash collisions. [006] Both the foregoing general description and the following detailed description provide examples and are explanatory only. Accordingly, the foregoing general description and the following detailed description should not be considered to be restrictive. Further, features or variations may be provided in addition to those set forth herein. For example, embodiments may be directed to various feature combinations and sub-combinations described in the detailed description, BRIEF DESCRIPTION OF THE DRAWINGS [007] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate various embodiments of the present invention. In the drawings: [008] FIG. 1 is a block diagram of an operating environment; [009] FIG. 2 is a flow chart of a method for recognizing text; [010] FIG. 3 is a block diagram of a first pass; [01l] FIG. 4 is a block diagram of a second pass; [012] FIG. 5 illustrates an algorithm; and [013] FIG. 6 is a block diagram of a system including a computing device. DETAILED DESCRIPTION [014] The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the following description to refer to the same or similar elements. While embodiments of the invention may be described, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the elements illustrated in the drawings, and the methods described herein may be modified by substituting, reordering, or adding stages to the disclosed methods. Accordingly, the following detailed description does not limit the invention. Instead, the proper scope of the invention is defined by the appended claims. [015] Consistent with embodiments of the invention, a two-pass hash based algorithm may be used for discovering frequent terms in a document (e.g. a text string). In addition. memory quality vs. memory vs. runtime may be controlled. Embodiments of the invention may find, for example, the most frequent 100 keywords along with their count in a 1.68 meg document with a 1% error miss rate taking l.5 seconds while keeping memory under 3KBs. [016] An algorithm described below, for example, may discover frequent keywords in a text string. Consider a text string containing the following text: "I went to the market but the market was closed." Embodiments of the invention may discover that "market" and "the" may be frequent terms that may qualify as keywords. However, embodiments of the invention may also be used for compression purposes. Consider a text string containing the following text: "abracadabra, abrasive brats!". Embodiments of the invention may discover that "bra" is a frequent term that may be worth compressing. Regardless, a goal of embodiments of the invention may be to discover frequent repeat terms. Another goal may be to produce a document summary, to highlight terms of importance, to compress document data, to find excessive grammatical repetitions, etc. In other words, embodiments of the invention may find frequent terms in a text string, be they keywords or even substrings within strings. In addition, embodiments of the invention may determine how many times each frequent term appears in the text string and may control memory usage, runtime execution, and result quality. [017] Conventional compression algorithms are designed for compress-and-store purposes. Thus conventional systems can make lookups expensive, or sometimes impossible without full decompression. Consistent with embodiments of the invention, the aforementioned two-pass hash based algorithm may implement lightweight compression, making quick lookups possible. Consider the following scenario for example. We may have a long list of accounts/contacts: - John Doe - John Wayne - Wayne Smith When the user types in some text (call this the input), we may quickly identify the accounts/contacts above, and highlight them. [018] ZIP may be used to compress the list. While the list may be compressed very efficiently, there may be no way to perform quick lookups. Because ZIP compression may be adaptive, we may not hope to compress the input and search for matching bytes in the compressed list. Decompressing the list to compare with the input may defeat the whole purpose as well. [019] If the aforementioned two-pass hash based algorithm is run to find the most frequent two symbols (e.g. also weighing the length of the string and the cost of encoding to find the "best" two symbols to compress.) We may end up with a compact list after lightweight compression: -$l=John - $2=Wayne - $1 Doe - $1 $2 - $2 Smith Because we may control how many symbols are encoded, we may replace those symbols in the input and compare. [020] The following is another example with XML. For example, we may have a large XML document containing: English French German We may want to be able to use less memory to hold the XML document but still be able to use XPath queries for lookups: XPath = Greetings/Word[@id="Welcome"]. [021 ] We may use ZIP to compress the XML document. XPath may just not work (because we have compressed bytes, not an XML document any more.) Now if we run the aforementioned two-pass hash based algorithm to find the most frequent two symbols (e.g. avoiding to encode certain XML nodes or attributes may be used to find the "best" two symbols to compress.) We may end up with a compact XML document after lightweight compression: $l=Greetings $2=Word <$1> <$2 id="Welcome">English<$2> <$2 id="Bonjour">French<$2> <$2 id="Danke">German<$2> Now we may continue to run XPath queries: XPath = $l/$2[@id="Welcome"] [022] FIG. 1 shows a recognition system 100 consistent with embodiments of the invention. As shown in FIG. 1, a text string 105 may be operated upon by a first pass 110 and a second pass 115 to produce a dictionary 120. For example, first pass 110 may walk text string 1 OS, generate terms, and calculate a hash value for each garnierite term. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, second pass 115 may walk text string 105 once again, generate terms, and again calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then there may be a good probability that the term is a frequent term. Consequently, the term may be added to dictionary 120 along with a corresponding frequency count. Then, dictionary 120 may be examined to remove terms that may not be frequent, but appeared due to hash collisions. Next, toms in dictionary 120 maybe ranked, pruned, and filtered. An example of an algorithm 500 for implementing the aforementioned two-pass process is shown in FIG. 5. [023] FIG. 2 is a flow chart setting forth the general stages involved in a method 200 consistent with an embodiment of the invention for recognizing text. Method 200 may be implemented using a computing device 600 as described in more detail below with respect to FIG. 6. Ways to implement the stages of method 200 will be described in greater detail below. Method 200 may begin at starting block 205 and proceed to stage 210 where computing device 600 may generate a plurality of generated terms used in text string 105. For example, first pass 110 may analyze text string 105 comprising "I went to the market but the market was closed." This analysis may generate a plurality of generated terms 305 comprising "I", "went", "to", "the", "market", "but", "the", "market", "was", and "closed" as shown in FIG. 3. [024] Consistent with embodiments of the invention, prefix / suffix terms may be generated, for example, for data compression purposes. For example, the term "abra" could generate prefix / suffix terms such as: "abra", "bra", "ra", "abr", etc. However, if finding frequent keywords in text string 105 is the purpose, there may be no need to generate prefixes / suffixes, rather individual strings from text string 105 maybe generated. [025] From stage 210, where computing device 600 generates plurality of generated terms 305 used in text string 105, method 200 may advance to stage 220 where computing device 600 may calculate a plurality of hash values from plurality of generated terms 305. A hash string algorithm may be used to generate hashes as shown below, private static int GetHash(string input) { int hash = 0; for (int i = 0; i < input.Length; i-H-) { hash = (hash « 5) + hash + input[i3; } return hash; } [026] A perfect hash function may be used to avoid "collisions." Collisions are where two different terms may result in the same hash value. A porfect hash function of a set S is a hash function that may map different keys (elements) in S to different numbers. A perfect hash function with values in a range of size some constant times the number of elements in S can be used for efficient lookup operations by placing the keys in a hash table according to the values of the perfect hash function. [027] A perfect hash function for a specific set S that can be evaluated in constant time, and with values in a small range, can be found by a randomized algorithm in a number of rations that is proportional to the size of S. The minimal size of the description of a perfect hash function may depend on the range of its function values: the smaller the range, the more space is required. Any perfect hash functions suitable for use with a hash table may require at least a number of bits that is proportional to the size of S. Many implementations may reqiurenumber of bits that is proportional to n log(/j), where n is the size of S. This may mean that space for storing the perfect hash function can be comparable to the space for storing the set. [028] Using a perfect hash function may be best in situations where there is a large set that is not updated frequently, and many lookups into it. Efficient solutions to performing updates are known as dynamic perfect hashing, but these methods are relatively complicated to implement. A simple alternative to perfect hashing, which also allows dynamic updates, may be "cuckoo hashing." [029] Moreover, a minimal perfect hash may be used. A minimal perfect hash function may be a perfect hash function that maps n keys to n consecutive integers ~ usually [0..n-l] or[l..n]. A more formal way of expressing this is: Let j and be elements of some set K. F is a minimal perfect hash function iff F(j) =F(A:) impliesj=k and there exists an integer a such that the range of F is a..a+|K|-l, A minimal perfect hash function F may be order-preserving if for any keys j and k,j

Documents

Application Documents

#	Name	Date
1	abs 1778-chenp-2010 abstract 29-03-2010.jpg	2010-03-29
2	1778-CHENP-2010 PCT 29-03-2010.pdf	2010-03-29
3	1778-chenp-2010 correspondence others 29-03-2010.pdf	2010-03-29
4	1778-chenp-2010 claims 29-03-2010.pdf	2010-03-29
5	1778-chenp-2010 power of attorney 29-03-2010.pdf	2010-03-29
6	1778-chenp-2010 form-5 29-03-2010.pdf	2010-03-29
7	1778-chenp-2010 form-3 29-03-2010.pdf	2010-03-29
8	1778-chenp-2010 form-1 29-03-2010.pdf	2010-03-29
9	1778-chenp-2010 drawings 29-03-2010.pdf	2010-03-29
10	1778-chenp-2010 description(complete) 29-03-2010.pdf	2010-03-29
11	1778-chenp-2010 abstract 29-03-2010.pdf	2010-03-29
12	1778-chenp-2010 pct search report 29-03-2010.pdf	2010-03-29
13	1778-chenp-2010 form-2 29-03-2010.pdf	2010-03-29
14	1778-chenp-2010 form-3 26-08-2010.pdf	2010-08-26
15	1778-CHENP-2010 FORM-18 26-09-2011.pdf	2011-09-26
16	1778-CHENP-2010 CORRESPONDENCE OTHERS 26-09-2011.pdf	2011-09-26
17	1778-CHENP-2010 FORM-6 25-02-2015.pdf	2015-02-25
18	MTL-GPOA - KONPAL.pdf ONLINE	2015-03-05
19	MS to MTL Assignment.pdf ONLINE	2015-03-05
20	FORM-6-1301-1400(KONPAL).81.pdf ONLINE	2015-03-05
21	MTL-GPOA - KONPAL.pdf	2015-03-13
22	1778-CHENP-2010 FORM-3 08-10-2015.pdf	2015-10-08
23	1778-CHENP-2010 CORRESPONDENCE OTHERS 08-10-2015.pdf	2015-10-08
24	1778-CHENP-2010-FER.pdf	2018-03-12
25	1778-CHENP-2010-Information under section 8(2) (MANDATORY) [21-08-2018(online)].pdf	2018-08-21
26	1778-CHENP-2010-OTHERS [22-08-2018(online)].pdf	2018-08-22
27	1778-CHENP-2010-FER_SER_REPLY [22-08-2018(online)].pdf	2018-08-22
28	1778-CHENP-2010-CORRESPONDENCE [22-08-2018(online)].pdf	2018-08-22
29	1778-CHENP-2010-COMPLETE SPECIFICATION [22-08-2018(online)].pdf	2018-08-22
30	1778-CHENP-2010-CLAIMS [22-08-2018(online)].pdf	2018-08-22
31	1778-CHENP-2010-Correspondence to notify the Controller (Mandatory) [01-10-2019(online)].pdf	2019-10-01
32	1778-CHENP-2010-HearingNoticeLetter10-10-2019.pdf	2019-10-10

Search Strategy

1	SEARCH_STRATEGY_1778CHENP2010_04-12-2017.pdf