"System And Method For Incorporating Anchor Text Into Ranking Search

< Back

"System And Method For Incorporating Anchor Text Into Ranking Search Results"

Abstract: Search results of a search query on a network are ranked according to a scoring function that incorporates anchor text as a term. The scoring function is adjusted so that a target document of anchor text reflect the use of terms in the anchor text in the target document"s ranking. Initially, the properties associated with the anchor text are collected during a crawl of the network. A separate index is generated that includes an inverted list of the documents and the terms in the anchor text. The index is then consulted in response to a query to calculate a document"s score. The score is then used to rank the documents and produce the query results.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

08 June 2005

Publication Number

35/2007

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

MICROSOFT CORPORATION

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

Inventors

1. DMITRIY MEYERZON

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

2. HUGO ZARAGOZA

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

3. MICHAEL J. TAYLOR

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

4. STEPHEN EDWARD ROBERTSON

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

Claims

1. A computer-implemented method for presenting a ranking of search results, comprising: providing an index to a plurality of documents including: a main index associating with each of the documents a frequency of one or more terms being included in each of the documents; an anchor text index associating with each of the documents an anchor text frequency of the one or more terms being included in anchor text in a source document referencing each ofthe documents; receiving a query including at least one query term; applying the query to the index to yield results of the query identifying one or more of the documents that include the at least one query term; applying a scoring function to generate a score for each of the one or more documents included in the results of the query, wherein the scoring function (score) includes one of: score= score= where: " (wtf + wtfAnchor)(k, + 1) I (N). d L... wdl x og - , an k,((l-b)+b --)+(wtf +wtJAnchoJ n avwdl ( Wtj + wtfAnchor )(k, + 1) L B BAnchor X log(N), kl + ( wtf + wtfAnchor) n B BAnchor wtf is a weighted term frequency applying a weight to a frequency with which a given query term is included in the document; wtfAnchor is a weighted term frequency applying a weight to a frequency with which the given query term is included in anchor text referencing the document; k1 is a constant; b is a constant; wdl is a weighted document length applying a weight to a length of the document being scored; avwdl is an average weighted document length of all documents being scored; N is the number of documents on the network; and n is the number of documents including at least one appearance of a given query term; and generating an output of the ranked results of the query to be displayed to a user.

2. The computer-implemented method of claim 1, further comprising building the index by processing each of the plurality of documents to determine the frequency of the one or more terms included in the document.

3. The computer-implemented method of claim 2, further compnsmg building the index by processing each of the plurality of documents to identify one or more anchor text entries each referencing another document.

4. The computer-implemented method of claim 3, further compnsmg generating an anchor text table, wherein an entry is made for each of the documents including an anchor text entry, including one or more of: a source identifier indicating the document including the anchor text entry; a target identifier indicating a target document the anchor text entry references; and one or more terms included in content of the anchor text entry.

5. The computer-implemented method of claim 4, further comprising generatin.g the index by collecting for each of the documents: the frequency of the one or more terms included in the document; and for each of the anchor table entries in which the document is listed as the target document of the target identifier, a frequency of the terms listed in the content of the anchor text entry.

6. The computer-implemented method of claim 1, further comprising ranking the documents according to a scoring function (score) that is determined based on terms including: I a weighted anchor term frequency (wtfAnchor); and an anchor text length normalization component (BAnchor) derived from: a weighted document length (wdl); and an average weighted document length (avwdl).

7. The computer-implemented method of claim 1, wherein when the document is not associated with anchor text data, the scoring function (score) includes: score= L wtf(kl + 1) x log(N) wdl n k1((l-b) +b--)+ wtf avwdl

8. The computer implemented method of claim 6, wherein a strength of the length normalization provided by BAnchor is adjusted by choosing a different constant value associated with BAnchor·

9. A computer-readable storage medium storing instructions executable on a computing system, comprising instructions to: evaluate contents of each of a plurality of documents on a network, including: recording a frequency ofterms included within the document; making an entry in an anchor text table for each anchor text entry referencing another document; compile an index, including: generating a main index that associates with each of the documents a frequency with which the at least one term is included in the document; generating an anchor text index that associates with each of the documents a frequency of terms listed in anchor text entries in the anchor text table referencing the document; receive a query including at least one query term; apply the query to the index to yield results of the query identifying one or more of the documents that include the at least one query term; \t• apply a scoring function to generate a score for each of the one or more documents included in the results of the query, applying a scoring function to generate a score for each of the one or more documents included in the results of the query, wherein the scoring function (score) includes one of: " (wtf + wtfAnchoJ(ki + 1) I (N). d score = L.... wdl x og -;; , an k1((1-b)+b · ---)+(wtj +wtJAnchoJ score= where: avwdl ( wtf + wtf_Anchor )(ki + 1) N L B BAnchor X log(-), kI +( w-tj +wt-fAn-ch-or) n B BAnchor wtf is a weighted term frequency applying a weight to a frequency with which a given query term is included in the document; wtfAnchor is a weighted term frequency applying a weight to a frequency with which the given query term is included in anchor text referencing the document; k1 is a constant; b is a constant; wdl is a weighted document length applying a weight to a length of the document being scored; avwdl is an average weighted document length. of all documents being scored; N is the number of documents on the network; and n is the number of documents including at least one appearance of a given query term; and generate an output ofthe ranked results of the query to be displayed to a user.

10. The computer-readable storage medium of claim 9, wherein making the entry in the anchor text table for each of the documents includes: storing for a source identifier indicating the document including the anchor text entry; I~ • storing a target identifier indicating a target document the anchor text entry references; and storing one or more terms included in content of the anchor text entry.

11. The computer-readable storage medium of claim 9, wherein when the document is not associated with anchor text data, the scoring function (score) includes: score= L wtf(k, + 1) x log(N) wdl n k, ((1- b)+ b---)+ wtf avwdl

12. The computer-readable storage medium of claim 9, wherein a strength of the length normalization provided by BAnchor is adjusted by choosing a different constant value associated with BAnchor·

13. The computer-readable storage medium of claim 9, further comprising causing an output of the ranked results of the query to be presented to a user.

14. A search engine system, comprising: an index for a plurality of documents, including: a main index associating with each of the documents a frequency of one or more terms being included in each ofthe documents; an anchor text index associating with each of the documents an anchor text frequency of the one or more terms being included in anchor text in a source document referencing each of the documents; a ranking system, including: a query interface configured to receive a query including at least one query term and apply the query to the index to identify one or more ofthe documents that include the at least one query term; a scoring function to generate a score for each of the one or more documents included in the results of the query, wherein the scoring function (score) includes: \9 • where: score= ( Wif + wifAnchor )(k + J) I B B I N Anchor X log(_) kl + ( wif + wifAnchor) n B BAnchor wtf is a wei~hted term frequency applying a weight to a frequency with which a given query term is included in the document; wtfAnchor is a weighted term frequency applying a weight to a frequency with which the given query term is included in anchor text referencing the document; k1 is a constant; wdl is a weighted document length applying a weight to a length of the document being scored; avwdl is an average weighted document length of all documents being scored; B is a document length normalization component defined as B = ((1-b)+ b wdl ) where b is a constant; avwdl BAnchor is an anchor text normalization component defined as B = ((1- b)+ b wdl ) where b is a constant; avwdl N is the number of documents on the network; and n is the number of documents including at least one appearance of a given query term; and a ranking system configured to rank the results of the query based on the score generated for each of the documents included in the results ofthe query.

15. The system of claim 14, further compnsmg a crawler configured to building the main index by processing each of the plurality of documents to determine the frequency of the one or more terms included in the document.

16. The system of claim 15, wherein the crawler is further configured to build the anchor text index by processing each of the plurality of documents to identify one or more anchor text entries each referencing another document.

17. The system of claim 16, wherein the crawler is further configured to generate an anchor text table, wherein an entry is made for each of the documents including an anchor text entry, including one or more of: a source identifier indicating the document including the anchor text entry; a target identifier indicating a target document the anchor text entry references; and one or more terms included in content of the anchor text entry.

18. The system of claim 17, wherein the crawler IS further configured to generate the index by collecting for each of the documents: the frequency of the one or more terms included in the document; and for each of the anchor table entries in which the document is listed as the target document of the target identifier, a frequency of the terms listed in the content of the anchor text entry.

Specification

M&GNo. 50037.293USOI
SYSTEM AND METHOD FOR INCORPORATING ANCHOR TEXT INTO
RANKING SEARCH RESULTS
Cross-Reference to Related Applications
5 The present invention is related to a patent application having serial
10
number I 0/804,326, entitled "Field Weighting in Text Document Searching", filed on
March 18, 2004. The related applications are assigned to the assignee ofthe present
patent application and are hereby incorporated by reference.
Background of the Invention
In a text document search, a user typically enters a query into a search
engme. The search engine evaluates the query against a database of indexed documents
and returns a ranked list of documents that best satisfy the query. A score, representing
a measure of how well the document satisfies the query, is algorithmically generated by
the search engine. Commonly-used scoring algorithms rely on splitting the query up
15 into search terms and using statistical information about the occurrence of individual
terms in the body of text documents to be searched. The documents are listed in rank
order according to their corresponding scores so the user can see the best matching
search results at the top of the search results list.
Many such scoring algorithms assume that each document is a single,
20 undifferentiated string of text. The query of search terms is applied to the text string (or
more accurately, to the statistics generated from the undifferentiated text string that
represents each document). However, documents often have some internal structure
(e.g., fields containing titles, section headings, meta data fields, etc.), and reducing such
documents to an undifferentiated text string loses any searching benefit provided by
25 such structural information.
Some existing approaches attempt to incorporate the internal structure of
documents into a search by generating statistics for individual document fields and
generating scores for individual fields. The score for an individual document is then
I i
computed as a weighted sum of scores for its fields. Some existing approaches attempt
to incorporate the internal structure of the document, but do not attempt to take into
consideration text about that document contained in other documents.
Summary of the Invention
5 Embodiments of the present invention are related to a system and
method for ranking search results using a scoring function that incorporates an anchor
text component. Anchor text consists of a URL (Uniform Resource Locator) pointing
to another document and an accompanying textual description. This text is directly
relevant to the target document, and is used in the present invention to provide a
10 measure of the relevance of the target document. For example, document A has some
anchor text pointing to document B. If the anchor text contains a word that is not in
document B, queries containing this word will not return the linked document without
the additional functionality provided by the present invention. Only Document A
would be returned but not document B. Since the description in document A is used to
15 describe the linked document B, this text is highly likely to be a precise
summary/description of the linked document. The present invention corrects for this
deficiency by incorporating the anchor text into the ranking of the target document..
In one aspect of the present invention, the network is first "crawled" to
generate a table of properties associated with the links and pages ofthe network.
20 "Crawling" refers to automatically collecting several documents (or any analogous
discrete unit of information) into a database referred to as an index. Crawling traverses
multiple documents on the network by following document reference links within
certain documents, and then processing each document as found. The documents are
processed by identifying key words or general text in the documents to create the index.
25 The index of the present invention includes a separate anchor text index partition.
The text that the present invention indexes is not limited to just the anchor text that
accompanies URLs. Anchor text can also include text with references to any other
objects. For example, people, categories, directories, etc. may also be indexed.
2
I
f
f I t I ~
r
(
i
I
In another aspect of the present invention, once the anchor text is
indexed and associated with the appropriate target document the anchor text is also used
for boosting document ranking. The term frequencies of terms that exist in both the
content and the anchor text are combined, so that the total occurrence of a term in the
5 document is boosted. The length of the target document is also lengthened by the
anchor text from the source documents that point to that particular target document.
Both of these are factors are used in a scoring function that determines the document's
relevance measure.
Brief Description of the Drawings
10 FIGURE 1 illustrates an exemplary computing device that may be used
in one exemplary embodiment of the present invention.
FIGURE 2 illustrates a functional block diagram of an exemplary system
for scoping searches using index keys in accordance with the present invention.
FIGURE 3 illustrates a functional block diagram for an exemplary
15 structure of an index in accordance with the present invention.
FIGURE 4 illustrates an exemplary network graph in accordance with
the present invention.
FIGURE 5 illustrates a logical flow diagram of an exemplary process for
handling anchor text to include the anchor text in document ranking in accordance with
20 the present invention.
FIGURE 6 illustrates a logical flow diagram of an exemplary process for
incorporating anchor text in ranking search results in accordance with the present
invention.
Detailed Description
25 The present invention now will be described more fully hereinafter with
reference to the accompanying drawings, which form a part hereof, and which show, by
way of illustration, specific exemplary embodiments for practicing the invention. This
invention may, however, be embodied in many different forms and should not be
3
5
10
construed as limited to the embodiments set forth herein; rather, these embodiments are
provided so that this disclosure will be thorough and complete, and will fully convey
the scope of the invention to those skilled in the art. Among other things, the present
invention may be embodied as methods or devices. Accordingly, the present invention
may take the form of an entirely hardware embodiment, an entirely software
embodiment or an embodiment combining software and hardware aspects. The
following detailed description is, therefore, not to be taken in a limiting sense.
Illustrative Operating Environment
With reference to FIGURE 1, one exemplary system for implementing
the invention includes a computing device, such as computing device 100. Computing
device 100 may be configured as a client, a server, mobile device, or any other
computing device. In a very basic configuration, computing device 100 typically
includes at least one processing unit 102 and system memory 104. Depending on the
15 exact configuration and type of computing device, system memory 1 04 may be volatile
(such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination
ofthe two. System memory 104 typically includes an operating system 105, one or
more applications 106, and may include program data 107. In one embodiment,
application 106 includes a search ranking application 120 for implementing the
20 functionality of the present invention. This basic configuration is illustrated in
FIGURE 1 by those components within dashed line 108.
Computing device 100 may have additional features or functionality.
For example, computing device 100 may also include additional data storage devices
(removable and/or non-removable) such as, for example, magnetic disks, optical disks,
25 or tape. Such additional storage is illustrated in FIGURE 1 by removable storage 109
and non-removable storage 110. Computer storage media may include volatile and
nonvolatile, removable and non-removable media implemented in any method or
technology for storage of information, such as computer readable instructions, data
structures, program modules, or other data. System memory 104, removable
30 storage 109 and non-removable storage 110 are all examples of computer storage
4
I
I
r I
•
media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM,
flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or
other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other
magnetic storage devices, or any other medium which can be used to store the desired
5 information and which can be accessed by computing device 100. Any such computer
storage media may be part of device 100. Computing device 100 may also have input
device(s) 112 such as keyboard, mouse, pen, voice input device, touch input device, etc.
Output device(s) 114 such as a display, speakers, printer, etc. may also be included.
Computing device 100 also contains communication connections 116
10 that allow the device to communicate with other computing devices 118, such as over a
network. Communication connection 116 is one example of communication media.
Communication media may typically be embodied by computer readable instructions,
data structures, program modules, or other data in a modulated data signal, such as a
carrier wave or other transport mechanism, and includes any information delivery
15 media. The term "modulated data signal" means a signal that has one or more of its
characteristics set or changed in such a manner as to encode information in the signal.
By way of example, and not limitation, communication media includes wired media
such as a wired network or direct-wired connection, and wireless media such as
acoustic, RF, infrared and other wireless media. The term computer readable media as
20 used herein includes both storage media and communication media.
Illustrative Embodiments for Incorporating Anchor Text into Search Ranking
Embodiments of the present invention are related to a ranking function
for a search engine. The quality of a search engine is typically determined by the
25 relevance of the documents according to the ranks assigned by the ranking function.
Anchor text is defined as text within the anchor tag of HTML (Example Web). Often, anchor text contains short high
quality description of the target URL (Uniform Resource Locator) and it is beneficial
for the ranking function to incorporate content of the anchor tags that point to the given
30 document into the ranking function for that document.
5
•
5
FIGURE 2 illustrates a functional block diagram of an exemplary system
for scoping searches using index keys in accordance with the present invention. System
200 includes index 210, pipeline 220, document interface 230, client interface 240,
anchor text plugin 250, indexing plugin 260, and anchor text table 270.
Index 210 is structured to include separate index partitions that includes
a main partition and another partition for the anchor text. A more detailed description
ofthe structure of index 210 is provided below in the discussion of FIGURE 3. The
records of these indexes are used to- in providing results to client queries. In one
embodiment, index 210 corresponds to multiple databases that collectively provide the
10 storage for the index records.
15
Pipeline 220 is an illustrative representation of the gathering mechanism
for obtaining the documents or records of the documents for indexing. Pipeline 220
allows for filtering of data by various plugins (e.g., anchor text plugin 250) before the
records corresponding to the data are entered into index 210.
Document interface 230 provides the protocols, network access points,
and database access points for retrieving documents across multiple databases and
network locations. For example, document interface 230 may provide access to the
Internet while also providing access to a database of a local server and access to a
database on the current computing device. Other embodiments may access other
20 document locations using a variety of protocols without departing from the spirit or
25
scope of the invention.
Client Interface 240 provides access by a client to define and initiate a
search. The search may be defined according to keywords and/or scope keys. An
exemplary method for processing search queries is described in greater detail in the
discussion of FIGURE 7 below.
Anchor text plugin 250 is one of several gatherer pipeline plugins.
Anchor text plugin 250 identifies the anchor text and its related properties that are
included in a document. The anchor properties are gathered by anchor text plugin 250
as the documents provided through document interface 230 are crawled. In one
30 embodiment, the functionality of anchor text plugin 250 is actually included in a
6
I
properties plugin rather than being provided as a separate plugin. The properties plugin
identifies all the fields of a document and their associated properties including the
anchor properties. In one embodiment, since anchor text is as~ociated with a target
document, associating the target document with the anchor text is deferred until the
5 crawl is complete. For example, when document A is indexed, and document A has
anchor text that points to document B, the anchor text is applied to document B. But
since document A is being indexed at the moment, this process is deferred. Also, there
may be multiple anchors are to be applied to document B requiring that they are
discovered before document B is indexed correctly. Deferring, the indexing of the target
10 documents until after the crawl is complete, better ensures the correctness of the
indexed results.
Indexing plugin 260 is another plugin connected to pipeline 220.
Indexing plugin provides the mechanism for generating, partitioning, and updating
index 210. In one embodiment, indexing plugin 260 provides the word lists that
15 temporarily cache the keywords and anchor text keys generated from crawled
documents before flushing these results to index 210. The records of index 210 are
populated from the crawl results included in these word lists.
Anchor text table 270 includes the anchor properties that have been
gathered by anchor text plugin 250. For instance of anchor text in a document, anchor
20 text table 270 includes a record of the anchor text that includes the properties associated
with the anchor text. For example, a record in anchor text table 270 may include a
target ID that identifies the target document of the link, a source ID that identifies the
current document, the anchor text itself, and the link in separate fields. In other
embodiments, other fields may be included in anchor text table 270 that are related to
25 linking between two documents. In one embodiment, the anchor and link properties
gathered from the crawl are used to generate a representation of the network with nodes
corresponding to the documents and branches corresponding to the links (see FIGURE
4). This network graph may then be loaded into memory and used to resolve the target
IDs for the target documents referred to by the anchor text.
7
Despite the illustration in system 200 of one-way and two-way
communications between functional blocks, any of these communication types may be
changed to another type without departing from the spirit or scope of the invention (e.g.,
all communications may have an acknowledgment message requiring two-way rather
5 than one-way communication).
FIGURE 3 illustrates a functional block diagram for an exemplary
structure of an index in accordance with the present invention. Index 300 includes main
index 310 and anchor text index 320.
Main index 310 includes records that correspond to the keywords and
10 other index keys that are returned corresponding to the crawl of documents. Main index
310 also includes the other index partitions related to other properties of the documents.
The records for that correspond to anchor text are diverted and entered into anchor text
index 320.
Anchor text index 320 includes records that correspond to the target
15 documents of the anchor text included in documents on the network. These target
documents are organized as an inverted index with the target document IDs listed in
association with words included in the anchor text or URL associated with the target
document. Anchor text index 320 is generated from the anchor text table after the crawl
is complete. The anchor text corresponding to each target document is concatenated
20 together in order to evaluate each target document for terms and enter the target
document in anchor text index 320. Including a separate index partition for the anchor
text allows relevance calculations to made based on the anchor text before incorporating
the anchor text as a factor in the scoring function of a document. Incorporating the
anchor text into the scoring function for ranking documents is described more fully in
25 the discussion of FIGURE 6 below.
FIGURE 4 illustrates an exemplary network graph in accordance with
the present invention. The network graph is comprised of nodes (e.g., 410) and edges
or links (e.g., 420). The nodes (e.g., 41 0) represent the pages and other resources that
are on the network that may be returned as results to a search query. The links (e.g.,
30 420) connect each one of these pages together through the use of navigation links listed
8
li: I l
I
I
on the pages. A set of link information may be gathered for each page that can be used
in determining properties related to the anchor text for a particular page.
In one embodiment, node 430 is the current document that includes an
anchor tag for the target document that corresponds to node 440. For example, the
5 anchor tag may correspond to anchor tag of HTML (Sample
Web). The ID of the current document is also known, usually being included in
the HTML of the document. In order to populate the anchor text table (see FIGURE 2)
the target document ID still needs to be resolved that is associated with the anchor text.
Network graph 400 assists in resolving the target document ID by providing a
10 representation of the network that may be walked to resolve unknown properties.
FIGURE 5 illustrates a logical flow diagram of an exemplary process for
handling anchor text to include the anchor text in document ranking in accordance with
the present invention. Process 500 starts at block 502 where access is provided to a
corpus of documents. Processing continues at block 504.
15 At block 504, the corpus of documents are crawled to determine the
documents that exist as well as properties (e.g., file type) that are associated with those
documents. An identifier or ID for each of the documents and their associated
properties are then forwarded as results of the crawl. Processing continues at block 506.
At block 506, the properties associated with the documents that relate to
20 anchor text obtained by an anchor text plugin. The anchor text properties may include
an identifier of the source document, and identifier of the target document, the anchor
text itself, and the URL of the link. Once these anchor properties are gathered,
processing moves to block 508.
At block 508, the anchor text table is generated. The anchor text table
25 includes the anchor text properties associated with each instance of anchor text. The
properties of each instance of anchor text are stored as records in the table. Once the
table is created, processing continues at block 510.
At block 510, an index is generated that includes a main index and an
anchor text index. In one embodiment, the index is generated after anchor text table is
30 built. The anchor text table includes an inverted list of documents associated with
9
•
5
anchor text keys. The anchor text keys correspond to the anchor text, in that they are
keywords contained in the anchor text or URL of the target document of an anchor tag.
Accordingly, the inverted list of documents, are the target documents of the anchor text
keys. Once the index is instantiated, processing continues at block 512.
At block 512, the main index and anchor text index are consulted along
with the anchor text table to incorporate relevance values based on the anchor text a
scoring function. The scoring function determines a relative score for a document. The
documents can then be ranked according to their scores. A more detailed description of
incorporating anchor text into ranking the documents is described in the discussion of
10 FIGURE 6 below. Once anchor text is incorporated into the ranking, processing
advances to block 514 where process 500 ends.
After process 500 is complete, the ranked documents may be returned to
the user by the various operations associated with the transmission and display of
results by a search engine. The documents corresponding to the higher precision results
15 may then be selected and viewed at the user' discretion.
20
25
FIGURE 6 illustrates a logical flow diagram of an exemplary process for
incorporating anchor text in ranking search results in accordance with the present
invention. Process 600 starts at block 602 when process 500 of FIGURE 5 enters block
512 and a query has been made by a client. Processing continues at decision block 604.
At decision block 604, a determination is made whether the document
for which the current score is being calculated is included in the anchor text index for
the word being queried. If the document is not listed in the anchor text index,
processing moves to block 608. However, if the document is listed in the anchor text
index, processing continues at block 606.
At block 606, a scoring function for determining a relevance score of a
document is adjusted to incorporate consideration and weighting of the anchor text. In
one embodiment, the scoring function corresponds to the field weighted scoring
function described in patent application serial number 10/804,326, entitled "Field
Weighting in Text Document Searching", filed on March 18, 2004 and hereby
10
I I
5
10
I5
20
25
30
incorporated by reference. As provided by the 10/804,326 patent application the
following is a representation of the field weighted scoring function:
L wtf(k1 + 1) x log(N)
wdl n
k1 ( (1 -b) + b--) + wtf
avwdl
(1)
Wherein the terms are defined as follows: wtf is the weighted term
frequency or sum of term frequencies of a given tem1s multiplied by weights across all
properties; wdl is the weighted document length; avwdl is the average weighted
document length; N is the number of documents on the network; n is the number of
documents containing the given query term, summed across all query terms; and k1 and
bare constants. These terms and the equation above are described in detail in the
10/804,326 patent application.
As a basic explanation, the weighted term frequency (wtf) corresponds to
the term frequency in the document weighted over the different fields in the document.
The weighted document length over the average weighted document length provides a
measu~e of how close the current document's length is to the average document length
and is a normalization term in the scoring function. The log of the number of
documents in the network (N) over the number of documents containing the given
query term (n) provides a measure of the document frequency. These quantiti~s are
discoverable and retrieved from the content index.
In one embodiment, the scoring function is adjusted to incorporate
anchor text by including an additional weighted term frequency value (wtfAnchor) that
corresponds to the frequency of the term in the anchor text, such that the new scoring
function becomes:
"" (wtf + wtfAnchor)(kl + 1) N
~ wdl x log(-;;)
k1 ((1- b) +b--) + ( wtf + wtfAnchor)
avwdl
(2)
II
..
Accordingly, the term frequency component of the scoring function is
5 updated with the frequency of the term in the anchor text. However, the other terms of
the scoring function remain unaffected. The query can obtain the term frequencies for
scoring function (2) by simply consulting the main index and the anchor text index
separately.
In another embodiment, the document length normalization is adjusted to
10 account for the anchor text by adjusting the scoring function to apply the length
normalization to the weighted term frequency of each field of the document before
adding the weighted term frequencies together. To incorporate the anchor text into the
document length normalization, a new term (B) is defined as:
15
B = ( (1 - b) + b wdl )
avwdl
Equation (1) may then be rearranged according to the new term to
(3)
20 produce the following:
25
30
(4)
The weighted term frequency associated with the anchor text (wtfAnchor)
may then bee added into the equation along with a new B Anchor term that corresponds to
the length normalization associated with the anchor text such that equation (4) becomes:
12
I
5
( Wtj + wtfAnchor )(kl + 1)
I B B Anchor X log(N)
kl + ( Wtj + wtfAnchor) 11
B B Anchor
(5)
Accordingly, in one embodiment, BAnchor differs from B by taking the
wdlAnchor and avwdlAnchor components of BAnchor from the anchor text field. In another
embodiment, the strength of the length normalization is adjusted by also choosing a
10 different bAnchor for the anchor text field. Once the scoring function is adjusted to
account for the anchor text, processing moves to block 608.
At block 608, the scoring function is populated with the variables for
calculating the score of the current document. As previously stated, the query can
obtain the term frequencies for populating the scoring function by simply consulting the
15 main index and the anchor text index separately.
At block 610, the scoring function is executed and the relevance score
for the document is calculated. Once the relevance score is calculated, it is stored in
memory and associated with that particular document. Processing then moves to
decision block 612.
20 At decision block 612, a determination is made whether relevance scores
for all the documents have been calculated according to scoring function (2). The
scores may be calculated serially as shown or in parallel. If all the scores have not been
calculated, processing returns to block 604 calculating the score for the next document
is initiated. However, if the all the scores have been calculated, processing continues to
25 block 614.
At block 614, the search results of the query are ranked according to
their associated scores. The scores now take into account the anchor text of each of the
documents. Accordingly, the ranking of the documents has been refined so that
documents referred to in anchor text reflect that reference. Once the search results are
30 ranked, processing proceeds to block 614, where process 600 returns to block 514 of
process 500 in FIGURE 5.
13
The above specification, examples and data provide a complete
description of the manufacture and use of the composition of the invention. Since many
embodiments of the invention can be made without departing from the spirit and scope
of the invention, the invention resides in the claims hereinafter appended.

'
We Claim:
1. A computer-implemented method for presenting a ranking of search results,
comprising:
providing an index to a plurality of documents including:
a main index associating with each of the documents a frequency of one or more
terms being included in each of the documents;
an anchor text index associating with each of the documents an anchor text
frequency of the one or more terms being included in anchor text in a source document
referencing each ofthe documents;
receiving a query including at least one query term;
applying the query to the index to yield results of the query identifying one or more of the
documents that include the at least one query term;
applying a scoring function to generate a score for each of the one or more documents
included in the results of the query, wherein the scoring function (score) includes one of:
score=
score=
where:
" (wtf + wtfAnchor)(k, + 1) I (N). d
L... wdl x og - , an
k,((l-b)+b --)+(wtf +wtJAnchoJ n
avwdl
( Wtj + wtfAnchor )(k, + 1)
L B BAnchor X log(N),
kl + ( wtf + wtfAnchor) n
B BAnchor
wtf is a weighted term frequency applying a weight to a frequency with
which a given query term is included in the document;
wtfAnchor is a weighted term frequency applying a weight to a frequency
with which the given query term is included in anchor text referencing the
document;
k1 is a constant;
b is a constant;
wdl is a weighted document length applying a weight to a length of the
document being scored;
avwdl is an average weighted document length of all documents being
scored;
N is the number of documents on the network; and
n is the number of documents including at least one appearance of
a given query term; and
generating an output of the ranked results of the query to be displayed to a user.
2. The computer-implemented method of claim 1, further comprising
building the index by processing each of the plurality of documents to determine the
frequency of the one or more terms included in the document.
3. The computer-implemented method of claim 2, further compnsmg
building the index by processing each of the plurality of documents to identify one or
more anchor text entries each referencing another document.
4. The computer-implemented method of claim 3, further compnsmg
generating an anchor text table, wherein an entry is made for each of the documents
including an anchor text entry, including one or more of:
a source identifier indicating the document including the anchor text entry;
a target identifier indicating a target document the anchor text entry references; and
one or more terms included in content of the anchor text entry.
5. The computer-implemented method of claim 4, further comprising
generatin.g the index by collecting for each of the documents:
the frequency of the one or more terms included in the document; and
for each of the anchor table entries in which the document is listed as the target document
of the target identifier, a frequency of the terms listed in the content of the anchor text
entry.
6. The computer-implemented method of claim 1, further comprising ranking
the documents according to a scoring function (score) that is determined based on terms
including:
I
a weighted anchor term frequency (wtfAnchor); and
an anchor text length normalization component (BAnchor) derived from:
a weighted document length (wdl); and
an average weighted document length (avwdl).
7. The computer-implemented method of claim 1, wherein when the
document is not associated with anchor text data, the scoring function (score) includes:
score= L wtf(kl + 1) x log(N)
wdl n
k1((l-b) +b--)+ wtf
avwdl
8. The computer implemented method of claim 6, wherein a strength of the
length normalization provided by BAnchor is adjusted by choosing a different constant
value associated with BAnchor·
9. A computer-readable storage medium storing instructions executable on a
computing system, comprising instructions to:
evaluate contents of each of a plurality of documents on a network, including:
recording a frequency ofterms included within the document;
making an entry in an anchor text table for each anchor text entry
referencing another document;
compile an index, including:
generating a main index that associates with each of the documents a
frequency with which the at least one term is included in the
document;
generating an anchor text index that associates with each of the documents
a frequency of terms listed in anchor text entries in the anchor text
table referencing the document;
receive a query including at least one query term;
apply the query to the index to yield results of the query identifying one or more of the
documents that include the at least one query term;
\t•
apply a scoring function to generate a score for each of the one or more documents
included in the results of the query, applying a scoring function to generate a score for
each of the one or more documents included in the results of the query, wherein the
scoring function (score) includes one of:
" (wtf + wtfAnchoJ(ki + 1) I (N). d
score = L.... wdl x og -;; , an
k1((1-b)+b · ---)+(wtj +wtJAnchoJ
score=
where:
avwdl
( wtf + wtf_Anchor )(ki + 1) N
L B BAnchor X log(-),
kI +( w-tj +wt-fAn-ch-or) n
B BAnchor
wtf is a weighted term frequency applying a weight to a frequency with
which a given query term is included in the document;
wtfAnchor is a weighted term frequency applying a weight to a frequency
with which the given query term is included in anchor text referencing the
document;
k1 is a constant;
b is a constant;
wdl is a weighted document length applying a weight to a length of the
document being scored;
avwdl is an average weighted document length. of all documents being
scored;
N is the number of documents on the network; and
n is the number of documents including at least one appearance of
a given query term; and
generate an output ofthe ranked results of the query to be displayed to a user.
10. The computer-readable storage medium of claim 9, wherein making the
entry in the anchor text table for each of the documents includes:
storing for a source identifier indicating the document including the anchor text entry;
I~
•
storing a target identifier indicating a target document the anchor text entry references;
and
storing one or more terms included in content of the anchor text entry.
11. The computer-readable storage medium of claim 9, wherein when the
document is not associated with anchor text data, the scoring function (score) includes:
score= L wtf(k, + 1) x log(N)
wdl n
k, ((1- b)+ b---)+ wtf
avwdl
12. The computer-readable storage medium of claim 9, wherein a strength of
the length normalization provided by BAnchor is adjusted by choosing a different constant
value associated with BAnchor·
13. The computer-readable storage medium of claim 9, further comprising
causing an output of the ranked results of the query to be presented to a user.
14. A search engine system, comprising:
an index for a plurality of documents, including:
a main index associating with each of the documents a frequency of one or more
terms being included in each ofthe documents;
an anchor text index associating with each of the documents an anchor text
frequency of the one or more terms being included in anchor text in a source document
referencing each of the documents;
a ranking system, including:
a query interface configured to receive a query including at least one query
term and apply the query to the index to identify one or more ofthe
documents that include the at least one query term;
a scoring function to generate a score for each of the one or more
documents included in the results of the query, wherein the scoring
function (score) includes:
\9
•
where:
score=
( Wif + wifAnchor )(k + J)
I B B I N Anchor X log(_)
kl + ( wif + wifAnchor) n
B BAnchor
wtf is a wei~hted term frequency applying a weight to a
frequency with which a given query term is included in the
document;
wtfAnchor is a weighted term frequency applying a weight to
a frequency with which the given query term is included in
anchor text referencing the document;
k1 is a constant;
wdl is a weighted document length applying a weight to a
length of the document being scored;
avwdl is an average weighted document length of all
documents being scored;
B is a document length normalization component defined
as B = ((1-b)+ b wdl ) where b is a constant;
avwdl
BAnchor is an anchor text normalization component defined
as B = ((1- b)+ b wdl ) where b is a constant;
avwdl
N is the number of documents on the network; and
n is the number of documents including at least one
appearance of a given query term; and
a ranking system configured to rank the results of the query based on the
score generated for each of the documents included in the results
ofthe query.
15. The system of claim 14, further compnsmg a crawler configured to
building the main index by processing each of the plurality of documents to determine the
frequency of the one or more terms included in the document.
16. The system of claim 15, wherein the crawler is further configured to build
the anchor text index by processing each of the plurality of documents to identify one or
more anchor text entries each referencing another document.
17. The system of claim 16, wherein the crawler is further configured to
generate an anchor text table, wherein an entry is made for each of the documents
including an anchor text entry, including one or more of:
a source identifier indicating the document including the anchor text entry;
a target identifier indicating a target document the anchor text entry references; and
one or more terms included in content of the anchor text entry.
18. The system of claim 17, wherein the crawler IS further configured to
generate the index by collecting for each of the documents:
the frequency of the one or more terms included in the document; and
for each of the anchor table entries in which the document is listed as the target document
of the target identifier, a frequency of the terms listed in the content of the anchor text
entry.

Documents

Application Documents

#	Name	Date
1	1484-del-2005-Other-Documents-(08-06-2005).pdf	2005-06-08
2	1484-del-2005-GPA-(08-06-2005).pdf	2005-06-08
3	1484-del-2005-Form-5-(08-06-2005).pdf	2005-06-08
4	1484-del-2005-Form-3-(08-06-2005).pdf	2005-06-08
5	1484-del-2005-Form-2-(08-06-2005).pdf	2005-06-08
6	1484-del-2005-Form-1-(08-06-2005).pdf	2005-06-08
7	1484-del-2005-Drawings-(08-06-2005).pdf	2005-06-08
8	1484-del-2005-Description-(Complete)-(08-06-2005).pdf	2005-06-08
9	1484-del-2005-Correspondence-others-(08-06-2005).pdf	2005-06-08
10	1484-del-2005-Claims-(08-06-2005).pdf	2005-06-08
11	1484-del-2005-Abstract-(08-06-2005).pdf	2005-06-08
12	1484-del-2005-Other-Documents-(30-08-2005).pdf	2005-08-30
13	1484-del-2005-Correspondence-others-(30-08-2005).pdf	2005-08-30
14	1484-del-2005-Form-3-(16-10-2007).pdf	2007-10-16
15	1484-del-2005-Correspondence-others-(16-10-2007).pdf	2007-10-16
16	1484-del-2005-Form-18-(29-09-2008).pdf	2008-09-29
17	1484-del-2005-Correspondence-others-(29-09-2008).pdf	2008-09-29
18	1484-del-2005-Form-13-(14-10-2008).pdf	2008-10-14
19	1484-del-2005-Correspondence-others-(14-10-2008).pdf	2008-10-14
20	1484-del-2005-Claims-(14-10-2008).pdf	2008-10-14
21	1484-DEL-2005-GPA-(11-06-2010).pdf	2010-06-11
22	1484-del-2005-Form-13-(11-06-2010).pdf	2010-06-11
23	1484-DEL-2005-Correspondence-Others-(11-06-2010).pdf	2010-06-11
24	1484-DEL-2005-Form-1-(06-01-2011).pdf	2011-01-06
25	1484-DEL-2005-Correspondence-Others-(06-01-2011).pdf	2011-01-06
26	1484-DEL-2005_EXAMREPORT.pdf	2016-06-30