Method For Gathering Content And Resources On The World Wide Web

Abstract: A method for gathering content and resources on the world wide web by simulating user interaction comprising steps of parsing and collecting tags from the HTML page, fetching out form tags and child nodes from the page, identifying the value of the nodes, identifying correct form tag for form submission from a database, parsing the child nodes of the form tag and collecting the values, permuting the values collected and submitting the combinations through the webpage to generate at least one page and filtering , gathering and indexing the desired content.

Patent Information

Application #

Filing Date

29 October 2007

Publication Number

37/2009

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

HCL TECHNOLOGIES LIMITED

184 NSK SALAI (ARCOT ROAD), VADAPALANI, CHENNAI-600 026, INDIA.

Inventors

1. SANDEEP DHAWAN

A8-9 SECTOR 60, NOIDA 201301, INDIA.

2. KINNAR KUMAR SEN

C/O HCL TECHNOLOGIES LTD, A8-9 SECTOR 60, NOIDA 201301, INDIA

3. SUMIT DATTA

C/O HCL TECHNOLOGIES LTD, A8-9 SECTOR 60, NOIDA 201301, INDIA

Specification

Field of Invention
The present invention relates to System and Method that can be incorporated in any search platform for gathering content and resources on the World Wide Web by simulating user interaction. More particularly the instant invention relates to a method that simulates human submission of HTML forms during search engine crawl and a method of extraction of domain-specific attribute values from any HTML page.
Background of the Invention
With the explosion of data available on World Wide Web, Search engines have become a part of everyone's digital life. However, scanning all web pages, extracting relevant information and then indexing them are no mean task for a search engine. This is primarily due to the huge size of Internet and the diverse techniques employed by websites in delivering information through web pages. In earlier days, most websites used to present static HTML pages that were updated by administrators through backend systems. This has now been replaced by dynamic websites that are driven by data stored in databases. In many websites, this technique is further refined by presenting only relevant data based on user-entered search query. For example, weather.com will not provide you the weather forecast for all cities in the world in one go - it provides data for a specific city you are interested in - and hence, asks you to enter the city name or zip code. And this makes perfect sense since a user is only interested in the weather condition of a specific city. This is also called as Deep Web.
This above change in data presentation by websites has caused great impact on the
working of popular search engines. This is because most search engines support only
surface web - websites that present their data through static links (and do not use user
query).
Another problem with the available search engine platforms is their inability to show the
most appropriate result for a search query as they do not understand the context of user's
query. For example, if user enters a search query "Old cars Make Honda Mileage 50000",

a typical search engine will show all web pages that have one or more of search keywords (such as Old, Honda, Mileage 50000 etc). Whereas the user may actually be looking to purchase an Old Honda Car that has mileage of 50000. This is due to the inability of search engines to extract domain-specific information from a web page during the crawling process. Continuing with the above example, a search engine should be able to identify attributes such as Car Make, Model and Mileage etc (which are related to automotive industry) while crawling web pages in this domain.
US Patent 66,65,658 titled "System and method for automatically gathering dynamic content and resources" tends to solve the problem of automatically creating and submitting user queries but this patent is applicable only for those dynamic websites that uses Cookies for session management. Further this invention uses a pre-conflgured DTD (Document Type Definition) for each website that specifies how the website has to be navigated, which elements are present in the web page etc .This invention creates a Query Template storing all FORM elements and their possible values. This process is done manually and the template is finally stored in a database. During actual crawling process, system retrieves this query template and then constructs search queries. This invention does not support JavaScript-based FORMs wherein selection of one value in a FORM element causes values in other FORM elements to change.
Therefore there is a need for a system to be present in search engines that is applicable for all dynamic websites that uses HTML FORM based search queries to provide data. The system should not require any previous knowledge about structure of a website. It should automatically identify which HTML FORM to be filled-up and submit. During the actual crawling process the system should automatically identify all FORM elements and their corresponding values and then use them to construct the query. The need for a database as used in the US Patent 6665658 should be obviated. Further since most of FORMs are JavaScripf^^ based forms there is a need that they should be supported. Moreover, once a web page is retrieved, the system should be able to identify domain-specific attribute values - this will improve quality of search result

Summary
It is the object of the instant invention to provide a System and Method that can be
incorporated in any search platform for gathering content and resources on the World
Wide Web by simulating user interaction.
It is yet another object of the instant invention to provide a system and method that
enables domain-specific attribute based indexing capability that can be applied to any
Hyper Text Transfer Protocol (HTML) page.
It is yet another object of the instant invention to provide a method for filtering,
gathering, indexing and storing it in a data store.
Brief description of accompanying drawings
The features of this invention together with its objects and advantages thereof may be best understood by reference to the description taken in conjunction w^ith the accompanying drawings.
Figure 1 illustrates a typical computer system that may be used to practice the present invention - it comprises of crawling, filtering and indexing mechanism
Figure 2 illustrates a flow diagram of HTML form submission.
Figure 3 illustrates the flow diagram used in the content filter mechanism.
Description of the Preferred Embodiments
Reference will now be made with reference to the accompanying drawings. According to the preferred embodiments,

As shown by figure 2, once the crawler detects an HTML form page (201) it parses the HTML page and collects all HTML tags (202). A HTML page contains tags such as , etc. All the tags present in the HTML page are identified and their child nodes picked up (203). A child node of a tag
could be etc. For each tag, find out
and child elements and their values. Now, access a metadata dictionary present within the system (205) and use this metadata to identify which tag is relevant for the system (206). Typically, a HTML page may contain many tags - the challenge is to find out which one is relevant. This is done by different techniques - matching name or name etc with that stored in metadata dictionary. For the identified tag, get its child nodes and their values and store these values within the system (207). Generate a permutation based on above values and then submit the FORM (208). Get the HTML Response Page and process further (209). If the Response Page shows no results, store the above combination of values within the system for self-learning The system can be trained to auto-populate fields in HTML FORM, This is done by defining a metadata that stores the text field name and its values. For example, for websites related to Automotive Industry, system administrator can define "Car Manufacturer" as a text field and specify "Nissan", "Honda" as its possible values. Whenever the system encounters a HTML FORM that has "Car Manufacturer" (or its synonyms) as a Text Field, the system auto-populates and submits the FORM for each value. The content so generated is analyzed (211) and forwarded for indexing (212)
Figure 3 illustrates the content filter mechanism through which the preferred embodiment analyzes contents retrieved during HTML form submission before forwarding the content for indexing. The mechanism works as follows:
Create metadata in the system defining attributes that are relevant for the specific domain (301). For example, while developing a search engine specific for the

financial sector, "Interest Rate" is an attribute the crawler should index. Similarly for real estate, the price and number of bedrooms etc are of interest. Also define rules for system behavior in case of multi-value attributes. For example, in case of real estate, you can define that "Number of bedrooms" is always a numerical value. Hence, during crawling process, if system encounters two values for "Number of bedrooms", the numerical value will be picked up. These attributes are used to build patterns for these attributes and register them with the system (302), one then crawls the Web using the method defined earlier to get the entire content of HTML page (303), Remove all nontextual content (such as images, banners, audio-video content) from HTML(Hyper Text Markup Language) page(304).
Till the page has exhausted each patterns (from pattern database registered within the system) are then applied (305) on the residual text content to identify and extract the values (305) corresponding to various attributes. The system then tags the value along with the keyword with the page (307). The system determines if multi-attribute values are required and if required applies the multi-value attribute rules. Further it checks if the page has exhausted or not (308) if exhausted e system sends the tagged page for indexing(309)
The invention with its preferred embodiments can be better understood using the following example:
Using the preferred embodiment while crawling a job site the crawler encounters a webpage wherein certain attributes need to be filled. Using the preferred embodiment of the instant invention the system would parse the web-page and collect all HTML tags. It would then detect all tags out of HTML tags. There may be more than one position on the HTML page where one needs to fill certain attributes before posting the page to get the desired information. The system therefore detects all these values. It then idenfifies the values of "action", "method" etc nodes fi-om the FORM tags. The system then accesses its metadata. This is an important configuration relating to the definition of domain-specific data dictionary within the system. This metadata helps the system in extracting specific parametric values from each webpage crawled. For example while

crawling a job site the a Web page may require values like field of work, experience , cost to company etc. the system takes all values associates with these fields from a database , permutes these values to get as many combinations as possible and feed the FORM tags of WEB page with these values. Once the values have been fed the web-site generates a page reflecting at least one result. One may have 10 job offers listed in a page. There is also a chance that the web -site may show certain data which is an advertisement related to the post but not relevant information. The system therefore filters this information. To achieve this the system first filters out all tags to get only visible textual content. It then uses patterns (store in a database) one by one on the textual content to extract the exact keywords and its values from the textual content. These extracted values are then tagged with the keyword with the page. Since there may be multiple results the system repeats this process till it reaches the end of page. Say suppose the system posts key words "patent agent" patent lawyer" "one year". It may receive a page with some results and some graphic advertisements and links pointing to other website. Using one of the preferred embodiment the crawler is able to filter out all non textual content, find the values received on the page that correspond to the keywords and then tag these values and key words with the page. It identifies and tags all values available. Once the page has been exhausted the tagged values along the keyword with the page are send for indexing. So when an end user enters the key words " patent agent" patent lawyer" "one year" on a search engine ,the indexer immediately directs the search engine to the pages crawled .
FORM , tag, Tags ,HTML,HTML tags, action, get, post, database, meta data, data dictionary used in this specification bear the same meaning as understood by a computer system developer especially to a person developing search engine platforms
Other modifications and variations to the invention will be apparent to those skilled in the art from the foregoing disclosure. Thus, while only certain embodiments of the invention have been specifically described herein, it will be apparent that numerous modifications may be made thereto without departing from the spirit and scope of the invention. For

example one or more programming modules can be clubbed to for a single program or one can have a dedicated gateway server. Other modifications are apparent.

We Claim
1. A method for gathering content and resources on the World Wide Web by simulating
user interaction comprising steps of
- parsing and collecting tags from the HTML page,
- fetching out form tags and child nodes from the page,
- identifying the value of the nodes,
- identifying correct form tag for form submission from a database,
- parsing the child nodes of the form tag and collecting the values and
- permuting the values collected and submitting the combinations through the webpage
to generate atleast one page.
2. Method as claimed in claim 1 further comprising the steps of
- filtering generated page
- gathering the desired content and
- indexing the desired content.
3. Method as claimed in claim 1 wherein the step of filtering gathering and indexing
comprises steps of:
- defining the attribute of a particular domain,
- building patters with the attributes,
- getting the content of the HTML page
- using patterns one by one on textual content and extracting the exact keyword and its corresponding value,
- extracting the value from the key word,
■ tagging the value along with the keyword with the page,
- repeating the above steps till the page has exhausted,
- sending the tagged page for indexing and storage in a database.
1 A System for gathering content and resources on the World Wide Web by simulating user interaction comprising

- means for parsing and collecting tags from the HTML page,
- means for fetching out form tags and child nodes from the page,
- means for identifying the value of the nodes,
- means for identifying correct form tag for form submission from a database,
- means for parsing the child nodes of the form tag and collecting the values and
" means for permuting the values collected and submitting the combinations through the
webpage to generate atleast one page .
5. System as claimed in claim 4 further comprising
- means for filtering generated page
- means for gathering the desired content and
- means for indexing the desired content.
6. System as claimed in claim 4 wherein the means for filtering gathering and indexing
comprises:
means for defining the attribute of a particular domain,
- means for building patters w^ith the attributes,
- means for getting the content of the HTML page
means for using patterns one by one on textual content and extracting the exact
keyword and its corresponding value, " means for extracting the value from the key word, " means for tagging the value along with the keyword with the page, " means for repeating the above steps till the page has exhausted,
- means for sending the tagged page for indexing and storage in a database.
7. A computer program product for gathering content and resources on the World Wide
Web by simulating user interaction, said computer program product comprising:
a computer readable means configured for:
parsing and collecting tags from the HTML page; fetching out form tags and child nodes from the page, identifying the value of the nodes;

identifying correct form tag for form submission from a database, parsing the child nodes of the form tag and collecting the values and permuting the values collected and submitting the combinations through the webpage to generate atleast one page
8, A method for gathering content and resources on the World Wide Web by simulating
user interaction substantially as herein described
9. A system for gathering content and resources on the World Wide Web by simulating
user interaction , substantially as herein described.
L.

Documents

Application Documents

#	Name	Date
1	2451-CHE-2007 FORM-18 23-04-2010.pdf	2010-04-23
1	2451-CHE-2007-AbandonedLetter.pdf	2017-07-06
2	2451-CHE-2007 POWER OF ATTORNEY 16-08-2010.pdf	2010-08-16
2	2451-CHE-2007_EXAMREPORT.pdf	2016-07-02
3	2451-che-2007-form 3.pdf	2011-09-04
3	2451-che-2007-abstract.pdf	2011-09-04
4	2451-che-2007-form 1.pdf	2011-09-04
4	2451-che-2007-claims.pdf	2011-09-04
5	2451-che-2007-correspondnece-others.pdf	2011-09-04
5	2451-che-2007-drawings.pdf	2011-09-04
6	2451-che-2007-description(complete).pdf	2011-09-04
7	2451-che-2007-correspondnece-others.pdf	2011-09-04
7	2451-che-2007-drawings.pdf	2011-09-04
8	2451-che-2007-claims.pdf	2011-09-04
8	2451-che-2007-form 1.pdf	2011-09-04
9	2451-che-2007-abstract.pdf	2011-09-04
9	2451-che-2007-form 3.pdf	2011-09-04
10	2451-CHE-2007_EXAMREPORT.pdf	2016-07-02
10	2451-CHE-2007 POWER OF ATTORNEY 16-08-2010.pdf	2010-08-16
11	2451-CHE-2007-AbandonedLetter.pdf	2017-07-06
11	2451-CHE-2007 FORM-18 23-04-2010.pdf	2010-04-23