Abstract: Described herein are methods and systems implementing a web page classification system for web page classification based on multi level features. In one embodiment, the web page classification system (102) for classifying a web page into pre-defined categories includes a processor (202); a memory (204) coupled to the processor (202), wherein the memory (204) includes a classifier (110) to classify a web page based on universal resource locator features, context features and content features.
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10, rule 13)
L Title of the invention*.
WEB PAGE CLASSIFICATION USING MULTI LEVEL FEATURES
2.Applicant(s)
NAME NATIONALITY ADDRESS
TATA CONSULTANCY Nirmal Building, 9th Floor, Nariman Point,
Indian
SERVICES LIMITED Mumbai-400021, Maharashtra, India
J. Preamble to the description
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it
is to be performed.
TECHNICAL FIELD
The present subject matter relates, in general, to web page classification and, in
particular, to web page classification using multi level features.
BACKGROUND
In recent years, there has been an exponential rise in the usage of the internet to
access information regarding any conceivable topic. Also, there has been an enormous increase
in the number of websites, electronic portals, etc., resulting in a very large number of web pages.
This has made retrieving relevant information from web pages very difficult.
Conventionally, search engines are used to assist users to locate the relevant web
pages regarding a subject. The search engines perform a search in data repositories based on the key words entered by the user in a query string. This generates a search result comprising a large number of web pages containing the key words. More often than not, a key word relates to more than one subject. For example, a key word "court" may relate to the field of law as in a court of justice or may relate to sports as in a tennis court. Accordingly, when a key word relevant to more than one subject is entered, the search result comprises web pages from all these fields. As a result, the users are required to scan manually through a large number of web pages to identify information relevant for their purposes. The scanning consumes a lot of time and effort of the users.
To enable the users to locate web pages pertaining to a subject of their interest in
a time efficient manner, web pages are classified into various categories based on the subject. Though various attempts at the classification of web pages have been made, the large number of web pages makes this a difficult task.
Further, the web pages are heterogeneous in nature thus making the classification
complex. For example, the web pages may be unstructured documents like text document, semi structured documents like Hypertext Markup Language (HTML) files, or fully structured documents like Extensible Markup Language (XML) file. The web pages may also contain files of various formats, such as image files, audio files, video files, multimedia files, etc. Thus, the distinct varieties of the web pages pose a challenge to web page classification.
SUMMARY
The subject matter described herein relates to a system and method for
automatically classifying web pages, which are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
In accordance with one embodiment of the subject matter, the web page
classification system for classifying a web page into pre-defined categories includes a processor; a memory coupled to the processor, wherein the memory includes a classifier to classify a web page based on URL features, context features and content features.
BRIEF DESCRIPTION OF DRAWINGS
The above and other features, aspects, and advantages of the subject matter will
be better understood with regard to the following description, appended claims, and accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference number in different figures indicates similar or identical items.
Fig. I illustrates a network environment for implementing a web page
classification system, in accordance with an embodiment of the present subject matter.
Fig. 2 illustrates a web page classification system, in accordance with one
embodiment of the present subject matter.
Fig. 3 illustrates an exemplary method for web page classification based on
multilevel features, according to an embodiment of the present subject matter.
Fig. 4 illustrates an exemplary method for dynamic expansion of categories of
web page classification, according to an embodiment of the present subject matter.
DETAILED DESCRIPTION
The present subject matter relates to systems and methods for web page
classification.
Web page classification is a technique used to classify documents, such as the
web pages available on World Wide Web (the web), under appropriate categories. Web page
classification is used by search tools, such as search engines, to search and provide data relevant
to a keyword search made by a user looking for some particular information on Internet.
Classification of web pages by human experts is possible but the large number of
web pages makes manual classification of web pages a daunting and difficult task. Conventional systems developed for web page classification either retrieve very less data, like the title of the web page or the uniform resource locator (URL) of the web page, from a web page leading to inaccurate classification or read the full content of the web page to classify the web page resulting in high resource consumption. For example, classification of web pages based on URL features is fast but has low accuracy, whereas classification of web pages based on reading the full content gives accurate results but results in increased resource and time consumption as compared to classification of web pages based on URL features. Therefore, it is desirable to have a system for web page classification that can optimize the classification process by performing classification accurately without consuming a large amount of resources.
The present subject matter discloses systems and methods for automatically
classifying the web pages based on multi level features. A web page classification system, according to the present subject matter, is configured to classify a web page into pre-defined categories based on the features of the web page. In one embodiment, the web page classification system progressively increases the volume of data retrieved from a web page until it is classified into pre-defined categories, thereby optimizing accuracy of web page classification and resource consumption.
Most of the web pages today are created using markup languages including, but
not limited to, Hypertext Markup Language (HTML) and extensible Hypertext Markup Language (XHTML). For improving web page representation, HTML structure of a web page is exploited. This helps .in identifying more representative terms of the web page. A web page can be represented using several elements of the web page including a body of the web page (BODY), the title of the web page (TITLE), section headings (H1-H6), emphasized content (EM) and/or meta description of the web page (meta tags).
According to an embodiment of the present subject matter, a web page is divided
into three distinct parts, the URL of the web page, the context features of the web page and the content of the web page. The web page classification system is configured to classify a web page into pre-defied categories based on the URL of the web page. For example, in one
implementation, the URL features are generated by checking delimiters, such as punctuation symbols in the URL. Any string between the delimiters is considered as a feature. Terms like http, www, .htm or .html, commonly occur in all the URLs, and hence are not useful for classification and therefore may be discarded from the feature list. The feature list is stemmed using any conventional stemming algorithm. The stemming of the features generates root words as URL features.
In case the web page could not be classified based on the URL, the web page
classification system is configured to classify the web page into categories pre-defined by the system administrator based on the context features of the web page. In one implementation, the web page classification system reads the information contained within the HEAD tag of the web page, removes any non-alphanumeric character, removes punctuation symbols, removes common words and stop words as retrieved from a set of pre-defined classification rules, removes HTML tags and stores the resultant words in lower case as context feature list. The words in the context feature list may be stemmed to their root words by applying any of the conventional stemming algorithms. The web page classification system then classifies the web page into pre defined categories based on the context features.
In case the web page could not be classified based on the context features, the
web page classification system uses content of the web page to classify the web page into categories pre-defined by the system administrator. In one implementation, the web page classification system reads the content of the web page, which is usually enclosed by the BODY tags in HTML. The BODY tag wraps around all of the content of the web page, such as headings, paragraphs, images, tables and so on. The web page classification system reads the content of the web page, removes all HTML tags, non-alphanumeric characters, and punctuation symbols. Additionally all common words and stop words retrieved from the set of classification rules are also removed and the resultant list is converted to lower case and is saved as content feature list. The words in the content feature list may be stemmed to their root words by applying any of the conventional stemming algorithms.
For example consider a case where the web page classification system is to
classify 1000 web pages into predefined categories. Statistically, in one example, around 40% of the web pages may be classified based on URL features, which requires processing a low volume of data and hence consumes very low resources and time. Thus, only the remaining 600 pages
are further processed using their context features which requires processing of relatively more volume of data and hence requires more resources and time as compared to processing of URL features. Statistically, for illustration, it may be considered that around 70% of web pages can be classified using context features. Hence, only 180 pages are further processed for content features which involve processing the full contents of the web page and requires maximum amount of resources and time as compared to processing of URL features and context features. Statistically, in one example, around 92% of the web pages can be classified according to content features. Thus, the web page classification system saves on processing time and resource consumption for classifying web pages without compromising on accuracy of classification of web pages.
In one embodiment, if a web page could not be classified in any of the pre-defined
categories, the web page is classified into a miscellaneous category. Further, when the number of
web pages in the miscellaneous category exceeds a pre-defined threshold, the web page
classification system is configured to cluster the web pages classified as miscellaneous so as to
create clusters of similar and related web pages. In one example, the pre-defined threshold may
be defined by a system administrator. When the number of web pages in a group exceeds a
predefined number set by the system administrator, the web page classification system alerts the
system administrator or a group of subject matter experts of the possibility of creation of a new
category and prompts the system administrator or subject matter experts for the name of the
category. Alternatively, the web page classification system may suggest a name for the new
category based on the most frequently occurring word or phrase in a cluster of similar web pages
and classify the web pages in the cluster under the newly defined category.
Thus, the web page classification system optimizes the resource consumption in
web page classification and accurately classifies web pages into pre-defined categories. Additionally, the web page classification system supports dynamic creation of new categories, thus further increasing the accuracy of web page classification.
The following disclosure describes systems and methods for web page
classification based on multi level features. While aspects of the described systems and methods can be implemented in any number of different computing systems, environments, and/or configurations, embodiments for the web page classification system are described in the context of the following exemplary system(s) and method(s).
It will be appreciated by those skilled in the art that the words during, while, and
when as used herein are not exact terms that mean an action takes place instantly upon an initiating action but that there may be some small but reasonable delay, such as a propagation delay, between the initial action and the reaction that is initiated by the initial action. Additionally, the word "connected" is used throughout for clarity of the description and can include either a direct connection or an indirect connection.
Fig. 1 illustrates a network environment 100 implementing a web page
classification system 102 for classification of web pages based on multi level features, according
to an embodiment of the present subject matter. In the network environment 100, the web page
classification system 102 is connected to a network 104 to access data 106.
The web page classification system 102 may be any computing device connected
to the network 104. For instance, the web page classification system 102 may be implemented as mainframe computers, workstations, personal computers, desktop computers, hand-held devices, multiprocessor systems, personal digital assistants (PDAs), laptops, network computers, minicomputers, servers and the like. In addition, the web page classification system 102 may include multiple servers to perform mirrored tasks for users, thereby relieving congestion or minimizing traffic.
The data 106, for example, may be stored on the World Wide Web. Although, the
data 106 is usually stored on the World Wide Web, it may also be stored elsewhere such as on a server internal to an organization or may be available from in one more data repository from different sources. The data 106 accessed by the web page classification system 102 is stored as one or more web pages.
In one embodiment, the data 106 and the web page classification system 102 both
include computer readable data storage media, such as hard disk drives and RAM memory, which store program instructions and data.
The web page classification system 102 accesses data 106 through the network
104. Communication links between the data 106 and the web page classification system 102 are enabled through a desired form of connections, for example, via dial-up modem connections, cable links, digital subscriber lines (DSL), wireless or satellite links, or any other suitable form of communication.
The network 104 may be a wireless network, a wired network, or a combination
thereof. The network 104 can also be an individual network or a collection of many such individual networks interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet. The network 104 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet and such. The network 104 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each other. Further, the network 104 may include network devices, such as network switches, hubs, routers, HBAs, for providing a link between the web page classification system 102 and the data 106. The network devices within the network 104 may interact with the web page classification system 102 and the data 106 through communication links.
The web page classification system 102 accesses the data 106 that may be stored
as web pages or in form of structured documents like XHTML pages, XML pages. The data 106
so accessed is classified by the web page classification system. 102. Classification of data
includes categorizing the web pages into one or more pre-defined categories. For the
classification described above, the web page classification system 102 uses three types of data,
namely URL, context features and content features of a web page to be classified.
The web page classification system 102 in accordance with the present subject
matter, employs a classifier 110 that may use machine learning methods to classify web pages based on multi level features. In one example, the web page classification system 102 uses machine learning techniques to build, train and test document classifiers. Document classifiers may be understood as applications that perform statistical text classification using one or more conventional document classification algorithms such as Naive Bayes and Space Vector representation. Rather than actually using the document classifiers to categorize individual documents, the web page classification system 102 extracts classification rules based on the document classification algorithms from the document classifiers. The classification rules are then transformed into a set of queries that can be sent to a search interface of the data 106. The search interface may then search the data 106 and provide the number of web pages having URL, context features or content features corresponding to the queries provided by the web page
classification system 102. The web page classification system 102 uses the number of matches reported for each query to make classification decisions without having to retrieve and analyze any of the actual documents stored in the data 106.
One or more client devices 108-1, 108-2, 108-N, collectively referred to as
client devices 108, may retrieve information from the web classification system 102. The client devices 108 may be in form of mainframe computers, workstations, personal computers, desktop computers, hand-held devices, multiprocessor systems, personal digital assistants (PDAs), laptops, network computers, minicomputers, servers and the like. The client devices 108 are connected to the web page classification system 102 through the network 104. The client devices 108 may be located at physically and geographically different places. In one embodiment, the client devices 108 may generate a request for the web page classification system 102 to classify a web page or a set of web pages into pre-defined categories.
Fig. 2 illustrates the web page classification system 102 in accordance with one
embodiment of the present subject matter. The web page classification system 102 includes
processor(s) 202, a memory 204 coupled to the processor(s) 202 and I/O interface(s), referred to
as interface(s) 206 to facilitate communication with other devices and systems.
The processor(s) 202 can be a single processing unit or a combination of multiple
processing units. The processor(s) 202 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 202 are configured to fetch and execute computer-readable instructions and data stored in the memory 204.
The interface(s) 206 may include a variety of software and hardware interfaces,
for example, interface for peripheral device(s) such as a keyboard, a mouse, an external memory, a printer, etc. Further, the interfaces 206 may enable the web page classification system 102 to communicate with other computing devices, such as web servers and external databases. The interfaces 206 may facilitate multiple communications within a wide variety of protocols and networks, such as the network 104, including wired networks, e.g., LAN, cable, etc., and wireless networks, e.g., WLAN, cellular, satellite, etc. The interfaces 206 may include one or more ports for connecting the web page classification system 102 to the data 106.
The memory 204 can include any computer-readable medium known in the art
including, for example, volatile memory such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 204 includes modules 208 and data 210.
The modules 208 include the classifier(s) 110, a category generation module 222,
a clustering module 224 and other module(s) 226. The other module(s) 226, in general, include routines, programs, objects, components, data structures, etc., that perform particular task or implement particular abstract data types and may include programs that supplement applications implemented by the web page classification system 102.
The classifier(s) 110 include a URL based classifier 228 that classifies a web page
based on URL features, a context based classifier 230 that classifies a web page based on context
features and a content based classifier 232 that classifies a web page based on content features.
The data 210 includes a pre-defined category list 212, classification rules 214 and
stemming data 216. The other data 218 may include any of the instructions, inference rules which might be required for the functioning of the web page classification system 102. For example, the other data 218 might include web crawlers and robots.
In operation, the web page classification system 102 retrieves a web page from
the data 106. In accordance with one embodiment of the subject matter, at the first instance, the URL based classifier 228 identifies the URL features to classify the web page based on URL features. In one implementation, the URL features are generated by checking delimiters, such as punctuation symbols, in the URL. Any string between the delimiters is considered as a feature. All the features so identified are populated in a feature list. Terms like http, www, .htm or .html commonly occur in all the URLs, and hence are not useful for classification and therefore may be discarded from by the URL based classifier 228. The URL based classifier 228 further removes words that appear more than once in the feature list to generate a set of tokens. The generated set of tokens are stemmed using any conventional stemming algorithm, for example, Porter's stemming algorithm, which is retrieved from stemming data 216. The stemming of the features generates root words as URL features. For example, consider a URL 'http://www.tcs.com/careers/Pages/default.aspx'. Here the delimiters are: ':', 7', and '.'. The words http, www, tcs, com, careers, pages, default and aspx are features. The words http, www,
com, aspx are common words that occur in many URLs and are not useful for the classification of the web page. Accordingly, the common words are removed from the feature list. Further, the words 'careers' and 'pages' may be stemmed to 'career' and 'page' respectively. Therefore the URL based classifier 228 may take tcs, career, and page as URL features and may attempt to classify the web page into one or more pre-defined categories retrieved from the category list 212. For example, the URL based classifier 228 may use the features tcs, career, and page to locate a relevant category from amongst a plurality of categories. Thus the URL based classifier may relate the word "tcs" to a name of an organization and classify the web page as belonging to the organization category. Further, the word "career" may be interpreted as being related to careers or jobs or recruitment and the web page might be classified accordingly. However, a web page may not always be classified based on the URL features. This may be due to lack in clarity or availability of URL features or non existence of suitable pre-defined categories in which the web page may be classified. In such a case, the web page classification system is configured to classify the web"page based on context features.
The context based classifier 230 extracts context features from the web page. For
generating the context features of a web page, the information within a HEAD tag of a web page is read. The HEAD tag contains information that describes the document itself, or associates it with related resources, such as scripts and style sheets. The HEAD tag includes a TITLE tag, which represents the document's title or name. Further, the HEAD tag also includes a BASE tag, which defines base URLs for links or resources on the page, and target windows in which the linked content may be opened. The HEAD tag further includes a LINK tag which refers to a resource of one or more kind such as Java script framework files to give support to the web page applications or cascading style sheets that provide instructions about how to style the various elements on the web page. The HEAD tag further includes a META tag which provides additional information about the page; for example, various character encodings the page uses, a summary of the page's content, instructions to search engines about whether or not to index content, and so on. Further, included in the HEAD tag is an OBJECT tag which represents a generic, multipurpose container for a media object. Additionally, other tags such as a SCRIPT tag, which is used either to embed or refer to an external script and a STYLE tag that provides an area for defining embedded (page-specific) cascading style sheet (CSS) styles may also be
defined within the HEAD tag. The context based classifier 230 may read one or more tags included within the HEAD tag to generate context features.
For generating the context features, the context based classifier 230 reads the
information contained within the HEAD tag of the web page, removes any non-alphanumeric characters, removes punctuation symbols, removes common words and stop words as retrieved from classification rules 214, removes HTML tags and stores the resultant words in lower case as context feature list. The words in the context feature list may be stemmed to their root words by applying any of the conventional stemming algorithms stored in the stemming data 216. The context based classifier 230 is configured to classify the web page into pre-defined categories retrieved from the category list 212 based on the context features. In one embodiment, the context based classifier 230 may use a compound of URL features and context features for classifying the web page into pre-defined categories.
If the web page could not be classified into the pre-defined categories by the
context based classifier 230, the content based classifier 232 reads the content of the web page which is usually enclosed by the BODY tag in HTML. The BODY tag wraps around all content of the web page, such as headings, paragraphs, images, tables, and so on. The content based classifier 232 reads the content of the web page, removes all HTML tags, non-alphanumeric characters, punctuation symbols. Additionally, all common words and stop words retrieved from the classification rules 214 are also removes and the resultant list is converted to lower case and is saved as content feature list. The words in the content feature list may be stemmed to their root words by applying any of the conventional stemming algorithms stored in stemming data 216. The content based classifier 232 is configured to classify the web page into pre-defined categories retrieved from the category list 212 based on the content features. In one embodiment, the content based classifier 232 may use a compound of URL features, context features and content features to classify a web page.
If the web page is not classified bases on URL features, context features and
content features in any of the pre-defined categories from the category list 212, the web page is stored in a miscellaneous category. When the number of web pages in the miscellaneous category exceeds a pre-defined threshold, say 100, the clustering module 224 applies any of the conventional clustering algorithms, like CURE data clustering algorithm and K-means clustering algorithm, retrieved from the classification rules 214 and clusters the web pages in the
miscellaneous categories into one or more groups. If the number of web pages in any of the groups exceeds a user defined number, say 40, the category generation module 222 prompts for a manual input to create a new category in the category list 212 and requests a name for the new category. Alternatively, the category generation module 222 may be configured to automatically create the new category and assign the most frequently occurring meaningful word in the group as the category name.
Thus, the web page classification system 102 optimizes resource and time
consumption in web page classification. Further, the web page classification 102 suggests a new
category when the number of web pages that could not be classified into the pre-defined
categories exceeds a pre-defined threshold, thus dynamically expanding categories in which web
pages may be classified. The web page classification 102 may also classify a web page into more
than one categories based on at least one of URL features, context features and content features.
Fig. 3 illustrates an exemplary method 300 for web page classification based on
multilevel features, according to an embodiment of the present subject matter. The method 300 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 300 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
The order in which the method 300 is described is not intended to be construed as
a limitation, and any number of the described method blocks can be combined in any order to implement the method 300, or an alternative method. Additionally, individual blocks may be deleted from the method 300 without departing from the spirit and scope of the subject matter described herein. Furthermore, the method 300 can be implemented in any suitable hardware, software, firmware, or combination thereof. The method 300 is presently provided for web page classification based on multilevel features. Although, the method 300 has been described in context of web page classification based on multilevel features, the same should not be construed
as a limitation. It will be apparent that the method 300 may be implemented for classification of
other forms of structured documents with modifications as known by those skilled in the art.
At block 302, the web page classification system 102 receives a web page which
has to be classified into pre defined categories as retrieved from category list 212. The web page may either be stored as data 106 or may be received from any of the client devices 108. The URL features of the web page are generated by the URL based classifier 228 and the web page classification system 102 classifies the web page into any of the pre-defined categories retrieved from the category list 212 as illustrated at block 304. The web page classification system 102 is configured to first classify the web page based on URL as it has the least volume of information associated with it and takes minimum time and resources to be processed. Usually the structure of the URL of a web page is scheme://host/path elements/document name.extension. For processing the URL the scheme which is usually hypertext transfer protocol (HTTP) or hypertext transfer protocol secure (HTTPS) and the extension of the document occur in many web pages and so may not be useful for classification of web pages. The URL based classifier 228 also removes the delimiters like "\", ".", etc. and stems the host, path elements and the document name to their root words so as to generate the URL based features. In one implementation, the web page classification system 102 may use lemmatization, which is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item or in the context of usage to determine the root words. Context plays an important role in determining the base word. For example, the word "meeting" may be used as a noun as in "in the cabinet meeting" or may be used as a verb as in "ministers are meeting tomorrow". Thus, the context in which the word is used may change the root word.
The web page classification system 102 checks if the web page could be classified
on the basis of URL features as shown at block 306. In case the web page could be classified, the web page classification system 102 returns the category of the web page as illustrated at block 308. If the web page classification system 102 is not classified based on the URL feature, the context based classifier 230, at block 310, may classify a web page on the basis of context features. Context based features are more elaborate than URL features and hence, have more probability of classifying the web page, but requires more resource and time for processing as compared to URL features.
The web page classification system 102 checks if the web page could be classified
on the basis of context features, as shown at block 312. In case the web page could be classified,
the web page classification system 102 returns the category of the web page, as illustrated at
block 308. If the web page is not classified, the content based classifier 232 may classify the web
page on the basis of context features at block 314. Classification of the web page based on
content feature involves reading the whole content of the web page and thus, involves the
maximum volume of data, which increases the probability of classifying the web page, but
requires the maximum time and resources for processing the contents of the web page.
The web page classification system 102 checks if the web page could be classified
on basis of content features, as shown at block 316. In case the web page could be classified, the web page classification system 102 returns the category of the web page, as illustrated at block 308. If the web page classification system 102 is not classified the web page classification system 102 classifies the web page as miscellaneous, as shown at block 318. Miscellaneous category is usually there to store the web pages that do not fall into any of the pre-defined categories retrieved from category list 212.
Thus the method 300 of web page classification based on multi level features
optimizes the accuracy of web page classification and resource consumed and time taken for the web page classification.
Fig. 4 illustrates an exemplary method 400 for dynamic expansion of categories
of web page classification, according to an embodiment of the present subject matter. The method 400 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 400 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
The order in which the method 400 is described is not intended to be construed as
a limitation, and any number of the described method blocks can be combined in any order to
implement the method 400, or an alternative method. Additionally, individual blocks may be deleted from the method 400 without departing from the spirit and scope of the subject matter described herein. Furthermore, the method 400 Can be implemented in any suitable hardware, software, firmware, or combination thereof. The method 400 is presently provided for dynamic expansion of categories of web page classification, Although, the method 400 has been described in context of web page classification based on multilevel features, the same should not be construed as a limitation. It will be apparent that the method 400 may be implemented for classification of other forms of structured documents with modifications as known by those skilled in the art.
The method 400 for dynamic expansion of categories of web page classification
may be activated based on a manual input or after a pre-defined time interval or whenever a new web page is classified in the miscellaneous category. At block 402, the web page classification system 102 retrieves the web pages classified as miscellaneous. At block 404, the web page classification system 102 checks if the number of web pages classified as miscellaneous exceeds a pre-defined threshold. For example, in one implementation the pre-defined threshold may be 100 web pages. If the number of web pages in the miscellaneous category is less than the predefined threshold, the web page classification system 102 waits for one of the pre-defined time interval or a manual input, as shown at block 416. In case the number of web pages exceeds the pre-defined threshold, the clustering module 224 implements any of the clustering algorithms retrieved from classification rules 214 and creates clusters of similar or related web pages, as illustrated at block 406.
At block 408, the category generation module 222 reads each web page in each
cluster and at block 410, checks if the number of web pages in any cluster is greater than a pre-defined number. For example, in one implementation, the pre-defined number of web pages in a cluster may be 40. If the number of web pages in each cluster is less than the pre-defined number, the web page classification system 102 waits for one of the pre-defined time interval or a user input, as shown at block 416. In case the number of web pages in each cluster exceeds the pre-defined number, the category generation module 222 generates an alert for the system administrator to create a new category, as shown at block 412, and generates a category name for the new category based on the most frequently occurring meaningful word, as illustrated at block 414. Alternatively, the category generation module 222 may be configured to automatically
create a new category with the category name based on the most frequently occurring meaningful word without user intervention.
The web page classification system 102 then waits for a pre-defined interval or a
user input, as shown at block 416. As the web pages that need to be classified grow in number and the content of the web pages become more diverse, the pre-defined categories originally defined by the system administrator may not always cover the subject matter of the content of the web pages. It is, therefore necessary to expand the category list 212 to accommodate the diverse content of the new web pages that are created so as to provide effective classification. The method 400 implemented by the web page classification system 102 dynamically creates new categories so as to expand the category list 212 to cover the new subject matter of the newly added web pages, resulting in accurate classification of web pages.
The web page classification system 102 thus classifies the web pages in a time
efficient and accurate manner to create a web page classification data for search engines to operate upon. The above description has been given in reference to a web page classification system 102 used for classifying web pages, but it would be appreciated by a person skilled in the art that the web page classification system 102 can be conveniently used for classifying structured documents other than web pages. Further, the web page classification system 102 may be configured to use a compound of URL features, context features and content features to classify a web page.
Although embodiments for web page classification system 102 for classifying
web pages have been described in language specific to structural features and/or methods, it is to be understood that the invention is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as exemplary embodiments for the web page classification system 102 and methods.
1/ We claim:
1. A method for web page classification, the method comprising:
determining at least one category of a web page based on universal resource locator features; and
if the at least one category of the webpage is not determined based on the universal resource locator features, then determining the at least one category of the web page based on content features
2. A method for web page classification, the method comprising:
determining at least one category of a web page based on universal resource locator features; and
if the at least one category of the webpage is not determined based on the universal resource locator features, then determining the at least one category of the web page based on context features; and
if the at least one category of the webpage is not determined based on the context features, then determining the at least one category of the web page based on content features.
3. The method for web page classification as claimed in claim 1 or claim 2, the method
further comprising classifying the web page in a miscellaneous category if the at least one
category of the webpage is not determined based on the content features.
i
4. The method for web page classification as claimed in any of the preceding claims,
wherein the content feature comprises at least one of tags enclosed within a BODY tag of the web page.
5. The method for web page classification as claimed In claim 2, wherein the context feature comprises at least one of a TITLE tag, a BASE tag, a LINK tag, a META tag, a OBJECT tag, a SCRIPT tag and a STYLE tag.
6. The method for web page classification as claimed in claim 1 or claim 2, further comprising:
grouping of at least one web page categorized in a miscellaneous category into at least one cluster;
generating an alert for creating an additional category based on number of web pages in the any of the at least one cluster exceeding a pre-defined number.
7. The method for web page classification as claimed in claim 6, the method further comprising generating a category name for the additional category based on a most frequently occurring meaningful word in the cluster and a manual input.
8. The method for web page classification as claimed in claim 7, wherein the grouping of the at least one web page categorized in the miscellaneous category is based on a number of web pages categorized in the miscellaneous category exceeding a pre defined threshold.
9. The method for web page classification as claimed in any of the preceding claims, wherein the classifying the web page includes categorizing the web page in a hierarchical structure.
10. A web page classification system (102) for classifying a web page, the system comprising:
a processor (202); and
a memory (204) coupled to the processor (202), wherein the memory (204) comprises:
a classifier (110) to classify the web page based in part on universal resource locator features and context features and content features.
11. The web page classification system (102) as claimed in claim 10, wherein the classifier
(110) further comprises:
a universal resource locator based classifier (228) to classify the web page based on the universal resource locator features;
a context based classifier (230) to classify the web page based on the context features; and
a content based classifier (232) to classify the web page based on the content features.
12. The web page classification system (102) as claimed in claim 10, wherein the web page classification system (102) further comprises a clustering module (224) configured to create clusters of web pages categorized in a miscellaneous category.
13. The web page classification system (102) as claimed in claim 12, wherein the clustering of the web pages is based at least on of a similarity and a relation between the web pages.
14. The web page classification system (102) as claimed in claim 10, wherein the web page classification system (102) further comprises a category generation module (222) configured to generate an alert for creation of an additional category out of a cluster when the number of web pages in the cluster exceeds a pre-defined number.
15. The web page classification system (102) as claimed in claim 14, wherein the category generation module (222) is configured to generate a category name for the additional category based on at least one of the most frequently occurring meaningful word in the cluster and a user input.
16. A computer-readable medium having embodied thereon a computer program for executing a method comprising:
determining at least one category of a web page based on universal resource locator features; and
if the at least one category of the webpage is not determined based on the universal resource locator features, then determining the at least one category of the web page based on context features; and
if the at least one category of the webpage is not determined based on the context features, then determining the at least one category of the web page based on content features.
17. The computer-readable medium method as claimed in claim 16, wherein the method
further comprises
grouping of at least one web page categorized in a miscellaneous category into at least one cluster;
generating an alert for creating an additional category based on number of web pages in the any of the at least one cluster exceeding a pre-defined number..
| # | Name | Date |
|---|---|---|
| 1 | 263-MUM-2011-OTHERS [27-07-2018(online)].pdf | 2018-07-27 |
| 2 | 263-MUM-2011-FER_SER_REPLY [27-07-2018(online)].pdf | 2018-07-27 |
| 3 | 263-MUM-2011-COMPLETE SPECIFICATION [27-07-2018(online)].pdf | 2018-07-27 |
| 4 | 263-MUM-2011-CLAIMS [27-07-2018(online)].pdf | 2018-07-27 |
| 5 | abstract1.jpg | 2018-08-10 |
| 6 | 263-MUM-2011-POWER OF ATTORNEY(23-9-2011).pdf | 2018-08-10 |
| 7 | 263-mum-2011-form 5.pdf | 2018-08-10 |
| 8 | 263-mum-2011-form 3.pdf | 2018-08-10 |
| 9 | 263-mum-2011-form 2.pdf | 2018-08-10 |
| 10 | 263-mum-2011-form 2(title page).pdf | 2018-08-10 |
| 11 | 263-MUM-2011-FORM 18(19-8-2011).pdf | 2018-08-10 |
| 12 | 263-mum-2011-form 1.pdf | 2018-08-10 |
| 13 | 263-MUM-2011-FORM 1(7-3-2011).pdf | 2018-08-10 |
| 14 | 263-MUM-2011-FER.pdf | 2018-08-10 |
| 15 | 263-mum-2011-drawing.pdf | 2018-08-10 |
| 16 | 263-mum-2011-description(complete).pdf | 2018-08-10 |
| 17 | 263-mum-2011-correspondence.pdf | 2018-08-10 |
| 18 | 263-MUM-2011-CORRESPONDENCE(7-3-2011).pdf | 2018-08-10 |
| 19 | 263-MUM-2011-CORRESPONDENCE(23-9-2011).pdf | 2018-08-10 |
| 20 | 263-MUM-2011-CORRESPONDENCE(19-8-2011).pdf | 2018-08-10 |
| 21 | 263-mum-2011-claims.pdf | 2018-08-10 |
| 22 | 263-mum-2011-abstract.pdf | 2018-08-10 |
| 23 | 263-MUM-2011-Correspondence to notify the Controller [24-02-2021(online)].pdf | 2021-02-24 |
| 24 | 263-MUM-2011-Written submissions and relevant documents [12-03-2021(online)].pdf | 2021-03-12 |
| 25 | 263-MUM-2011-US(14)-HearingNotice-(HearingDate-01-03-2021).pdf | 2021-10-03 |
| 1 | 263_mum_2011_27-11-2017.pdf |