System And Method For Creating Labels For Clusters

Abstract: Disclosed is a method and system for creating labels for cluster in computing environment. The system comprises receiving module, candidate items selector, combination array generator, coverage value analyzer, candidate pair selector, unique word filter and cluster label selector. Receiving module receives input data and candidate items selector selects candidate items occurring repetitively using n-gram technique to generate list of candidate items with frequency of occurrence. Combination array generator selects candidate items to populate two dimensional array wherein each array element represents pair of n-gram. Coverage value analyzer determines coverage value for each pair of n-gram from array. Candidate pair selector selects pairs of n-gram from two dimensional array to process and generate list of candidate pairs. The unique word filter determines number of unique words in each candidate pair. Cluster label selector sorts list of candidate pairs using coverage value and number of unique words to select cluster label.

Patent Information

Application #

Filing Date

01 July 2013

Publication Number

24/2015

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Patent Number

Legal Status

Grant Date

2023-09-07

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

NIRMAL BUILDING, 9TH FLOOR, NARIMAN POINT, MUMBAI 400021, MAHARASHTRA, INDIA

Inventors

1. DESHPANDE, SHAILESH SHANKAR

TATA CONSULTANCY SERVICES LIMITED, PROCESS ENG. INNOVATION LAB, TRDDC, 54-B, HADAPSAR INDUSTRIAL ESTATE, PUNE 411013, MAHARASHTRA, INDIA

2. PALSHIKAR, GIRISH KESHAV

TATA CONSULTANCY SERVICES LIMITED, PROCESS ENG. INNOVATION LAB, TRDDC, 54-B, HADAPSAR INDUSTRIAL ESTATE, PUNE 411013, MAHARASHTRA, INDIA

3. G, ATHIAPPAN

TATA CONSULTANCY SERVICES LIMITED, PROCESS ENG. INNOVATION LAB, TRDDC, 54-B, HADAPSAR INDUSTRIAL ESTATE, PUNE 411013, MAHARASHTRA, INDIA

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention: SYSTEM AND METHOD FOR CREATING LABELS FOR CLUSTERS
Applicant
Tata Consultancy Services Limited. A Company Incorporated in India under The Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
[001] The present subject matter described herein, in general, relates to text mining
and text clustering, more particularly to creating one or more labels for one or more clusters.
BACKGROUND
[002] In current business scenario, organizing and analyzing huge amount of
electronics records is a challenging task. In order to achieve the business objectives of the organization, categorizing the electronic records in different groups based on records similarity is a common step deployed. When user doesn't know about the number of groups to be formed and the nature of the groups, usually unsupervised approach such as clustering is applied. In clustering, system form groups by automatically comparing each document with other documents and by using a threshold for forming a group. Few documents from the collection are selected as the cluster centers around which the groups are formed. Clustering textual answers to a survey questionnaire is one of the significant mechanisms to generate meaningful insights from textual responses.
[003] Most of the clustering techniques do not provide descriptive labels to the
clusters. In order to identify good descriptive label for a set of documents, user has to go through the set of documents manually, read and understand them, and then a descriptive label may be created.
[004] Automatic cluster labeling disclosed in prior art faces many challenges such
as single word or words set as label, are not sufficient descriptors and they fail to provide descriptive label. A complete sentence as a label is too lengthy for many situations. A complete sentence or words and/or phrases as in centroid vector are also not very useful as it is too lengthy and might not provide good coverage. Most frequent single word and/or phrase also fail to provide good coverage. Complex semantic analysis does not help as it is more time consuming than clustering.
[005] There are many solutions provided in the prior art for cluster labeling, one of
them discloses extracting verb phrases, noun phrases from a given cluster using natural language parser. Further, the method calculates the KL divergence for each keyword or combination of keywords as extracted. Most discriminative key words for a given cluster are

selected as the cluster labels. However these labels are not good enough as cluster label and the method is computationally intensive. In addition because of inherent limitations in clustering process that a cluster might not content a single theme or phrase that can cover all the records in the cluster. Further, prior art technique disclosing label using single most frequent phrase or keyword do not exemplify all the records in a given cluster. Thus prior art techniques fail to provide an automatic way to provide descriptive label which will reflect most of the content in the given cluster.
SUMMARY
[006] This summary is provided to introduce aspects related to systems and
methods for creating one or more labels for one or more cluster and the aspects are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[007] In one implementation, a system for creating one or more labels for one or
more cluster, in a computing environment is disclosed. The system comprises a processor and a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprises a receiving module, a candidate items selector, a combination array generator, a coverage value analyzer, a candidate pair selector, a unique word filter and a cluster label selector. The receiving module is configured to receive an input data. The candidate items selector is configured to select plurality of candidate items occurring repetitively in the input data using an n-gram selection technique for a predefined value of n to generate a sorted list of candidate items with a frequency of occurrence of the candidate items in the input data. The combination array generator is configured to select predefined number of the candidate items from the sorted list of candidate items to populate a two dimensional array, wherein each element of the two dimensional array represents a pair of the n-gram. The coverage value analyzer configured to determine a coverage value for each pair of the n-gram present in the two dimensional array to further populate a sorted two dimensional array. The candidate pair selector is configured to select a predefined number of pairs of the n-gram from the sorted two dimensional array to further process and generate a list of the candidate

pairs. The unique word filter is configured to accept the list of the candidate pairs to determine a number of unique words in each of the candidate pairs. The cluster label selector is configured to sort the list of the candidate pairs using the coverage value and the number of unique words to create a sorted list of the candidate pairs for selecting a cluster label from the sorted list of the candidate pairs.
[008] The present invention also discloses a method for creating one or more labels
for one or more cluster in a computing environment. The method comprises receiving an input data and selecting plurality of candidate items occurring repetitively in the input data using n-gram technique for a predefined value of n to generate a sorted list of candidate items with a frequency of occurrence of the candidate items. The method further comprises selecting foremost predefined number of the candidate items from the sorted list of candidate items to populate a two dimensional array wherein each element of the two dimensional array represents a pair of the n-gram. The method further comprises determining a coverage value for each pair of the n-gram from the two dimensional array to further populate a sorted two dimensional array and selecting a predefined number of pairs of the n-gram from the sorted two dimensional array occurring foremost to further process and generate a list of a candidate pairs. The method further comprises accepting the list of the candidate pairs to determine a number of unique words in each of the candidate pairs; and sorting the list of the candidate pairs using the coverage value and the number of unique words to create a sorted list of the candidate pair for selecting a cluster label form the sorted list of the candidate pairs wherein the receiving, the selecting plurality of candidates, the selecting foremost predefined number of candidate items, the determining, the selecting a predefined number of pairs, the accepting and the sorting are performed by the processor.
[009] The present invention also discloses a computer program product having
embodied thereon a computer program for creating one or more labels for one or more cluster. The computer program product comprises a program code for receiving an input data and a program code for selecting plurality of candidate items occurring repetitively in the input data using n-gram technique for a predefined value of n to generate a sorted list of candidate items with a frequency of occurrence of the candidate items. The computer program product further comprises a program code for selecting foremost predefined

number of the candidate items from the sorted list of candidate items to populate a two dimensional array wherein each element of the two dimensional array represents a pair of the n-gram. The computer program product further comprises a program code for determining a coverage value for each pair of the n-gram from the two dimensional array to further sort the two dimensional array in descending order of the coverage value for each n-gram pair to populate a sorted two dimensional array and a program code for selecting a predefined number of pairs of the n-gram from the sorted two dimensional array occurring foremost to further process and generate a list of a candidate pairs. The computer program product further comprises a program code for accepting the list of the candidate pairs to determine a number of unique words in each of the candidate pairs and a program code for sorting the list of the candidate pairs using the coverage value and the number of unique words to create a sorted list of the candidate pair for selecting a cluster label form the sorted list of the candidate pair.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The detailed description is described with reference to the accompanying
figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.
[0011] Figure 1 illustrates a network implementation of a system for creating one or
more labels for one or more clusters in a computing environment, in accordance with an embodiment of the present subject matter.
[0012] Figure 2 illustrates the system for creating one or more labels for one or more
cluster, in accordance with an embodiment of the present subject matter.
[0013] Figure 3 illustrates a method for creating one or more labels for one or more
cluster, in accordance with an embodiment of the present subject matter.

DETAILED DESCRIPTION
[0014] System and method for creating labels for cluster are described. System
generates one or more descriptive labels that cover important themes discussed in a given set of documents of similar nature and are called as cluster. The label generated by the system for a cluster of documents could be formed using a single word or a single phrase and/or combination of them. System and method may use n-gram technique to select the candidate items occurring repetitively in the input set of documents. Further the candidate items are selected based on the frequency of occurrence of the candidate items. A two dimensional array is generated by using the selected candidate items. Each element of the two dimensional array represents a pair of the n-gram. Coverage value for each pair of the n-gram in the two dimensional array is used to select the candidate pairs from the two dimensional array. Further unique words occurring in each candidate pairs are determined. Further, cluster labels are selected based on the coverage value and the number of unique words in each of the candidate pairs.
[0015] The system and method identifies predefined number of labels for example
three, and user then selects one of the labels as appropriate descriptor of the set of documents. The system and method disclosed herein may also find application in labeling the collection of documents that are to be clustered to give cluster centers.
[0016] While aspects of described system and method for creating one or more
labels for one or more cluster, may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system.
[0017] Referring now to Figure 1, a network implementation 100 of system 102 for
creating one or more labels for one or more cluster is illustrated, in accordance with an embodiment of the present subject matter.
[0018] Although the present subject matter is explained considering that the system
102 is implemented on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop

computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. It will be understood that the system 102 may be accessed by multiple users through one or more user devices 104-1, 104-2... 104-N, collectively referred to as user 104 hereinafter, or applications residing on the user devices 104. Examples of the user devices 104 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, and a workstation. The user devices 104 are communicatively coupled to the system 102 through a network 106.
[0019] In one implementation, the network 106 may be a wireless network, a wired
network or a combination thereof. The network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network 106 may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
[0020] Referring now to Figure 2, the system 102 is illustrated in accordance with an
embodiment of the present subject matter. In one embodiment, the system 102 may include at least one processor 202, an input/output (I/O) interface 204, and a memory 206. The at least one processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 202 is configured to fetch and execute computer-readable instructions stored in the memory 206.
[0021] The I/O interface 204 may include a variety of software and hardware
interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 204 may allow the system 102 to interact with a user directly or through the client devices 104. Further, the I/O interface 204 may enable the system 102 to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O

interface 204 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example. LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 204 may include one or more ports for connecting a number of devices to one another or to another server.
[0022] The memory 206 may include any computer-readable medium known in the
art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 206 may include modules 208 and data 210.
[0023] The modules 208 include routines, programs, objects, components, data
structures, etc., which perform particular tasks or implement particular abstract data types. In one implementation, the modules 208 may include a receiving module 212, a candidate items selector 214, a combination array generator 216, a coverage value analyzer 218, a candidate pair selector 220, a unique word filter 222, a cluster label selector 224 and other modules 226. The other modules 226 may include programs or coded instructions that supplement applications and functions of the system 102.
[0024] The data 210, amongst other things, serves as a repository for storing data
processed, received, and generated by one or more of the modules 208. The data 210 may also include a system database 228, and other data 230. The other data 230 may include data generated as a result of the execution of one or more modules in the other module 226.
[0025] In one implementation, at first, a user may use the client device 104 to access
the system 102 via the I/O interface 204. The user may register him using the I/O interface 204 in order to use the system 102. The working of the system 102 may be explained in detail in Figures 2 and 3 explained below. The system 102 may be used for creating one or more labels for one or more cluster.
[0026] In accordance with an embodiment of the present subject matter, referring to
figure 2, a detailed working of the system 102 is explained. The system 102 comprises the receiving module 212 configured to receive an input data. The input data comprises a set of text documents, a set of text records associated with one or more cluster. The set of text

documents may comprise survey responses, responses on the blogs, user forums or any other collection of text data required by the person skilled in the art etc.
[0027] In one embodiment, the collection of text document or text records may be
called as cluster. By way of an example, text responses to a survey question are clustered into five clusters indicating major concerns respondents have. The responses are stored in an electronic format. Each cluster shows the records belonging to that group or have index of that document. Further, the objective is to create a label for each cluster.
[0028] Table 1 shows sample ingredients of the cluster content as an example,
further comprises Environment Culture Cluster content captured as a survey response.

I North Sydney is a well recognized business location
2 Client
3 Technology
4 Friendly environment
5 The work environment
6 Business area Near in the MRT
7 Comfortable environment
8 Encouraging environment, the opportunity to communicate with others on some problems we are studying, etc.
9 Working Environment
10 Friendly environment.
11 Friendly work environment, approachability of people.
12 Good environment, good team
13 Technology
14 Environment provided is bright and clean.
15 I am at a client side, hence most of the project Management and work environment policies are set by the client.

Table 1
[0029] The system 102 further comprises a candidate items selector 214 configured
to select plurality of candidate items occurring repetitively in the input data. The selection of candidate items may be performed by using an n-gram selection technique for a predefined value of n. Further after selecting the candidate items, a frequency of occurrence of the candidate items in the input set of data is calculated and the list of candidate items is further sorted by using the value of the frequency of occurrence of the candidate items to generate a sorted list of candidate items. The candidate items further comprises of words, phrases or a combination thereof. In one embodiment, the predefined value of n using n-gram technique may range from 1 to 5. The list of candidate items is sorted in accordance with a descending order of the frequency of occurrence of the candidate items.
[0030] In accordance with an exemplary embodiment, the documents or the records
in the first cluster are accessed to create candidate items list of most frequent word or phrases. System uses n-gram technique for selecting candidate items. The system can take any value of n as configured by the user and perform candidate items selection. In one embodiment, system uses value of n from 1 to 5. It is observed empirically that going beyond 5-gram provides only marginal improvement in labeling survey responses. Further, the frequency of occurrence of each n-gram within all the records in a given cluster is calculated. Further, a list of n-grams along with its frequency of occurrence is created. Further, the list for each n-gram is sorted in descending order.
[0031] According to an exemplary embodiment, Table 2 shows the sample list of n-
grams / candidate items with its frequency of occurrence in the Environment-Culture cluster.

Candidate Item Frequency Of Occurrence
Environment 67
Friendly 9
Clean 8
Learn 6
Client 5
friendly environment 5

office environment
2
comfortable environment 2
environment etc 2
nice environment 2
easy access to 1
neat and clean 2
comfortable work environment 2
office building and 1
office environment etc 1
own decisions friendly and 1
project environment is totally 1
opportunity learn my client 1
other services relatively clean 1
organized office environment etc 1
provided is bright and clean 1
project management and work environment 1
project environment is totally different 1
quality food in canteen good 1
public transportation easy access to 1
Table 2
[0032] The system 102 further comprises the combination array generator 216
configured to select foremost predefined number of the candidate items from the sorted list of candidate items and to populate a two dimensional array. Each element of the two dimensional array represents a pair of the n-gram. In accordance with an exemplary embodiment, the candidate items list created by candidate items selector is accessed by the combination array generator and it selects top 5 n-grams for each n as candidates for further processing. After completion of candidate n-grams selection, system has 25 n-grams along with its frequencies. The combination array generator generates a two dimensional array can be matrix of 25X25 cells wherein each cell represents a value for a pair of n-gram.
[0033] The system 102 further comprises the coverage value analyzer 218
configured to determine a coverage value for each pair of the n-gram present in the two dimensional array. The coverage value analyzer further configured to populate a sorted two

dimensional array. The coverage value for each pair of the n-gram is determined to further ensure a maximum coverage with a minimum overlap. The two dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the n-gram. In accordance with an exemplary embodiment, the coverage value analyzer calculates coverage value for each cell, that is, for each pair of n-gram in the matrix. The coverage value for a pair of n-gram A, and n-gram B is given as Coverage value = P(A) + P(B) - P(AHB). The coverage value in the cell indicates the maximum coverage with minimum overlap between the two n-gram pairs. Based on the coverage value for each n-gram pair, the two dimensional array (matrix) are sorted in descending order (largest value first).
[0034] According to an exemplary embodiment, Table 3 shows a sample of the two
dimensional array with the n-gram pairs and the coverage value for the n-gram pair. By way of an example, Table 3 shows the unigram pairs with respective coverage value for the Environment-Culture cluster.

environment friendly clean learn client
environment 0.761363636 0.761364 0.829545 0.818181818 0.784090909
friendly 0.761363636 0.102273 0.193182 0.170454545 0.159090909
clean 0.829545455 0.193182 0.090909 0.159090909 0.147727273
learn 0.818181818 0.170455 0.159091 0.068181818 0.113636364
client 0.784090909 0.159091 0.147727 0.113636364 0.056818182
[0034] According to an exemplary embodiment, Table 4 shows a sample of the
sorted two dimensional array including n-gram pairs with respective coverage value in the cluster. By way of an example, Table 4 shows the sample of the sorted two dimensional array content having n-gram pairs with respective coverage value for the Environment-Culture cluster. The content of two dimensional array is sorted based on the coverage value of the n-gram pair.

Graml Gram2 Label P(AUB)
1 1 environment, clean 0.829545
1 1 environment, learn 0.818182

1 environment, client 0.784091
3 environment, neat and clean 0.784091
4 environment, opportunity learn my client 0.772727
2 environment, comfortable environment 0.761364
2 environment, office environment 0.761364
4 environment, own decisions friendly and 0.761364
4 environment, other services relatively clean 0.761364
3 environment, office environment etc 0.761364
2 environment, friendly environment 0.761364
1 environment friendly 0.761364
3 environment, office building and 0.761364
2 environment, nice environment 0.761364
3 environment, easy access to 0.761364
3 environment, comfortable work environment 0.761364
Table 4
[0035] The system 102 further comprises the candidate pair selector 220 configured
to select a predefined number of pairs of the n-gram from the sorted two dimensional array occurring foremost to further process and generate a list of the candidate pairs. The candidate pair selector further selects at least top 2 n-gram pairs from the sorted two dimensional array. In accordance with an exemplary embodiment, the matrix is filled with the coverage value for each n-gram pairs, the candidate pair selector selects top two pair from the matrix. This step is executed to further reduce the probable pairs for the labels. System can select any number of top values.
[0036] The reported invention in present embodiment, selects top 2 values from n-
gram pairs (1,5),(1.4),(1,3),(2,5),(2,4),(2,3). With execution of this step, the candidate pair selector creates 12 pairs of n-gram as candidate pairs for labels. System stores this list of n-gram pairs and its coverage values in electronic format for further processing. According to exemplary embodiment, the pairs including only unigram, only bigram or combination of unigram and bigram are not selected as candidate pair, as they are not found to be suitable labels. For example (1,1), (1,2), (2,1), (2,2).

[0037] The system 102 comprises the unique word filter 222 configured to accept
the list of the candidate pairs to determine a number of unique words in each of the candidate pairs. According to the exemplary embodiment, the unique word filter accepts the list created by the candidate pair selector and calculates number of unique words in each n-gram pair. Unique word herein is referred as unique in the collection of documents in a given cluster - a word is unique if it appears in just one or two documents in a given cluster. The unique word filter then updates the list of candidate pair with number of unique words in each n-gram pair.
[0038] The system 102 further comprises the cluster label selector 224 configured to
sort the list of the candidate pairs using the coverage value of the n-gram pair and the number of unique words in the n-gram pair to create a sorted list of the candidate pairs for selecting a cluster label from the sorted list of the candidate pairs. The cluster label selector sorts the list of the candidate pairs by using the coverage value in a descending order and the number of unique words in ascending order or vice a versa to create a sorted list of the candidate pair. The cluster label selector selects at least 3 candidate pairs from the sorted list of candidate pairs to further select the cluster labels. According to an exemplary embodiment, the cluster label selector sorts the candidate pair list first by using coverage value in descending order and then number of unique words in ascending order and then stores in electronic form. In another embodiment, the cluster label selector mav sort the candidate pair list first by using the number of unique words in ascending order and then using the coverage value in descending order and then stores in electronic form. The cluster label selector accesses the sorted list and selects top candidate pair as candidate labels for the given cluster. Further, the system displays for example top 3 values from the candidate pair list on the user interface as the candidate labels and user may select one of them as a cluster label. In yet another embodiment, all the n-gram pairs from the candidate pair list with its coverage value are shown to the user for selection of cluster label.
[0039] In accordance with one exemplary embodiment, Table 5 shows the sorted list
of candidate pairs created by the cluster label selector 224. Column first and second show n-gram strength. For example, first label in the tables created by a pair of one word "environment" and a trigram "neat and clean" and corresponding coverage and the unique

word in the n-gram pairs is shown in column four and five. As shown in the table 5, the list of candidate pair is sorted first by using the coverage value in a descending order and the number of unique words in an ascending order. Further, user can choose the most appropriate label for the cluster. As shown below, the labels or the candidate pairs occurring foremost in the list below are the more appropriate labels.

Gram l Gram2 Label / Candidate Pair Coverage
Value
P(AUB) Unique words
1 3 environment, neat and clean 0.784091 0
1 4 environment, opportunity learn my client 0.772727 0
1 3 environment office environment etc 0.761364 0
2 friendly environment, neat and clean 0.079545 0
2 3 friendly environment, comfortable work environment 0.079545 0
2 4 friendly environment, organized office environment etc 0.068182 0
2 4 friendly environment, opportunity to learn my client 0.068182 0
1 4 environment, own decisions friendly and 0.761364 2
1 5 environment, public transportation easy access to 0.761364 2
1 5 environment, provided is bright and clean 0.761364 2
2 5 friendly environment, quality food 0.068182 2

in canteen good
2 5 friendly environment, provided is bright and clean 0.068182 2
Table 5
[0040] In yet another embodiment, as shown in table 6, the cluster label selector
224, selects top 3 labels as label for given cluster. Further, the system 102, may display the top 3 labels to the user and user may select one of them as a cluster label. For example, the labels below are selected as final labels:

environment, neat and clean
environment, opportunity learn my client
environment, office environment etc
Table 6
[0041] Further, it is observed that referring to Table 4, although the uni-gram pairs
occurring foremost are having highest coverage value, they are comparatively less readable, so they are not the suitable labels for the cluster. Rather, the n-gram pairs may be bi-gram, tri-gram and onwards selected as a top pairs from the two dimensional array are found to be suitable labels.
[0042] In accordance with another embodiment, the cluster label selector 224 is
further configured to find cluster centers. For a given set of documents to be clustered as input data the foremost candidate labels from the candidate pair list may be selected as cluster centers for further processing of clustering. By way of an example, top 5 candidate labels or the candidate pairs from the sorted list of candidate pairs may be selected as cluster centers for further processing of clustering.
[0043] Referring to figure 3, a method (300) for creating one or more labels for one
or more cluster is shown in accordance with an embodiment of the present subject matter. The method (300) may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 300 may also

be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
[0044] The order in which the method 300 described are not intended to be
construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 300 or alternate methods. Additionally, individual blocks may be deleted from the method 300 without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method 300 may be considered to be implemented in the above described system 102.
[0045] Referring to figure 3. a method (300) for creating one or more labels for one
or more cluster is described. In step 302, an input data is received. In one implementation, the input data is received by the receiving module 212. The input data further comprises a set of text documents, a set of text records associated with the cluster. In step 304. a plurality of candidate items occurring repetitively in the input data is selected using n-gram technique for a predefined value of n. The candidate items further comprises of words, phrases or a combination thereof. The predefined value of n using n-gram technique ranges from 1 to 5. Further frequency of occurrence of each candidate item in the input data is calculated. In one implementation, the plurality of candidate items occurring repetitively in the input data is selected using n-gram technique for a predefined value of n by the candidate items selector 214. In one implementation, the frequency of occurrence of each candidate item in the input data is calculated by the candidate items selector 214. In step 306. a sorted list of candidate items is generated with a frequency of occurrence of the candidate items. The list of candidate items is sorted in accordance with a descending order of the frequency of occurrence of said candidate items. In one implementation, a sorted list of candidate items with a frequency of occurrence of the candidate items is generated by the candidate items selector 214.

[0046] Referring to figure 3. in step 308. a foremost predefined number of the
candidate items from the sorted list of candidate items are selected. In one implementation, the foremost predefined number of the candidate items from the sorted list of candidate items is selected by the combination array generator 216. In step 310, a two dimensional array is populated wherein each element of the two dimensional array represents a pair of the n-gram. In one implementation, the two dimensional array is populated by the combination array generator 216 wherein each element of the two dimensional array represents a pair of the n-gram. In step 312, a coverage value for each pair of the n-gram from the two dimensional array is determined. The coverage value for each pair of the n-gram is determined to further ensure a maximum coverage with a minimum overlap. In one implementation, the coverage value for each pair of the n-gram from the two dimensional array is determined by the coverage value analyzer 218. In step 314. a sorted two dimensional array is populated. The two dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the n-gram. In one implementation, the sorted two dimensional array is populated by the coverage value analyzer 218.
[0047] Still referring to figure 3, in step 316, a predefined number of pairs of the n-
gram are selected from the sorted two dimensional array occurring foremost. In one implementation, the predefined number of pairs of the n-gram occurring foremost from the sorted two dimensional array are selected by the candidate pair selector 220. The candidate pair selector further selects at least top 2 n-gram pairs. In step 318, the selected n-gram pairs are further processed and a list of a candidate pairs is generated. In one implementation, the selected n-gram pairs are further processed and the list of a candidate pairs is generated by the candidate pair selector 220.
[0048] Referring to figure 3. in step 320, the list of the candidate pairs is accepted
and further a number of unique words in each of the candidate pairs is determined. In one implementation, the list of the candidate pairs is accepted by the unique word filter 222 and further a number of unique words in each of the candidate pairs is determined by the unique word filter 222. In step 322, the list of the candidate pairs is sorted using the coverage value and the number of unique words to create a sorted list of the candidate pair. Sorting of the list of the candidate pairs is performed by using the coverage value in a descending order

and the number of unique words in an ascending order or vice a versa to create a sorted list of the candidate pair. In one implementation, the list of the candidate pairs is sorted using the coverage value of the candidate pair and the number of unique words in the candidate pair by the cluster label selector 224. In step 324, a cluster label is selected form the sorted list of the candidate pairs. At least top 3 candidate pairs are selected from the sorted list of candidate pairs to further select the cluster labels. In one implementation, the cluster label form the sorted list of the candidate pairs is selected by the cluster label selector 224.
[0049] Still referring to figure 3, in method 300, the receiving, the selecting plurality
of candidates, the selecting foremost predefined number of candidate items, the determining, the selecting a predefined number of pairs, the accepting and the sorting steps explained above are performed by the processor 202.
ADVANTAGES
[0050] System and method of the present invention uses two statistical parameters to
assure the good coverage without any overlap between the two individual n-grams in a
given n-gram pair.
[0051] System and method of the present invention overcomes the readability
problem by choosing n-gram pairs rather than single word or phrases or a single n-gram and use of n-gram pairs together provide good coverage than a single word or a phrase or a single n-gram.
[0052] System and method of the present invention uses unique word filtration
mechanism which assures that low frequency words are not a part of the label.
[0053] System and method of the present invention does not make use of any natural
language processing techniques and hence simple to maintain, robust, computationally efficient and less time consuming.
[0054] System and method of the present invention can create labels for documents
in any language.
[0055] System and method of the present invention is generic and can create labels
for any collection of any logical units of words.

WE CLAIM:
I. A system for creating one or more labels for one or more cluster, in a computing environment, the system comprising:
a processor;
a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprising:
a receiving module configured to receive an input data;
a candidate items selector configured to select plurality of candidate items occurring repetitively in the input data using a n-gram selection technique for a predefined value of n to generate a sorted list of candidate items with a frequency of occurrence of the candidate items in the input data;
a combination array generator configured to select foremost prede_fined_number of the candidate items from the sorted list of candidate items to populate a two dimensional array, wherein each element of the two dimensional array represents a pair of the n-gram;
a coverage value analyzer configured to determine a coverage value for each pair of the n-gram present in the two dimensional array to further populate a sorted two dimensional array;
a candidate pair selector configured to select a predefined number of pairs of the n-gram from the sorted two dimensional array to further process and generate a list of the candidate pairs;
a unique word filter configured to accept the list of the candidate pairs to determine a number of unique words in each of the candidate pairs;

a cluster label selector configured to sort the list of the candidate pairs using the coverage value and the number of unique words to create a sorted list of the candidate pairs for selecting a cluster label from the sorted list of the candidate pairs.
2. The system of claim 1, wherein the input data comprises a set of text documents, a set of text records associated with one or more cluster.
3. The system of claim 1, wherein the candidate items further comprises of words, phrases or a combination thereof.
4. The system of claim 1, wherein the predefined value of n using n-gram technique may range from 1 to 5.
5. The system of claim 1, wherein the list of candidate items is sorted in accordance with a descending order of the frequency of occurrence of the candidate items.
6. The system of claim 1. wherein the coverage value for each pair of the n-gram is determined to further ensure a maximum coverage with a minimum overlap.
7. The system of claim 1. wherein the two dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the n-gram.
8. The system of claim 1, wherein the candidate pair selector further selects at least top 2 values for selected n-gram pairs.
9. The system of claim 1, wherein a cluster label selector sorts the list of the candidate pairs by using the coverage value in a descending order and the number of unique words in ascending order or vice a versa to create a sorted list of the candidate pair.

10. The system of claim 1, wherein the cluster label selects at least 3 candidate pairs from the sorted list of candidate pairs to further select the cluster labels.
11. A method for creating one or more labels for one or more cluster, in a computing environment, the method comprising:
receiving an input data;

selecting plurality of candidate items occurring repetitively in the input data using a n-gram technique for a predefined value of n to generate a sorted list of candidate items with a frequency of occurrence of the candidate items;
selecting foremost predefined number of the candidate items from the sorted list of candidate items to populate a two dimensional array wherein each element of the two dimensional array represents a pair of the n-gram;
determining a coverage value for each pair of the n-gram from the two dimensional array to further populate a sorted two dimensional array;
selecting a predefined number of pairs of the n-gram from the sorted two dimensional array occurring foremost to further process and generate a list of a candidate pairs;
accepting the list of the candidate pairs to determine a number of unique words in each of the candidate pairs; and
sorting the list of the candidate pairs using the coverage value and the number of unique words to create a sorted list of the candidate pair for selecting a cluster label form the sorted list of the candidate pairs;
wherein the receiving, the selecting plurality of candidates, the selecting foremost predefined number of candidate items, the determining, the selecting a predefined number of pairs, the accepting and the sorting are performed by the processor.
12. The method of claim 11, wherein the input data further comprises a set of text documents, a set of text records associated with the cluster.
13. The method of claim 11, wherein the candidate items further comprises of words, phrases or a combination thereof.
14. The method of claim 11, wherein the predefined value of n using n-gram technique ranges from 1 to 5.

15. The method of claim 11. wherein the list of candidate items is sorted in accordance with a descending order of the frequency of occurrence of said candidate items.
16. The method of claim 11, wherein the coverage value for each pair of the n-gram is determined to further ensure a maximum coverage with a minimum overlap.
17. The method of claim 11, wherein the two dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the n-gram.
18. The method of claim 11, wherein the candidate pair selector further selects at least top 2 values for selected n-gram pairs.
19. The method of claim 11, wherein sorting the list of the candidate pairs is performed by using the coverage value in a descending order and the number of unique words in an ascending order or vice a versa to create a sorted list of the candidate pair.
20. The method of claim 11. wherein at least 3 candidate pairs are selected from the sorted list of candidate pairs to further select the cluster labels.
21. The method of claim 11, wherein at least 5 candidate pairs are selected from the sorted list of candidate pairs as cluster centers.
22. A computer program product having embodied thereon a computer program for creating one or more labels for one or more cluster, the computer program product comprising:
a program code for receiving an input data;
a program code for selecting plurality of candidate items occurring repetitively in the input data using a n-gram technique for a predefined value of n to generate a sorted list of candidate items with a frequency of occurrence of the candidate items;
a program-code for selecting foremost predefined number of the candidate items from the sorted list of candidate items to populate a two dimensional array wherein each element of the two dimensional array represents a pair of the n-gram;

a program code for determining a coverage value for each pair of the n-gram from the two dimensional array to further sort the two dimensional array in descending order of the coverage value for each n-gram pair to populate a sorted two dimensional array;
a program code for selecting a predefined number of pairs of the n-gram from the sorted two dimensional array occurring foremost to further process and generate a list of a candidate pairs;
a program code for accepting the list of the candidate pairs to determine a number of unique words in each of the candidate pairs; and
a program code for sorting the list of the candidate pairs using the coverage value and the number of unique words to create a sorted list of the candidate pair for selecting a cluster label form the sorted list of the candidate pair.

Documents

Application Documents

#	Name	Date
1	2217-MUM-2013-IntimationOfGrant07-09-2023.pdf	2023-09-07
1	2217-MUM-2013-Request For Certified Copy-Online(22-05-2014).pdf	2014-05-22
2	2217-MUM-2013-PatentCertificate07-09-2023.pdf	2023-09-07
2	Form 3 [01-12-2016(online)].pdf	2016-12-01
3	Certified Copy_2217-MUM-2013.pdf	2018-08-11
3	2217-MUM-2013-PETITION UNDER RULE 137 [23-08-2023(online)].pdf	2023-08-23
4	ABSTRACT.jpg	2018-08-11
4	2217-MUM-2013-RELEVANT DOCUMENTS [23-08-2023(online)].pdf	2023-08-23
5	2217-MUM-2013-Written submissions and relevant documents [23-08-2023(online)].pdf	2023-08-23
5	2217-MUM-2013-FORM 3.pdf	2018-08-11
6	2217-MUM-2013-FORM 26(6-9-2013).pdf	2018-08-11
6	2217-MUM-2013-Correspondence to notify the Controller [10-08-2023(online)].pdf	2023-08-10
7	2217-MUM-2013-FORM-26 [10-08-2023(online)]-1.pdf	2023-08-10
7	2217-MUM-2013-FORM 2.pdf	2018-08-11
8	2217-MUM-2013-FORM-26 [10-08-2023(online)].pdf	2023-08-10
8	2217-MUM-2013-FORM 2(TITLE PAGE).pdf	2018-08-11
9	2217-MUM-2013-FORM 18.pdf	2018-08-11
9	2217-MUM-2013-US(14)-HearingNotice-(HearingDate-16-08-2023).pdf	2023-07-21
10	2217-MUM-2013-CLAIMS [12-09-2019(online)].pdf	2019-09-12
10	2217-MUM-2013-FORM 1.pdf	2018-08-11
11	2217-MUM-2013-COMPLETE SPECIFICATION [12-09-2019(online)].pdf	2019-09-12
11	2217-MUM-2013-FORM 1(25-7-2013).pdf	2018-08-11
12	2217-MUM-2013-DRAWING.pdf	2018-08-11
12	2217-MUM-2013-FER_SER_REPLY [12-09-2019(online)].pdf	2019-09-12
13	2217-MUM-2013-DESCRIPTION(COMPLETE).pdf	2018-08-11
13	2217-MUM-2013-OTHERS [12-09-2019(online)].pdf	2019-09-12
14	2217-MUM-2013-CORRESPONDENCE.pdf	2018-08-11
14	2217-MUM-2013-FER.pdf	2019-03-12
15	2217-MUM-2013-ABSTRACT.pdf	2018-08-11
15	2217-MUM-2013-CORRESPONDENCE(6-9-2013).pdf	2018-08-11
16	2217-MUM-2013-CLAIMS.pdf	2018-08-11
16	2217-MUM-2013-CORRESPONDENCE(25-7-2013).pdf	2018-08-11
17	2217-MUM-2013-CORRESPONDENCE(25-7-2013).pdf	2018-08-11
17	2217-MUM-2013-CLAIMS.pdf	2018-08-11
18	2217-MUM-2013-ABSTRACT.pdf	2018-08-11
18	2217-MUM-2013-CORRESPONDENCE(6-9-2013).pdf	2018-08-11
19	2217-MUM-2013-CORRESPONDENCE.pdf	2018-08-11
19	2217-MUM-2013-FER.pdf	2019-03-12
20	2217-MUM-2013-DESCRIPTION(COMPLETE).pdf	2018-08-11
20	2217-MUM-2013-OTHERS [12-09-2019(online)].pdf	2019-09-12
21	2217-MUM-2013-DRAWING.pdf	2018-08-11
21	2217-MUM-2013-FER_SER_REPLY [12-09-2019(online)].pdf	2019-09-12
22	2217-MUM-2013-COMPLETE SPECIFICATION [12-09-2019(online)].pdf	2019-09-12
22	2217-MUM-2013-FORM 1(25-7-2013).pdf	2018-08-11
23	2217-MUM-2013-CLAIMS [12-09-2019(online)].pdf	2019-09-12
23	2217-MUM-2013-FORM 1.pdf	2018-08-11
24	2217-MUM-2013-US(14)-HearingNotice-(HearingDate-16-08-2023).pdf	2023-07-21
24	2217-MUM-2013-FORM 18.pdf	2018-08-11
25	2217-MUM-2013-FORM-26 [10-08-2023(online)].pdf	2023-08-10
25	2217-MUM-2013-FORM 2(TITLE PAGE).pdf	2018-08-11
26	2217-MUM-2013-FORM-26 [10-08-2023(online)]-1.pdf	2023-08-10
26	2217-MUM-2013-FORM 2.pdf	2018-08-11
27	2217-MUM-2013-FORM 26(6-9-2013).pdf	2018-08-11
27	2217-MUM-2013-Correspondence to notify the Controller [10-08-2023(online)].pdf	2023-08-10
28	2217-MUM-2013-Written submissions and relevant documents [23-08-2023(online)].pdf	2023-08-23
28	2217-MUM-2013-FORM 3.pdf	2018-08-11
29	ABSTRACT.jpg	2018-08-11
29	2217-MUM-2013-RELEVANT DOCUMENTS [23-08-2023(online)].pdf	2023-08-23
30	Certified Copy_2217-MUM-2013.pdf	2018-08-11
30	2217-MUM-2013-PETITION UNDER RULE 137 [23-08-2023(online)].pdf	2023-08-23
31	2217-MUM-2013-PatentCertificate07-09-2023.pdf	2023-09-07
31	Form 3 [01-12-2016(online)].pdf	2016-12-01
32	2217-MUM-2013-IntimationOfGrant07-09-2023.pdf	2023-09-07
32	2217-MUM-2013-Request For Certified Copy-Online(22-05-2014).pdf	2014-05-22

Search Strategy

1	search_2217mum2013_11-03-2019.pdf