Abstract: A.method for comparing a similarity between two words in any of plurality of natural languages comprising the steps of: - a. transforming phonetic transcription of the two words into corresponding Sanskrit words; b. separating each syllable of each of the said transcribed word into Sanskrit consonants and vowels, sequentially arranging the resultant letter elements to form phonetic units, wherein each phonetic unit corresponds to a vowel, a consonants or a consonant with trailing vowel if any, to form a set of phonetic units of the word; and c. comparing the set of phonetic units of the two words with one another in a predetermined manner with predetermined set of matching criteria for concluding the similarity.
FIELD OF INVENTION:
The invention generally relates to a method and apparatus for matching words inexactly, to enable the user to seek information relevant to the user from plurality of words.
PRIOR ART:
It has always been assumed that English words similarity check or natural language word similarity check can only be done on the basis of English spelling and English pronounciation comparison algorithm. The known practice is that the words are always transcribed to English for the purpose of comparison. As is known, there is a wide gap between the spelling that is phonetic syllable and pronounciation of English words due to which it is found difficult to retrieve similarity information from English Text and it is worst when the documents are non-English.
One conventional technique for determining similarity is based on the concept of sets. A word e.g., may be represented as a subset of phonetic units i.e., letters in English form a corpus of phonetic units. The similarity or resemblance of two words to one another is then defined as the intersection of two sets divided by the union of the two sets. One problem with this set based similarity measure is that there is limited flexibility in weighting the importance of the elements within a set. A letter is either in a set or it is not in a set. In practice, however, it may be
desirable to weight certain letters or syllable such that letters/syllable that occur relatively infrequently in the corpus more heavily when determining the similarity of words.
This technique is also extended for using matching of words to document which shall comprise a set of words, set of sentences etc. This is the most commonly used conventionaf practice.
Accordingly, due to the suffering in the prior art, which appears to be non-flexible for giving the weightage of the split contents of the word, there is a need in the art for improved technique for determining similarity between documents.
OBJECT OF INVENTION:
The systems and methods consistent in the present invention address the problems suffered in the prior art and also envisage other new needs by providing a method for comparing similarity between English words on the basis of Sanskrit transcription similarity comparison algorithm.
It has been known that Sanskrit words are spelled using the same phonetic syllables as their pronounciation and are thus in clear connection with the pronounciation. For this reason, it is very convenient for the user to compare and
retrieve data on the basis of phonetic transcription and phonetic symbols if done in Sanskrit rather than in English or any other natural language.
Therefore the present invention has been made in view of overcoming the above problem in comparing two English words in'English algorithm and it is an object of the invention to provide a method for compahng similarity between two English words or any foreign words on the basis of Sanskrit pronounciation of said words and on the basis of Sanskrit phonetic transcription similarity comparison algorithm. The algorithm is based on the swaras and vyanjanas of the Sanskrit language.
With growing need of computerization of documents and on-line documents, there is lot of work dealing with large files and documents hitherto unknown previously for sorting, studying and picking. With the unlimited amount of information and documents available, there is a frequent need to select information under certain conditions. So there is a need to selectively seek information by a user. So, there is a compelling need to relieve the burden of selecting information from the human end to machine aided inter-personal communication system.
The system should also not demand any specialized knowledge or skill to use the system.
In the previous art, it is always using the thesaurus with the English language as the base language with English vowels and English consonants as links encoded and associated in the thesaurus. But due to the phonetic and spell variations in English, this previously known art approach was yielding erroneous values in calculating similarity weightage. Furthermore, the result was worst when this thesaurus was used for foreign words matching.
Therefore'what is required urgently in this field is a thesaurus that may be independent of the language of the words being matched for the purpose of matching and may be capable of being used with any language words, making the word matching process very generic, more reliable, accurate and efficient. Also what is further required is a method using such a universal thesaurus for evaluating similarity of words in plurality of languages.
DESCRIPTION OF INVENTION:
In accordance with the present invention, the above and other objects can be accomplished by a provision of a method for comparing a similarity between two specific words, may be English, by actually comparing a similarity between Sanskrit phonetic transcription of the two words, comprising the first step of transforming each of the foreign word comprising consonants and vowels into sets of Sanskrit syllables, sequentially arranging the set of syllables to form plurality of phonetic units of the corresponding word, comparing the phonetic
units of the two words in question based on the algorithm to compile the weightage of each phonetic unit and summing the weightage as a whole based on predetermined rule for summation. The summed up total will determine the similarity value of the comparison.
The above and other objects with the features and advantages of the present invention will now be more clearly understood from the following detailed description taken in conjunction with accompanying drawings.
Figure 1 is the flow chart illustrating a method for comprising the similarity between Sanskrit phonetic transcription of an English word in accordance with the present invention.
An algorithm based on Sanskrit syllables is herein adapted to compare a phonetic similarity between English words to effectively perform an approximate search for similarity on the basis of spelling and pronounciation which may not be accurately known in spelling.
The English language has only 5 vowels and 19 consonants, whereas the Sanskrit language has 15 swaras and 35 vyanjanas, thereby providing a greater flexibility for allotting weightage between syllables, thereby enabling better and accurate development of algorithm.
A method for similarity comparison between phonetic transcription of words in accordance with the present invention can borrow a basic methodology from an English algorithm, for improving the method of matching. The consonants H, W and Y in English create lot of confusions in English algorithm as there is lot of ambiguity surrounding these letters in relation to pronounciation and spell. So, there is a special need for specific attention for the comparison purposes when English algorithm is used. This has been overcome by using an algorithm created from the Sanskrit syllables.
The invention attains to take advantage of the fact that there is a difference between English phonological structure and rule and the Sanskrit phonological structure and rule. In accordance with the present invention, the existing known English algorithms are modified and applied with novel Sanskrit algorithm in consideration of phonological characteristics of the Sanskrit language to be adequate, for the actual circumstances existing and persisting in the English language.
The similarity comparison method of the present invention is mainly adapted to compare a similarity between two Sanskrit transcription derived from two English words. In this regard, the present phonetic transcription similarity comparison method, compares a pronounciation similarity between consonants along with associated vowels, by using vyanjanas and associated swaras of the Sanskrit language.
In brief, the present invention uses the phonetic transcription similarity comparison identities, specifically the association of correct vowels with the correct consonants, which are generally confused in English algorithm due to difference between English phonological structure and English spelling. It therefore, determines more efficiently whether the two phonetic transcription in Sanskrit are similar or not.
A method for comparing a similarity between two words, comprising the steps of:-
(1) rewriting phonetically the first word to be compared with a plurality of second set of words in Sanskrit;
(2) splitting the Sanskrit words into plurality of phonetic units to form first set and plurality of second set of phonetic units corresponding to Sanskrit words;
(3) comparing the first set of phonetic units of first Sanskrit word with the phonetic units of selected second Sanskrit word using a stored matrix to compile a plurality of weight values associated with the sets of phonetic units of the Sanskrit words;
(4) summing the plurality of weight values in predetermined manner to determine the total weight value of the comparison;
(5) averaging the weight value in a predetermined manner over the number of phonetic units to calculate the similarity factor value;
(6) determining whether the second word is similar to the first word based on whether the similarity factor value is above or below a predetermined similarity factor value;
(7) transferring similar words to a list of similar marks for generative reporting list;
(8) repeating the steps of 1 to 7 for plurality of second words being compared with the first word;
(9) displaying the list of similar words.
The method is capable of correctly and rapidly retrieving various similar words without confusion inspite of the English words having spelling and phonetic structure different just merely by transforming the words to be compared into Sanskrit letter words phonetically without assigning or considering the meaning of the words.
Although the embodiments of the present invention has been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible without departing from the scope and sprit of the invention as disclosed in the specification and claims herein.
Eventhough the description may more specifically relate to the matching of two English words as illustrations for understanding herein in the description, the invention is not limited by this description. It is to be noted that the language of
the query or the search word and the language of the database containing the target words can be same or may be different from one another. The original language of the search word and the target word are irrelevant as the very core of the invention and the very step of the invention transforms the search word and the target words from their original language into Sansknt equivalent by a method of transformation involving creating phonetic units as described in detail above. As such, the invention is capable of adapting to a query in any language and stored database containing target words in any language for the process of matching using the method described in the invention. The main characteristic of this invention is in using a novel algorithm by using Sanskrit phonetic units in mapping the weightage factor for the purpose of matching words and documents in any language.
A typical similarity check system in a particular natural language functions by using natural language processing which is generally concerned with an attempt to decompose or split the given content into smaller sub contents according to the linguistic rules. However, in this invention, there is no need to understand the meaning or the grammar of the query words and the database words, which are being compared. They are merely being transformed into Sanskrit word using the syllables from the phonetics of the original word. It is only limited specifically to phonetic and phonological knowledge to relate how the word sounds originally and restructure the original word in Sanskrit characters using swaras and vyanjanas of the Sanskrit language. There is no need for morphological
knowledge, syntactical knowledge or semantic knowledge of the foreign language. The transformation is merely rewriting the foreign word with Sanskrit characters without any concern to what the word means and thereby transformation is context independent. This means the transformation is regardless of the meaning and the context in which the word may be used.
A system and a method of evaluating similarity among data words are described herein. In one embodiment, the thesaurus structure is created for the purpose of using in processing. The system shall include inputting thesaurus concepts and relationships within its structure. The processing block shall be capable of splitting the unitary words into phonetic units which is a set of vyanjanas, swaras or combination thereof arranged sequentially.
Each document in the database containing the target words is subjected to a set of processing steps to generate a Sanskrit language representation of the subject content of the document. This is normally done before the query is entered and the database document is available at the point of making the query. The query is also subjected to processing, possibly a different set of processing to generate Sanskrit language representation of the subject content of the query.
This document retrieval system for the purpose of matching a query is such that were the user can enter a query in any natural language including English in a
desired one of plurality of supported languages and retrieve documents from a database that includes documents in supported languages.
The documents or word of query and the database containing the target words or documents are matched based on the transformed Sanskrit contents of the query and the stored database to generate a measure of relevance of similarity of the document to the query using an algorithm which is based on Sanskrit syllables.
Further, the phonetic units are used for matching between two words using the code of weightage based on similarity between two phonetic units so compared. Therefore, summation of weightage is done at the end of the process of comparison of mutual phonetic units. Many alternative embodiments may be used to calculate the matching weightage. The drawings illustrate one such embodiment of weightage allotment.
The calculation of a relative weightage corresponding to each entry in the matrix may be varied for the purpose of achieving speed, accuracy, reliability or combination thereof. The predetermined criteria may include weightage allotted to number of phonetic units forming the words, prefix of the word, suffix of the word, body of the word, word length etc., depending on whether to execute the search for exact match, wild card match or fuzzy logic match.
However, the specification and drawings are recorded as illustrated rather than in restrictive sense.
The present invention provides an efficient word segmentation apparatus, the word being an English word or any foreign word, the said apparatus capable of overcoming many drawbacks suffered in prior art and which apparatus will employ transliteration or transformation of the word technique into a Sanskrit word using phonetic symbol information of the original word so as to replace troublesome probability calculation. It will also further use few linguistic semantics and syntax rules thereafter for the purpose of comparison.
The Sanskrit word segment apparatus and method is characterized by a matrix which has the swaras, vyanjanas and combination each in y axis as x axis and with weightage value as cell values within the said matrix.
The syntax information portion that stores the cell value in the matrix which is a two dimensional array formed by swaras, vyanjanas and combination thereof may vary from 0 to 1 corresponding to indicate whether the words are different, similar or identical. The value of 0 indicates the two phonetic units are different, while the value 1 indicates the two phonetic units are identical. The value between 0 and 1 grade the similarity between the two words in a manner as per the predetermined criteria. The string of phonetic units forming the two words are compared mutually and correspondingly to compute the ceil values to
determine the weightage of each matching step. The weightage of all the steps thereafter is summed to obtain the total weight factor. The results are all stored in the buffer region. The report is then generated for the purpose of the matched words depending upon the determined rules which atleast determines the minimum weight factor value for assuming the similarity. Each of the phonetic units may be matched to obtain plurality of weightage associated with each query word and the document containing the target words.
In one embodiment, the total weightage may be a mere summation of individual weightage of phonetic units. Alternatively, many other techniques may be used to calculate or determine the total weightage. One such technique is character transposition in which the sequence or order of two consecutive characters may be considered for the purpose of summation, to compute the overall similarity between the two words. In other count of phonetic units, prefix phonetic units, suffix phonetic units, word length may also be given consideration. The criteria is based on whether an attempt is made to execute searches for exact match or similar match. Furthermore, the criteria may also be dependent on how much to use the linguistic semantics.
The transliterated based approach has demonstrated good results as they are computationally intensive over the types of variations seen in many words in many languages. However, inspite of the computational complexity, the simple algorithm as envisaged ensures speed of computation, thereby rendering the
approaches as a useful one in the present time where there is a need for accurate and speedy matching.
As all of the techniques applicable in prior art, for matching may also be adapted in this technique for the purpose of retaining and preserving the good qualities of matches obtained in prior art and thereafter, improving upon it for more accuracy and better reliability with inherent flexibility, making the invention a very novel and useful apparatus.
In one aspect of the invention, a method for searching the database including a plurality of records, receiving a search criteria for matching, wherein the records mean a field, Should/Can this be in another patent?.
In another aspect of the invention, a method for searching a database including a plurality of records receiving a search criteria for matching wherein the records may have plurality of fields and wherein the search criteria may have plurality of search elements, Should/Can this be in another patent?
The method may further comprise employing plurality of matrix tabies for generating the weight value of the match. The tables are all stored in the memory and which tables may be adjusted and modified flexibly depending upon the need of the user. The creation of the tables which link the performance of processing and determining the weightage are all performed as part of set up
process, which is created prior to any search and stored in the memory of the processor. The system thereby will also include processor which may be a PC or a Mainframe etc.
The method and system so disclosed herein is relatively reliable and fast. It is characterized further by its vowel-consonants and/or swaras/vyanjanas cognizant search method in Sanskrit words, but wherein the query subject content and document subject contents may be in any plurality of supported languages.
It is believed that this invention provides a significant improvement in assessing and matching a database with a search query which employs a search strategy enabling each step in search to be more reliable and as fast as possible.
WE CLAIM:
1. A. method for comparing a similarity between two words in any of plurality
of natural languages comprising the steps of: -
a. transforming phonetic transcription of the two words into corresponding
Sanskrit words;
b. separating each syllable of each of the said transcribed word into
Sanskrit consonants and vowels, sequentially arranging the resultant
letter elements to form phonetic units, wherein each phonetic unit
corresponds to a vowel, a consonants or a consonant with trailing
vowel if any, to form a set of phonetic units of the word; and
c. comparing the set of phonetic units of the two words with one another
in a predetermined manner with predetermined set of matching cnteria
for concluding the similanty.
2. A method of matching a search word according to a predetermined set of
matching criteria to a set of target words contained in a collection of words,
comprising: -
(a) creating and storing prior to query entry a ,target-word-lookup-table (database) containing the collection of words associating each of the stored words with a unique identification code, creating associated plurality of Sanskrit phonetic units of the stored words, wherein the target-word-lookup-table (database) is multi-dimension, the first dimension
corresponding.to consecutive identification code and other dimensions corresponding to plurality of phonetic units of the stored words;
(b) creating and storing prior to a query entry a syllable (character)- lookup-table-matrix, associating the plurality of Sanskrit phonetic units mutually along the x-y axes (vectors);
(c) query entry and transforming the search word into plurality of Sanskrit phonetic units;
(d) selecting from the target-word-lookup-table (database) a target word and assessing the target word to identify the common set of Sanskrit phonetic units also correspondingly contained in the search word;
(e) determining a plurality of weightage of similarity of target word with respect to search word in the syllable (character)- lookup-table-matrix with search word and target words forming the axis (vectors) of the matrix according to the predetermined set of weightage criteria for allotting the weight in the cells of the matrix;
(f) determining in response to whether the target word has a predetermined set of characteristics similar to search word based on predetermined set of matching criteria for including the target word as a matching word to a report table;
(g) repeating steps (c) to (f) until exhausting all target words in the target-word-database- lookup-table; and
(h) displaying the result table comprising plurality of words from target-word-database- lookup-table matching the search word as per the predetermined matching criteria.
3. A method of matching a search word according to a predetermined set of
matching criteria to a set of target words contained in a collection of words, as
claimed in Claims 1 and 2, wherein the method of representing the target word in
a database includes words in a plurality of supported languages, the method
carried out for each target word comprising:
(a) splitting into a set of phonetic units of the subject content of the target word, and
(b) generating Sanskrit representation of the subject content of the target word based on its phonetic units.
4. A method of matching a search word according to a predetermined set of
matching criteria to a set of target words contained in a collection of words, as
claimed in claims 1 and 2, wherein the method of representing a search word in
query comprises a word in any plurality of supported languages, the method
carried out for the search word comprising:
(a) splitting into a set of phonetic units of the subject content of the search word, and
(b) generating Sanskrit representation of the subject content of the search word based on its phonetic units.
5. A method of matching a search word according to a predetermined set of matching criteria to a set of target words contained in a collection of words, as claimed in claims 1 to 4, wherein the language of the search word is the same as the language of the target word.
6. A method of matching a search word according to a .predetermined set of matching criteria to a set of target words contained in a collection of words, as claimed in claims 1 to 4, wherein the language of the search word is different from the language of the target word.
7. A system for comparing a similarity between two words in any of plurality of natural languages comprising: -
a. means for transforming phonetic transcription of the two words into
corresponding Sanskrit words;
b. means for separating each syllable of each of the said transcribed word
into Sanskrit consonants and vowels, sequentially arranging the resultant
letter elements to form phonetic units, wherein each phonetic unit
corresponds to a vowel, a consonants or a consonant with trailing vowel if
any, to form a set of phonetic units of the word; and
c. means for comparing the set of phonetic units of the two words with one another in a predetermined manner with predetermined set of matching criteria for concluding the similarity.
8. A system for matching a search word according to a predetermined set of matching criteria to a set of target words contained in a collection of words, comprising: -
(a) means for creating and storing prior to query entry a target-word-look up-table (database) containing the collection of words associating each of the stored words with a unique identification code, creating associated plurality of Sanskrit phonetic units of the stored words, wherein the target-word-lookup-table (database) is multi-dimension, the first dimension corresponding to consecutive identification code and other dimensions corresponding to plurality of phonetic units of the stored words;
(b) means for creating and storing prior to a query entry a syllable (character)-lookup-table-matrix, associating the plurality of Sanskrit phonetic units mutually along the x-y axes (vectors);
(c) means for query entry and transforming the search word into plurality of Sanskrit phonetic units;
(d) means for selecting from the target-word-lookup-table (database) a target word and assessing the target word to identify the common set of Sanskrit phonetic units also correspondingly contained in the search word;
(e) means for determining a plurality of weightage of similarity of target word with respect to search word in the syllable (character)- lookup-table-matrix with search word and target words forming the axis (vectors) of the matrix according to the predetermined set of weightage criteria for allotting the weight in the cells of the matrix;
(f) means for determining in response to whether the target word has a predetermined set of characteristics similar to search word based on predetermined set of matching criteria for including the target word as a matching word to a report table;
(g) means for repeating steps (c) to (f) until exhausting all target words in the target-word-database- lookup-table; and
(h) means for displaying the result table comprising plurality of words from target-word-database- lookup-table matching the search word as per the predetermined matching criteria.
9. A system for matching a search word according to a predetermined set of matching criteria to a set of target words contained in a collection of words, as claimed in Claims 7 and 8, wherein the method of representing the target word in a database includes words in a plurality of supported languages, the system for target word comprising:-
(a) means for splitting into a set of phonetic units of the subject content of the target word, and
(b) means for generating Sanskrit representation of the subject content of the target word based on its phonetic units.
10. A system for matching a search word according to a predetermined set of
matching criteria to a set of target words contained in a collection of words, as
claimed in claims 7 and 8, wherein the method of representing a search word in
query comprises a word in any plurality of supported languages, the system for
search word comprising:-
(a) means for splitting into a set of phonetic units of the subject content of the search word, and
(b) means for generating Sanskrit representation of the subject content of the search word based on its phonetic units.
11. A system for matching a search word according to a predetermined set of
matching criteria to a set of target words contained in a collection of words, as
claimed in claims 7 to 10, wherein the language of the search word is the same
as the language of the target word.
12. A system for matching a search word according to a predetermined set of matching criteria to a set of target words contained in a collection of words, as claimed in claims 7 to 10, wherein the language of the search word is different from the language of the target word.
| # | Name | Date |
|---|---|---|
| 1 | 1588-che-2007 abstract.pdf | 2011-09-03 |
| 1 | 1588-che-2007-form 5.pdf | 2011-09-03 |
| 2 | 1588-che-2007 claims.pdf | 2011-09-03 |
| 2 | 1588-che-2007-form 3.pdf | 2011-09-03 |
| 3 | 1588-che-2007-form 26.pdf | 2011-09-03 |
| 3 | 1588-che-2007 correspondence others.pdf | 2011-09-03 |
| 4 | 1588-che-2007-form 1.pdf | 2011-09-03 |
| 4 | 1588-che-2007 description(complete).pdf | 2011-09-03 |
| 5 | 1588-che-2007-drawings.pdf | 2011-09-03 |
| 5 | 1588-che-2007 drawings.pdf | 2011-09-03 |
| 6 | 1588-che-2007-description(complete).pdf | 2011-09-03 |
| 6 | 1588-che-2007 form-1.pdf | 2011-09-03 |
| 7 | 1588-che-2007-correspondnece-others.pdf | 2011-09-03 |
| 7 | 1588-che-2007 form-26.pdf | 2011-09-03 |
| 8 | 1588-che-2007 form-3.pdf | 2011-09-03 |
| 8 | 1588-che-2007-claims.pdf | 2011-09-03 |
| 9 | 1588-che-2007 form-5.pdf | 2011-09-03 |
| 10 | 1588-che-2007-claims.pdf | 2011-09-03 |
| 10 | 1588-che-2007 form-3.pdf | 2011-09-03 |
| 11 | 1588-che-2007-correspondnece-others.pdf | 2011-09-03 |
| 11 | 1588-che-2007 form-26.pdf | 2011-09-03 |
| 12 | 1588-che-2007-description(complete).pdf | 2011-09-03 |
| 12 | 1588-che-2007 form-1.pdf | 2011-09-03 |
| 13 | 1588-che-2007-drawings.pdf | 2011-09-03 |
| 13 | 1588-che-2007 drawings.pdf | 2011-09-03 |
| 14 | 1588-che-2007-form 1.pdf | 2011-09-03 |
| 14 | 1588-che-2007 description(complete).pdf | 2011-09-03 |
| 15 | 1588-che-2007-form 26.pdf | 2011-09-03 |
| 15 | 1588-che-2007 correspondence others.pdf | 2011-09-03 |
| 16 | 1588-che-2007-form 3.pdf | 2011-09-03 |
| 16 | 1588-che-2007 claims.pdf | 2011-09-03 |
| 17 | 1588-che-2007-form 5.pdf | 2011-09-03 |
| 17 | 1588-che-2007 abstract.pdf | 2011-09-03 |