Abstract: This invention relates to a predictive tool for creating text in Indic languages based on analyzing phonetic syllables of words and their frequency. The tool also has a frequency profiler which adapts to the frequency of a users use of words in the Indic script to create texts.
FORM-2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE
Specification
(See section 10 and rule 13) A PREDICTIVE TOOL FOR CREATING TEXT IN INDIC LANGUAGES
PENFOSYS PRIVATE LIMITED
An Indian Company of 291, Somawar Peth, Pune 411 011, Maharashtra, India
4 APR 2OO5
THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED:-
This invention relates to a predictive tool for creating text in Indic languages.
Designing a predictive tool for Indic scripts is a problem in Human, Linguistic and Mathematical engineering and needs to be approached from these vantage points. A typical Indic script is Devanagari with special reference to Hindi, but the same principles can be applied to any other Indic script.
A certain number of constraints are encountered when Devanagari has to be ported onto the mobile phone:
Data inputting using a computer has traditionally been carried out using a 102 key keyboard. These keyboards because of the large number of keys permit a disambiguated method of displaying characters on screen by associating one letter with one key or by generating out characters by means of dead keys (Latin 1 languages). The case is quite different for scripts termed as "complex" as in the case of Devanagari where ligatures have to be accommodated.
However with micro-devices such as mobile phones as well as WAP devices entering into the market, a new demand for data entry methods has been introduced. Mobile telephones are a good example of how engineering requirements will colour software as well as demand new and innovative 'fits'.
Due to the tiny size of the mobile phone, the keyboard is reduced to a minimal 12 button keyboard. This reduced size of the keyboard makes it hard for the user to type text quickly and efficiently, because multiple letters are associated with one single key, forcing the user to type using Multiple Taps. This procedure however monotone it may be was used up to recently to key in text. To reduce user frustration a new class of text entry methods has appeared. It uses dictionaries in an attempt to resolve the word ambiguity and requires, in most cases, only one keystroke per character.
The mobile user is not necessarily an urban one conversant with the English alphabet. While this holds good for urban areas where an English literate and semi-literate class can make do with English, an increasingly large number of users has little or no knowledge of English and are still users of mobile phone technology. Saturation in urban areas has already reached and it is precisely this rural market composed of users literate in their own mother-tongues which will be the next wave. Of these, Devanagari users constitute a large percentage followed by scripts as varied as Punjabi, Gujarati, Tamil, Bangla, Telugu and Kannada practically in this order (survey of 2001 on relations between economic links and scripts, based on Census of India reports). English phonetic key-boards with torturous acrobatics to key in the script can make do but the final frontier is to present the user with a key-board in his own script.
A large number of users would like to use mobile phones in their own scripts and given the relatively high number of characters in the script, keying-in a message is a long and complicated process.
Prediction apparatus and methods for creating text are used in order to reduce the number of key press and resulting effort for the user. Mobile phones and such small devices have a numeric keypad on which characters are mapped to type text. Hence, multiple characters are mapped on to each key. Existing predictive tools like the T9 which are used for Anglo-Latin script based languages such as English are usually based on the following methods:
1. Predict a letter for every key press such that the letter is mapped on to the respective key and combinations of the letters lead to a valid word.
2. Predict ahead a whole word based on the key sequence typed by the user.
Both the above methods are not very suitable for phonetic based languages like the Indian languages. The Indian languages have about 55 to 65 characters in the alphabet set. This number is much larger than the English alphabet set. Thus, when the method mentioned in (1) is followed, the confusion it creates is compounded. Also, since the number of letters is large, the forms of various words are also quite large and make the whole word prediction (2) an irritation.
Also Indic script languages follow a different syntax in that it is not legal for any letter to follow any letter. In all phonetic languages the basic unit for making a word is a syllable. The formation of a syllable is governed by a set of rules in which the letters can be joint to form a syllable. Thus the formation of words in Indic scripts is not by letters but syllables.
The predictive tool in accordance with this invention is based on the following principle: Since the Indian languages are phonetic in nature, the
words are made of syllables, which may be combinations of one or more letters that form a basic pronounceable unit. Since a syllable ought to be such combination of letters that can be pronounced, only a limited sequence of characters can be valid. The tool in accordance with this invention exploits this property of Indian languages and tries to predict the nearest valid syllable that leads to a valid word. Thus, the predictive tool in accordance with this invention looks ahead and completes a desired word much faster without creating any user irritation.
Another feature of this invention is that a the tool continuously creates a predictive profile based on frequency of occurrence of words and patterns of phrases used by a user. It also learns the pet words used by a user which do not normally appear in a preset dictionary as a valid word.
According to this invention there is provided a predictive tool for indic
scripts consisting of a
(l)a first stored database consisting of dictionary of the concerned language containing all the commonly used and appendably stored words, each of the stored words being broken up into each of their respective syllables,
(2) a second stored data base consisting of all the alphabets and alphabet variations of the concerned language;
(3) a display means which can receive and display text having a main text area and a window in which the predictive tool can provide predictions;
(4) tool inputting means adapted to receive within the tool from a general text inputting means, such as a keyboard the user key input;
(5) recognizer means adapted to refer to the letter data base and recognize the key input as at least one letter in the india script of a partial word with reference to the word database
(6) a temporary partial word storage means in which the recognizer appends into memory a partial word inputted by a user;
(7) a temporary word and syllable storage means in which the recognizer is adapted to, after comparison of the partial word with words and syllables stored in the word database and creating a first sub set of whole words and a second sub set of syllables defined words from the word database formed commencing with the partial word, temporarily store the sub set of words and/or syllables fetched from the word database;
(8) a syllable analyzer adapted to read the sub sets of words and syllables in the temporary word database, analyze the sub set of words and syllables so read and sort and arrange the sub set of words into sorted predictive strings having elements of words and syllables in accordance with the rules of the language stored in the analyzer;
(9) a sorted strings storage means in which the sorted predictive strings of words and syllables can be temporarily stored;
(10) a frequency profiler adapted to register words and
syllables inputted by a user using the tool, and further
adapted to append any new or original words or syllable frequencies of the user in the word data base and still further adapted to send responses to the syllable analyzer to alter, if necessary, the sorted predictive strings in the sorted strings storage means;
(11) a word display handler in which elements of the amended sorted predictive strings are stored for predictive display;
(12) a settable selector means for enabling the user to select at least some of the elements of the amended sorted predictive strings in the word display handler in accordance with a set frequency threshold, typically from one to three, for predictive display in the window and an operation means for transferring one element satisfying the prediction displayed in the window to the main text for further creation of text in the indie language; and
(13) a reset means for emptying all the temporary storage means for receiving the next iteration of input for prediction.
The invention will now be described with reference to the accompanying
drawings, in which
Figure 1 is a block diagram of the elements of the predictive tool in
accordance with this invention; and
Figure 2 is a block diagram of a syllable analyzer for the tool of figure 1.
Referring to figure 1 of the accompanying drawings,
The user key input passes to the RECOGINIZER of the predictive tool in
accordance with this invention via the tool input INPUT..
The RECOGINIZER refers the letter database LETTER D B and identifies
the input character.
This character is appended into a memory of a temporary partial word
storage means T P W S.
The partial word is passed into the Word Database W D B. A set of
matching words and syllables , which start with the letters of the partial
word is identified and sub sets of such words and syllables are passed to a
temporary memory T D.
The Syllable analyzer S A analyses the set of words and syllables and
creates a list of sorted predictive strings of words and syllables and stores
these strings in a string storage SS.
A word display handler W D H handles these sorted strings in a separate
window [WINDOW] on the screen [DISPLAY] for the user's choice.
The word display handler W D H can accept the user's choice through a
selector [SELECTOR] . The selector also passes this information to the
frequency profiler F P to update the frequency ratings of the syllable
analyzer S A of the formed sorted predictive strings to alter the formed
predictive strings in the strings storage according to the habits of the user .
New words and uncommon frequencies and original construction and word formations are stored by the frequency profiler FP in the word data base W D B. The word data base therefore gets continuously appended with use.
A user may set the threshold of predictive choice with the help of the choice selector so that a certain number of predictive words or syllables are displayed in the window for predictive selection. Typically such a threshold can be three elements of the predictive string. The user also has a choice for
opting for a display in the window of full predictive words or only syllables. Accordingly, one or the other predictive string will be selected for displaying elements therefrom.
Figure 2 of the accompanying drawings illustrates the syllable analyzer S A
processes inputs from the partial word and the temporary set of dictionary
words that start with the letters of the partial word.
The processor [PROCESSOR] takes input from the letter database to
identify the letters in the partial word. Matching words form the temporary
dictionary are parsed by the parser [PARSER] to analyse for the number of
syllables in the matching words.
The parser [PARSER[ analyses the syllables in the partial word and the
matching words. A syllable is identified as explained below
The Indian language alphabet set usually has the following characters
2. Independent Vowels
3. Consonants
4. Matras (vowel endings that can be added to consonants)
5. Modifiers (Signs that can modify the vowel pronunciation)
6. Consonant Joiners
A syllable, which is a basic unit, can be of two kinds
1. Vowel Syllable
2. Consonant syllable
A Vowel syllable starts with an independent vowel. Only a modifier can follow an independent vowel.
A consonant syllable starts with a consonant. A consonant joiner or a matra or a modifier can follow a consonant.
Another consonant can follow a consonant joiner.
A modifier can follow a matra.
If a following character is not as per the rules mentioned above, it
forms a new syllable.
Based on input from the frequency profiler F P , the words for predictive suggestion are selected for the following criteria
1. Words with minimum number syllables are selected first.
2. If more than one word with the same number of minimum syllables are found, they are sorted on frequency with higher frequency strings being arranged first.
A user on seeing predictive elements in the WINDOW has a choice of selecting one of the predictive elements or to carry on with the inputting of the text. Once a selection takes place the tool has resetting means [RESET] which empties all the temporary storage registers waiting for the next input.
Claim:
[1] A predictive tool for india scripts consisting of a
(l)a first stored database consisting of dictionary of the concerned language containing all the commonly used and appendably stored words, each of the stored words being broken up into each of their respective syllables,
(2) a second stored data base consisting of all the alphabets and alphabet variations of the concerned language;
(3) a display means which can receive and display text having a main text area and a window in which the predictive tool can provide predictions;
(4) tool inputting means adapted to receive within the tool from a general text inputting means, such as a keyboard the user key input;
(5)recognizer means adapted to refer to the letter data base and recognize the key input as at least one letter in the indie script of a partial word with reference to the word database
(6) a temporary partial word storage means in which the recognizer appends into memory a partial word inputted by a user;
(7) a temporary word and syllable storage means in which the recognizer is adapted to, after comparison of the partial word with words and syllables stored in the word database and creating a first sub set of whole words and a second sub set of syllables defined words from the word database formed commencing with the partial word, temporarily store
the sub set of words and/or syllables fetched from the word database;
(8) a syllable analyzer adapted to read the sub sets of words and syllables in the temporary word database, analyze the sub set of words and syllables so read and sort and arrange the sub set of words into sorted predictive strings having elements of words and syllables in accordance with the rules of the language stored in the analyzer;
(9) a sorted strings storage means in which the sorted predictive strings of words and syllables can be temporarily stored;
(10) a frequency profiler adapted to register words and syllables inputted by a user using the tool, and further adapted to append any new or original words or syllable frequencies of the user in the word data base and still further adapted to send responses to the syllable analyzer to alter, if necessary, the sorted predictive strings in the sorted strings storage means;
(11) a word display handler in which elements of the amended sorted predictive strings are stored for predictive display;
(12) a settable selector means for enabling the user to select at least some of the elements of the amended sorted predictive strings in the word display handler in accordance with a set frequency threshold, typically from one to three, for predictive display in the window and an operation means for transferring one element satisfying the prediction displayed in the window to the main text for further creation of text in the india language; and
(13) a reset means for emptying all the temporary storage means for receiving the next iteration of input for prediction.
[2] A predictive tool for india scripts as claimed in claim 1, in which the syllable analyzer comprises
a. a processor for processing inputs from the temporary
word storage means, the letter database and the
frequency profiler;
b. a receiver means for receiving words and syllables
from the temporary word storage means
c. a parser for parsing subsets of words and syllables
from the temporary word storage means for
analyzing the words and syllables and sorting the
words and syllables into predictive strings in
accordance with the rules of language and the
frequency of usage of the words and syllables in the
language and forming sorted predictive strings for
display, said parser further adapted to receive inputs
from a frequency profiler and further adapted to
reform the formed sorted strings in accordance with
inputs received from the frequency profiler; and
d. a choice selector for receiving inputs from the user
for selecting a threshold for display of the predictive
strings of words or syllables for creating text and
transferring the selected elements of the string to the
display.
[3] A predictive tool for indic scripts as claimed in claim 1, in which the frequency profiler is adapted to append the word data base with new and original words and word formations of a user.
,th
Dated this 4th day of April, 2005.
IOHAN DEWAN OF R.K.DEWAN & COMPANY APPLICANTS' PATENT ATTORNEY