Mechanism For Abbreviating Or Compressing Messages In The Sms

< Back

Mechanism For Abbreviating Or Compressing Messages In The Sms

Abstract: An apparatus for compression of "sms" messages in Indian languages is disclosed. The apparatus comprises a collection of 3000 to 10,000 commonly used vocabulary words of a language stored in a word data base; a collection of about 3000 to 5000 commonly used syllables of the said language stored in a syllable data base; a collection of commonly used phrases of the language stored in a phrase data base, a word analyzer, a syllable analyzer and a phrase analyzer. Inputs are received from a first temporary register and second temporary register and compared by the analyzers and replaced by tokens from the data bases. The message is sent in the form of token of lesser number of bytes than the original message. At the receiver end, the message is received in the token form and decoded back to the original character message.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

07 May 2004

Publication Number

44/2006

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

PENFOSYS PRIVATE LIMITED

291, SOMAWAR PETH, PUNE - 411 011

Inventors

1. KALA SRIRAM

OF PENFOSYS PVT. LTD., 291 SOMAWAR PETH PUNE 411 011

Specification

FORM-2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE
Specification
(See section 10 and rule 13)
AN APPARATUS FOR COMPRESSION OF 'SMS' MESSAGES
IN INDIAN LANGUAGES
PENFOSYS PRIVATE LIMITED' an Indian Company of 291, Somawar Peth. Pune 411 OIL Maharashtra, India,

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED:-

Field of invention
This invention relates to sending and receiving of messages via hand held devices such as mobile phones.
Particularly, the invention relates to an apparatus sending longer messages via the mobile phone.
Still particularly, this invention relates to the mechanism for compression of messages used in SMS to be sent in any of the Indian language scripts.
Description of prior art
Mobile phones, nowadays, are frequently being used to send SMS (short message service). Short message service, which was first introduced by European wireless network operators in 1991, enables mobile subscribers to easily send and receive text messages. Although specifications and industry standards related to SMS are constantly evolving and being modified, SMS messages have traditionally been used to convey readable text information, where the text can include any combination of alphanumeric characters . SMS delivery service provides a mechanism for transmitting "short" messages to and from SMS-capable terminals.
A limitation of the SMS is that the maximum length of a single message is around 160 characters. If the message exceeds 160 characters, the instrument breaks the message into two or more messages and message sender has to pay for two messages. This limitation of the short message service arises

because by convention and to avoid congestion a message is restricted to 160 bytes of digital information, [for the user one character is one byte] To send a message longer than 160 bytes the message has to be broken up into two or more messages. Hence the name short messaging service.
To meet this 160 byte criteria, several mechanisms have been devices by users. For instance, users have developed their own short code of language. There are several short cuts, acronyms and abbreviations commonly used in mobile phones for writing in English language. For example "Please" is often understood when written as "plz". Again the phrase 'before you' is typed in sms as 'b4u' and is well understood. Similarly the words ' I love you' is typed as Tlu' and is well understood. Such kinds of short representation of English words are possible due to the nature of English language. While a word in English is pronounced differently than the individual letters that make it, Indian language words are pronounced as they are written. For example, a word "kamala" will be written in any Indian language as "ka + ma + la" and also pronounced as ka ma la, whereas a word in English for example "say" is never pronounced as "s (es) + a (a) + y (wai)". The Indian scripts are phonetic in nature so there is very little scope for creating any short forms. Also it is not possible to compress Indian scripts because even if one letter is missed out of the word it becomes difficult to understand the whole word. Again in Indian languages each matra constitutes one character and this further restricts the size of messages in Indian languages.
The different types of algorithm used for compressing messages are Huffman compression algorithm, Lempel-Ziv-Welch (LZW) data

compression technique, Winzip algorithm, JPEG compression algorithm and the like. The LZW algorithm is a lossless algorithm but has a major disadvantage that it has low rate of convergence. The Huffman compression algorithm and the JPEG algorithm are internal compression algorithm which have some major disadvantages:
• these are lossy algorithms which means the decompressed text isn't quite the same as the original one. The lossy compression is mainly used for speech, audio, image and video signals; and
• this type of compression consumes more battery power.
U.S. Pat. No. 5991713 discloses a method for compressing text which includes steps of parsing words from text in an input file and comparing the parsed words to a predetermined dictionary. The dictionary has a plurality of vocabulary words, numbers or tokens corresponding to each vocabulary word. The parsed words are replaced with numbers or tokens corresponding to the numbers assigned in the predetermined and supplemental dictionary.
U.S. Pat. No. 5204756 discloses a method for the compression and decompression of binary test images. The method distinguishes between large low-frequency areas and small high-frequency areas in the original frame. For the low-frequency areas, a scheme for lossy compression is used, whereas for the high-frequency areas, a scheme permitting lossless compression is applied.

U.S. Pat. No. 5373290 discloses a method for managing multiple dictionaries in content addressable memory based data compression. A class of lossless data compression algorithms uses a memory-based dictionary of finite size to facilitate the compression and decompression of data. To reduce the loss in data compression caused by dictionary resets, a standby dictionary is used to store a subset of encoded data entries previously stored in a current dictionary. The data is compressed/decompressed according to the address location of data entries contained within a dictionary built in a content addressable memory.
U.S. Pat. No. 5389922 discloses a method for compression using small dictionaries with applications to network packets. The invention is a dictionary initialization scheme adaptive to changes in the type and structure of input data. The compression ratio is increased by minimizing the number of data entries used to represent single characters in the input data.
U.S. Pat. No. 5870036 discloses a method for compression and decompression of data using a plurality of data compression mechanisms. The representative samples of each block of data are tested to select appropriate data compression mechanisms to be applied to the block. The block is then compressed using the selected mechanisms. The compressed block is provided with an identifier of the selected mechanism. For decompression, the identifier is examined to select appropriate data decompression mechanisms to be applied to the block.
U.S. Pat. No. 6850948 discloses a method for compressing textual documents, encoded using a tag-based markup language, such as XML or

SGML documents, in a manner that allows a compressed document to be processed without decompression. A document is compressed using a standard compression algorithm that is applied only to the data elements of the document.
U.S. Pat. No. 6879271 discloses a method for performing adaptive data compression. An alphabet and vocabulary in the encoder and decoder is built adaptively and stored in a dictionary as symbols are to be encoded and decoded. Each time an unknown symbol is to be encoded by the encoder, the encoder adds the symbol to the dictionary and transmits it in plain in the encoded string.
However, the aforesaid inventions relevant to the data compression methods cannot be used to send messages on mobile phones and at any rate cannot be used to compress texts in Indian languages.
An object of this invention is to overcome the above disadvantages disclosed in the prior art.
Another object of this invention is to reduce the battery power consumption while sending SMS.
Yet another object of the invention is to allow the user to send longer SMS. The SMS can be sent in the length of a short e-mail.

Summary of Invention
The invention envisages a mechanism for abbreviating or compressing messages in the SMS used by mobile phones.
Particularly, the invention envisages the mechanism for compression of messages used in SMS to be sent in any of the Indian language scripts.
In order to realize an effective method this invention provides a mechanism for automatically compressing or abbreviating SMS messages. The text compression technique according to this invention primarily overcomes the disadvantages of the prior art. The commonly used vocabulary amongst most of the people for any language is within five to ten thousand words. A principal feature of the invention is to create a dictionary of 5 thousand to ten thousand commonly used words in a particular language. The dictionary based compression reduces the coding problem significantly. In accordance with this invention a completely different method to compress the message is used. The dictionary provides tokens for commonly used vocabulary words. One token corresponds to a unique vocabulary word. Tokens are also provided which correspond to some of local geographical names like Jhumaritalaiya, Ooty and the like in the Indian language script and to commonly used personal names like Aditya, Zulfikar and the like. Each word is identified uniquely by a token code so that each word can be uniquely recognized by the code associated with it in the decoding process. Such a database can be termed as an identity database.

Several benefits arise by the use of the text compression technique according to the present invention.
According to this invention therefore there is provided an apparatus for compression of 'sms' messages in Indian languages, said apparatus comprising (a) a collection of 3000 to 10,000 commonly used vocabulary words of a
language stored in a word data base along with a unique corresponding
word token consisting of two bytes ; (b)a collection of about 3000 to 5000 commonly used syllables of the said
language stored in a syllable data base along with a unique
corresponding syllable token consisting of two bytes ;
(c) a collection of commonly used phrases of the language stored in a
phrase data base along with a unique corresponding phrase token
consisting of two bytes ;
(d) an input box for receiving characters for messaging;
(e) a first temporary register in which input characters received in the input box can be temporarily stored;
(f) a word analyzer means which is adapted to read the characters in the first temporary register, analyze the characters so read and compare the characters with words stored in the word data base and on finding a match adapted to replace the word in the first temporary register with a token corresponding to the read word;
(g)a fuzzy analyzer means adapted to receive signals from the word analyzer in the absence of a read word and further adapted to analyze the characters to ascertain if the word is a misspelt or miswritten word and if

so found further adapted to auto correct the so misspelt or miswritten word and transfer the same to the word analyzer for token coding;
(h) a syllable analyzer adapted to receive signals from the word analyzer in the absence of a read word adapted to read the character as syllables and compare the read syllables with syllables stored in the syllable data base and on finding a match adapted to replace the syllable in the first temporary register with a token corresponding to the read syllable;
(i) a second temporary register which is adapted to receive characters or tokens from the first temporary register;
(j) a phrase analyzer adapted to parse the tokens in the second temporary register and further adapted to compare the group of parsed tokens with tokens stored in the phrase data base for a group of tokens stored as a single phrase token and further adapted to replace a group of tokens in the second temporary register with a phrase token; and
(k) an output box adapted to receive the contents of the second temporary register and further adapted to signal the second and the first temporary registers for their resetting and still further adapted to transfer the compressed message to the send box a message transmission device.
Particularly, the apparatus may include a plurality of language data bases and a selector means is provided to select a particular language.
Brief Description of the Drawings
Figure 1 illustrates a block diagram of the apparatus according to the present invention;

Figure 2 illustrates an example of the degree of compression achieved according to the present invention; and
Figure 3 illustrates another example of the degree of compression.
Detailed Description of the Preferred Embodiment
Referring to the drawings,
Figure 1 depicts the block diagram of the apparatus of this invention. This figure explains the basic blocks needed for message compression. The first block is the IN block which receives message/text (collection of words) from the create box of the mobile phone from where the message is sent to the temporary register 1 (TR1).
Typically temporary register 1 (TR1), as the name indicates, is a temporary register where the message is stored temporarily. The TR1 receives the character and adds this received character to the previously received character (if any) to form a word which is sent to a word analyzer (WA).
Typically the word analyzer (WA) does the compression of the word by replacing the particular word by the corresponding token code from a word database (WDB) in which words and their corresponding tokens are stored . The database has commonly used plurality of vocabulary words, numbers or tokens corresponding to each vocabulary word. Each word is identified uniquely by a code so that each word can be uniquely recognized by a code associated with it. Such a database can be termed as the word database (WDB).

The word can also be referred from a fuzzy analyzer (FA). Typically FA is based on an auto-correct algorithm which matches the common confusions of vowels, accent markers, confusing consonants or characters and tries to find out a closest match and auto corrects a possibly misspelled word. However, if the user does not want auto-correct, it can be switched off.
Any Indian language makes use of about three thousand to five thousand syllables as units which can be used for words. Syllables are valid pronounceable units which consists of one or more letters. Typically a syllable analyzer (SA) replaces syllable used in the message by a corresponding code from a syllable database (SDB). Typically SDB consists of a collection of all such valid syllables where a unique token code is provided to each syllable which is greater than one character in length.
Temporary register 2 (TR2) receives the compressed message from TR1 and stores the message temporarily.
Certain sequences of word may form phrases. When the message is forwarded to a phase analyzer, it selects the phrases from the message and refers it to a phrase database (PDB). If certain sequences of word are found as phrases, they are replaced by the corresponding codes from the phrase database (PDB). The punctuation marks are not compressed. Typically the phrase database uses various procedures like frequently used the words are followed by spaces so the spaces are not included in the compressed message.

Typically the IN block receives the input characters for the creation of message/text. The character is sent to the temporary register 1(TR1). The compression of message is done through four levels. At the first level, the word is forwarded to the word analyzer (WA) from where it is referred to the word database (WDB). If the word is found in the WDB, the word Identity is fetched and forwarded to TR1 through WA. If the word is not found, a fresh look up from the database is done at the second level by forwarding the word to the Fuzzy analyzer (FA). If the word is found it is compressed and forwarded to the TR1 it is once again sent to the word analyzer which then converts the word to its corresponding token. If the word is not found in FA, it is passed back to the WA from where the word is forwarded to the third level of compression i.e. the syllable analyzer (SA). The SA refers all the words which are not compressed through WDB to a syllable database (SDB). The syllables are replaced with the codes of syllables from the SDB. All the characters, syllables, words and phrases which are not found, are retained as such. The abbreviated or the compressed word is then passed to a temporary register 2 (TR2) from where it is forwarded to the fourth level of compression i.e. phrase analyzer (PA). If certain sequences of word are found as phrases, they are replaced by corresponding codes from the phrase database (PDB). All other punctuations are retained. During decompression, wherever separate punctuations are not found, the words are appended by a default space. The compressed phrase is added up to form the compressed message and transferred to the temporary register 2(TR2) from where the message is finally passed to the send box of the mobile phone.

The apparatus may include a plurality of language data bases and a selector means [not shown] is provided to select a particular language.
The same mechanism for abbreviating or compressing messages in SMS, according to present invention, is installed at the receiver's end and by using the same set of databases the compressed message is decompressed for reading.
Figures 2 and 3 illustrates, the generic text in Hindi language, as how many bytes of memory it will require before message compression and after compressing the message according to the present invention. On an average, a single Hindi word in accordance with this invention is provided with a token consumes two bytes of memory. According to the present invention, the words are replaced by token codes and one code takes two bytes, in this fashion there are a possible total of 216 (i.e. 65536) such possible unique combinations /codes. For instance two bytes are represented as follows: 11110011 & 10001101 this will be a token for a word. Independent tokens are given for words, numbers, names, syllables and phrases and punctuations. By the aforesaid method there are a possible 65536 tokens, which can be used for the effective compression of the message to be sent as SMS.
The dictionary therefore consists of commonly used vocabulary words, numbers and provides tokens corresponding to each vocabulary word. In figure 2 and 3, each word in the generic text has a token assigned to it. Figure 2 depicts six words and a punctuation mark. After the message compression all the words and the punctuation mark are assigned tokens Tl,

T2, T3, T4, T5, T6 and T7. Similarly in figure 3, there are five words and a punctuation mark, after the message compression all the words and the punctuation mark are assigned tokens from T8 to T13. As can be seen the message in figure 2 is compressed from 2 bytes to 13 bytes and the message in figure 3 is compressed from 29 bytes to 11 bytes.

We Claim:
[1] An apparatus for compression of 'sms' messages in Indian languages,
said apparatus comprising
(a) a collection of 3000 to 10,000 commonly used vocabulary words of a
language stored in a word data base along with a unique corresponding
word token consisting of two bytes ; (b)a collection of about 3000 to 5000 commonly used syllables of the said
language stored in a syllable data base along with a unique
corresponding syllable token consisting of two bytes ;
(c) a collection of commonly used phrases of the language stored in a
phrase data base along with a unique corresponding phrase token
consisting of two bytes ;
(d) an input box for receiving characters for messaging;
(e) a first temporary register in which input characters received in the input box can be temporarily stored;
(f) a word analyzer means which is adapted to read the characters in the first temporary register, analyze the characters so read and compare the characters with words stored in the word data base and on finding a match adapted to replace the word in the first temporary register with a token corresponding to the read word;
(g)a fuzzy analyzer means adapted to receive signals from the word analyzer in the absence of a read word and further adapted to analyze the characters to ascertain if the word is a misspelt or miswritten word and if so found further adapted to auto correct the so misspelt or miswritten word and transfer the same to the word analyzer for token coding;

(h)a syllable analyzer adapted to receive signals from the word analyzer in the absence of a read word adapted to read the character as syllables and compare the read syllables with syllables stored in the syllable data base and on finding a match adapted to replace the syllable in the first temporary register with a token corresponding to the read syllable;
(i) a second temporary register which is adapted to receive characters or tokens from the first temporary register;
(j) a phrase analyzer adapted to parse the tokens in the second temporary register and further adapted to compare the group of parsed tokens with tokens stored in the phrase data base for a group of tokens stored as a single phrase token and further adapted to replace a group of tokens in the second temporary register with a phrase token; and
(k) an output box adapted to receive the contents of the second temporary register and further adapted to signal the second and the first temporary registers for their resetting and still further adapted to transfer the compressed message to the send box a message transmission device.
[2] An apparatus for compression of 'sms' messages as claimed in claim 1, in which each token is of two bytes length.
[3] An apparatus for compression of 'sms' messages as claimed in claim 1, in which the apparatus a plurality of language data bases and a selector means is provided to select a particular language.
[4] An apparatus for compression of 'sms' messages as claimed in claim 1, in which the fuzzy analyzer can be switched off.

[5] An apparatus for compression of 'sms' messages as described herein with reference to the accompanying drawings.

>th
Dated this 28th day of April, 2005.

M0HAN DEWAN -OF R.K.DEWAN & COMPANY APPLICANTS' PATENT ATTORNEY