Abstract: The present disclosure relates to a system for matching at least one proper noun, said system comprising one or more processors to receive said at least one proper noun in a first language; process matching of said at least one proper noun with a plurality of words of the first language that are generated in real-time from a database that stores language words in a second language, said plurality of words being generated based on phonetic conversion of each syllable of language words of said second language, and combination of the syllables of said second language words to form components having one or more words in the first language, wherein, for each rule-based match between said at least one proper noun and said plurality of words, a value is assigned to corresponding word in a manner such that the word having highest value is chosen as the matching word.
[0001] The present disclosure relates to a system and method for matching proper
nouns/names across different languages.
BACKGROUND
[0002] The background description includes information that may be useful in
understanding the present invention. It is not an admission that any of the information provided
herein is prior art or relevant to the presently claimed invention, or that any publication
specifically or implicitly referenced is prior art.
[0003] In today's day and age, the vast majority of businesses retain extensive amounts
of data regarding various aspects of their operations, such as inventories, customers, products,
etc. Data about entities, such as people, products, parts or anything else may be stored in digital
format in a data store such as a computer database. These computer databases permit the data
about an entity to be accessed rapidly and permit the data to be cross-referenced to other
relevant pieces of data about the same entity. The databases also permit a person to query the
database to find data records pertaining to a particular entity, such that data records from
various data stores pertaining to the same entity may be associated with one another.
[0004] A data store, however, has several limitations which may limit the ability to find
the correct data about an entity within the data store. The actual data within the data store is
only as accurate as the person who entered the data, or an original data source. Thus, a mistake
in the entry of the data into the data store may cause a search for data about an entity in the
database to miss relevant data about the entity because, for example, a last name of a person
was misspelled or a social security number was entered incorrectly, etc. A whole host of these
types of problems may be imagined: two separate record for an entity that already has a record
within the database may be created such that several data records may contain information
about the same entity, but, for example, the names or identification numbers contained in the
two data records may be different so that it may be difficult to associate the data records
referring to the same entity with one other.
[0005] For a business that operates one or more data stores containing a large number
of data records, the ability to locate relevant information about a particular entity within and
among the respective databases is very important, but not easily obtained. Once again, any
3
mistake in the entry of data (including without limitation the creation of more than one data
record for the same entity) at any information source may cause relevant data to be missed
when the data for a particular entity is searched for in the database. In addition, in cases
involving multiple information sources, each of the information sources may have slightly
different data syntax or formats which may further complicate the process of finding data
among the databases. An example of the need to properly identify an entity referred to in a data
record and to locate all data records relating to an entity in the health care field is one in which
a number of different hospitals associated with a particular health care organization may have
one or more information sources containing information about their patient, and a health care
organization collects the information from each of the hospitals into a master database. It is
necessary to link data records from all of the information sources pertaining to the same patient
to enable searching for information for a particular patient in all of the hospital records.
[0006] There are several problems which limit the ability to find all of the relevant data
about an entity in such a database. Multiple data records may exist for a particular entity as a
result of separate data records received from one or more information sources, which leads to
a problem that can be called data fragmentation. In the case of data fragmentation, a query of
the master database may not retrieve all of the relevant information about a particular entity. In
addition, as described above, the query may miss some relevant information about an entity
due to a typographical error made during data entry, which leads to the problem of data
inaccessibility. In addition, a large database may contain data records which appear to be
identical, such as a plurality of records for people with the last name of Smith and the first
name of Jim. A query of the database will retrieve all of these data records and a person who
made the query to the database may often choose, at random, one of the data records retrieved
which may be the wrong data record. The person may not often typically attempt to determine
which of the records is appropriate. This can lead to the data records for the wrong entity being
retrieved even when the correct data records are available. These problems limit the ability to
locate the information for a particular entity within the database.
[0007] To reduce the amount of data that must be reviewed, and prevent the user from
picking the wrong data record, it is also desirable to identify and associate data records from
the various information sources that may contain information about the same entity. There are
conventional systems that locate duplicate data records within a database and delete those
duplicate data records, but these systems may only locate data records which are substantially
identical to each other. Thus, these conventional systems cannot determine if two data records,
with, for example, slightly different last names, nevertheless contain information about the
4
same entity. In addition, these conventional systems do not attempt to index data records from
a plurality of different information sources, locate data records within the one or more
information sources containing information about the same entity, and link those data records
together. Consequently, it would be desirable to be able to associate data records from a
plurality of information sources which pertain to the same entity, despite discrepancies between
attributes of these data records and be able to assemble and present information from these
various data records in a cohesive manner. In practice, however, it can be extremely difficult
to provide an accurate, consolidated view of information from a plurality of information
sources. The challenge can be even greater in some cases where data records from the plurality
of information sources contain more than one language.
[0008] While matching name across 2 different languages, the biggest problem
observed is that that the names are slightly modified. The order of the first, middle or last name
may be changed, or, based on regional naming conventions, some parts may be added or deleted
from a name yet the 2 name variations need to be recognized as belonging to the same person.
It is important for the algorithm to develop the capability to recognize variations as belonging
to the same person otherwise the purpose of matching the names (to join the two databases)
will not be achieved.
[0009] There is therefore a need in the art to provide a written text conversion system
and method that seeks to overcome or at least ameliorate one or more of the above-mentioned
problems and other limitations of the existing solutions and utilize techniques, which are
robust, accurate, fast, time-efficient, cost-effective and simple.
OBJECTS OF THE PRESENT DISCLOSURE
[0010] Some of the objects of the present disclosure, which at least one embodiment
herein satisfies are as listed herein below.
[0011] It is an object of the present disclosure to provide a system and method for
matching proper nouns/names across different languages.
[0012] It is an object of the present disclosure to provide a system and method for the
conversion of text from one language to other.
[0013] It is an object of the present disclosure to provide a system and method for the
recognition of written text that is cost effective and easy to implement.
[0014] It is an object of the present disclosure to provide a system and method for the
recognition of written text that enhances adaptability to new texts.
5
SUMMARY
[0015] The present disclosure relates to a system for matching at least one proper noun
across different languages, wherein system can include one or more processors that are
associated with a text search engine that runs on a computing device. In an aspect, the one or
more processors can be operatively coupled with a memory that stores one or more executable
instructions in a manner such that when the one processors execute a part of the one or more
executable instructions, the text search engine can receive the at least one proper noun in a first
language, and then process matching of the at least one proper noun with a plurality of words
of the first language that are generated in real-time from a database that stores language words
in a second language. The plurality of words can be generated in real-time from the
corresponding language words based on phonetic conversion of each syllable of language
words of the second language, and combination of the syllables of the second language words
to form components having one or more words in the first language. In an aspect, during the
matching, for each rule-based match between the at least one proper noun and the plurality of
words, a numerical value is assigned to corresponding word of the plurality of words in a
manner such that the word of the plurality of words that has the highest numerical value can be
chosen as the matching word, and its corresponding language word in the database can be
determined as the matching language word, and wherein the rule-based match can be
configured to match words based on a combination of first letter match, sequential letter match,
and order of names.
[0016] In an aspect, post matching, each respective word of the plurality of words,
based on a benchmark set for the extent of matching, can be classified in any class selected
from “matching”, “partially-matching”, and “not matching”. In another aspect, the benchmark
is revisable i.e. it can be adjusted to vary the extent of matching required for classification.
[0017] In yet another aspect, the at least one proper noun in the first language can be
received by the text search engine from a second database that stores words in the first
language. In another aspect, the at least one proper noun in the first language can be retrieved
from said second database in an iterative manner.
[0018] In an aspect, the first language can be English, and the second language can be
any of Telugu, Marathi, Bengali, Assamese, Tamil, and Odiya, among other languages.
[0019] The present disclosure further relates to a method for matching at least one
proper noun across different languages, said method comprising the steps of: receiving, through
one or more processors of a text search engine that is configured in a computing device, said
at least one proper noun in a first language; matching said at least one proper noun with a
6
plurality of words of the first language that are generated in real-time from a database that
stores language words in a second language, said plurality of words being generated in realtime from said corresponding language words based on phonetic conversion of each syllable
of language words of said second language, and combination of the syllables of said second
language words to form components having one or more words in the first language, wherein,
during the matching, for each rule-based match between said at least one proper noun and said
plurality of words, a numerical value is assigned to corresponding word of said plurality of
words, and wherein the rule-based match is configured to match words based on a combination
of first letter match, sequential letter match, and order of names; and choosing a word from
said plurality of words that has the highest numerical value as the matching word, and marking
its corresponding language word in said database as the matching language word for said at
least one proper noun.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] In the figures, similar components and/or features may have the same reference
label. Further, various components of the same type may be distinguished by following the
reference label with a second label that distinguishes among the similar components. If only
the first reference label is used in the specification, the description is applicable to any one of
the similar components having the same first reference label irrespective of the second
reference label.
[0021] FIG. 1 illustrates an exemplary network architecture in which or with which
proposed system can be implemented in accordance with an embodiment of the present
disclosure.
[0022] FIG. 2 illustrates an exemplary module diagram for the recognition of written
text in accordance with an embodiment of the present disclosure.
[0023] FIG. 3 is a flow diagram illustrating a process for the recognition of written text
in accordance with an embodiment of the present disclosure.
[0024] FIG. 4 illustrates an exemplary computer system in which or with which
embodiments of the present invention can be utilized in accordance with embodiments of the
present disclosure.
DETAILED DESCRIPTION
[0025] The following is a detailed description of embodiments of the disclosure
depicted in the accompanying drawings. The embodiments are in such detail as to clearly
7
communicate the disclosure. However, the amount of detail offered is not intended to limit the
anticipated variations of embodiments; on the contrary, the intention is to cover all
modifications, equivalents, and alternatives falling within the spirit and scope of the present
disclosure as defined by the appended claims.
[0026] In the following description, numerous specific details are set forth in order to
provide a thorough understanding of embodiments of the present invention. It will be apparent
to one skilled in the art that embodiments of the present invention may be practiced without
some of these specific details.
[0027] Embodiments of the present invention include various steps, which will be
described below. The steps may be performed by hardware components or may be embodied
in machine-executable instructions, which may be used to cause a general-purpose or specialpurpose processor programmed with the instructions to perform the steps. Alternatively, steps
may be performed by a combination of hardware, software, and firmware and/or by human
operators.
[0028] Various methods described herein may be practiced by combining one or more
machine-readable storage media containing the code according to the present invention with
appropriate standard computer hardware to execute the code contained therein. An apparatus
for practicing various embodiments of the present invention may involve one or more
computers (or one or more processors within a single computer) and storage systems containing
or having network access to computer program(s) coded in accordance with various methods
described herein, and the method steps of the invention could be accomplished by modules,
routines, subroutines, or subparts of a computer program product.
[0029] If the specification states a component or feature “may”, “can”, “could”, or
“might” be included or have a characteristic, that particular component or feature is not
required to be included or have the characteristic.
[0030] As used in the description herein and throughout the claims that follow, the
meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates
otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on”
unless the context clearly dictates otherwise.
[0031] Exemplary embodiments will now be described more fully hereinafter with
reference to the accompanying drawings, in which exemplary embodiments are shown. These
exemplary embodiments are provided only for illustrative purposes and so that this disclosure
will be thorough and complete and will fully convey the scope of the invention to those of
ordinary skill in the art. The invention disclosed may, however, be embodied in many different
8
forms and should not be construed as limited to the embodiments set forth herein. Various
modifications will be readily apparent to persons skilled in the art. The general principles
defined herein may be applied to other embodiments and applications without departing from
the spirit and scope of the invention. Moreover, all statements herein reciting embodiments of
the invention, as well as specific examples thereof, are intended to encompass both structural
and functional equivalents thereof. Additionally, it is intended that such equivalents include
both currently known equivalents as well as equivalents developed in the future (i.e., any
elements developed that perform the same function, regardless of structure). Also, the
terminology and phraseology used is for the purpose of describing exemplary embodiments
and should not be considered limiting. Thus, the present invention is to be accorded the widest
scope encompassing numerous alternatives, modifications, and equivalents consistent with the
principles and features disclosed. For the purpose of clarity, details relating to technical
material that is known in the technical fields related to the invention have not been described
in detail so as not to unnecessarily obscure the present invention.
[0032] Thus, for example, it will be appreciated by those of ordinary skill in the art that
the diagrams, schematics, illustrations, and the like represent conceptual views or processes
illustrating systems and methods embodying this invention. The functions of the various
elements shown in the figures may be provided through the use of dedicated hardware as well
as hardware capable of executing associated software. Similarly, any switches shown in the
figures are conceptual only. Their function may be carried out through the operation of program
logic, through dedicated logic, through the interaction of program control and dedicated logic,
or even manually, the particular technique being selectable by the entity implementing this
invention. Those of ordinary skill in the art further understand that the exemplary hardware,
software, processes, methods, and/or operating systems described herein are for illustrative
purposes and, thus, are not intended to be limited to any particular named element.
[0033] Embodiments of the present invention may be provided as a computer program
product, which may include a machine-readable storage medium tangibly embodying thereon
instructions, which may be used to program the computer (or other electronic devices) to
perform a process. The term “machine-readable storage medium” or “computer-readable
storage medium” includes, but is not limited to, fixed (hard) drives, magnetic tape, floppy
diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical
disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs),
programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically
erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of
9
media/machine-readable medium suitable for storing electronic instructions (e.g., computer
programming code, such as software or firmware).A machine-readable medium may include a
non-transitory medium in which data may be stored and that does not include carrier waves
and/or transitory electronic signals propagating wirelessly or over wired connections.
Examples of a non-transitory medium may include but are not limited to, a magnetic disk or
tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), flash
memory, memory or memory devices. A computer program product may include code and/or
machine-executable instructions that may represent a procedure, a function, a subprogram, a
program, a routine, a subroutine, a module, a software package, a class, or any combination of
instructions, data structures, or program statements. A code segment may be coupled to another
code segment or a hardware circuit by passing and/or receiving information, data, arguments,
parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed,
forwarded, or transmitted via any suitable means including memory sharing, message passing,
token passing, network transmission, etc.
[0034] Furthermore, embodiments may be implemented by hardware, software,
firmware, middleware, microcode, hardware description languages, or any combination
thereof. When implemented in software, firmware, middleware or microcode, the program
code or code segments to perform the necessary tasks (e.g., a computer-program product) may
be stored in a machine-readable medium. A processor(s) may perform the necessary tasks.
[0035] Systems depicted in some of the figures may be provided in various
configurations. In some embodiments, the systems may be configured as a distributed system
where one or more components of the system are distributed across one or more networks in a
cloud computing system.
[0036] Each of the appended claims defines a separate invention, which for
infringement purposes is recognized as including equivalents to the various elements or
limitations specified in the claims. Depending on the context, all references below to the
"invention" may in some cases refer to certain specific embodiments only. In other cases, it
will be recognized that references to the "invention" will refer to subject matter recited in one
or more, but not necessarily all, of the claims.
[0037] All methods described herein may be performed in any suitable order unless
otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all
examples, or exemplary language (e.g., “such as”) provided with respect to certain
embodiments herein is intended merely to better illuminate the invention and does not pose a
limitation on the scope of the invention otherwise claimed. No language in the specification
10
should be construed as indicating any non-claimed element essential to the practice of the
invention.
[0038] Various terms as used herein are shown below. To the extent a term used in a
claim is not defined below, it should be given the broadest definition persons in the pertinent
art have given that term as reflected in printed publications and issued patents at the time of
filing.
[0039] The present disclosure relates to text input and prediction system. More
particularly, the present disclosure related to systems and methods for recognition of written
text.
[0040] The present disclosure relates to a system for matching at least one proper noun
across different languages, wherein system can include one or more processors that are
associated with a text search engine that runs on a computing device. In an aspect, the one or
more processors can be operatively coupled with a memory that stores one or more executable
instructions in a manner such that when the one processors execute a part of the one or more
executable instructions, the text search engine can receive the at least one proper noun in a first
language, and then process matching of the at least one proper noun with a plurality of words
of the first language that are generated in real-time from a database that stores language words
in a second language. The plurality of words can be generated in real-time from the
corresponding language words based on phonetic conversion of each syllable of language
words of the second language, and combination of the syllables of the second language words
to form components having one or more words in the first language. In an aspect, during the
matching, for each rule-based match between the at least one proper noun and the plurality of
words, a numerical value is assigned to corresponding word of the plurality of words in a
manner such that the word of the plurality of words that has the highest numerical value can be
chosen as the matching word, and its corresponding language word in the database can be
determined as the matching language word, and wherein the rule-based match can be
configured to match words based on a combination of first letter match, sequential letter match,
and order of names.
[0041] In an aspect, post matching, each respective word of the plurality of words,
based on a benchmark set for the extent of matching, can be classified in any class selected
from “matching”, “partially-matching”, and “not matching”. In another aspect, the benchmark
is revisable i.e. it can be adjusted to vary the extent of matching required for classification.
[0042] In yet another aspect, the at least one proper noun in the first language can be
received by the text search engine from a second database that stores words in the first
11
language. In another aspect, the at least one proper noun in the first language can be retrieved
from said second database in an iterative manner.
[0043] In an aspect, the first language can be English, and the second language can be
any of Telugu, Marathi, Bengali, Assamese, Tamil, and Odiya, among other languages.
[0044] The present disclosure further relates to a method for matching at least one
proper noun across different languages, said method comprising the steps of: receiving, through
one or more processors of a text search engine that is configured in a computing device, said
at least one proper noun in a first language; matching said at least one proper noun with a
plurality of words of the first language that are generated in real-time from a database that
stores language words in a second language, said plurality of words being generated in realtime from said corresponding language words based on phonetic conversion of each syllable
of language words of said second language, and combination of the syllables of said second
language words to form components having one or more words in the first language, wherein,
during the matching, for each rule-based match between said at least one proper noun and said
plurality of words, a numerical value is assigned to corresponding word of said plurality of
words, and wherein the rule-based match is configured to match words based on a combination
of first letter match, sequential letter match, and order of names; and choosing a word from
said plurality of words that has the highest numerical value as the matching word, and marking
its corresponding language word in said database as the matching language word for said at
least one proper noun.
[0045] FIG. 1 illustrates an exemplary network architecture in which or with which
proposed system can be implemented in accordance with an embodiment of the present
disclosure.
[0046] As illustrated, in a network implementation 100, the system 102 can be
communicatively coupled with a plurality of computing devices 106-1, 106-2…106-N
(collectively referred to as computing devices 106 and individually referred to as computing
device 106 hereinafter) through network 104. The system 102 can be implemented using any
or a combination of hardware components and software components such as a server 112, a
computing system, a computing device, a security device and the like.
[0047] Further, the system 102 can interact with input devices 108-1, 108-2…108-N
(collectively referred to as input devices 108, and individually referred to as input device 108
herein after), through the computing devices 106 or through applications residing on the
computing devices 106. In an implementation, the system 102 can be accessed by applications
residing on any operating system, including but not limited to, AndroidTM, iOSTM, and the like.
12
Examples of the computing devices 106 can include but are not limited to, a portable computer,
a personal digital assistant, a handheld device, and a workstation. In a preferred embodiment,
the computing devices 106 are mobile phones associated with respective input devices 108.
[0048] In an embodiment, the input device 108 can include a touch pad, touch enabled
screen of a computing device, an optical sensor, an image scanner and the like that can be used
to receive a handwriting input that forms part of an input to the system 102.
[0049] In an embodiment, the input device 108 can be implemented such that it forms
part of the computing device 106. For example, the input device 108 can be a touch screen
implemented with mobile device 106 to receive a written text from the user.
[0050] In an embodiment, the system 102 can receive the input from the input device
108. Further, the received written text can be received by using a device, apparatus or hand to
form one or more letters of a language. In another embodiment, the input device 108 can be an
optical sensor that can be used for capturing the image of the written text that can be received
by the system for further processing, the text having a first set of words.
[0051] In an embodiment, the system 102 can be configured to extract letters from the
first set of words from the received text or sample text. In an embodiment, the first set of words
can include one or more words. The first set of words can include name or salutation etc.
[0052] In one embodiment, the system 102 can extract each letters of the received text.
The letters can form part of a language. In another embodiment, the first set of words can
include symbols such as ‘, (comma)’, ‘; (semicolon)’, ‘& (ampersand), and the like. In another
embodiment, the received text can be of one language such as English or Marathi etc.
[0053] In an embodiment, the system 102 can be configured to determine one or more
letters of the language. The system 102 can be configured to determine one or more letters by
comparing at least a part of the determined one or more characters with a second dataset
comprising one or more characters pertaining to the one or more letters of the language. In an
embodiment, the second data asset can be stored in a second database. It would be appreciated
that the second database can be present on a cloud/ server.
[0054] It would be appreciated by the person skilled in the art that although the
embodiments have been described in terms of English, Marathi, Gurmukhi, Punjabi, Hindi and
Devanagari. However, the systems and methods can be used for other languages as well.
[0055] In an embodiment, the system 102 can be configured to iteratively select one or
more letters from the extracted letters corresponding to at least one word out of the first set of
words to determine a second set of words. The system 102 can repeat the iterations for a
plurality of instances. The iterative selection of the letters can depend on a prestored set of rules
13
or a set of instructions that can be already stored. The prestored set of rules can be configured
as per requirement.
[0056] In an embodiment, the system 102 can be configured to compare each of the
second set of words with a plurality of prestored words stored in a second database, and assign
a score to a corresponding closest matching word for each of the second set of words based on
comparison, wherein the closest matching of words is based on highest number of common
letters of the second set of words and the words of the plurality of words.
[0057] Further, the system 102 can be configured to determine a cumulative score based
on the closest matching words for the second set of words; and compare the determined
cumulative score with a predefined threshold score, wherein when the cumulative score
breeches the predefined threshold based on the comparison, a third set of words of the plurality
of words corresponding to the closest matching word is a converted set of words for the
received first set of words.
[0058] In an embodiment, the system 102 can comprise a presentation unit (not shown).
The presentation unit can be configured to either visually display converted set of words.
[0059] FIG. 2 illustrates an exemplary module diagram for recognition of written text
in accordance with an embodiment of the present disclosure.
[0060] In an aspect, module diagram 200 of the system 102 may comprise one or more
processor(s) 202. The one or more processor(s) 202 may be implemented as one or more
microprocessors, microcomputers, microcontrollers, digital signal processors, central
processing units, logic circuitries, and/or any devices that manipulate data based on operational
instructions. Among other capabilities, the one or more processor(s) 202 are configured to fetch
and execute computer-readable instructions stored in a memory 206 of the system 102. The
memory 206 may store one or more computer-readable instructions or routines, which may be
fetched and executed to create or share the data units over a network service. The memory 206
may comprise any non-transitory storage device including, for example, volatile memory such
as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0061] The system 102 may also comprise an interface(s) 204. The interface(s) 204
may comprise a variety of interfaces, for example, interfaces for data input and output devices,
referred to as I/O devices, storage devices, and the like. The interface(s) 204 may facilitate
communication of system 102. The interface(s) 204 may also provide a communication
pathway for one or more components of the system 102. Examples of such components include,
but are not limited to, processing engine(s) 208 and data 210.
14
[0062] The processing engine(s) 208 may be implemented as a combination of
hardware and programming (for example, programmable instructions) to implement one or
more functionalities of the processing engine(s) 208. In examples described herein, such
combinations of hardware and programming may be implemented in several different ways.
For example, the programming for the processing engine(s) 208 may be processor executable
instructions stored on a non-transitory machine-readable storage medium and the hardware for
the processing engine(s) 208 may comprise a processing resource (for example, one or more
processors), to execute such instructions. In the present examples, the machine-readable
storage medium may store instructions that, when executed by the processing resource,
implement the processing engine(s) 208. In such examples, the system 102 may comprise the
machine-readable storage medium storing the instructions and the processing resource to
execute the instructions, or the machine-readable storage medium may be separate but
accessible to system 102 and the processing resource. In other examples, the processing
engine(s) 208 may be implemented by electronic circuitry.
[0063] The data 210 may comprise data that is either stored or generated as a result of
functionalities implemented by any of the components of the processing engine(s) 208 or the
system 102. With reference to FIGs. 1 and 2, the present disclosure relates to finding an
appropriate match for a proper noun in a first language with a word/noun stored in a second
language in a database. For instance, if we need to check/verify whether a legal document such
as a property paper that mentions the owners’ name as “Priya Thakur” in English Language, is
actually in the name of “Priya Thakur”, wherein the actual names are stored in a database in a
different language such as Tamil, the proposed system and aspects thereof can be used.
[0064] In an exemplary embodiment, aspects of the proposed system, in order to
undertake the above-mentioned verification, can be configured to generate a plurality of words
that can potentially match with the English language name “Priya Thakur”, wherein the
plurality of words, as part of potential words generation engine 212, are generated in English
language (first language) from corresponding Tamil Language (Second language) words that
are stored in the database (property register, for instance). Such plurality of words can be
generated in real-time from the corresponding language words (Tamil words) based on
phonetic conversion of each syllable of language words of the second language, and
combination of the syllables of the second language words to form components having one or
more words in the first language. For instance, all Tamil words that are phonetically (first name
or second name or both) similar to “Priya Thakur” are extracted and their corresponding
English words are generated (referred to as plurality of words).
15
[0065] Once the plurality of words are generated in the English language, matching
engine 214 of the proposed system, can be configured to match the proper noun that needs to
be searched for with each of the plurality of words, based on which match-based value
association engine 216, for each rule-based match between the at least one proper noun and the
plurality of words, computes and assigns a numerical value to corresponding word of the
plurality of words. Matching language word identification engine 218 can then be used to
identify the word of the plurality of words that has the highest numerical value as the matching
word, and its corresponding language word in the database (say 210) can be determined as the
matching language word, and wherein the rule-based match can be configured to match words
based on a combination of first letter match, sequential letter match, and order of names.
[0066] In an aspect, post matching, each respective word of the plurality of words,
based on a benchmark set for the extent of matching, can be classified in any class selected
from “matching”, “partially-matching”, and “not matching”. In another aspect, the benchmark
is revisable i.e. it can be adjusted to vary the extent of matching required for classification.
[0067] In yet another aspect, the at least one proper noun in the first language can be
received by the text search engine from a second database that stores words in the first
language. In another aspect, the at least one proper noun in the first language can be retrieved
from said second database in an iterative manner.
[0068] In an aspect, the first language can be English, and the second language can be
any of Telugu, Marathi, Bengali, Assamese, Tamil, and Odiya, among other languages.
[0069] In an exemplary aspect, an implementation of the proposed system can be in the
form of a parser from Oriya to English language for comparing millions of names (proper
nouns) from Oriya database with corresponding database of English names. The proposed
invention therefore can be configured as a Parser that compares millions of Oriya names with
the corresponding English names through an automated process across two large databases. If
there is a large database in English and it needs to be compared with another large database in
Oriya, the proposed invention can help do so efficiently. An exemplary object of the present
invention is to compare names of people and places, where language conversion does not
require meaning-based translation, only re-rendering in the other language so that the names
may be compared and matched.
[0070] In an exemplary implementation, the plurality of words in English language (in
the above example) can be generated from corresponding words in the second language (for
instance Oriya language) using Phonetic conversion of each syllable of Oriya to English,
wherein, for instance, every consonant can be combined with a vowel, an ‘r’ sound or an ‘n’
16
or ‘m’ sound , or more than one of these 3 options, to form a character, which represents a
complete syllable. A comprehensive library of thousands of such characters can be created to
represent every possible sound in the Oriya language. For each syllable, a corresponding set of
English alphabets was assigned to most closely resemble that sound. Post this the syllables can
be combined to form components of the name (First Name, Middle Name, Last Name) in
English, post which the process of matching the two English Databases can be conducted using
an automated, iterative script, to find names that best match with each other. A numerical value
can be assigned to each match on the basis of several criteria such as first letter match,
sequential letter match and order of names among others. A qualifying benchmark is
established for classification as “matching”, “partially matching” or “not matching” and this
benchmark may be relaxed or tightened iteratively to generate a good match. The proposed
system can help zero-in on which names need to match (in the large databases), and then use
an iterative algorithm to find the match. The proposed system can help match names despite
differences in spelling, match names even if the order of Surname, Last name and Middle name
is different in the two databases, match names despite a variation in the form of names – For
example if one database says Babu Pattnaik and the other database says Baburam Pattnaik, the
program recognizes them as the same person, indicate where the name is matching only
partially – for example, only the surname matches. This aspect is useful as one database may
have the father’s name whereas the other database may have the son’s name, and we may need
to establish the connection.
[0071] In an exemplary aspect, multiple rules sets can be generated during conversion
from second language to first language for subsequent matching exercise, wherein the
conversion can be done as part of the potential words generation engine 212. Few exemplary
rules can include “Without remove vowels_e_m_ratio (string remove vowels in English, string
remove vowels in Marathi)” where both the names are taken after removing vowels and split
by space in an array and pass in for loop we increment our count variable if following
conditions are satisfied: -
If first name of the Names/Proper Nouns written in Language 1 contains first name of
the Names/Proper nouns written in Language 2, then we increment the count
If whole Names/Proper nouns written in Language 2 is present in the Names/Proper
Nouns written in Language 1, then we increment the count.
If length of the Names/Proper nouns written in Language 2 is greater than 2, then match
a combination of two letters.
17
[0072] Another exemplary rule can include Combination match_ratio (string remove
vowels in English, string remove vowels in Marathi), where both the names are split by space
in an array and then if length of the Names/Proper nouns written in Language 2 is equal to 2,
and then the loop run is made for length -1 and make combination of two letters. If name length
greater than two, loop run is performed for length -2, and combination of three letters and same
for Names/Proper Nouns written in Language 1, and combination of letter is made in array and
then compared each combination and increase count value return count. Yet another
conversion/generation rule can include “only first” (string remove vowels in English, string
remove vowels in martahi) where, the names are passed after removing vowels, and then the
word is split, and loop is run by length of the string, and then the same thing is performed in
Names/Proper Nouns written in Language 1, and loop for first name of first letter of the
Names/Proper nouns written in Language 2 and first name, second name last name of first letter
of the Names/Proper Nouns written in Language 1, if match, then increase count and return the
pass percentage of match.
[0073] FIG. 3 is a flow diagram illustrating a process for recognition of written text in
accordance with an embodiment of the present disclosure.
[0074] At step 302, the proposed method includes the step of receiving, through one or
more processors of a text search engine that is configured in a computing device, said at least
one proper noun in a first language. At step 304, the method includes the step of matching said
at least one proper noun with a plurality of words of the first language that are generated in
real-time from a database that stores language words in a second language, said plurality of
words being generated in real-time from said corresponding language words based on phonetic
conversion of each syllable of language words of said second language, and combination of
the syllables of said second language words to form components having one or more words in
the first language, wherein, during the matching, for each rule-based match between said at
least one proper noun and said plurality of words, a numerical value is assigned to
corresponding word of said plurality of words, and wherein the rule-based match is configured
to match words based on a combination of first letter match, sequential letter match, and order
of names; and at step 306, the method can include the step of choosing a word from said
plurality of words that has the highest numerical value as the matching word, and marking its
corresponding language word in said database as the matching language word for said at least
one proper noun.
[0075] In an aspect, the proposed method may be described in the general context of
computer-executable instructions. Generally, computer executable instructions can include
18
routines, programs, objects, components, data structures, procedures, modules, functions, etc.,
that perform particular functions or implement particular abstract data types. The method can
also be practiced in a distributed computing environment where functions are performed by
remote processing devices that are linked through a communications network. In a distributed
computing environment, computer-executable instructions may be located in both local and
remote computer storage media, including memory storage devices.
[0076] The order in which the method as described is not intended to be construed as a
limitation and any number of the described method blocks may be combined in any order to
implement the method or alternate methods. Additionally, individual blocks may be deleted
from the method without departing from the spirit and scope of the subject matter described
herein. Furthermore, the method may be implemented in any suitable hardware, software,
firmware, or combination thereof. However, for ease of explanation, in the embodiments
described below, the method may be considered to be implemented in the above-described
system.
[0077] For example, the following table displays the names in different languages and
their corresponding English conversion and corresponding converted texts.
Langu
age
Name
English
Conversion
Name in English
Database
Matched by the proposed
systemw’rp234o4r2ejofpef
er’pijvkz ?
Marat
hi बळीराम स ुंबे Baliram SUMBE
Sarubai Baliram
Sumbe Yes
Marat
hi स ुंदरराव Sundarrao
Bhaskar
Sundarrao
Chavhan Yes
Marat
hi
छगन अर् ुन
बाुंगर
CHAGAN ARJUN
BANGAR
ARJUN
CHAGAN
BANGAR Yes
Marat
hi
ज्ञानोबा
कोुंडीबा
लव्हारे
Dnyanoba
Kondiba
Lavhare
Kondiba
Lavhare Yes
19
Marat
hi
चुंद्रकाुंत
सर्ेराव
चौधार
अपाक आई
स शीलाबाई
सर्ेराव
चौधार
chndarkant
srajerav
chaudhar apak
aaee
susheelabaee
srajerav
chaudhar
chandrakant
sarjerao
chaudhar Yes
Telugu
కోర
గంగాధర్
రెడ్డ(ిపుల్లారె
డ్డి)
koora
gamgaadhara
redadi(pulalaare
dadi)
Kora
Gangadhar
Reddy Yes
Telugu
బి. నడ్డపి
ఓబులు bi. nadipi obulu
Budda Nadipi
Obulu Yes
Telugu
పేరా
లచ్చ నన గా
రి
రామిరెడ్డ(ి
బాల
గంగనన)
peerla
laccannagaari
raamireddi(baal
a gamganna)
PERLA
RAMIREDDY Yes
Telugu
మధసుధ
న్ రెడ్డి
ఎద్దుల(ప్ర
భాకర రెడ్డి)
madhasudhana
redadi
edhadula(parab
haakra redadi)
YEDDULA
MADHUSUDAN
A REDDY Yes
Telugu
గజ్లజ
రాజేశ్వ రి
gajjala
raajeeshvari
GAJJALA
RAMAKRISHNA
REDDY Yes
Telugu
శిరపు చినన
ముని రెడ్డ ి
shirapu cinna
muni reddi
CHINNA MUNI
REDDY Yes
20
Oriya
ପ୍ରଦିପ କୁମାର
ସାହୁ
pradipa kumara
sahu Pradeep Sahu Yes
Oriya
ଆନନ୍ଦ ଚନ୍ଦ୍ର
ସାହୁ
ananda chandra
sahu
Ananda Sahu
Chandra Yes
Oriya
ବବ୍ଳୁପଧାନ,
ଅଜିତ ପଧାନ
bablu padhana
ajita padhana BABLU PADHAN Yes
Oriya
ବରୁ ଣ କୁମାର
ରଣା
baruna kumara
rana
RANA BARUNA
KUMAR Yes
Oriya
ଗ ୌରଚନ୍ଦ୍ର
ସାହୁ
gaurachandra
sahu GOUR SAHU Yes
Hindi
ववर्ेन्द्रवसह
वप
वहम्मतवसह
vijenadarsingh
pi himamtsingh
Vijendra
Himmat Singh Yes
Hindi सर्नवसह Sajjan Singh
Sajjan Singh
Rajput Yes
Hindi
नन्दा र्ावत
अहीर
nnada jati
aheer Nandji Yadav Yes
Hindi चतरभ र् chtrbhuj
Mr.
CHATURBHUJ Yes
Hindi ग लाबवसुंह Gulabsingh Singh Gulab Yes
[0078] FIG. 4 illustrates an exemplary computer system in which or with which
embodiments of the present invention can be utilized in accordance with embodiments of the
present disclosure.
[0079] As shown in FIG. 4, computer system 400 can include an external storage device
410, a bus 420, a main memory 430, a read only memory 440, a mass storage device 450,
communication port 460, and a processor 470. A person skilled in the art will appreciate that
the computer system may include more than one processor and communication ports. Examples
of processor 470 include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s),
or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOC™
system on chip processors or other future processors. Processor 470 may include various
21
modules associated with embodiments of the present invention. Communication port 460 can
be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet
port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other
existing or future ports. Communication port 460 may be chosen depending on a network, such
a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer
system connects.
[0080] Memory 430 can be Random Access Memory (RAM), or any other dynamic
storage device commonly known in the art. Read-only memory 440 can be any static storage
device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for
storing static information e.g., start-up or BIOS instructions for processor 470. Mass storage
450 may be any current or future mass storage solution, which can be used to store information
and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel
Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment
(SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial
Bus (USB) and/or Firewire interfaces), e.g. those available from Seagate (e.g., the Seagate
Barracuda 7102 family) or Hitachi (e.g., the Hitachi Deskstar 7K1000), one or more optical
discs, Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g.,
SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie,
Nexsan Technologies, Inc. and Enhance Technology, Inc.
[0081] Bus 420 communicatively couples processor(s) 470 with the other memory,
storage and communication blocks. Bus 420 can be, e.g. a Peripheral Component Interconnect
(PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like,
for connecting expansion cards, drives and other subsystems as well as other buses, such a front
side bus (FSB), which connects processor 470 to software system.
[0082] Optionally, operator and administrative interfaces, e.g. a display, keyboard, and
a cursor control device, may also be coupled to bus 420 to support direct operator interaction
with a computer system. Other operator and administrative interfaces can be provided through
network connections connected through communication port 460. The external storage device
410 can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact
Disc - Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video
Disk-Read Only Memory (DVD-ROM). Components described above are meant only to
exemplify various possibilities. In no way should the aforementioned exemplary computer
system limit the scope of the present disclosure.
22
[0083] Embodiments of the present disclosure may be implemented entirely hardware,
entirely software (including firmware, resident software, micro-code, etc.) or combining
software and hardware implementation that may all generally be referred to herein as a
“circuit,” “module,” “component,” or “system.” Furthermore, aspects of the present disclosure
may take the form of a computer program product comprising one or more computer readable
media having computer readable program code embodied thereon.
[0084] Thus, it will be appreciated by those of ordinary skill in the art that the diagrams,
schematics, illustrations, and the like represent conceptual views or processes illustrating
systems and methods embodying this invention. The functions of the various elements shown
in the figures may be provided through the use of dedicated hardware as well as hardware
capable of executing associated software. Similarly, any switches shown in the figures are
conceptual only. Their function may be carried out through the operation of program logic,
through dedicated logic, through the interaction of program control and dedicated logic, or even
manually, the particular technique being selectable by the entity implementing this invention.
Those of ordinary skill in the art further understand that the exemplary hardware, software,
processes, methods, and/or operating systems described herein are for illustrative purposes and,
thus, are not intended to be limited to any particular named.
[0085] As used herein, and unless the context dictates otherwise, the term "coupled to"
is intended to include both direct coupling (in which two elements that are coupled to each
other contact each other) and indirect coupling (in which at least one additional element is
located between the two elements). Therefore, the terms "coupled to" and "coupled with" are
used synonymously. Within the context of this document terms "coupled to" and "coupled
with" are also used euphemistically to mean “communicatively coupled with” over a network,
where two or more devices are able to exchange data with each other over the network, possibly
via one or more intermediary device.
[0086] It should be apparent to those skilled in the art that many more modifications
besides those already described are possible without departing from the inventive concepts
herein. The inventive subject matter, therefore, is not to be restricted except in the spirit of the
appended claims. Moreover, in interpreting both the specification and the claims, all terms
should be interpreted in the broadest possible manner consistent with the context. In particular,
the terms “comprises” and “comprising” should be interpreted as referring to elements,
components, or steps in a non-exclusive manner, indicating that the referenced elements,
components, or steps may be present, or utilized, or combined with other elements,
components, or steps that are not expressly referenced. Where the specification claims refer to
23
at least one of something selected from the group consisting of A, B, C …. and N, the text
should be interpreted as requiring only one element from the group, not A plus N, or B plus N,
etc.
[0087] While the foregoing describes various embodiments of the invention, other and
further embodiments of the invention may be devised without departing from the basic scope
thereof. The scope of the invention is determined by the claims that follow. The invention is
not limited to the described embodiments, versions or examples, which are included to enable
a person having ordinary skill in the art to make and use the invention when combined with
information and knowledge available to the person having ordinary skill in the art.
ADVANTAGES OF THE PRESENT DISCLOSURE
[0088] The present disclosure provides for a system and method for matching proper
nouns/names across different languages.
[0089] The present disclosure provides a system and method for the conversion of text
from one language to other.
[0090] The present disclosure provides a system and method for the recognition of
written text that is cost effective and easy to implement.
[0091] The present disclosure provides a system and method for the recognition of
written text that enhances adaptability to new texts.
We Claim:
1. A system for matching at least one proper noun across different languages, said system
comprising:
one or more processors that are associated with a text search engine that runs on a
computing device, said one or more processors being operatively coupled with a memory
that stores one or more executable instructions in a manner such that when said one
processors execute a part of said one or more executable instructions, the text search
engine:
receives said at least one proper noun in a first language;
processes matching of said at least one proper noun with a plurality of words of
the first language that are generated in real-time from a database that stores language
words in a second language, said plurality of words being generated in real-time from
said corresponding language words based on phonetic conversion of each syllable of
language words of said second language, and combination of the syllables of said second
language words to form components having one or more words in the first language,
wherein, during the matching, for each rule-based match between said at least
one proper noun and said plurality of words, a numerical value is assigned to
corresponding word of said plurality of words in a manner such that the word of said
plurality of words that has the highest numerical value is chosen as the matching word,
and its corresponding language word in said database is determined as the matching
language word,
and wherein the rule-based match is configured to match words based on a
combination of first letter match, sequential letter match, and order of names.
2. The system as claimed in claim 1, wherein post matching, each respective word of the
plurality of words, based on the benchmark set for the extent of matching, is classified in
any class selected from “matching”, “partially-matching”, and “not matching”.
3. The system as claimed in claim 1, wherein said benchmark is revisable.
4. The system as claimed in claim 1, wherein said at least one proper noun in said first
language is received by the text search engine from a second database storing words in
said first language.
5. The system as claimed in claim 4, wherein said at least one proper noun in said first
language is retrieved from said second database in an iterative manner.
25
6. The system as claimed in claim 1, wherein the first language is English, and said second
language is any of Telugu, Marathi, Bengali, Assamese, Tamil, and Odiya.
7. A method for matching at least one proper noun across different languages, said method
comprising the steps of:
receiving, through one or more processors of a text search engine that is
configured in a computing device, said at least one proper noun in a first language;
matching said at least one proper noun with a plurality of words of the first
language that are generated in real-time from a database that stores language words in a
second language, said plurality of words being generated in real-time from said
corresponding language words based on phonetic conversion of each syllable of language
words of said second language, and combination of the syllables of said second language
words to form components having one or more words in the first language, wherein,
during the matching, for each rule-based match between said at least one proper noun
and said plurality of words, a numerical value is assigned to corresponding word of said
plurality of words, and wherein the rule-based match is configured to match words based
on a combination of first letter match, sequential letter match, and order of names;
choosing a word from said plurality of words that has the highest numerical value
as the matching word, and marking its corresponding language word in said database as
the matching language word for said at least one proper noun.
8. The method as claimed in claim 1, wherein post matching, each respective word of the
plurality of words, based on the benchmark set for the extent of matching, is classified in
any class selected from “matching”, “partially-matching”, and “not matching”.
9. The method as claimed in claim 1, wherein said at least one proper noun in said first
language is received by the text search engine from a second database storing words in
said first language.
10. The method as claimed in claim 1, wherein the first language is English, and said second
language is any of Telugu, Marathi, Bengali, an Indian Language, Assamese, Tamil, and
Odiya.
| # | Name | Date |
|---|---|---|
| 1 | 202011015298-FORM 18 [09-02-2024(online)].pdf | 2024-02-09 |
| 1 | 202011015298-STATEMENT OF UNDERTAKING (FORM 3) [07-04-2020(online)].pdf | 2020-04-07 |
| 2 | 202011015298-FORM 1 [07-04-2020(online)].pdf | 2020-04-07 |
| 2 | 202011015298-8(i)-Substitution-Change Of Applicant - Form 6 [03-10-2023(online)].pdf | 2023-10-03 |
| 3 | 202011015298-DRAWINGS [07-04-2020(online)].pdf | 2020-04-07 |
| 3 | 202011015298-ASSIGNMENT DOCUMENTS [03-10-2023(online)].pdf | 2023-10-03 |
| 4 | 202011015298-DECLARATION OF INVENTORSHIP (FORM 5) [07-04-2020(online)].pdf | 2020-04-07 |
| 4 | 202011015298-PA [03-10-2023(online)].pdf | 2023-10-03 |
| 5 | 202011015298-COMPLETE SPECIFICATION [07-04-2020(online)].pdf | 2020-04-07 |
| 5 | abstract.jpg | 2021-10-18 |
| 6 | 202011015298-Proof of Right [28-09-2020(online)].pdf | 2020-09-28 |
| 6 | 202011015298-FORM-26 [22-06-2020(online)].pdf | 2020-06-22 |
| 7 | 202011015298-Proof of Right [28-09-2020(online)].pdf | 2020-09-28 |
| 7 | 202011015298-FORM-26 [22-06-2020(online)].pdf | 2020-06-22 |
| 8 | abstract.jpg | 2021-10-18 |
| 8 | 202011015298-COMPLETE SPECIFICATION [07-04-2020(online)].pdf | 2020-04-07 |
| 9 | 202011015298-PA [03-10-2023(online)].pdf | 2023-10-03 |
| 9 | 202011015298-DECLARATION OF INVENTORSHIP (FORM 5) [07-04-2020(online)].pdf | 2020-04-07 |
| 10 | 202011015298-ASSIGNMENT DOCUMENTS [03-10-2023(online)].pdf | 2023-10-03 |
| 10 | 202011015298-DRAWINGS [07-04-2020(online)].pdf | 2020-04-07 |
| 11 | 202011015298-8(i)-Substitution-Change Of Applicant - Form 6 [03-10-2023(online)].pdf | 2023-10-03 |
| 11 | 202011015298-FORM 1 [07-04-2020(online)].pdf | 2020-04-07 |
| 12 | 202011015298-STATEMENT OF UNDERTAKING (FORM 3) [07-04-2020(online)].pdf | 2020-04-07 |
| 12 | 202011015298-FORM 18 [09-02-2024(online)].pdf | 2024-02-09 |
| 13 | 202011015298-FER.pdf | 2025-05-28 |
| 1 | NPL1E_02-12-2024.pdf |
| 1 | searchE_02-12-2024.pdf |
| 2 | NPL1E_02-12-2024.pdf |
| 2 | searchE_02-12-2024.pdf |