An Apparatus And Method Of Consolidating Multiple Discrete Records

< Back

An Apparatus And Method Of Consolidating Multiple Discrete Records Stored In A Plurality Of Raw Data Files

Abstract: AN APPARATUS AND METHOD OF CONSOLIDATING MULTIPLE DISCRETE RECORDS STORED IN A PLURALITY OF RAW DATA FILES

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

11 July 2000

Publication Number

Publication Type

INA

Invention Field

Status

Parent Application

Patent Number

Legal Status

Grant Date

2006-10-31

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

BOMBAY HOUSE, 24 SIR HOMI MODY STREET, MUMBAI - 400 001, MAHARASHTRA, INDIA, AN INDIAN COMPANY

Inventors

1. JOSHI HEMANT, GUPTA KAMAL

BOMBAY HOUSE, 24 SIR HOMI MODY STREET, MUMBAI - 400 001, MAHARASHTRA, INDIA, AN INDIAN COMPANY

Specification

FORM - 2
THE PATENTS ACT, 1970
(39 of 1970)
COMPLETE
Specification
(Section 10; rule 13)
AN APPARATUS AND METHOD OF CONSOLIDATING MULTIPLE DISCRETE RECORDS STORED IN A PLURALITY OF RAW DATA FILES

TATA CONSULTANCY SERVICES LTD.
of Bombay House, 24, Sir Homi Mody Street, Mumbai 400 023,

Maharashtra, India, an Indian Company

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED:-

This invention relates to ah apparatus and a method for data management
Background of the Invention
There are many companies, specifically in the finance and marketing section, who collect information about individuals. In many cases, the information collected from one or more sources (possibly at different points in time) needs to be compiled into a list of unique customers for better utilisation. The list(s) might be obtained from the subscribers of different policies of for example an insurance company, or different account holders or from the subscriber list of a company that has been taken over / merged, or might even be bought from the market. Also many Financial Institutions are now getting into the retail market in a big way, by means of banks, shares, credit cards, personal loans, bonds and Mutual Funds for which they could utilise a list of unique customers for target mailings and cross selling of products.
This invention envisages an apparatus and a method for improving data management and typically an apparatus and a method that can identify the same customer appearing in two or more independent lists, by knowing his name and address only. At present there is no known apparatus that can do this job.
Although this invention is described hereinafter keeping in mind the peculiar conditions prevalent in India, the apparatus and method can be used for processing of data containing name and address information of countries other than India, with appropriate modifications.
2

An object of this invention is to provide a method and apparatus that converts raw data of multiple lists into a single table of valid unique entities such as customers.
The following abbreviations are used in the specification for ease of

understanding:
Cust Customer
DBA Database Administrator
DDL Data Definition Language
Id Identifier
PL/SQL Procedural Language/ Structured Query
Language
RPC Remote Procedure Call
BCP Bulk Copy (Database utility from MSSQL
Server)
VB Visual Basic
MSDEV Microsoft Visual Developer
RDBMS Relational Database Management Systems
ODBC Open Database Connectivity
According to this invention there is provided a method of consolidating multiple discrete records stored in a plurality of raw data files to produce a consolidated processed data file, each record in the said raw data file containing a plurality of discrete information, said method comprising the steps of Providing a database for containing discrete processed records in table
format, in the data storage means of a central processing unit.
3

Loading the raw data file into the data storage means of the said central
processing unit.
Providing consolidating means and loading the said consolidating
means in the central processing unit;
Using the consolidating means to define the attributes for each of the
said records in the said raw data file;
Using the consolidation means to create master reference files.
Using the consolidation means to define the information contained in a
raw data file and the rules for processing the said information, in a
predetermined format to create a discreet metadata file for the raw
data file.
Using the created attributes to define matching criteria for records in a
raw data file with corresponding processed records;
Using the consolidation means to process the records in at least one of
the raw data files using the information defined in its metadata file.
Using the metadata file of a raw data file and the master reference files
to cleanse and standardise the records in a particular raw data file
being processed to create cleansed and standardised records stored in a
processed data file.
Loading the processed data file containing cleansed and standardised
records into the said database.
4

Using the matching criteria and the cleansed and standardised records
loaded into the said database to produce a list of unique entities and
unique families by comparing against existing unique entities or
unique families if any in the database and producing an output in table
format in the said database.
Using the consolidation means to verify whether a particular unique
entity created should remain unique or be merged with an existing
entity;
Using the consolidation means to merge two entities to form a single
entity, if needed.
Using the consolidation means to split an entity formed by combining
two or more records, into two separate unique entities, if needed.
Using the consolidation means to obtain reports about the processing
of the said raw data files and consolidation of records.
Typically the step of defining attributes of the records includes:
Using the consolidation means to display the existing attributes on a
display screen in table format, in response to an instruction for defining
the attributes;
Using the consolidation means to simultaneously display an editable
blank table that is capable of accepting the name, data-type and size of
a particular attribute.
5

Using the consolidation means to add a defined attribute in the database.
Typically, the step of creating a master reference files includes:
Using the consolidation means to select the type of reference file to be
created from amongst validation files, substitution files, mapping files
and pin reference files.
Using the consolidation means to display the contents of the selected
reference file in a table format and editing the values therein.
Using the consolidation means to save the said edited master reference
file.
Typically, the step of creating a metadata file includes
Using the consolidation means to display the attributes, their data-types
and size in table format;
Using the consolidation means to specify the rules for cleansing and
standardisation for each of the defined attributes, and can optionally
specify the default value for any attribute.
Using the consolidation means to specify the characters that separate
the values of different attributes in the raw data file.
Typically, the step of creating a metadata file includes the step of specifying predominant name attribute format of the raw data file.
6

Typically, the step of defining the criteria includes: Using the consolidation means to display the existing criteria, if any, in table format;
Using the consolidation means to define one or more criteria to be used for matching said entity records to form a unique list of entities, in terms of the said attributes.
Typically, the step of defining the criteria includes segregating the criteria into a definite or ambiguous criteria.
Typically, the step of defining the criteria includes the step of ranking the criteria.
Typically, the step of cleansing and standardising raw data file includes using the master reference files to determine whether or not the name represents an individual.
Typically, the said step of determining whether or not a name represents an individual includes, determining the parts of the name (title, first name, first name suffix, last name and middle name) by determining the name format of the name using the master reference files; standardising the said first name and said last name using the respective master reference files;
cleansing the non-individual name using the master reference files and getting the name parts for the same; generating the metaphones of the first and last names.
7

Typically, the step of cleansing and standardising raw data file includes using the master reference files to cleaning the address using the rules defined in the metadata file.
Typically, the step of cleaning the address includes deriving the parts of the address from the address string (address, city, state, PIN) and cross validating the values using pre-determined master reference files; standardising the said city and state names using pre-determined master reference files; completing the PIN from the said city and state name and partial PIN available from the address, using pre-determined master reference file; generating the address components for use in the matching process.
Typically, the step of cleaning and standardising includes cleaning and standardising the operator-defined attributes using the rules defined in the said metadata file and the master reference files.
Typically, the step of matching includes:
determining if the said entity information set matches with any of the entity information sets (records) already processed by the apparatus, using the said definite matching criteria in the said sequence;
8

determining if the said entity information set matches with any of the entity information sets (records) already processed by the apparatus, using the said ambiguous matching criteria in the said sequence; determining if the said entity information set matches with any of the family information sets (records) already processed by the apparatus/ using the said definite family matching criteria in the said sequence; creating a new entity record if the entity does not match with any other entity in the master table, using the definite match criteria for entities and updating the entity information if a record matches an existing entity record; creating a new family record if the entity information does not match with any of the existing families using the family match criteria.
According to this invention there is also provided a novel apparatus for carrying out the method as described herein above for consolidating multiple discrete records stored in a plurality of raw data files to produce a consolidated processed data file, each record in the said raw data file containing a plurality of discrete information, said apparatus comprising
a central processing unit; a memory unit co-operating with the central processing unit; a data storage unit co-operating with the memory unit; a display unit associated with the central processing unit; input means
9

linked to the central processing unit a database for containing discrete processed records in table format, and capable of storage in the data storage means.
Loading means to load raw data files into the data storage means of the said central processing unit
Loading means to load processed data file containing cleansed and standardised records into the data storage means of the said central processing unit.
Consolidating means adapted to be loaded in the central processing unit for consolidating the raw data files .
Typically, the consolidation means includes means to define the attributes for each of the said records in the said raw data file.
Typically, the apparatus includes means to create master reference files., means to define the information contained in a raw data file and the rules for processing the said information, in a predetermined format to create a discreet metadata file for the raw data file.
10
i
i
!

Typically the consolidation means include means to define matching criteria for records in a raw data file with corresponding processed records; means to process the records in at least one of the raw data files using the information defined in its metadata file, means to use the matching criteria and the cleansed and standardised records loaded into the said database to produce a list of unique entities and unique families, comparator means to compare an entity against entities existing in the list of unique entities or unique families if any in the database to determine whether the entity can be added to the list of unique entities and means to produce an output in table format in the said database, verification means to verify whether a particular unique entity created should remain unique or be merged with an existing entity, merging means to merge two entities to form a single entity, if needed; splitting means to split an entity formed by combining two or more records, into two separate unique entities, if needed, and report generating means to obtain reports about the processing of the said raw data files and consolidation of records.
User part of the User-server architecture, usually representing the Front-end Server. In accordance with a preferred embodiment of the invention, the apparatus of the invention includes a server component means and a user component means. The server component means of
11

the apparatus which is adapted to run on a computer with the following operating systems - UNIX release V on SUN Solaris 2.5.1 (with Y2K patch) and Windows NT Server 4.0. The user component means of this apparatus runs , typically, on Windows NT/95/98. The apparatus of the invention includes operating means which comprises PERL scripts, C / C++ programs and VB programs. Further the apparatus includes means for converting PERL scripts to executables, such means including Perl2Exe Utility. . The means software tool is preferably compatible with Oracle 7.x and 8.x and SQL Server 6.5. & 7.0
The operating means in accordance with this invention includes a front-end, which will typically run on Windows NT/Windows95/ Windows98 and a back-end, which typically runs on either UNIX or Windows NT Server 4.0. The apparatus of this invention includes an interface linking the front end to the back end. The apparatus makes use of ODBC functionality to work with Oracle / SQL Server. Interaction between Front-end and the Back-end is done through RPCs.
The method of the invention typically involves Data Loading from ASCII files into the database which is accomplished by using the data
12

loading utility means of the specific databases (SQL*Loader for Oracle and BCP for MSSQL Server).
In accordance with a desired aspect of this invention, the apparatus is a single user, single instance apparatus. Multiple instances of use of the apparatus should preferably not be simultaneously executed.
The method of the invention involves the initial step of setting up of the User Component means, which involves creating a flexible directory structure on the user end computer.
Typically in accordance with this invention one unique directory is created that contains all the requisite files (mainly executable and DLL's). A further a temporary directory contains the temporary files created when the apparatus is being used.
The method of the invention also includes the step of setting up of the Server Set-up in the apparatus of this invention which resides on either Windows NT Server 4.0 or Unix (Sun Solaris). Server program for "Apparatus of this invention" is installed in a fixed directory structure. A User has the option of installing the server component of the Apparatus of this invention- in any directory, but within this directory
13

the set-up creates a fixed directory structure and the Apparatus of this
invention looks for various components within this structure. Typically
in accordance with the method of this invention the Set-up step creates
a directory CustConsolSrv. CustConsolSrv has other directories as
given below:
CustConsolSrv/data - for Data files
CustConsolSrv/refs - for Reference, Substitution, and
Validation files CustConsolSrv/preproc - for Pre-processing
Executables CustConsolSrv/program - for executables and control
files.
CustConsolSrv/meta - for Metadata files
CustConsolSrv/log - for Log files CustConsolSrv/status - for Method Status files CustConsolSrv/tmp - for temporary files
A unique feature of the invention is the building up of data The data is built up through various methods as the list of unique customers gets built up.
The raw data file is an ASCII file with the data fields separated by a field separator means. The user defines the field separator means. Each metadata file has a separator associated with it, which can be changed by the user.
14

The invention will now be described with reference to the
accompanying drawings in which,
Figure 1 shows a schematic diagram of the apparatus in accordance
with this invention;
Figure 2 flow diagram of the broad process flow of the method of this
invention;
Figure 3 shows the details of the method flow of figure 3 in accordance
with this invention;
Figure 4 shows the details of the cleansing and standardisation method
flow used in the matching method derived in the pre-processing
method in accordance with this invention;
Figure 5 is a table showing the examples of standard attributes
included in the mandatory and optional fields;
Figure 6 is a table showing the sample error codes;
Figure 7 is a table showing the sample attributes available to define
matching criteria used by the method of the invention in the
matching/consolidation process;
Figure 8 is a table showing examples of correspondence between
master reference files and their extensions;
Figure 9 is a table showing the examples of a list of noise characters
removed from/substituted in the address field;
Figure 10 is a table showing examples of the mappings of last names in
the process of standardisation of last names;
Figure 11 is a table showing examples of the mappings of first names
in the process of standardisation of first names;
Figure 12 shows a sample list of common terms removed from
company names for the purposes consolidation;
Figure 13 shows a sample list of noise characters in the city name
which are removed;
15

Figure 14 is a sample list of variations in city names, which are
mapped;
Figure 15 shows the process flow diagram for matching in the process
of consolidation;
Figure 16 shows the flow for family matching in the process if
consolidation;
Figure 17a is a table showing the CUSTOMER_LOAD and
CUSTOMER-DTL details;
Figure 17b is a table showing the CUSTOMER_MASTER;
Figure 17c is a table showing the FAMILYMASTER table details;
Figure 17d is a table showing the GENERAL ^PARAMETERS table
details;
Figure 17e is a table showing the MATCH_CRITERIA table details;
Figure 17f is a table showing the AMBIGUOUS_MATCH_LIST table
details;
Figure 17g is a table showing the PROCESS_STATUS table details;
Figure 17h is a table showing the FIELD_DEFS table details;
Figure 17i is a table showing the ERROR MSG table details;
Referring to the drawings, Figure 1 shows a schematic diagram of the apparatus in accordance with this invention for data management. The novel apparatus is adapted for consolidating multiple discrete records stored in a plurality of raw data files to produce a consolidated processed data file, each record in the said raw data file containing a plurality of discrete information.
16

The said apparatus comprises a central processing unit; a memory unit co-operating with the central processing unit; a data storage unit co¬operating with the memory means; a display means associated with the central processing unit; input means linked to the central processing unit a database for containing discrete processed records in table format, and capable of storage in the data storage means. The input means includes loading means to load raw data files into the data storage means of the said central processing unit and loading means to load processed data file containing cleansed and standardised records into the data storage means of the said central processing unit.
The apparatus of the invention also includes consolidation means
adapted to be loaded in the central processing unit for consolidating the
raw data files. The consolidation means includes the following
CM1 - Means to define the attributes
CM2 - Means to create the master reference files
CM3 - Means to create metadata file(s)
CM4 - Means to define matching criteria
CMS - Means to process records
CM6 - Means to use the matching criteria to produce a list of Unique
entities and list of unique families.
CM7 - Comparator means
CM8 - Verification means
CM9 - Merging Means
CM10 - Splitting means
17

CM11 - Report generating means
The system has User-server architecture. The User component means is housed typically in a computer having a Windows based system (NT / 95 /98) while the server is a Win NT or a UNIX server. The Hardware should be Intel X86 / Sun compatible. A flow diagram of the architecture of the apparatus of this invention is as shown in Figure 1 of the accompanying drawings.
The front end is a means typically in visual Basic that provides the user interface to the apparatus. The user co-operates with the server component means through a middle layer means, which is a collection of RPC programmed means whose structure is compatible with that of SUN's RPC. The back end means are typically in PERL/C/C++ that process the data and produce the results.
The front end connects to database using ODBC.
Data Fields
There are two types of fields in the data- a) System defined fields and b) User defined fields. Among the System Defined fields, the mandatory and optional fields in the data are shown in the table in Figure 4 of the accompanying drawings.
User Defined Fields
Besides the fields given in the table, user has the option to add more fields. Processing of a user-defined field is done on the basis of the type of field, substitution list, and validation list.
18

Other Fields
Other fields (fields not mapped to system defined user fields), if
present in the feed data, are ignored.
Process Flow
The broad process flow in accordance with this invention is shown in figure 2 of the accompanying drawings and figure 3 shows the details of the process flow.
Data Extraction
The data extraction and pre-processing modules have been merged. The Pre-processing module extracts the data from a raw data file as per the details given by field specification file (metadata file), one record at a time.
pre-processing
Purpose
Pre-processing module of the Apparatus of this invention is developed
in Perl 5.00503. It mainly does string processing and text
manipulation. An executable is made out of the Perl code using
Perl2Exe utility. Although, Perl code is independent of operating
system, the executable obtained by Perl2Exe is operating system
dependent.
Scope
Pre processing is important in the Apparatus of this invention because it is expected to receive data from different feed systems. The consistency and integrity of the data cannot be guaranteed because of the volume of data. Data entry operators enter this data at a fast speed
19

and hence the cleanliness of data cannot be ascertained. The pre¬processing of raw data files will ensure that most of the errors that creep in while data entry do not affect the recognition of an individual. The two components of address that are used in the matching method are derived in the pre-processing method. The details of the flow for pre-processing for cleansing and standardisation are seen in Figure 4 of the accompanying drawings.
Components of Pre-processing
Cleaning the raw data files
This module includes the functionality for the following:
♦ Removing multiple spaces and replacing it with single space
• Removing spaces before and after the field separator in the file
* Converting all lower case characters to upper case.
Separation of individual customers and company customers
The recognition of company and individual's record is necessary, as
the two names are treated in different manner during name processing.
The separation is based on the source data (if available) or by
identifying the companies from their names.
If the source data has a category field that separates an individual from
company, then proper substitution and validation files should be
provided so that the source category code can be mapped to 0 (zero)
for individual or 1 (one) for company.
If the source does not have a category field, the individual and company record is separated based on some common words / terms used in the company names.
20

A sample list of these words:
ACCESSORIES
AGENCIES
AGENCY
AGRO
APPARELS
APPLIANCES.
AND
ASSN
ASSOC
ASSOCIATES
ASSOCIATION
AUTO
AUDIO
AUTOMOBILES
Any name having any of these words is considered as a company, else it is considered to represent an individual.
Name processing of individual and company customers
Individuals Noise removal:
The noise elements from name are removed. Anything other than [A-Z] and zeros is considered as noise in the name. Zero is considered as a typographical error (for O) as it is very close to "O" on the keyboard and prone to error because of similarity in looks. If a zero happens to be in a sequence of numbers then it is not considered to be a typographical error. Also if more than two zeros are found in
21

succession, then it is not considered to be a typographical error. Apart from such cases, zero is replaced with 40' (alphabet 'oh').
Title removal
Titles from the name are also removed. The sample list of titles for individuals is given below:
AIR ADMIRAL BRIG CAPT
CAPTAIN CDR CMDE COL
COLONEL COMMANDER DR FLT
GEN GENERAL GROUP GRP
HUF KUM LT MAJ MAJOR
MINOR MISS MR
MRS MS RETD RTD
SHRI SHREE SMT SOU
VICE
An ambiguous title (one, which can be a valid last name) is removed
only if a valid Last name is found in the name, examples of ambiguous
titles are as follows:
PANDIT
CHOWDHURY
KHAN
DEWAN
Name Components
The Customer Name is broken up into its sub components. The components associated with Name are Last Name, First Name and Rest of the Name (initials).
22

For individuals, the Last Name is identified using the Last Name lookup files. The Customer Name is assumed to be in one of the formats given below:
A.
B.
C.
D.
Here, A, B, C and D are the codes for four name formats.
The user is required to rank the name formats.
The name subpart identification depends on the position of the last
name in the name format. Depending on which format between (A)
and (B) has the higher rank, first or the last word from the name field is
checked for last name in the look up files.
If user ranks format (A) higher than format (B), the first word from the name field is checked for last name in the look up files, to start with. If there exists a last name matching the word, then the name is said to be in the format (A). Else, the last word of the name is looked up as a possible last name. If found, the name is said to be in the format (B). If the last name is still not found then the name is checked for the other two formats.
The name is said to be in format (C) if all the words except the last
word are either single letters or made of only consonants.
The name is said to be in format (D) if all the words except the first
word are either single letters or made of only consonants.
When the name is found to be in (C) or (D) format the last name is
taken as 'NOT AVAILABLE'. This is for the purpose of facilitating
the matching method.
23

If the name is detected to be in any of the defined formats, then the sub-components of the name is picked up based on the format. If the name cannot be fit into any of the formats, it is assumed to be of the highest ranked format and the sub-components are populated accordingly. This might result into the error "New Last Name".
Suffix detection
Some names are suffixed with some common suffixes (like Kumar).
These suffixes might be concatenated with the first name, written as a
separate word or omitted altogether. Therefore the name field is
searched for the words which, are commonly used as the suffix to the
first name as well as the middle name. List of some name suffixes is:
KUMAR
CHANDRA
PRASAD
PRAKASH
BHAI
BEN
RAO
If the first name contains any suffix, it is removed from the first name and it is treated as part of the middle name.

Standardisation of last names
Last name is standardised against a list of variations available for a given last name, e.g. Aggarwal and Agarawal is mapped to Agarwal. The sample list of such mappings is shown in the table in Figure 10 of the drawings.
Standardisation of first names
First name is standardised against a list of variations available for a given first name, e.g. Praveen is mapped to Pravin. The sample list of such mappings is in Figure 11 of the accompanying drawings.
Companies
For company records the common terms (like pvt, ltd, sons, etc) are removed from the name. This is done only for the purposes of customer consolidation, and the name of the customer in the final output contains the said common terms. The sample list of common terms is given in Figure 12 of the accompanying drawings.
For the companies, the name is logically split into first name, middle name and last name considering that the name is in the format. The name considered here is sans the common terms.
For both Individuals and Companies, the First Name and Middle Name components are optional, whereas the last name is mandatory. The Name and Last name is populated as "NOTAVAILABLE" in case of a missing Customer Name.
25

Pin processing
Spaces are removed from feed record pin code.
The PIN code is checked if it is good. A good PIN has the following
characteristics:
a) The PIN is numeric
b) The PIN has 6 digits, and
c) The first digit of the PIN lies between 1 and 8.
If the pin is in the nnn-nnn (where n is any number) format, that too is cleaned and treated as a good pin If condition a, b and c are satisfied. If the PIN information received from the feed system were found not meeting the above criteria, the PIN is assumed to be 0 (zero). In such cases, the address is searched for PIN and the same is populated, if found. For details, see "PIN search in address". Address processing The address processing constitutes the following:
a) Removal of noise: Any character other than alphabets and numbers is considered as noise in the address. A sample list of noise characters in the address is shown in the table of Figure 13.
b) Converting roman digits to decimals: The roman letters from I to X is converted into their corresponding decimals. I, V and X as single characters are not to be converted to their respective decimals as they might carry a different meaning. (I and V can be part of Initials of some name on whom, say, the street is named, whereas X represents "cross road" also). But these is converted into decimals when they signal a numeric meaning, for e.g., "I ST", "V TH" etc.
c) Numbers in the format "First", "Second" etc. are converted to decimals i.e. 1,2 etc.
26

City processing
d) Removing noise from the city name: The sample list of noise characters in the city name is given in Figure 12 of the drawings.
e) Standardising the city name: A table of variations in the city name is consulted to standardise the city names. The table is given in Figure 14 of the accompanying drawings.
Pin search in address for bad pins
The search for a good PIN in the address is conducted for those records, whose PIN information received from the feed system is bad (The PIN could not be categorised as a good PIN). The PIN is assumed to occur in either of the following formats in the address string.
a) As a 6 digit number
b) As a set of two three digit numbers separated by space
c) As a three, two or one digit number preceded by a city name. The city should be one of the cities in the city-pin map. This map is used to derive the first three/four digits of the PIN, and complete the PIN in the method. If the number found in the address is of two digits (one digit), then one zero (two zeros) is padded to complete the three digits.
For example, the address contains ".... BOMBAY 19".
Here, BOMBAY is mapped to MUMBAI (standardised) and then the
first three digits of the PIN for Mumbai are searched. That would
return 400. As the number in the address is 19 a zero is left padded to it
and the PIN for the address is 400019.
The search for the PIN is conducted in the last five "words" of the
address. A "word" is defined as a set of characters terminated by space.
If the pin is found to be in format "city - pin" then the city field too is
populated with the city from the address, else the city is as received
from the feed system (after cleaning and standardisation).
27

Removal of pin and city from address
The city and PIN information is removed from the address to negate
their effect on the derived components of the address.
Removal of phone from address
Any number 7 digits or more in the address is considered as a phone
number and is removed before deriving the address components.
Any number of 5 digits or more preceded by TEL or PH or PHONE is
considered as a phone number and is removed from the address.
Derivation of City's and State's name from PIN If the city/state field is blank and the PIN is a good pin, city/state field is derived from the PIN. Mapping between PIN and City / State name is available from a master list
Derivation of State's name from City's name If the state field is blank and the PIN is a good PIN, state name is populated using the city name, if possible. Mapping from city to state is available from a master list.
Derive components from address
The following components are derived from the address:
a) Representative 1 (REPl): All Numeric occurrences in the Address. The numbers are sorted in the descending order (to avoid loosing zeros).
b) Representative 2 (REP2): Numeric representation of sorted consonants in the Address. Only one occurrence of each consonant is considered. The consonants are part of a 26-bit string (from A to Z), with each bit getting the value "1" when the respective consonant is present in the
28

address and 0 (zero) otherwise. The vowels always have a 0 value. This forms a binary representation of some number, which is used for further processing. This is done primarily to ease comparison between two such strings.
For example, consider the address 140/20 Parijat, Hanuman Cr Rd No 2, Vile Parle (East) Bombay 400057
The Pin and City is removed from the Address considered for extracting of Rep 1 and Rep2. The remaining address is
140/20 Parijat, Hanuman Cr Rd No 2, Vile Parle (East)
Which gives Repl= 422100 Rep2 - CDHJLMNPRSTV = 00110001010111010111010000 = 12940752
The default value is "0" for Repl and null for Rep2.
Processing of User Defined fields
Each of the user-defined fields is processed as follows:
a) If the Substitution is needed, pre-processing program will do the substitution on this field using the substitution rule given in the substitution file. Substitution file contains the substitution rules in a specific format described later.
b) If the validation is needed, pre-processing program will validate the field's value against a list of values/patterns given in the validation file. If the field's value finds a match in the validation list, it is passed to the matching program otherwise the default value specified for the field in the metadata is passed to the matching program. Validation file and its format are described later.
29

Exception Conditions
During various steps in Pre Processing, an exception code can be
associated with a record if the record fails certain non-severe
conditions of pre processing at identified steps.
The record is written to an Exception File
(..err), but would still be considered for
Consolidation/ Matching Method. (ProcNo is the method number
allocated by the operating system.) Currently the exception codes or
sample error codes that have been identified are provided in the table
shown in Figure 6, which are by no means exhaustive.
CUSTOMER CONSOLIDATION
Matching/Consolidation module of Apparatus of this invention is developed in C/C++. It includes ODBC programming and PL/SQL code for database operations.
System Architecture
The Matching/Consolidation Method involves following processes. Based on the matching criteria specified by the user, Matching/Consolidation method will use one or more attributes/fields from the table shown in Figure 7.
Of the fields shown in figure 7, all except Address and Initials are matched exactly.
The user can define the tolerance for the address, which is internally taken as the notion of allowable difference in the consonant strings of the two records being matched. If the tolerance is set as 0 (zero), then it is equivalent to exact matching of address.
30

The user can define whether the initials should be matched exactly, or initials of one record should be a sub-sequence of the initials of the other record. E.g. the sub-sequence of XYZ are XYZ, XY, YZ, XZ, X, Y, Z and .
DATA LOADING
Data Loading method will load the data fields into the CUSTOMER_LOAD table seen in Figure 17a). The data is loaded using SQL*Loader / BCP and is then be processed and put in the final tables. At the end of the matching (consolidation) method the feed CUSTOMER_LOAD is ideally empty.
CUSTOMER CONSOLIDATION
The Matching (Consolidation) processes will work on the data in the CUSTOMERLOAD table and create the consolidated records. All data about all the unique records is stored in the CUSTOMER_MASTER table seen in Figure 17b. The reference of all the feed system records connected with the unique master record is stored in the CUSTOMER_DTL table seen in Figure 17c. The records that had ambiguous match with the master records are kept in AMBIGUOUS, MATCH_LIST table seen in Figure 17g for the further processing. Ambiguous matches are processed with user intervention. The data in feed table is deleted as a part of Matching/Consolidation method. The data in the consolidation tables is retained for subsequent modifications and additions. A unique identifier is generated for each customer as a part of Matching/Consolidation method. The method flow diagram is given in Figure 3 of the accompanying drawings.
31

Method Description
After the pre-processing is complete, the output file contains all the fields required for matching and the fields derived during the pre¬processing. The pre-processed data is loaded in the CUSTOMER_LOAD table shown in Figure 17a and the consolidation method populates the CUSTOMER_MASTER shown in the table in Figure 17b, FAMILY_MASTER shown in Figure 17d, and CUSTOMER_DTL shown in Figure 17c along with history and journal entry tables for the customer and the master tables for city, state and country.
Matching is done on the basis of matching criteria provided by the user. User can provide a maximum of ten matching criteria (five for Definite Match and five for Ambiguous Match) as described herein below.
In the matching criterion, when the user includes alphabetic component
of address, he must supply the tolerance for the matching. The
tolerance works in the following manner:
A tolerance is accepted when comparing the alphabetic component of
address ADDR_COMP2. Here, the strings are not matched character
by character, but a specified level of mismatches of characters is
permitted. This is done to overcome the mismatch when one of the
addresses has some additional information. For e.g., consider the
following two addresses
1 VIVEK BLDG OPP MUN OFFICE MAMLATDAR WADI ROAD
MALADW
1 VIVEK BLDG MAMLATDAR WADI ROAD MALAD W
32

The consonants C, F, N and P are not present in the second address, which causes the addresses to be treated as different if a 100% match is done on alphabetic component of the address. But if a tolerance of 4 characters were allowed, then the two addresses are considered as matching.
Note: The matching of records (customers) is not accomplished by matching the ADDR_COMP2 only. The last name, first name, initials, PIN and ADDR_COMPl are also usually involved. Hence allowing for a tolerance of 4 characters mismatch while matching ADDRCOMP2 does not cause much harm, instead it does a lot good in detecting similar addresses as shown above. The actual number should be decided based on the desired accuracy and the actual ground level data.
The matching method has the following broad steps
1. Pick up a record from CUSTOMER_LOAD
2. Find a match using Definite Match Criteria (in descending order)
3. A) If a match is found, then update the customer information and link the record to the existing customer.
B) If a match is not found, go to step 4,
4. Find a match using Ambiguous Match Criteria (in descending order)
5. Create a New customer record. If the customer matches with one or more customers with the ambiguous match criterion, then update the AMBIGUOUS_MATCH_LIST table shown in Figure 17g.
6. Record the history of updates, if any.
7. Delete the record from CUSTOMERJLOAD
The steps 1-7 are repeated until there are no more records left in the CUSTOMER_LOAD table.
For each record a template is created in CUSTOMER_MASTER table and the corresponding record is created in CUSTOMERDTL table.
33

For matching records, only a single template is created in CUSTOMER_MASTER table and all the record links is created in CUSTOMER_DTL table.
As the records are inserted in CUSTOMER_MASTER table, a unique id is associated with each template that is the customer id associated to each unique Customer in the system.
1.1 FAMILY CONSOLIDATION
The user is able to define the criterion for family consolidation. The family consolidation criterion is only a definite criterion (ambiguous data is not be supported). The family link of the customer changes with any change in the consolidation field values. The flow for the family consolidation is as illustrated in figure 16 of the accompanying drawings.
Front End
The front end of the apparatus is made up of various modules. Front-end for Apparatus of this invention is developed in VB 6.0 on Windows NT 4.0. It also has a few DLLs generated from C programs using MSDEV 6.0.
LIST OF SCREENS
The front-end for Apparatus of this invention has the following screens
1. A main screen that serves as a base for all the links to different functions
2. "Metadata Information" screen for metadata management
3. "Consolidation Criterion" screen to manage the consolidation criterion
4. "Reference Files" screen to manage the reference, substitution and validation files
34

5. "Manage Database" screen to add field(s) to the tables
6. "General Parameter Settings" screen to view the general parameters shown in the table of Figure 17e.
7. "Manage Method" screen to start / restart the method and check its
status
8. "Method Ambiguous List" screen to manually process the ambiguously matched records
9. "Manual Merge / Split" screen to manually merge two customers or split a customer into two based on the feed system identities.
10. "Errors and Information" screen to view the pre-processing errors, list
of new last names and list of mergers.
11. "About" screen that gives information about Apparatus of this
invention.
FUNCTIONS SERVED BY FRONT-END
Addition of User defined fields
The user is able to add new fields to the database with the help of "Manage Database" screen. Default database table is created with the set-up of Apparatus of this invention. This module is responsible for adding the additional columns (fields) in the CUSTOMER_MASTER and CUSTOMER_LOAD tables. User will get to view the list of fields that already exist in the database table and the properties of each field.
Defining Substitution Rules
User has the option of specifying a list of substitutions, which can be used while pre-processing. This is done through the screen "Managing Reference Files". Each such list of substitutions is stored as a substitution file on server in CustConsolSrv/Refs. Each line in the
35

substitution file contains one pattern-replacement pair. Format for the
pattern-replacement pair:
||
The pattern can be formed using the generic rules of regular expression
as defined in the PERL v5.00503 documentation. Here "||" (two pipe
characters without any other character in between them) is the
separator.
Defining Validation Rules
User has the option of creating a list of values/patterns against which a data field can be validated. Each such list is stored as a validation file on Server in CustConsolSrv/Refs. Each line of the validation file contains a valid pattern against which the value of a data field needs to be validated. The pattern can be formed using the generic rules of regular expression as defined in the PERL v5.00503 documentation.
Adding Values in the Reference Files
User is allowed to modify any of the reference files, which are provided by the development team. Through the front end provided, the user can also create new substitution or validation files, if needed. The user should exercise sufficient care in modifying the reference files.
Defining Metadata
User has to set the following parameters to define the metadata:
Raw Data File
User may choose to see the snapshot view of a raw data file, before creating a metadata file. User gets a list of data files, which is obtained from CustConsolSrv/Daia directory from the server. Once the user
36

chooses a data file from the list, he/she will get first 10 records of that file on display.
Metadata File
User can select an existing metadata file or he/she can create a new
metadata file. If the user selects a metadata file, the relevant
information is populated on the screen. The user can modify the
information and save it.
Metadata Specification
The user can define / modify the following parameters:
1. Field Separator in the data file.
2. The order of the name formats.
3. For System Defined fields -

• Field Numbers of Feed System ID and Customer ID (Field number is assumed to start with one(l))
• a) If the name is a single field, the field number of that field
b) If the name is split, the field numbers of Last name, first name and initials fields.
If the name is not split as last name, first name and initials then the field numbers of the fields that combine to give the name should be entered against "name".
• Field number of the title field, if any.
• The field numbers of the address line(s)
• The field numbers of PIN code, city and state fields. Multiple fields should be separated by a'+' sign.
4. User defined fields - for each user-defined fields user has to provide
the following information:
37

• Field Name. User will get a list of fields that exists in the CUSTOMER_LOAD table, from which he/she can choose the additional fields.
• Field number (in the raw data file)
• Substitution file
In case, substitution is needed, user needs to provide a substitution file. A list of substitution files, available at server, is provided to the user. If the user wants a new substitution list, he/she has to create a new substitution file before hand.
• Validation file
In case, validation is needed, a validation file must be provided. A list of validation files, available at server, is provided to the user. If the user wants a new validation list, he/she has to create a new validation file before hand.
• For a date field, the user needs to provide the relative position of date,
month and year in the data. It is assumed that the date exists without
the time component, and does not have a weekday name. The year is
expected to be in 4 digits. The user is given the following list for the
relative positions, to choose from
D-M-Y
D-Y-M
M-D-Y
M-Y-D
Y-D-M
Y-M-D
When more than one field from the source combine to form one field
in the metadata, all the source field numbers must be entered separated
by a comma (+). The fields are then concatenated with a space between
the two fields.
38

5. Fields that does not exist in the raw dataset - User has an option of inserting the fields, which do not exist in the raw data. For such fields user must provide a default value Pre-processing method populates this field in the Load File with the value specified by the user.
6. The user can also define a default value for a field that exists in the raw data file, but is sparsely populated. The default value is considered if the field value happens to be 'NULL' after applying the substitution rules.
NOTE: One field can be mapped to more than one field of raw data set. To do so, user can specify a list of column numbers separated by a plus (+) symbol, e.g. if address appears in four columns (numbered 5,6,7,8) in raw data set, then user has to specify "5+6+7+8" in the "raw data field no(s)" field against the address field.
Matching Criteria Specification
The User has to provide the basis of matching through the matching criteria as seen in the table shown in Figure 17f. There are two types of criteria: a) Definite Match Criterion, b) Ambiguous Match Criterion.
A maximum of five criteria can be given in each of these two categories. The user has to specify at least one definite match criterion
The criterion is defined based on the combination of the columns. The user can also specify a specific value to a field as part of the criterion. Matching criteria specification is stored in the database tables on the server. Value cannot be specified for address.
Process Control
This module includes the functionality for the following
39

• Starting a consolidation process
• Re-starting a consolidation process, which didn't complete on last run
• Method status
Reports
This module allows the user to view various reports. User is able the view the following reports • Error Logs (missing name/ missing address/ missing or invalid PIN, etc.)
Manual Intervention / Updates
In case of ambiguous matches, master record and the new record is displayed simultaneously. User has an option to specify whether two records represent the same customer or different customers. If user indicates that the two records represent one customer then the new record is merged with the master record, else if the user indicates that the records represent different customers, they is retained as is.
INTERFACE
Interface between the front-end and the back-end is done through RPC programs.
DESCRIPTION
The interface between the front-end and the back-end of Apparatus of this invention is accomplished using RPC programs compliant with
40

SUN Solaris. The SUN Solaris compliant RPC programs for Windows are available with the development team.
The RPC programs supports the following functions:
Getting a Unique Method-id
Whenever the user starts a new process, front-end requests for a unique method identifier from the server. Server generates a unique method-id that has the format - p.m.d Where, p is a program counter generated by the server m is the current month (1 for January, 2 for February, etc.) d is the current day of the month (1-31)
Transfer of Metadata Information
Metadata file is transferred from user to server as a stream of characters. RPC function on server will rebuilt the metadata file from this string. Metadata file is created in the metadata directory, CustConsolSrv/meta).
Transfer of Substitution/Validation/Secondary reference files
User is able to create /modify substitution/validation/reference files in a predetermined format. Any newly created file is, immediately, transferred from front-end to the back-end through RPC. Contents of the file are transferred as a single stream of characters. A new-line character separates each record. At back-end, RPC function will create this file and will save in CustConsolSrv/refs directory with appropriate extension. If a file with the same name exists, it is overwritten.
41

Getting list of files
Data Files
All the data files must reside in CustConsolSrv/data directory on the
server. A data file should have an extension ".dat".
This function looks for all the data files available at the Server in
CustConsolSrv/Data/ directory
Reference files / Substitution files / Validation files
All the reference files, substitution files and validation files must reside
in CustConsolSrv/refs directory on the server. These files are identified
by the filename extensions. The correspondence between three types of
files and their extensions is seen in Figure 7 .
Getting records from Data file
This functionality provides a snapshot of the raw data to the user. An RPC function returns the first ten records of a data-file to the front-end, where the data-file's name comes as input from the front-end. This function looks for the data-file on server in the Data directory, with .dat as the extension.
Starting Pre-processing, Data Loading and Matching Methods
A wrapper program is called by the RPC function to start or restart a method. For a new process, the wrapper invokes the Pre-processing, data loading and matching programs in this sequence. Data loading is started only when pre-processing is completed successfully. Similarly, Matching process is started only when the data loading is completed successfully.
Getting Pre-processing Status
Pre-processing module of Apparatus of this invention will generate a status file. Status file contains the number of records processed till that
42

point in time. The number of records processed is updated after processing of every 100 records. Status file has a standard name format: .sts and it resides in a standard directory CustConsolSrv/status.
Restarting Pre-processing Method
This option allows the user to restart a pre-processing method, which terminated before completion. When a pre-processing method is restarted, pre-processing will skip the records that have already been processed. The number of records processed till the previous run is found from the Load file generated by the pre-processing module (the load file is unique for a data file).
Getting Data loading status
To get the status of Data loading, it is checked if the data loading programs are executing. If yes, then data load is in progress. If no, then the log file is checked for existence. If the log file does not exist, then the data load did not start. If the log file exists, then it is read to decide if the data load is complete or incomplete.
Restarting Data Load
To restart the data load, corrections must be made to the pre-processed data file. Also the user should make sure that another data load or matching method is not currently running. Calling the wrapper program with the appropriate flag restarts the data load.
Getting Matching Method Status
The matching method maintains a status file that defines its current the status. The status file is updated if the matching method fails due to a trappable error. The status file contains the error text and optionally the
43

ID of the record being processed currently. The status file is also updated on successful completion of the process.
When the status file says that the process is running, it is checked against the system whether the process is actually running. If it is not found to be running, then it is assumed to be killed.
Restarting Matching Process
Calling the wrapper program with appropriate flag will restart the
matching process.
Figure 15 illustrates a typical overall matching logic.
Following are the known limitations of the apparatus. These can be improvements or additions within the nature and the scope of this invention.
1. The apparatus expects only one name in the name column. In some cases the names of all the joint holders are entered in the same field. In those cases, the derived name is vastly different from the actual name.
2. The apparatus expects input data from ASCII file in a
predefined format. It has not been integrated with the database to take
the input data.
3. If the PIN column has a valid PIN and the address has a short
PIN (say the PIN column has 411013 and the address has Pune - 13)
then the short PIN is not removed from the address. This affects the
derived address components and hence might affect the consolidation.
44

4. Typographical and phonetic errors in the first and last name of the customers are not addressed fully.
5. The apparatus expects the date field to contain only the date. Also, year should be in four digits. Apparatus is not able to handle the date field if the year is in two digits. If the date field contains time and/or weekday name, apparatus is not able to handle it.
6. If the city name found in the city name field, city name derived from the PIN code, and the city name derived from the last five words of address are all different, apparatus will not be able to decide on the correct city name.
REPORTS
Following reports are generated.
• Error Logs (missing name/ missing address/ missing or invalid PIN, etc.)
It is envisaged that various modifications and improvements can be incorporated into the apparatus and method of this invention without departing from its nature and scope.
45

We Claim:
1. Apparatus for consolidating multiple discrete records stored in a plurality of raw data files to produce a consolidated processed data file, each record in the said raw data file containing a plurality of discrete information, said apparatus comprising
a central processing unit; a memory unit co-operating with the central
processing unit; a data storage unit co-operating with the memory unit; a
display unit associated with the central processing unit; input means linked to
the central processing unit a database for containing discrete processed records
in table format, and capable of storage in the data storage means;
loading means to load raw data files into the data storage means of the said
central processing unit;
loading means to load processed data file containing cleansed and standardised
records;
into the data storage means of the said central processing unit;
consolidating means adapted to be loaded in the central processing unit for
consolidating the raw data files;
said consolidating means consisting of:
(i) means to define the attributes for each of the said records in the said raw
data file;
(ii) means to create master reference files;
46

(iii) means to define the information contained in a raw data file and the rules
for processing the said information, in a predetermined format to create a
discreet metadata file for the raw data file;
(iv) means to define matching criteria for records in a raw data file with
corresponding processed records;
(v) means to process the records in at least one of the raw data files using the
information defined in its metadata file;
(vi) means to use the matching criteria and the cleansed and standardised
records loaded into the said database to produce a list of unique entities and
unique families;
(vii) means to compare an entity against entities existing in the list of unique
entities or unique families if any in the database to determine whether the entity
can be added to the list of unique entities and means to produce an output in
table format in the said database;
(viii) means to verify whether a particular unique entity created should remain
unique or be merged with an existing entity.
2.Apparatus as claimed in claim 1 wherein the consolidation means include merging means to merge two entities to form a single entity.
3. Apparatus as claimed in claim 1, wherein the consolidation means includes splitting means to split an entity formed by combining two or more records, into two separate unique entities.
47

4. Apparatus as claimed in claim 1, wherein the consolidation means includes
report generating means to obtain reports about the processing of the said raw
data files and consolidation of records.
5. Apparatus for data management as described herein with reference to the
accompanying drawings.

48

Documents

Application Documents

#	Name	Date
1	647-mum-2000-form 2(granted)-(13-07-2004).doc	2004-07-13
2	647-mum-2000-claims(granted)-(13-07-2004).doc	2004-07-13
3	647-mum-2000-correspondence(ipo)-(31-10-2006).pdf	2006-10-31
4	647-MUM-2000-CORRESPONDENCE(RENEWAL PAYMENT LETTER)-(09-07-2009).pdf	2009-07-09
5	Form 27 [31-03-2017(online)].pdf	2017-03-31
6	647-MUM-2000-RELEVANT DOCUMENTS [28-03-2018(online)].pdf	2018-03-28
7	abstract1.jpg	2018-08-08
8	647-mum-2000-power of attorney(11-7-2000).pdf	2018-08-08
9	647-mum-2000-form-pct-isa-210(11-7-2000).pdf	2018-08-08
10	647-mum-2000-form 6(1-9-2004).pdf	2018-08-08
11	647-mum-2000-form 5(8-3-2004).pdf	2018-08-08
12	647-mum-2000-form 4(18-9-2001).pdf	2018-08-08
13	647-mum-2000-form 3(18-9-2001).pdf	2018-08-08
14	647-mum-2000-form 26(1-9-2004).pdf	2018-08-08
15	647-mum-2000-form 2(granted)-(13-7-2004).pdf	2018-08-08
16	647-mum-2000-form 19(8-3-2004).pdf	2018-08-08
17	647-mum-2000-form 1(11-7-2000).pdf	2018-08-08
18	647-mum-2000-drawing(18-9-2001).pdf	2018-08-08
19	647-MUM-2000-CORRESPONDENCE(RENEWAL PAYMENT LETTER)-(10-7-2008).pdf	2018-08-08
20	647-mum-2000-correspondence(4-10-2004).pdf	2018-08-08
21	647-mum-2000-claims(granted)-(13-7-2004).pdf	2018-08-08
22	647-mum-2000-cancelled pages(13-7-2004).pdf	2018-08-08
23	647-MUM-2000-RELEVANT DOCUMENTS [23-03-2019(online)].pdf	2019-03-23
24	647-MUM-2000-RELEVANT DOCUMENTS [29-03-2020(online)].pdf	2020-03-29
25	647-MUM-2000-RELEVANT DOCUMENTS [29-09-2021(online)].pdf	2021-09-29
26	647-MUM-2000-RELEVANT DOCUMENTS [26-09-2022(online)].pdf	2022-09-26