Method For Storing Data Records And A Data Storage System For Storing

< Back

Method For Storing Data Records And A Data Storage System For Storing Data Records

Abstract: A method for storing data records having a plurality of data fields storing one or more data items comprising, reducing the data records to a plurality of records of a single dimension; storing and indexing said plurality of records across a plurality of data stores. The present invention also includes a data storage system for storing data record in a data structure.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

21 August 2008

Publication Number

9/2010

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

3I INFOTECH LIMITED

TOWER NO 5, 3RD TO 6TH FLOOR, INTERNATIONAL INFOTECH PARK, VASHI, NAVI MUMBAI,

Inventors

1. MISHRA SUKUMAR

D-403PURVA BELMONT, J.P. NAGAR 6TH PHASE ON KANAKAPURA, MAIN ROAD, BANGALORE-560078,

Specification

FORM 2
THE PATENT ACT 1970 (39 OF 1970) & THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See section 10 and rule 13)

TITLE OF THE INVENTION:
METHOD FOR STORING DATA RECORDS AND A DATA STORAGE SYSTEM FOR STORING DATA RECORDS
APPLICANTS
3i INFOTECH LIMITED
TOWER NO. 5, 3RD TO 6TH FLOORS
INTERNATIONAL INFOTECH PARK
VASHI, NAVI MUMBAI 400703
AN INDIAN COMPANY

The following specification particularly describes the invention and the manner in
which it is to be performed

METHOD FOR STORING DATA RECORDS AND A DATA STORAGE SYSTEM FOR STORING DATA RECORDS FIELD OF THE INVENTION
The present invention, in general relates to a method for storing data records and to a data storage system for storing data records and in particular to storing of data records in a data structure. BACKGROUND OF THE INVENTION
Data in computers is stored mainly in databases. Of all the different database models that are used, the relational model (RM) is the most popular and prevalent. The RM is designed to store information as a table, each row storing a record and each column of a row, storing a value for an attribute of that record.
One of the primary drawbacks of the RM is its inherent inability to store attributes of type set. The most basic of the normal forms defines that each and every value for every attribute in a table must at least be atomic, that is they can not be multi-valued. Though a normalization process allows overcoming some of these problems, it is always at the expense of efficiency. Index files, typically B+ Tree or hashed index, are created on certain attribute(s) and are used to improve access performance from this unsorted table. The attribute(s) are chosen based on a pre-determined set of queries that are expected on this table. Arbitrary queries not made on these attributes can not take advantage of these indices and indeed sometimes suffer in performance due to the presence of these indices.
For B+ Tree Index, certain index schemes can not be used in the processing of queries with the NOT predicate. Any queries having a NOT condition would require a TABLE-SCAN with on-the-fly selection operation. Although, multi¬dimensional index structures address some of these problems, however, in the event of certain queries, specifically based on queries with undefined intermediate column values, the result may be scanning of the whole table, which is not desirable.
In brief, none of the existing indexing structures, bitmap indexes or multidimensional indexes can address all of the following requirements on
1A

their own :
1. Provide performance stability under high dimensions.
2. Ability to store non-atomic set data.
3. Ability to index attributes of varying data types.
4. Provide good and stable performance for any arbitrary query.
5. Update operations are proportional with the dimensions being inserted or modified.
It is true, that some of the aforesaid issues are not critical in the traditional domain of Online Transaction Processing (OLTP). However, these issues are of significant importance while dealing with the domain of document management, content management, and digital asset management (DAM), DNA databases, XML databases, On-Line Application Processing (OLAP) application, data ware housing, and object oriented databases. These are a few of the non-limiting areas, where the said issues are required to be dealt with utmost importance/significance.
For example, DNA databases may require 1024 dimensions, or DAM databases, which require about 256 attributes for every image! Query on these domains can not be predetermined and queries on any of these attributes are equally likely.
Accordingly, there was a long felt need to develop a methodology and system for storing data, whereby, the issues as discussed aforesaid, are taken care of in a comprehensive manner. The present invention aims at meeting the aforesaid long felt need.
OBJECTS OF THE INVENTION
It is an important object of the present invention, to provide a system for storing data, adapted to provide performance stability under high dimensions. It is a further object of the present invention to provide a system for storing data, adapted to store non-atomic set data.
It is a further object of the present invention to provide a system for storing data, adapted to index attributes of varying data types.
2

It is a further object of the present invention to provide a system for storing
data adapted to provide good and stable performance, for any arbitrary query.
It is yet another object of the present invention to provide a system for storing
data adapted to ensure, that update operations are proportional with the
dimensions being inserted or modified.
It is a principal object of the present invention to provide a comprehensive
system for storing data, adapted to. provide all the aforesaid facilities.
It is another prime object of the present invention, to provide a methodology
for storing data records by providing a data structure, whereby all of the
following requirements are addressed on their own:
a) Providing performance stability under high dimensions.
b) Ability to store non-atomic set data.
c) Ability to index attributes of varying data types.
d) Providing good and stable performance for any arbitrary query.
e) Update operations are proportional with the dimensions being inserted or modified.
It is another object of the present invention to provide support for arbitrary data sets (at least arbitrary in dimension and type) and also for handling arbitrary (as in not predetermined) query sets. It is yet another object of the present invention to provide operational efficiency within acceptable variations over the complete set of arbitrary queries.
How the aforesaid objects are achieved and the other aspects of the invention will be clear from the following description, which is purely for the sake of understanding.
SUMMARY OF THE INVENTION
Accordingly, the present invention provides a method for storing data records having a plurality of data fields storing one or more data items comprising,: reducing the data records to a plurality of records of a single dimension, storing and indexing said plurality of records across a plurality of data stores.
3

In accordance with preferred embodiments of the method of the present
invention:
-the plurality of data stores comprises 4 B + trees,
-comprising the steps of:
a) Creating a unique record identifier for every data record;
b) Generating a unique attribute identifier for each data field;
c) Assigning said unique attribute identifier to each unique data item of said data field;
d) Assigning a value ID to each data field, the value ID defining a data type of the respective data item(s);
e) Generating a value ID for each unique data item, the value ID comprising
a unique identifier for said data item with respect to said respective data field;
f) Maintaining a frequency count for each data field or data item, the frequency count for a data field comprising the number of data items stored for said field, the frequency count for a data item comprising the number of instances of said data item occurring in said stored data items for the respective data field; and,
g) Storing at least selected parts of the data fields and/or data items and associated unique record identifier, attribute identifier, value ID and frequency count in each of the four B + Trees.
-in the event of a new data record to be inserted, said method comprises:
a) Creating a unique record identifier for the data record;
b) For each data item of each data field of the data record: identifying the unique attribute identifier for said data field;
determining whether the data item is stored with respect to said unique
attribute identifier;
in the event of said data item being stored, incrementing the frequency
count for the data item;
in the event of said data item being not stored, generating a value ID for
the data item, the value ID comprising a unique identifier for said data item with respect to said respective data field, assigning a frequency count to said data item and incrementing the frequency count of the data field; and,
4

storing at least selected parts of the data item and associated unique
identifier, attribute identifier, value ID and frequency count in each of the four B+ Trees.
- in the event of a data record to be deleted, said method comprises:
for each data item of each data field of the data record:
identifying the unique attribute identifier for said data field;
in the event of the frequency count of the data item stored with respect
to said unique attribute identifier is greater than one, decrementing the frequency count;
in the event of the frequency count is one, deleting at least selected
parts of the data item and associated unique identifier, attribute identifier,
value ID and frequency count from the four B+ Trees.
The present invention also provides a data storage system for storing data
record in a data structure, each data record including a plurality of data fields
storing one or more data item, the data storage system comprising:
an interface operatively connected to a database; said database comprising
four data stores encoded in a computer readable memory, the data stores
being arranged, in combination, to store and index said data records in the
event of being reduced to a plurality of records of a single dimension;
said interface being adapted to accept at least selected ones of:
input of a new data record to be stored, input of changes to a stored data
record,
input of a deletion request for a stored data record, input of a query on a data
record.
In accordance with preferred embodiments of the data storage system of the
present invention:
- said data structure comprises a plurality of data stores arranged, in
combination, to store and index said data records in the event of being
reduced to a single dimension.
-each data store comprises a B+ Tree structure.
-the data stores include a first data store having a generic record of a first
5

type for each unique data item of each of said data fields.
-said first data store includes a generic record of first type for each of said
data fields.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The nature and scope of the present invention will be better understood from the accompanying drawings, which are by way of illustration of some preferred embodiments and not by way of any sort of limitation. In the accompanying drawings:
i
Figure 1 illustrates a sample data set for an employee database; Figure 2 illustrates a normalized database structure for an employee database, as is used in the prior art;
Figure 3 illustrates the normalized database structure of Figure 2 populated with data from the sample data set of Figure 1; Figure 5 illustrates the fields and data of an unpopulated UNI file; Figure 6 illustrates the UNI file populated with the data from the first record of the sample data set of Figure 1;
Figure 7 illustrates the INU file, populated with the data from the first record of the sample data set of Figure 1;
Figure 8 illustrates the CRF and FRC files, populated with the data from the first record of the sample data set of Figure 1;
Figure 9 illustrates the UNI file of Figure 6 populated with the data from the second record of the sample data set of Figure 1;
Figure 10 illustrates the INU file of Figure 7 populated with the data from the 15 second record of the sample data set of Figure 1; Figure 11 illustrates the CRF and FRC files of figure 8 populated with the data from the second record of the sample data set of Figure 1; Figure 12 is a flow diagram of a present invention search algorithm searching a database according to an embodiment of the present invention; Figure 13 is a table showing an analysis of an update algorithm for use with a database according to an embodiment of the present invention; Figure 14 is a table showing Fan-out and height of an embodiment of the present invention B+ Trees (N = 1,000,000 records). The height in brackets is the height of a B+ Tree index in the RM;
Figure 15 is a table showing Cache requirements of an embodiment of the
6

present invention in KB (N=1,000,000 records);
Figure 16 is a table showing the Cache requirement comparison of an
embodiment of the present invention Vs a Fully indexed Relational Table
(FIRT) in %( N = 1.000,000 records);
Figure 17 is a table showing a Disk storage requirement comparison of an
embodiment of the present invention Vs FIRT in %(N = 1,000,000 records);
and,
Figure 18 illustrates aspects of the UNI file with 10000 records.
Figure 19 is a table showing an analysis of a search algorithm for use with a
database according to an embodiment of the present invention.
Fig 20 is a schematic diagram of a computer system according to an
embodiment of the present invention.
BRIEF DESCRIPTION OF THE INVENTION
The following describes some preferred embodiments of the present invention, which are purely for the sake of understanding the performance of the invention,*and not by way of any sort of limitation.
The present invention supports data type sets, such as those for use in most non-OLTP environments. Since there is no difference between data (attributes without any index) and index attributes, the physical separation of index and tables are eliminated. There is only one structure, which stores all data. All the attributes stored in the data structure of the present invention are treated with the same importance. However, unlike in the RM the number of files that needs to be created remains 4, irrespective of the number of attributes.
The structure is designed such that while no file mirrors the other, any file of this group can be generated from the other 3 or 2. This leads to a high degree of fault-tolerance build into the database structure.
In the RM, a single table is created to map between a multi-value attribute and a key attribute of the database. Take for example, TABLE 2 of Figure 2, in which "Employee Code" is mapped with "Language". This is an attempt to break the horizontal row straightjacket of a relational table, and introduce "vertical rows". This increases space requirements, as well as the number of expensive joins during access. In contrast, in the present invention, the whole structure is based on "vertical rows", allowing easy support to multi-
7

value fields.
The relational model may use "coding" to reduce the size of the composite
keys. In thepresent invention, a similar technique may be used. However,
unlike the relational model, not only can the values be coded, but also the
attributes themselves. This technique helps in reducing the key size of 3 of
the 4 index tables to simply 8 bytes, irrespective of the size of the individual
attributes. Using 8 bytes, a domain may be addressed having (231-1)
attributes each having (231-1) unique values. In case a lesser number of
attributes are required, the key size may be reduced to 6 (i.e. 32,767
attributes) or even 5(le. 127 attributes). This improves the efficiency of the
index operations as the fan-out increases manifold. And there is no join to
be performed later with the "mapping table". For the sake of understanding
and not by way of any sort of limitation, a key size of 6 (maximum 32,767
attributes) will be assumed for the rest of this document, although other key
sizes are also possible and fall within the scope of the present invention.
The present invention facilitates a high degree of parallelism in the
processing of queries involving more than one attribute, and extremely fast
in- memory Boolean operations ensure negligible performance degradation
when compared with single attribute searches.
An index search involving an inequality (i.e. SELECT * FROM TABLE1 where DEPARTMENT NOT IN -Accounts') can be tackled with ease, unlike in most relational index schemes. In a relational model, this always requires a full table scan. In the present invention, they are tackled like any other query. Some attributes like "EMPLCODE" are considered highly selective attribute, as search on this attribute results in a small result set, while a search on either of the two values of a low selective attribute like "SEX" will potentially generate a result set that is half the number of records in the database. Searches on low selectivity (LS) attributes will be slower than searches on high"selectivity (HS) attributes, very similar to indexing in RM. However, the structure of the index files is such that in case of low selectivity attributes, the key compression in the nodes of the FRC file can be very high. This leads to higher fan-outs on these nodes, balancing out some of the inherent inefficiencies of the low selectivity attributes.
The present invention allows fast retrieval of a unique list of values for any attribute (or a range thereof). In the relational model, that would have required a 5 traditional SELECT statement, with the UNIQUE post
8

processing. If the attribute does not have an index, then a full table-scan is executed. It is important to allow applications to provide end-users with a list of already entered values. This in turn allows end-users to query the system more knowledgeably, resulting in a more focused search. For example, instead of searching for "Tridib*" the user can now make a high-selectivity search on Tridib Roy Chowdhury". The present invention allows end-users to avoid entering different values-for an attribute, for the same entity. For example, entering "SHOP-FLOOR-II" instead of "SHOP FLOOR 2" leads to update anomalies. However, if a read-only list was provided the user can simply choose 'SHOP FLOOR 2".
Performance on access is far superior in the present invention than in the traditional RDBMS model, especially when listing on data-attribute values.
The present invention also performs well in space-time complexities in static domain of data warehousing (DW), when compared with the performance of the various bitmap indexes.
A preferred embodiment of the present invention is based upon B+ Trees, which have been designed to handle the peculiar needs of data-index environment. The information being stored in four B+ Tree files and the uniqueness of information representation provides many of the advantages of the present invention, such advantages not being present in normal B+ Tree 25 structures. However, it will be appreciated that other single or multi-dimensional indexing structures could be used in place of some or all of the B+ trees and such implementations are intended to fall within the scope of the present invention.
Databases according to the present invention include four data structures, each corresponding to a B+ tree. The four B+ Trees are referred to throughout the remaining description as the UNI file, INU file, CRF file and FRC files, although no significance is given by the naming of these files. In addition, although each B+ tree is discussed with reference to a file, the data structures do not need to each be stored within a single file and could be divided across numerous files or stored in a memory. The implementation discussed is merely an example and the skilled addressee will appreciate that many other implementations that fall within the scope of the present
9

invention are possible.
Data to be stored is in the form of records. Each record includes a number of data fields that store one or more data items. An example of a number of records is shown in Figure 1. With the exception of the header row, each row is considered to be a record. The header row defines the data fields (Emp. Code, Name, Sex, D.O.B., Department, Designation, Language, Qualification, Skill-set). Entries in the records directly beneath a header define the data for that data field. For example, the entry for data field Emp. Code for the first record is "1".
The data type of data for data fields may differ. For example, the data type for Emp. Code may be numeric, the data type for Name may be alphanumeric while the data type for D.O.B may be alphanumeric or a predefined composite date data type. In addition, as can be seen from the Language, Qualification and Skill-Set data fields, a record may have more than one entry for a data field resulting in a set data type.
Each record stored is allocated a unique record ID.
UNI FILE
A sample snapshot of the unpopulated UNI file for the employee record database is shown in Figure 5. The entries shown include to the attribute numbers (attribute #), attribute values, value IDs and a frequency count.
The UNI file stores data on attributes, that is, attribute names (data fields) and data itself (data items). In essence, this file stores both the data and metadata identifying the data and its type.
Data in the UNI file is stored in records of a generic format. The same generic record format is used irrespective of whether the data stored is on data, or data fields.
A database to be stored is broken down into the attribute names defining the database and the data. Data is reduced to a single dimension so that data of a set data type is split into a number of separate data items, each of which is stored separately.
Each attribute has a unique attnbute number.
The Attribute number corresponding'to a certain attribute say "Name", is 2.
10

Attribute number 0 is reserved for storing the number of records in the database (a count of the Record Ids stored). The attribute value field stores the data itself (i.e. the data field name or data itself). The Value ID field stores information relating to the data type and the maximum length of the attribute value. In Fig 5, the size of the "D.O.B." field is 10(0xa) and the data type is date(1), while that of "Name" is 30 and data type is alphanumeric(O) and that of "Language" is 20 (0x14) and data type is that of a set(2).
Figure 6 illustrates the file of Figure 5 after the first record from the sample database of Fig 1 is entered. We can see that there are 2 entries each for The entries as illustrated in Figure 5 are sorted according to the composite primary key.of this file (Attribute*, Attribute Value).
The Frequency Count Field for the data values records the number of records of the database where this value occurs for the corresponding attribute. Since only one record has been inserted, all the frequency counts are showing one.
The Frequency Count Field for the attribute values (Attribute# > 0x8000) stores the numbers of unique values that are present for this attribute. Hence, while at this stage only one value is recorded against "Name", there are two unique values recorded against "Language" (English, Hebrew), "Qualifications" (B.Sc, PhD.) and "Skill Sets" (Databases, OS).
The new or modified entries of this file compared to the state in Figure 5 are shown in bold.
11

The snapshot of the file entries in the UNI after the second record as inserted is shown in Figure 9.
New entries have been made for Attribute # 1 (Empl. Code). As the
Attribute Value 2 was not present for attribute # 1, the Frequency Count for
the 5 corresponding attribute entries (0x8001) is incremented from 1 to 2.
This new value of 2 is used as the Value ID for the new attribute value entry
of (1,2, 2, and 1).
Similarly new value entries (2, Gopalaswamy, T.M., 2,1), (4, 2/7/1954, 2, 1) are made with the corresponding increase in the Frequency Count for attributes "Name" (Attribute # 0x8002), and "D.OB" (Attribute # 0x8004) However, as the value for "Sex" (M) is already in the table, only the Frequency Count for that attribute value entry is changed from 1 to 2 (3, M, 1,2).
The Frequency Count of the "Record ID" attribute is also changed to 2 to reflect the number of records entered in the database. The table structure information can be stored in a separate table. In fact, if
the size of attribute names is substantially bigger than that of its value, it is recommended that we store this information separately. However, irrespective of whether the table structure information is within the UNI file or in a separate file, operation is substantially the same.
In most application environments, the programming convenience of storing them in the same file can be availed of.
Intelligent caching ensuring that the leaf storing this information (i.e. the last
leaf in the leaf chain), is always in the cache ensures that we do not use disk accesses to access this information. For a disk block-size of around 8 KB, "and if we limit the attribute name to 10 bytes, each of these records occupy 19 bytes. Therefore, one leaf can store around details of about 400 attributes.
In a preferred embodiment, attribute identifier and Record ID fields have their highest bit set to 1 where they are used for a generic record corresponding to a data field and 0 where they are used for a generic record corresponding to a data item. This allows the data stored in the generic record to be identified more quickly.
12

INU FILE
The.INU file may be considered a cross-referenced image of the UNI file. The snapshots of the INU file after the insertion of the first and second records are shown in Figure 7 and Figure 10 respectively.
There are no entries corresponding to the entries corresponding to the 5 attribute names (i.e. Attribute # 0x8001, 0x8002 etc.) and the "Record ID" (Attribute # 0x8000). The frequency counts of the UN! file are not recorded in the INU file.
The key for this file is a composite key, comprising of {Attribute #; Value ID}. The size of this composite key is 6 byteS- The record size is variable, as the 10 "Attribute Value" size is variable.
The maximum number of attributes in a 'table' is 32,767, while the maximum number of values for each attribute is (231--1) or 2,147,483,647. However, the limit on the number of fields is an implementation detail and can be easily increased to (231-1) or 2,147,483,647! 15
CRF FILE
This file maps the "Record ID" to the "Attribute W and the corresponding 'Value ID". The snapshots of the CRF file after the insertion of the first and second record is shown Figure 8 and Figure 11 respectively. The CRF table essentially represents a table vertically, rather than the horizontal representation of the RM table. This allows the structure to store multiple value attributes like "Language", "Qualification", "Skill Sets". These are represented by the "Attribute #" 7, 8, 9. As can be seen in Figure-11, "Document#"= 1 has couple of entries for "Attribute#"=7, and a couple for "Attribute #"=8.
The structure is similar to the structure of TABLE2, TAPLES, and TABLE4 of Figure 2. However, according to the present invention, all these tables can be reduced to one index file. This eliminates the need to perform an expensive join operation.
The other advantage of the CRF file over the RM tables is that the size of the composite index is always 6 bytes only, irrespective of the exact sizes of the respective attributes. Therefore, there is no limit in the present invention on the size of the attributes, which is contained in the composite key.
13

It should be noted that the CRF file the entries are not unique on their composite keys, as the same document can have multiple values for certain set-type fields, e.g. Languages, Qualifications etc.
FRC FILE
The FRC file maps the "Attribute #" and the "Value ID" to a "Record ID". The snap shots of the FRC file after the insertion of the first and the second record is shown Figure 8 and Figure 11 respectively.
The composite key for this file consists (Attribute#; Value ID), and the size of this key .is always 6 bytes.
There is one entry in this file for every record in the CRF: fj|e. However, the composite key is different. This is the primary table used for "table" searching. The small size of the key, which is 6 and independent of the attribute size, allows very efficient searches across all attributes.
The record size of only 10 bytes, allows for a very efficient TABLE-SCAN (actually a leaf-scan of the FRC B+ Tree), when addressing queries relating to attributes of low selectivity.
Note that the FRC file the file entries are not unique on their composite keys, as the same attribute value can be found in multiple records, e.g. Sex= M.
14

Figure 12 is a flow diagram of a present invention search algorithm searching a database according to an.embodiment of the present invention. The result of search is a Record ID, which uniquely identifies a collection of content object.
As discussed above, attributes can be any different type, like alphanumeric, numeric, decimal, date, time etc, or a set of these atomic types. Types like numeric, decimal, date time can have their values stored as a unique number, which will also maintain their natural sort order. Data type fields can store data internally as "yyyymmdd". These attributes are called Numerically Representable (NR) attributes. In such cases, it is important to note that a corresponding entry is not required in the INU table. The "Value ID" of the various values can be mapped to the actual attribute values. The entry in the UNI table is required to keep the "Frequency Count", and also to provide the "Unique Value" list. There is strong possibility that in certain domains, the entry in the UNI for these attributes is also not warranted.
For entries where the Frequency Count is 1, the unique Record Id can be stored against that entry. This obviates the requirement to search in the FRC 5 file during query processing. The highest bit indicates whether the frequency count or the Record ID is being stored. These entries are known as Totally Selective (TS) entries. Assuming both the unique record ID and frequency count are numeric, there would be a need to find out some way of differentiating between them when stored. The highest bit is used to indicate whether the stored value is a frequency count or a unique record ID. If the highest bit is 1, then we logically AND the value with Ox7FFF to retrieve the unique record ID. If the highest bit is 0, then the value is the frequency count. A skilled reader will appreciate that the interpretation of the highest bit could be reversed without departing from the scope of the present invention.
15

DATABASE OPERATIONS
Figure 12 is a flow chart of the basic database search/duery algorithm. The examples discussed below relate to the structured duery language (SQL), although it will be apparent to the skilled addressee that queries in other formats could easily be handled and fall within the scope of the presently claimed invention.
The search algorithm will now be discussed with reference to an example
query, an Exact Match Query of the form:
SELECT EMP.CODE: NAME
FROM EMPLOYEES
WHERE DESIGNATION ='PROJECT LEADER AND
QUALIFICATION ='BSc.' OR QUALIFICATION ='PhD.'
The search attributes are "DESIGNATION" and "QUALIFICATION".
The select attributes (or the reply attributes) are "EMPLCODE" and "NAME".
There are two search criteria - records must include a designation of
"Project Leader" and a Qualification of "PhD" or "BSc",
Each search criterion is initially processed separately.
16

In step 10, the Value ID for DESIGNATlON"='LEADER' is obtained. This is done by searching the UNI index file with the composite attribute {"Attribute # =6"; "Attribute Value = PROJECT LEADER"} in step 10. This gives the Value ID, which is stored in an a-list in step 20.
In step 30, it is determined if more search terms need to be resolved. In the present case there are more search terms to be resolved and steps 10 to 20 are repeated for the other two search terms. These processes can be run serially, as discussed above, or in parallel.
At this point we have a list of tuples {"Attribute #", "Value ID"}, one for each search term stored in the a-list.
In step 40, the FRC file is searched for each entry in the a-list with the tuple as the composite key. For example, for the designation search term, the tuple {"Attribute # = 6"; "Value ID = 1"} is the composite key. The B+ Tree search will retrieve the first record in the FRC file, which matches this composite key. From the search results, the Record ID for the record is obtained. A scan is then made through the leaf link-list of the FRC file to obtain all other records that match this composite record.
In step 50, for each Record ID obtained, a corresponding bit in a search bit map is set corresponding to this sub-query.
Steps 40 and 50 are repeated for all the entries in the a-list
Again, although steps 40 and 50 are discussed as being applied serially to each entry, they could be applied in parallel.
Note for queries on HS entries, the FRC search is not required as the UNI file will automatically generate the Record ID for the sub-query.
17

There are now 3 bit maps one for each of the sub-queries. These bitmaps are then combined in step 70 using the Boolean operators set in the query (above, in this case bitmapl AND (bitmap2 OR bitmap3)) to generate a results bitmap containing the Record ID(s) that match the query. 5
In step 80, for each of the Record IDs found in step 70, the CRF file is searched with the composite key ("Record ID"; "Attribute #=1"} to obtain the Value ID. Note that the select attribute with the smaller Attribute # is EMP.CODE, whose Attribute # is 1
A leaf scan (almost always the same leaf as before) is used to get the Value ID for the remaining select attributes for this Record ID in step 80.
Now there is a table in memory, where the columns are attributes and the rows are the Record IDs, each cell containing a Value ID.
In step 90, for each column (attribute), the following is done to translate the Value IDs into actual values to be output as the result of the query:
1. Generate a set of unique Value ID for this column, and sort by Value ID.
2. For each Value ID, search INU for {Attribute #; Value ID}.
3. Get the actual value,
4. Replace all entries in the column with the same Value ID with this value.
It should be noted that the search in 2 may not always require a full index search or even a leaf scan. In case the next "Value ID" is in the last accessed leaf, we do not have to do any further operation. This is very relevant in case of low selectivity attributes. For NR attributes, access to the INU is not required.
Range Queries:
In the case of the SQL Query:
18

SELECT EMP CODE,
NAME FROM
EMPLOYEES
WHERE DESIGNATION =PROJECT LEADER'
AND QUALIFICATION ='BSc.' OR D.O.B. >=
'01/03/1970'
Such a query could be performed on a database according to an embodiment of the present invention as follows:
In step 10 the Value ID for "DESIGNATIONS-PROJECT LEADER' is determined. 10
in step 10 the UNI index file is searched with the composite attribute {"Attribute # = 6"; "Attribute Value = PROJECT LEADER"}- This gives a "Value ID" which is stored, in step 20 in an a-list.
The same steps are repeated to determine "Value ID" for "QUALIFICATION"=B.E.' and for "D.O.B" >='01/03/1970
As before, these processes can be run in parallel. Steps 40 to 90 are then the same as above.
Searched with wild cards and NOT condition:
In the case of the SQL Query:
SELECT EMP.CODE, NAME
FROM EMPLOYEES
WHERE SKILL SETS ='Data*' AND
QUALIFICATION ='BSc.' OR D.O.B. >= '01/03/1970' AND
DEPARTMENT <> 'R&D'
Such a query could be performed on a database according to an
embodiment of the present invention as follows:
In step 10, the Value IDs for "Skill Sets"= 'Data*1 is determined.
19

In step 10, the UNI index file is searched with the composite attribute {"Attribute # = 6'; "Attribute Value = Data*'}. This will give a list of "Value ID"s. A leaf scan on UNI would be required to generate all the "Value ID'S matching the criterion these are then stored in the a-list in step 20. 5 The same steps are repeated to determine "Value ID" for "QUALIFICATIONB=*BSc.', for "D.O.B">='01/03/1970' and for "DepartmentA'R&D1.
As before, these processes can be run in parallel. Steps 40 to 60 are as previously described.
There are now 4 bit maps, one for each of the sub-queries. In step 70, these bitmaps are then combined using the Boolean operators according to the query 15 above, viz. {Bitmapl AND (bitmap2 OR bitmap3) AND! bitmap4} to generate the final bitmap containing the Record ID which matches the query.
Steps 80 to 90 are then the same as described above.
Searches with Set Operators:
In the case of the Superset predicate query:
SELECT EMP.CODE,
NAME FROM
EMPLOYEES WHERE QUALIFICATIONS 3 {'MBA'CPA'} AND QUALIFICATION ='CS' OR D.O.B. >='01/03/1960'AND DEPARTMENT <> 'FINANCE'
This can be reduced to the following query:
SELECT EMP.CODE, NAME
FROM EMPLOYEES
WHERE QUALIFICATIONS ='CPA' AND QUALIFICATION ='MBA' AND
QUALIFICATION ='CS' OR D.O.B. >= '01/03/1970' AND DEPARTMENT <>
'FINANCE'
20

Subset Predicate:
SELECT EMP.CODE, NAME FROM EMPLOYEES
WHERE QUALIFICATIONS c {'CPA 'MBA'} AND QUALIFICATION ='CS' OR D.O.B. >= '01/03/1970' AND DEPARTMENT <> 'FINANCE'
This can be reduced to the following query:
SELECT EMP.CODE, NAME
FROM EMPLOYEES *
WHERE QUALIFICATIONS ='CPA' OR QUALIFICATION ='MBA' AND
QUALIFICATION ='CS' OR D.O.B. >= '01/03/1970' AND DEPARTMENT
<>'FINANCE'
Overlap Predicate:
SELECT EMP.CODE, NAME FROM
EMPLOYEES
WHERE QUALIFICATIONS r, {'CPAYMBA', 'BSc', PhD'} > 2 AND
QUALIFICATION ='CS' OR D.O.B. >= '01/03/1970' AND
DEPARTMENT <> 'FINANCE'
This query will need to be processed in two ways. The standard way would be process it as we did for the subset predicate query, except we make a change 25 to the OR operation of the bitmaps generated from the sub-queries QUALIFICATION='CPA', QUALIFICATION='MBA... The OR operation does not do a Boolean .OR. It checks for each bit-position the number of 1s. If the number is > 2, the resultant bitmap bit is set to 1. Else it is set to zero.
The second way would be to ignore the overlap predicate in step 10, but 30 apply on-the-fly selection in step 80, on the overlap attribute. QUALIFICATIONS.
21

QUERY OPTIMIZATION
On studying the above algorithms, it should be apparent that the algorithms for search are essentially the same irrespective of the nature of the query. Therefore, there is no need for query optimization, query path selection.
However, that is not to say that we cannot use query optimization techniques. In step 10, we can get the "Frequency Count" for various "Attribute #; Attribute Value" pair at no extra cost. We could then use this to see whether all the sub-queries of step 40, are required. For example, if the query is as follows:
SELECT NAME FROM EMPLOYEES
WHERE D.O.B = '1/3/1970' AND DEPARTMENT='FINANCE' In this case, if the D.O.B='1/3/1970' has a frequency count of 4, while the other sub-query has 100. Then it might be better to do step 40-50 for D.O.B, and go straight to step 80, and apply 'select* on the fly on the 4 {Record ID, Attribute*, Value ID} tuples found with {Attribute #=5; Value ID=2}.
INSERTING
Input:
A set of tuples in the following format: {"Emp Code", "12345"}.{"NameB, "James Butcher"}. {"Sex", "Male"}, {"Department", "Accounts"}, {"Designation", "Manager"), {"Languages Spoken", 20 "English"}, {"Languages Spoken", "Hindi"}, {"Skill Sets", "As400"}, {"Skill Sets", "WebSphere"}, {"Qualification", "MBA"}, {"Date Of Birth". "23/01/1963"}.
Output:
A Record ID - positive in case of successful entry, -1 for unsuccessful attempt.
Pre-processing:
The' pre-processing stage converts the set of tuples to the following tuple, using the structure stored in the UNI file. This stage checks for data validity (type and size) and does not require any disk
22

accesses as the concerned leaf storing the structure is always in the cache.
The output of the pre-process is:
KVN - the total number of tuples in the input.
KVARRAY- An array of tuples of the type {1, "12345'}, {2, "James Butcher"}, {3, "Male'}, {5, "Accounts"} ,{6, "Manager"}, {9, "English"},{9, "Hindi"}, {8, "As400"}, {8, "Websphere"}. {7, "MBA"}, (4 ."23/01/1963"}.
Sorting on "Attribute #" is not mandatory.
23

The insert algorithm for inserting a record {having a record ID of rid) is explained below in a pseudo-C code. (Special consideration for HS entries has not been discussed), typedef struct
{ unsigned short
attributeid;
char vaiue[];} KVARRAY; long
insert(short kvn, KVARRAY *kva)
{ inti;
long
vid;
long rid;
if (!(rid =
getnextridO)
return-1;
for (i = 0; i < kvn; i++)
{if (! (vid = uniinujnsert(kva[i).attributeid, kva[i]->value, rid)))
goto err; if (!(crffrcjnsert(rid, kva[i].attributeid,
vid)))
goto err;
} return
rid; err:
deletelastrid(rid);
return-1;
- }
longgetnextrid()
24

//search in the UNI file for the entry {0x8000}.
//The "Frequency Count" field will contain the last Record
ID (rid).
// frequenct_cnt++
// return frequency_cnt
}
long uniinu_insert(unsigned short ano, char Value)
{
long frq;
long vid;
// if {ano, value} already exists in the UNI
{
// let frq = "Frequency Count" stored in UNI against {ano,
value).
//frq = frq + 1
// update the "Frequency Count" in UNI against the {ano,
value).
// return vid = "Value ID" stored against {ano.value}.
// else
//frq = "Frequency Count" stored in UNI against {ano | 0x8000}.
// frq = f rq + 1
// update the "Frequency Count" in UNI against {ano [
0x8000}. // insert a new record in UNI {ano, value, frq, 1)-
// insert a new record in INU {ano, frq, value}.
// return frq
25

long crffrcjnsert(long rid, unsigned short ano, long vid)
{
// insert record in CRF {rid, ano, vid} // insert record in FRC {ano, vid, rid)}
26

DELETION
The input to this process would be a Record ID (rid).
The insert algorithm is explained below in a pseudo-C code.
long delete(long rid)
{ char valueQ;
long frq;
// search CRF to generate a list of (Attribute #, Value ID), from the records which match {rid}. Call this list the FRC-list // delete all the records of CRF which matches {rid}. // for each entry in the FRC-list (ano, vid) // delete from FRC the entry (ano, vid, rid) // end for
//for each entry in the FRC-list (ano, vid) // find value = "Attribute Value" from INU for record {ano, vid} //find UNI record entry {ano, value} //frq = "Frequency Count" of UNI record {ano, value}, //frq = frq-1 //if frq = 0 then //delete from UNI {ano.value}. //delete from INU {ano, vid}. //else // update the UNI record {ano, value} with a new value for "Frequency Count" = frq. end for
Figure 13 is a table showing an analysis of an update algorithm for use with a
database according to an embodiment of the present invention.
Figure 20 is a schematic diagram of.a computer system according to an
embodiment of the present invention.
A data storage system 100 includes a storage medium 110 and a
management system 120. The storage medium 110 may be memory, a hard
disk, a number of hard disks or some other computer readable medium. The
B+ tree structures as discussed above are encoded and stored on the storage
27

medium 110 in a predetermined manner.
A user wishing to access data in the data storage system 100 accesses a user terminal 130 having a user interface 140 that is arranged to communicate with the management system 120 via a network 150 such as a LAN, WAN, internet, intranet or a simple computer data bus. Data can be accessed, updated and manipulated by the user interface 140 and management system 120 in accordance with the algorithms and procedures set out above.
It will be apparent that many implementations of the present invention are possible, both in a networked computer environment and where the user interact and database reside on the same computer system. The above described computer system is just one example of a system that is intended to fall within the scope of the claims, although the skilled reader will appreciate that many modifications and variations are possible without departing from the scope of the present invention.
APPLICATION AREAS
Document content management systems (DCMS) require a highly indexed environment, as documents need to be accesses through different access path. A DCMS environment like most OODBMS domains seldom requires a projection over the associated attributes, but needs to access the document content or objects, which are, accessed though the OBJECT-ID stored in the Record ID.
Knowledge management systems, digital asset management systems, medical databases, data warehousing, OLAP databases which requires 10 multidimensional access through arbitrary queries are ideal environments for implementation of the present invention. Object Oriented Databases, XML databases which require storing unnormalized data and setting data types can effectively use the structures and techniques described herein. These applications also deal with arbitrary queries and have indexing needs in high dimensions (> 3).
This does not preclude the usage of the present invention along with relational databases. While relational systems typically use B-Trees, hashed
28

index or bit-mapped index for their index requirement, in a highly indexed environment the present invention can be used. The result of the searches would be a list of ROW-IDs very similar to the traditional index structures. In the three-tier architecture of system design, separation of the business middle-layer from the lower data-storage layer can be finally achieved. This is because all access paths are optimized (as all paths are "indexed"). Hence, no fresh paths need to be added because of any change or addition to the business layer. This reduces deployment time in case of new applications, as well as changes to existing applications.
The present invention has been described with reference to some drawings and preferred embodiments, purely for the sake of understanding and not by way of any limitation and the present invention includes all legitimate developments within the scope of what has been described herein before and what has been claimed hereinafter.
29

WE CLAIM
1. A method for storing data records having a plurality of data fields storing one or more data items comprising, reducing the data records to a plurality of records of a single dimension; storing and indexing said plurality of records across a plurality of data stores.
2. The method as claimed in claim 1 wherein the plurality of data stores comprises 4 B + trees.
*
3. The method as claimed in claim 2, comprising the steps of:
a) Creating a unique record identifier for every data record;
b) Generating a unique attribute identifier for each data field;
c) Assigning said unique attribute identifier to each unique data item of said data field;
d) Assigning a value ID to each data field, the value ID defining a data type of the respective data item(s);
e) Generating a value ID for each unique data item, the value ID comprising a
unique identifier for said data item with respect to said respective data field;
f) Maintaining a frequency count for each data field or data item, the frequency count for a data field comprising the number of data items stored for said field, the frequency count for a data item comprising the number of instances of said data item occurring in said stored data items for the respective data field; and,
g) Storing at least selected parts of the data fields and/or data items and associated unique record identifier, attribute identifier, value ID and frequency count in each of the four B + Trees.
4. The method as claimed in claim 3 wherein, in the event of a new data record
30

to be inserted , said method comprises :
a) Creating a unique record identifier for the data record;
b) For each data item of each data field of the data record: -identifying the unique attribute identifier for said data field;
-determining whether the data item is stored with respect to said unique attribute
identifier;
-in the event of said data item being stored, incrementing the frequency count for
the data item;
-in the event of said data item being not stored, generating a value ID for the
data item, the value ID comprising a unique identifier for said data item with
respect to said respective data field, assigning a frequency count to said data
item and incrementing the frequency count of the data field; and,
-storing at least selected parts of the data item and associated unique identifier,
attribute identifier, value ID and frequency count in each of the four B+ Trees.
5. The method as claimed in claim 3 or 4 wherein in the event of a data record to
be deleted, said method comprises:
- for each data item of each data field of the data record:
- identifying the unique attribute identifier for said data field;
- in the event of the frequency count of the data item stored with respect to said unique attribute identifier is greater than one, decrementing the frequency count;
- in the event of the frequency count is one, deleting at feast selected parts of the data item and associated unique identifier, attribute identifier, value ID and frequency count from the four B+ Trees.
6. A data storage system for storing data record in a data structure, each data
record including.a plurality of data fields storing one or more data item, the data
storage system comprising:
an interface operatively connected to a database; said database comprising four
31

data stores encoded in a computer readable memory, the data stores being arranged, in combination, to store and index said data records in the event of being reduced to a plurality of records of a single dimension; said interface being adapted to accept at least selected ones of: input of a new data record to be stored, input of changes to a stored data record,
input of a deletion request for a stored data record, input of a query on a data record.
*
7. The data storage system for storing data records as claimed in claim 6
wherein said data structure comprises a plurality of data stores arranged, in
combination, to store and index said data records in the event of being reduced
to a single dimension.
8. The data storage system as claimed in ciaim 7 wherein each data store
comprises a B+ Tree structure.
9. The data storage system as claimed in claim 7 or 8 wherein the data stores
include a first data store having a generic record of a first type for each unique
data item of each of said data fields.
10. The data storage system as claimed in claim 9 wherein said first data store
includes a generic record of first type for each of said data fields.
Dated this 31st July 2008
REGNO. 411
FOR KOCHHAR & CO.
APPLICANTS' AGENT
32 A, V. S. RAMASARMA
Advocate
Patent Agent No. IN/PA 411
KOCHHAR & CO.,
Advocates & Legal Consultants
Suite# 301. Sigma Wing, 3rd Fteor,
•Reheja Towers' # 177, Anna Seta*,
Chennal - 600 002. Tamil Nadu, India.

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	1765-MUM-2008-RELEVANT DOCUMENTS [16-03-2020(online)].pdf	2020-03-16
1	1765-MUM-2008-REQUEST FOR ADJOURNMENT OF HEARING UNDER RULE 129A [03-07-2018(online)].pdf	2018-07-03
2	abstract1.jpg	2018-08-09
2	1765-MUM-2008-FURTHER HEARING NOTICE-10-02-2020.pdf	2020-02-10
3	1765-MUM-2008_EXAMREPORT.pdf	2018-08-09
3	1765-MUM-2008-FORM-26 [11-03-2019(online)].pdf	2019-03-11
4	1765-mum-2008-we claim.doc	2018-08-09
4	1765-MUM-2008-Annexure (Optional) [06-03-2019(online)]-1.pdf	2019-03-06
5	1765-MUM-2008-SPECIFICATION(AMENDED)-240815.pdf	2018-08-09
5	1765-MUM-2008-Annexure (Optional) [06-03-2019(online)].pdf	2019-03-06
6	1765-MUM-2008-Written submissions and relevant documents (MANDATORY) [06-03-2019(online)]-1.pdf	2019-03-06
6	1765-MUM-2008-Power of Attorney-240815.pdf	2018-08-09
7	1765-MUM-2008-Written submissions and relevant documents (MANDATORY) [06-03-2019(online)].pdf	2019-03-06
7	1765-MUM-2008-PETITION UNDER RULE 137 -240815.pdf	2018-08-09
8	1765-MUM-2008-OTHERS-240815.pdf	2018-08-09
8	1765-MUM-2008-FORM-26 [18-02-2019(online)]-1.pdf	2019-02-18
9	1765-MUM-2008-HearingNoticeLetter.pdf	2018-08-09
9	1765-MUM-2008-FORM-26 [18-02-2019(online)].pdf	2019-02-18
10	1765-mum-2008-form 5.pdf	2018-08-09
10	1765-MUM-2008-Proof of Right (MANDATORY) [18-02-2019(online)].pdf	2019-02-18
11	1765-MUM-2008-Form 5-240815.pdf	2018-08-09
11	1765-mum-2008-ExtendedHearingNoticeLetter_19Feb2019.pdf	2019-01-18
12	1765-MUM-2008-Abstract-240815.pdf	2018-08-09
12	1765-mum-2008-form 3.pdf	2018-08-09
13	1765-MUM-2008-Form 3-240815.pdf	2018-08-09
14	1765-mum-2008-abstract.pdf	2018-08-09
14	1765-MUM-2008-FORM 26(4-1-2010).pdf	2018-08-09
15	1765-MUM-2008-Claims-240815.pdf	2018-08-09
15	1765-mum-2008-form 2.pdf	2018-08-09
16	1765-mum-2008-claims.pdf	2018-08-09
17	1765-MUM-2008-CORRESPONDENCE(4-1-2010).pdf	2018-08-09
17	1765-mum-2008-form 2(title page).pdf	2018-08-09
18	1765-MUM-2008-CORRESPONDENCE(IPO)-(FER)-(25-8-2014).pdf	2018-08-09
18	1765-MUM-2008-FORM 18(4-1-2010).pdf	2018-08-09
19	1765-mum-2008-correspondence.pdf	2018-08-09
19	1765-mum-2008-form 1.pdf	2018-08-09
20	1765-MUM-2008-Examination Report Reply Recieved-240815.pdf	2018-08-09
21	1765-mum-2008-description(complete).pdf	2018-08-09
21	1765-mum-2008-drawing.pdf	2018-08-09
22	1765-MUM-2008-Drawing-240815.pdf	2018-08-09
23	1765-mum-2008-description(complete).pdf	2018-08-09
23	1765-mum-2008-drawing.pdf	2018-08-09
24	1765-MUM-2008-Examination Report Reply Recieved-240815.pdf	2018-08-09
25	1765-mum-2008-correspondence.pdf	2018-08-09
25	1765-mum-2008-form 1.pdf	2018-08-09
26	1765-MUM-2008-CORRESPONDENCE(IPO)-(FER)-(25-8-2014).pdf	2018-08-09
26	1765-MUM-2008-FORM 18(4-1-2010).pdf	2018-08-09
27	1765-mum-2008-form 2(title page).pdf	2018-08-09
27	1765-MUM-2008-CORRESPONDENCE(4-1-2010).pdf	2018-08-09
28	1765-mum-2008-claims.pdf	2018-08-09
29	1765-MUM-2008-Claims-240815.pdf	2018-08-09
29	1765-mum-2008-form 2.pdf	2018-08-09
30	1765-mum-2008-abstract.pdf	2018-08-09
30	1765-MUM-2008-FORM 26(4-1-2010).pdf	2018-08-09
31	1765-MUM-2008-Form 3-240815.pdf	2018-08-09
32	1765-MUM-2008-Abstract-240815.pdf	2018-08-09
32	1765-mum-2008-form 3.pdf	2018-08-09
33	1765-mum-2008-ExtendedHearingNoticeLetter_19Feb2019.pdf	2019-01-18
33	1765-MUM-2008-Form 5-240815.pdf	2018-08-09
34	1765-mum-2008-form 5.pdf	2018-08-09
34	1765-MUM-2008-Proof of Right (MANDATORY) [18-02-2019(online)].pdf	2019-02-18
35	1765-MUM-2008-HearingNoticeLetter.pdf	2018-08-09
35	1765-MUM-2008-FORM-26 [18-02-2019(online)].pdf	2019-02-18
36	1765-MUM-2008-FORM-26 [18-02-2019(online)]-1.pdf	2019-02-18
36	1765-MUM-2008-OTHERS-240815.pdf	2018-08-09
37	1765-MUM-2008-PETITION UNDER RULE 137 -240815.pdf	2018-08-09
37	1765-MUM-2008-Written submissions and relevant documents (MANDATORY) [06-03-2019(online)].pdf	2019-03-06
38	1765-MUM-2008-Power of Attorney-240815.pdf	2018-08-09
38	1765-MUM-2008-Written submissions and relevant documents (MANDATORY) [06-03-2019(online)]-1.pdf	2019-03-06
39	1765-MUM-2008-Annexure (Optional) [06-03-2019(online)].pdf	2019-03-06
39	1765-MUM-2008-SPECIFICATION(AMENDED)-240815.pdf	2018-08-09
40	1765-MUM-2008-Annexure (Optional) [06-03-2019(online)]-1.pdf	2019-03-06
41	1765-MUM-2008_EXAMREPORT.pdf	2018-08-09
41	1765-MUM-2008-FORM-26 [11-03-2019(online)].pdf	2019-03-11
42	abstract1.jpg	2018-08-09
42	1765-MUM-2008-FURTHER HEARING NOTICE-10-02-2020.pdf	2020-02-10
43	1765-MUM-2008-REQUEST FOR ADJOURNMENT OF HEARING UNDER RULE 129A [03-07-2018(online)].pdf	2018-07-03
43	1765-MUM-2008-RELEVANT DOCUMENTS [16-03-2020(online)].pdf	2020-03-16