Abstract: A system and method for Enterprise Data Warehousing (EDW) has been disclosed. The Enterprise Data Warehousing system envisaged by the present invention is designed, developed, monitored and controlled by metadata. The metadata includes self validation means which validates, cleanses and tallies the data up to the grain level to attain 100% data tallying thereby, increasing user confidence in reported figures. The Enterprise Data Warehouse has the ability to store multinational organizational data which operates in different time zones, countries, languages and currencies. Additionally, the system includes a Corporate Data Warehouse (CDW) to store summary data for head quarter reporting and a History Data Warehouse (HDW) to store full data from multiple countries without any cutoff dates. The system provides the top, middle and executive level management with 360 degrees view of the data / performance parameters stored in the warehouse with grain level drill down capability based on their individual privilege levels.
FORM-2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
SELF VALIDATING AND METADATA DRIVEN ENTERPRISE DATA WAREHOUSING SYSTEM
TATA CONSULTANCY SERVICES LIMITED,
an Indian Company of Nirmal Building, 9th Floor,
Nariman Point, Mumbai - 400 021,
Maharashtra, India.
The following specification particularly describes the invention and the manner in which
it is to be performed
FIELD OF THE INVENTION
The present invention relates to the field of data processing.
Particularly, the present invention relates to the field of enterprise data warehousing.
DEFINITIONS OF TERMS USED IN THE SPECIFICATION
The term 'data warehouse'" in this specification relates to a central repository for storing an enterprises global data in a uniform format for further processing, statistical analysis and reporting.
The term 'metadata' in this specification relates to a repository which stores data including information, rules, formats, mappings, structures and the like to enable automatic control and management of data in the data warehouse.
The term 'grain level' in this specification relates to the lowest level of detail available for a particular data.
The term 'star schema' in this specification relates to a dimensional approach for designing a data warehouse. In this approach, transaction data is partitioned into fact table and dimension table for storage. And, the records in the fact table are linked to corresponding records in the dimension table using surrogate keys.
The term 'transformed data' in this specification relates to data which is extracted from source systems, cleansed, validated and mapped according to the design of the data warehouse.
BACKGROUND OF THE INVENTION AND PRIOR ART
A data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis. Data warehouse focuses on data storage. However, the means to retrieve and analyze data, to Extract, Transform and Load (ETL) the data and to manage the data dictionary are also considered as essential components of a data warehousing system. Thus, a data warehouse includes business
intelligence tools, tools to ETL data into the repository, and tools to manage and retrieve metadata (data on data tables).
A data warehouse is subject oriented, that is, data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together. A data warehouse is time variant as the changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time. The data in the data warehouse is non-volatile. That is, the data in the data warehouse is never over-written or deleted once committed. The data is static, read-only, and retained for future reporting. Also, the data warehouse is integrated. The data warehouse contains data from most or all of an organization's operational systems. This varied data is made consistent using conformed (common) dimensions.
There are two leading approaches of storing data in a data warehouse namely the dimensional approach and the normalized approach. In the dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, or "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are given below:
• In order to maintain the integrity of facts and dimensions, loading the data warehouse with cleansed and validated data from different operational systems is a very complicated process.
• It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does its business.
In the normalized approach, the data in the data warehouse are stored following the database normalization rules. Tables are grouped together by subject areas that reflect
general data categories (e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the large number of tables involved, it can be difficult for users both to join data from different sources into meaningful information and to access the information without a precise understanding of the source of the data and of the data structure of the data warehouse. Therefore it is desirable to keep information about the data warehouse in a metadata (data about data).
There have been attempts in the prior art to provide enterprise data warehousing systems which enable efficient storage, retrieval and reporting of data. Particularly; United States Patent 6668253 discloses an enterprise information management system to enable an enterprise to identify the key drivers of business success and leverage the knowledge throughout the enterprise. The disclosed information management system performs validation of data each time data is transformed and loaded into the data warehouse. However, this validation is limited to checking referential integrity, checking free form text fields, multi purpose fields, dummy values, contradicting data, violation of business rules and data anomalies. The data is not 100% tallied to ensure its accuracy.
Also, United States Patent 6718335 discloses a system for providing an improved metadata catalogue. There is provided a single metadata catalog which is coupled to the means for modeling, the means for extracting and storing, and the reporting tool. Thus, if any changes are required, only the metadata catalogue is updated and the changes are reflected across the extraction, storage and reporting modules. Even though the ETL process is metadata driven the precision of data is not verified (validated and tallied) hence, the data displayed in the reports may not be 100% tallied. Also, the disclosure does not provide a comprehensive and consolidated view of the information in the warehouse to the users.
Further, United States Patent Application 20090299987 discloses a system for metadata driven temporal data warehouse load design for change data capture. The system includes run time structured query language (SQL) code generator that can automatically process and load data into the warehouse from any type of source system. However, the patent
application does not include self validation processes in the metadata to ensure 100% tallying of data.
Thus, in order to enable the top and middle level management team and the executives to review the performance parameters of all the individual business units of the organization, there is felt a need for a self validating and metadata driven enterprise data warehousing system by the use of which:
• the parameters of all the individual units of organizations can be efficiently stored in a single medium;
• the performance parameters of all the individual units of organizations can be retrieved and reviewed effectively;
• the time and effort for performance report collation can be reduced considerably; and
• a high level of user confidence in reported figures can be obtained.
OBJECT OF THE INVENTION
It is an object of the present invention to provide an Enterprise Data Warehousing (EDW) system which enables capturing of the performance parameters of all the individual units of organizations in a single medium.
It is another object of the present invention to provide an Enterprise Data Warehousing (EDW) system by the use of which the performance parameters of all the individual units of organizations can be retrieved and reviewed effectively.
It is yet another object of the present invention to provide an Enterprise Data Warehousing (EDW) system by the use of which the time and effort for performance report collation can be reduced considerably.
One more object of the present invention is to provide a self validating Enterprise Data Warehousing (EDW) system which increases the accuracy of the data in the data warehouse whereby a high level of user confidence in reported figures can be obtained.
Further, another object of the present invention is to provide a metadata driven Enterprise Data Warehousing (EDW) system by the use of which all steps and best practices of developing and maintaining an enterprise data warehousing system can be enforced.
Yet another object of the present invention is to provide an Enterprise Data Warehousing (EDW) system by the use of which a data warehouse can be designed and developed to handle data for a multi-national organization with operations in many countries, different time zones, languages and currencies.
Still another object of the present invention is to provide an Enterprise Data Warehousing system by the use of which instant reports giving 360° view of each Key Performance Indicators (KPI) for all the individual departments of the organization can be generated.
SUMMARY OF THE INVENTION
In accordance with the present invention there is provided a scalable Enterprise Data Warehousing system comprising:
• a database server for storing an enterprise's transformed data, the database server comprising at least one logical data mart for storing summarized and aggregated data views to facilitate generation of various reports;
• a metadata unit having:
o an object dictionary comprising:
■ conceptual requirement capturing means adapted to provide a graphical user interface to capture the vision, mission, goal, strategy and targets of an enterprise and further adapted to capture the functions of every unit of an enterprise in the form of modules along with business rules, inclusions, exclusions and exceptions to the rules for each module and still further adapted to capture the KPIs (Key Performance Indicators) and report generation requirements based on the KPI;
■ business requirement capturing means adapted to provide a tabular form to capture entity definitions and rules for each of the modules along with corresponding attributes and attribute
properties for the entities and the relationships and dependency between entities and further adapted to capture KPI templates for each of the entities including information selected from the group of information consisting of KPI associated with the module, the KPI formula, calculations, measures, rules, inclusions, exclusions, exceptions and its related KRI (Key Risk Indicator), KQI (Key Quality Indicator), KOI (Key Opportunity Indicator) and artistic impressions of the dashboard, summary, detailed and proof reports and still further adapted to capture source units connection information;
■ logical schema creation means adapted to create a customized database design having database table column definitions, normalization scheme for the database tables, column creations, summaries and aggregation table creation, column attribute definition creation, identification of primary, unique and foreign keys, creating constraints and rules for performing validation of data;
■ physical schema creation means adapted to create database tables for the data warehouse based on a predetermined modeling technique and further adapted to create a self validation scheme to validate data in the data warehouse using a triangulation technique and still further adapted to create views for generation of reports;
■ security metadata creation means adapted to receive a list of users belonging to each of the source units and their access level details;
o a first tabular database means co-operating with the conceptual requirement capturing means, business requirement capturing means, logical schema creation means, physical schema creation means and security metadata creation means adapted to store the captured and created information into the object dictionary's conceptual model
database table, business model database table, logical model database tabJe, physical model database table and security model database table respectively, wherein the database tables are inter-linked using foreign keys, thereby facilitating update to the entire system if any of the database tables are updated; • a Extract Transform and Load (ETL) unit co-operating with the metadata unit including:
o an ETL engine having:
o fetching means adapted to fetch source unit connection information from the business model database table and further adapted to communicate with the source units; o Schema generation means adapted to read the objects dictionary and further adapted to generate set of target database tables for each of the source units including staging database table (STG) and cleansing database table (CLN) and dimensional database table (DIM), Facts database table (FACT), Summary database table (Fact2) and Balanced scorecard database table (BSC) in the database server: o ETL SQL code generation means with following capabilities:
■ First copy generation means co-operating with the fetching means adapted to copy source data to the STG database table;
■ First validation means adapted to read and validate the data in the STG database tables using the validation information in the physical model database table and further adapted to segregate erroneous data;
■ Second copy generation means co-operating with the STG database table adapted to copy 'incremental data' present in the STG database table to the cleansing database table (CLN);
■ second validation means adapted to read and validate the data in the CLN database table to check for referential integrity as per the foreign key design stored in the physical model database table and further adapted to segregate erroneous data;
■ segregation means adapted to segregate the data present in the CLN database tables into master data or transaction data;
■ Third copy generation means co-operating with the CLN database tables adapted to copy master data present in the CLN database tables to the DIM database tables and transaction data present in the CLN database tables to the Fact Database tables respectively in the data warehouse;
■ Fourth copy generation means adapted to copy incremental data in DIM database table and Facts Database table to the Fact2 Summary database table and further adapted to copy the KPI, KRI, KQ and KOI data from the Fact2 Summary database table to the BSC Balanced scorecard database table and still further adapted to copy the incremental data from the FACT, DIM, FACT2 and BSC database tables to the logical data mart based on the views defined on these tables in the physical model database table;
■ Error handling means adapted to receive the erroneous data and further adapted to notify a system administrator to correct the records and further adapted to receive and use the corrected records in the next copy generation means; and
■ updation means adapted to update the STG database table with newly received source data using change data and trickle feed techniques;
• tracking means adapted to keep a track of data in each of the database tables using a triangulating tallying technique at the end of each ETL cycle; and
• a reporting unit cooperating with the logical data mart and the physical model database table to generate tracked reports from views stored in the physical model database table in the form of dashboards and balanced scorecards based on the user access level details specified in the security model database table.
Typically, the database server includes a second tabular database means adapted to store the transformed information in the form of database tables wherein the database is based on a star schema and the database tables are divided into facts table and dimension tables. Alternatively, the second tabular database means is adapted to store the transformed information in the form of database tables, wherein the database holds normalized information.
Preferably, the system includes a communication interface adapted to send the erroneous data to the system administrator / supervisor for correction in the XML format.
Further, the third copy generation means includes a surrogate key generation means adapted to assign a surrogate key to a record in the dimension database table if a fact record is updated to the FACT database table without a corresponding dimension record.
Furthermore, the system includes a data modeling engine adapted to automate the extract, map. transform and load operations including ETL SQL code generation means for automatic generation of data warehouse tables, extraction, cleansing, validation, mapping and loading of data into the tables based on the information in the object dictionary's database tables.
Still further, the data modeling engine includes record insertion means adapted to insert extra columns into the data warehouse tables to handle data cleansing, simulation, audit trail and data mining.
Additionally, the data modeling engine includes shadow column insertion means adapted to insert additional shadow column to the data warehouse tables to handle simulations, flags, status and audit trails.
The system includes a forward engineering engine having code creation means adapted to automatically create objects from the object dictionary and code updation means adapted to automatically update objects from the object dictionary.
Typically, the system includes a data governance unit adapted to track changes made to the metadata and perform audits and data certifications.
In accordance with the present invention, the system is hosted at a data center and includes a disaster recovery site. Also, the system includes columnar compression means adapted to compress the data stored in the database sever and load balancers at the data center and the disaster recovery site.
Further, the system includes a Corporate Data Warehouse to store data available in tables including DIM database tables, FACT2 summary database table and BSC database table and a History Data Warehouse to store full data present in tables including DIM database table and FACT database table.
Furthermore, the database server includes means for purging of data present in the tables including DIM database table, FACT database table, FACT2 summary database tables, BSC database table and Corporate Data warehouse tables as per the data retention policy and means for indexing of tables including STG database tables, CLN database tables, DIM database table, FACT database table, FACT2 summary database tables, BSC database table.
Further, the objects dictionary includes an object dictionary updation means adapted to
update objects in the object dictionary and further adapted to log the time and date of the
update.
The database server also includes a Master Data Management unit adapted to write back
cleansed data in the DIM database tables to the source units.
In accordance with the present invention there is provided a method for providing a scalable enterprise data warehousing system, the method comprising the following steps:
• capturing the conceptual requirements of an enterprise including the vision, mission, goal, strategy and targets of an enterprise, the functions of every unit of an enterprise in the form of modules along with business rules, inclusions, exclusions and exceptions to the rules for each module, the KPIs and report generation requirements based on the KPI along with a list of users belonging to the unit and their user access level details in a conceptual data model;
• capturing the business requirements of an enterprise including entity definitions, rules along with corresponding attributes and attribute properties for the entities and the relationships and dependency between entities, KPI templates for each of the entities including information consisting of KPI associated with the module, the KPI formula, calculations, measures, rules, inclusions, exclusions, exceptions and its related KRI, KQI, KOI and artistic impressions of the dashboard, summary, detailed and proof reports and source units connection information in the form of a business data model;
• creating the logical schema based on the conceptual and business requirements and any industry specific generic logical data model (LDM), generating column definitions, identifying primary and unique keys and capturing them in a logical data model;
• creating the physical schema based on the business requirements and the logical schema, a self validation scheme to validate data in a data
warehouse using a triangulation technique and views for generation of reports and capturing them in a physical data model;
• converting the conceptual data model, business data model, logical data model and physical data model into database tables to form an objects dictionary;
• fetching source unit connection and communicating with the source units information from the business data model into the physical data model;
• generating a set of target database tables including staging database table (STG), cleansing database table (CLN), dimensional database table (DIM), Facts database table (FACT), Summary database tables (Fact2) and Balanced scorecard database table (BSC) for each of the source units based on the physical data model;
• copying fetched source data to the STG database table;
• validating the data in the STG database tables using the validation information in the physical data model and segregating erroneous data;
• copying 'incremental data' present in the STG database table to the cleansing database table (CLN);
• validating the data in the CLN database table to check for referential integrity as per the physical data model stored in the objects dictionary and further segregating the erroneous data;
• partitioning the data present in the CLN database table into master data or transaction data;
• copying master data present in the CLN database table to the DIM database table and transaction data present in the CLN database table to the Facts Database table respectively and further copying FACT and DIM data to the FACT2 summary database table and the KPI, KRI, KQ and KOI data from the Fact2 Summary database table to the BSC Balanced scorecard database table and still further copying the incremental data from the FACT, DIM, FACT2 and BSC database
tables to a logical data mart based on the views defined on these tables in the physical data model;
• notifying a system administrator to correct the erroneous records;
• receiving and using the corrected records in the next copy cycle;
• incrementally updating the STG database table with newly received source data using change data and trickle feed techniques;
• tracking the data in each of the databases using a triangulating tallying technique at the end of each extract transform load cycle; and
• generating reports from views stored in the physical data model in the form of dashboards and balanced scorecards based on the user access level details specified in the business data model.
Typically, the step of incrementally updating staging database table includes the steps of checking the timestamp of the source table and applying filters to ensure that very old data is not copied into the staging database table.
Preferably, the step of validating the data in the STG database tables using the validation information in the physical data model includes the steps of checking whether all records were copied from the source database tables, tallying figures including amount, quantity and KPI numbers for each batch of the records from the source and target tables, checking if the tallying leads to a non-zero result else raising an alert to notify discrepancy in records.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The self validating and metadata driven Enterprise Data Warehousing (EDW) system will now be explained with reference to the accompanying drawings in which:
FIGURE 1 illustrates the architecture for the Enterprise Data Warehousing System in accordance with the present invention;
FIGURE 2 is a schematic of the self validating and meta data driven Enterprise Data Warehousing (EDW) system in accordance with the present invention; and
FIGURE 3 is a flowchart showing the steps to be followed for development of a self validating and metadata driven Enterprise Data Warehousing (EDW) system in accordance with the present invention.
DETAILED DESCRIPTION
The accompanying drawings and the description thereto are merely illustrative of a self validating and metadata driven enterprise data warehousing system in accordance with this invention and only exemplify the system of the invention and in no way limit the scope thereof.
The prior art data warehousing systems provide enterprise information management tools for extraction, transformation, loading and reporting of data present in the warehouse. However, the prior art systems require manual intervention for the design, development and operation of the warehouse, the validation performed is not sufficient to guarantee accuracy of the data in the warehouse and the organization's data is not presented in a manner effective for the top management to get a 360 degree view of each of the data sets stored in the warehouse. Hence, to overcome these drawbacks, the present invention envisages a self validating and metadata driven Enterprise Data Warehousing (EDW) system.
In accordance with the present invention, the proposed data warehouse uses classical subject wise star schema architecture and can store enterprise wide operational and performance parameter based data received from various sources in a single medium. Alternatively, the proposed data warehouse can be based on a normalized schema to be compatible with proprietary database appliances like Teradata and Netezza.
The data in the warehouse can be effectively retrieved by the top management to get a 360 degree view of each performance parameter / data. In addition, the proposed EDW system can hold an enterprises global data which may be in different currencies, time zones and language and store it in a uniform format in the data warehouse. Additionally,
the proposed system is designed and developed with data level security built into the EDW for more strict control and faster reporting.
In accordance with one aspect of the present invention, the system includes a self validation means which ensures that data loaded in the data warehouse is validated and tallied upto the grain level to attain 100% data accuracy. The importance of grain level tallying of loaded data can be seen typically, in the case of telecom domain where Call Detail Records (CDR) account up to millions of records per day. Two to three percent of these records fail to be billed as they are incomplete from the telephony system which in turn leads to lots of dollars freely given away. These incomplete records are ignored by the prior art billing and finance/costing systems. But with the self validation and tallying proposed in this invention these are considered important and utilized for costing and network usage calculations. Hence, providing the enterprise with 100% accuracy of data and minimizing the operational losses.
Also, the proposed EDW system is designed, developed, monitored and controlled by an end-to-end metadata. The metadata is used to design, develop, build, test, migrate old data, support and maintain the EDW system. Thus, the metadata is end-to-end and the main pillar of the proposed EDW system. The system also includes a data modeler engine to enable architects, business analysts and system analysts to have control on the business entities, their attributes, EDW objects and their attributes.
The design of the EDW system in accordance with the present invention includes a Corporate Data Warehouse (CDW) to store summary data from multiple countries for head quarter reporting, a History Data Warehouse (HDW) to store full data of the enterprise from multiple countries and time zones without any cut off dates, logical data marts and logical Operational Data Stores (ODS) to save expensive disk space on the EDW server.
In accordance with another aspect of the present invention, the EDW system is easily scalable and more modules and systems can be added seamlessly in future. In addition, the data warehouse is designed and developed with extra columns in Fact and summary
tables to enable affinity, patterns, seasonality and trend analysis, simulation, forecasting and data mining in the data warehouse with simple commercial reporting tools.
Further, the EDW system provides web services on top of cleansed data by the use of which future systems and software enhancements can get details, on querying this web service.
The present invention also includes a Master data management (MDM) unit as its essential component for merging of master data from various source systems flowing into the EDW. For instance, merging of customer and product master tables from different source systems. This merging is necessary for generation of joint reports of data from different modules.
Also, data governance is introduced at all levels of the organization. Balanced Score Card Reports are available to all users with drill down to dashboards, summary and detailed reports.
This invention envisages a system that is technology agnostic, particularly in the nature of a non-rigid framework. This framework can be easily adapted or modified to suit different environments and project requirements. For instance, if this invention is required to be used for a particular technology like Teradata, modifications are required to be done in the engines in accordance with this invention. Since the engines in this invention are also available in ANSI 1992 SQL for editing, the changes for each database vendor or operating system or specific project can be easily incorporated by the user.
The data modeling engine in accordance with the presently disclosed invention, is not 100% automatic, as it only automates the routine work of the System / Data architect and therefore many iterations can be done in a single day and it is possible to complete a lot of work in a short time. It is further envisaged that this invention will eventually, automatically distribute storage of database FACT and summary tables and their index files in multiple physical disks to give a ten times higher speed. The new distribution engine would automatically calibrate the usage of tables during ETL and reporting and
automatically re-distribute the tables in multiple disks thus increasing the speed continuously. If a database appliance like Oracle Exadata Sun, Teradata, Netezza or other equipment is used, still higher speeds and scalability can be reached by this engine as these appliances have about a hundred disks to distribute data tables.
The Enterprise Data Warehousing system (EDW) will now be explained with reference to the accompanying drawings, wherein FIGURE 1 illustrates a high level overview of the self validating and metadata driven enterprise data warehousing system. The scalable architecture includes two sets of servers for the EDW. The main servers are at the Data Center 100 and the alternate servers are at the Disaster Recovery Site (DRS) 100a which may be 100km away in another earthquake zone. In accordance with the present invention, the data in the warehouse at the Data Center 100 is compressed to save expensive disk space on the EDW server and save transmission cost from the Data Center 100 to the DRS 100a using a columnar compression means (not shown in the figures).
Both, the Data Center 100 and the DRS 100a include a web server 102, a load balancer 104, a Report & ETL server 106, a data warehouse 108 and a Storage Area Network (SAN) 110.
Typically, for every EDW the most important aspect is the instant generation of reports for business users. 24 x 7. To enable this, both sets of servers 100 and 100a are always in sync and in active-active mode. The load balancers 104 in both the centers divert reports to the less busy report server 106 for instant report generation. Even when the data center is down for maintenance, users can get reports from the DRS 100a.
The ETL server represented generally by reference numeral 106 is 100% incremental. The ETL server 106 utilizes Change Data Capture techniques to update only fresh data in the EDW every morning. And throughout the day a 24 x 7 trickle feed is used to handle large volumes of other available transactional data. In accordance with this invention. logical data marts and logical operational data stores are used by the EDW to avoid two extra ETL cycles. This also saves expensive disk space in the SAN 110.
For large EDW projects, the present invention uses the classical Bill Inmon model which is adapted to be an extensible model, where new modules can be added one by one. As organizations split and merge, complete new systems can be seamlessly removed or added to the EDW in accordance with this invention.
The EDW architecture under this invention can easily absorb operational data from multiple countries with complex currency, time zone, language and other regional differences.
FIGURE 2 shows the block diagram for the Enterprise Data Warehousing system, represented generally by reference numeral 200 in accordance with the present invention. The system 200 includes a metadata unit 202, a Data Governance unit 218, an Extract Transform Load (ETL) unit 222, a database server 236, a tracking unit 245 and a reporting unit 250.
The metadata unit 202 incorporates an objects dictionary which holds information including the business requirements, the logical and physical schema details and the security details for fetching of data from source units 220 and creation of pre-determined database tables in the database server 236. The fetching of data from the source units, transformation, cleansing and validation of data and further movement of data into the respective database tables is carried out by the ETL unit 222. The database server 236 holds the various database tables whose definitions are stored in metadata unit 202 and are loaded with data by the ETL unit 222. The database server 236 in addition to hosting these tables also performs the functions of purging the data as per the data retention policy, indexing the tables for faster access and maintaining logical data marts to facilitate report creation. The tracking unit 245 of the system 200 performs validation of the data in the database tables using a triangulation technique to ensure 100% tallying of data after each ETL cycle. The reporting unit 250 of the system co-operates with the logical data mart of the database server 236 for generation of various reports. The data in these reports is displayed to the users based on their access level settings in the metadata unit 202.
The system components and their detailed working are explained hereinafter.
Metadata Unit 202: Metadata unit 202 is maintained end to end by the present invention for data level security, generation of data models and for control of the system operations.
Metadata is loosely defined as data about data. Traditionally, a metadata unit is a collection of descriptions of the items in a data model for the benefit of programmers and others who need to refer to them. The present invention goes a step forward and uses the metadata's objects dictionary to automate many jobs including table creation, index creation, self validation, self correction, data entry screen, views and quick reports generation.
The metadata unit 202 in accordance with the present invention includes:
■ an objects dictionary 204 for capturing the requirements of the system from a
conceptual, business, logical, physical structure and design perspective and
formulating rules / mappings / definitions for effective operation and control of
the system. The objects dictionary includes:
o conceptual requirement capturing means 206;
o business requirement capturing means 208;
o logical schema creation means 210;
o physical schema creation means 212; and
o means for storing user rote and levels and data level security details 214.
■ Data modeling engine 216 incorporating an Extract Transform Load (ETL) SQL
code generation means (not shown in the figures) for generation of commands for
automatic processing of data captured from various source units 220 (2201 ...
220,,) of an enterprise. The data modeling engine 216 and ETL SQL (Structured
Query Language) code generating means (not shown in the figures) are included
in the metadata unit 202 to reduce ETL development time which in turn translates
to less total cost of ownership for the EDW.
In accordance with the present invention, the following are steps followed in the EDW for development of metadata unit 202:
• entering the vision of the client into the business metadata objects dictionary;
• entering the mission into the objects dictionary 204 which leads to the vision becoming a reality;
• setting the goal of the mission, usually with a year number (e.g. 2020);
• setting the targets to reach the goal within a time frame;
• framing strategies to reach the goal(s);
• entering the business measures which are to be tracked to ensure reaching the goal. These measures are called Key Performance Indicators (KPIs). Sometimes there may be one or more KPIs to show performance parameters for each business measure. Business rules may be different for each KPI. There will be some assumptions, exceptions, inclusions and exclusions for each business measure and its KPIs. These can form Key Risk Areas (KRA) and Key Risk Indicators (KRI) which are also to be tracked. There can also be some Key Quality Indicators (KQI) and Key Opportunity Indicators (KOI). The KOI show where new sales offers or marketing efforts or financial control will bring Return on Investment (ROI) to the enterprise;
• identifying the sources systems 220 from which the data is going to be collected. The data base metadata dictionary is updated with the password, IP address and the like for these source connections;
• defining Tables and columns which will contain the source data. The technical team creates tables, check constraints, foreign key constraints and indices based only on what is written in the Physical Data Model (PDM) objects dictionary. This ensures high discipline. This also makes it easy to identify a performance tuning problem;
• defining mappings to show transformations and loading into FACT tables;
• defining primary and unique keys for all tables in the EDW;
• checking constraints (for example, Employee age is between 18 and 58) which are defined in the objects dictionary. This ensures that improper data cannot get inserted into the tables even in future;
• entering foreign key constraints in the dictionary to ensure 100% referential integrity;
• defining composite indexes for high speed access;
• entering data that might be needed for small quantity of manual measures and updating the dictionary;
• keeping the system details along with database details in the dictionary;
• keeping user list, security levels, roles and access in the objects dictionary;
• creating views for the Balanced Score Card, dashboards and reports. This way half the logic of reports is handled in ETL. This makes report generation faster. Since views are at the database level, access will be the fastest this way;
• creating drill down reports and dashboard views using FACT2 tables;
• creating views using HIST and DIM tables for data mining or reports on old data. Such reports are rarely used and may run for up to an hour;
• creating ETL for each email or SMS (short messaging service) alerts. This is to ensure 100% logic to be in ETL mappings. There must be no calculation or complex logic in any report;
• creating "Create Event's in SQL will trigger off the email or alert when it occurs. Events also allow scheduling routine activities like ETL.
In accordance with one aspect of the present invention, the metadata is tightly coupled with the project and data. The dictionary is object oriented. All elements are objects. Every user or every column is also an object. All objects have properties and events associated with them. When a data object is shown in a screen it will be, by default be as per the properties entered in the dictionary. Thus, a period selection might be a combo box, but the year will be a text box on screen. The same column is just a report column in a report or its view (polymorphism). For every data column its location on screen, shape, color, upper and lower limits are entered. Thus, all the uses of objects in the present invention are defined inside the dictionary 204. This also brings uniformity in screens developed using standard reporting tools.
Similarly a mapping, an alert email, an SMS query and report; all are objects. The users, developers, PL (project leader), BA (business analyst), CA (chartered accountant) and all those who are involved in the project are also objects. When the supervisor changes
anything in the dictionary 204, user and date time stamp are recorded against each record updated. The objects dictionary 204 includes an object dictionary updation means (not shown in the figures) to update objects in the object dictionary and further adapted to log the time and date of the update for use by the data governance unit 218. Thus, the 15 objects dictionary tables are very important tables in the present invention. It is the brain of the system. This invention's chief objective is to keep all data manipulations at database level to achieve highest speed.
This invention is metadata driven which will control the ETL operations. Hence, all the data movements are properly defined in the objects dictionary 204. This synchronizes the ETL at both the main data center and disaster recovery site. Even if a change request is lodged after 5 years of going live there wiil not be a problem as all the required information is available in the objects dictionary 204. There is hence no person dependency in the present invention. The present invention is fully process driven. The business design documents are reports based on the metadata dictionary.
The object dictionary 204 of the metadata unit 202 and its components are described
hereinafter:
■ Conceptual requirement capturing means 206: The conceptual capturing means 206 of the object dictionary 204 provides an interface to the business analyst/ user for capturing the business requirements. It starts with capturing of the vision, mission, goal, strategy and targets of the organization. The conceptual requirement capturing means 206 captures the various modules / theme to be included in the Enterprise Data Warehouse (EDW) and associates an owner and an alternative owner for every theme. Each theme is defined clearly and information mentioning what is included, what is excluded and exceptions to the rule are captured. These modules may be interlinked and joined. The relationship between various themes is captured by the conceptual requirement capturing means 206. In addition, the business entities for each module and their link are also captured. However, the business entity attributes are not featured here.
Further, a list of reports for each module is collected from the business analyst along with wish list of improvements from users. The reports are classified as Daily, Weekly, Monthly, Quarterly and Annual. Comparison between various periods, locations, product groups, products, customer groups and customers from existing MIS reporting systems are collated by the Business Analyst. These form the basic KPI for the modules / themes. For every KPI there can be a few key contributors. For instance top 5% customers contribute considerably for the revenue growth projections. So relationship with these customers and their being solvent are Key Risk Indicators (KRI). Revenue for previous month forms a report. Comparison of two periods becomes a M.I.S. report. Revenue growth as a % becomes a KPI. Revenue Growth Potential becomes a Key Opportunity Indicator (KOI). Most Dashboards in accordance with this invention show KOI analysis as these give management good return on investment. They point out growth potential in all KPI.
The theme owner will list out audience for the module's reports. These business users have data access based on their functional roles, user level and geography or area of operation. The metadata's object dictionary 204 includes a first tabular database means (not shown in the figures) which is adapted to store the information captured by the conceptual requirement capturing means 206 into a Conceptual Model object dictionary.
Business requirement capturing means 208: Business requirement capturing means 208 provides a tabular interface to the business analysts to capture all the business entities in each theme. For every entity, an owner and an alternative owner are identified and captured. The entity definitions, relationships and dependency between entities are recorded by the means. For instance, customer, product, customer group, product group, sales invoice, purchase bill and the like are all examples of entity and customer name, customer addresses and customer location are examples of entity attributes.
All business entities, definitions, rules, limits and relationships are recorded in a tabular form and contain sequence of business events with exception branching. exclusions and inclusions. Input for these tabular forms is given by business users with clarifications and suggestions by an industry functional expert.
For each business entity there can be many attributes. Each attribute is clearly defined and captured by the business requirement capturing means 208, Also, captured is the source for each entity /source units, calculation formula, validations, format, lower and upper limit, default value, null, zero, spaces, negative numbers, decimals and accuracy rules for each attribute.
In accordance with the present invention, once the business requirements are captured by the business requirement capturing means 208, the metadata's object dictionary's first tabular database means (not shown in the figures) stores the information into a Business Model object dictionary. Thereafter, changes can be done using ODBC connectivity to the database containing the business requirements objects dictionary.
Additionally, various KPIs which are defined in a template as seen in TABLE 1 are captured. This template has all the formula, calculations, measures, rules, inclusions, exclusions, exceptions and example for each KPI and its related KR1, KQI and KOI. This template is initially filled up by the Business Analyst after referring to an industry specific Logical data model which has generic KPI and business analytics. These forms are then to be signed off by the head of operations, marketing, sales and finance.
Parameter Values
Business Theme / Module
Business Measure (e.g. revenue)
KPI Definition (e.g. revenue growth)
Key Opportunity Indicator (e.g. revenue growth potential)
(KOI)
Objectives
Q Top 10 □ Top n% □ Trend
D Bottom n a Bottom n% □ Patterns
□ Seasonality □ Forecast □ Affinity
D Disaffinity a Data Mine □ What-if
D 80/20 □ ABC □ Ageing
□ FSN □ Rank □ Min
□ Avg □ % □ Max
□ Total □ Count □ Percentile
□ Variance □ Alert □ Sort
a Simulation □ Drill down □ Roll-up
a Slice □ Dice □ Pivot
KPI Measures KPI Dimensions / filters
KPI Analysis required □ Top 10 □ Top n% □ Trend
□ D Bottom n
Key Business Decisions
enabled
KPI Calculation Steps /
Formula
KPI Type o Negative KPI o Positive KPI
KPI Perspective o Finance o Internal o Learning o External
Frequency of update in BSC o Daily o Weekly o Monthly o Quarterly o Bi-annual o
Annual Owner name Reporting Responsibility Users Exclusions
Inclusions
Exceptions
Assumptions = key risk areas
(KRA)
Key Risk Indicator (KRI) Key Quality Indicator (KQI)
1. What panel
2. Which panel
3. Who panel
4. When panel
5. Where panel
6. Why panel
7. How panel
RAG Value range %, # ... Red Amber Green
Index score 12345 6 7 8 9 10
Related KPI
Drill Down reports
Challenges
Dashboard Terms, remarks &
notes
Data Source (SME)
Data available from when Nn months
Data Volume
Data Quality
Remarks & Notes on data
TABLE 1 showing the KPI template
After sign off, this set of forms become the Business Measures Handbook (BMH). This is recorded in the Metadata's Objects Dictionary 204 in a tabular form. Subsequent changes have to go through the data governance approval process
performed by the Data Governance unit 218. Only after all due approvals, a change in the business model objects dictionary is permitted. The system can be enhanced to make BMH available as a design document report.
Normal, upper and lower limit of value of results of a KPI are also captured by the Business requirements capturing means 208. This is used by the ETL routine to finally do a three way tallying of measures between the target value and the source table values. This ensures end-to-end tally of figures in the EDW with the source data. This is useful for performing data certification and data audit by the data governance unit 218.
The artist impression for each global and regional dashboard for each KPI is included in the Business Model. A consolidated list of KPI shown in the BSC report is also depicted in this artist impression. The artist impressions of the dashboard, summary, detailed and proof reports have to be signed off by the business users. The artist impressions are created by the Business Analyst. For advanced analytics he may refer to the generic logical data model.
Some columns from other existing reports can be added to this artist impression thus consolidating a number of reports into fewer dashboards. Each panel is titled to fully explain what it shows. The tables in 'which' and 'who' panels have group headings, headings and column headings. The column heading shows units of figures being shown in the column including US$, %, Avg, Count, Kg and the like. The KPI % column is usually sorted in descending order. This column heading is given in an orange background color. The best practice followed in this invention is to have one orange column head in each and every table appearing in reports and dashboards.
The drill down report layouts are also drawn in another worksheet, A check list of above best practices helps the Business Analysts (BA) to perfect the artist impressions. These are then copied into a presentation for users. Panels are copied to separate slides to give enlarged readability. These entity relationship diagrams,
the business model and the logical model are also copied into a Business Requirement Document (BRD) to be later signed off by users.
The BRD is the delivery at this stage of Data Warehousing having:
1. The Concept Model
2. The Business Model
3. Business Measures Handbook
4. Analysis artist impressions
The following are best practice steps to be followed by the Business Analyst (BA) in accordance with this invention for capturing of business requirements:
1. The BA will identify business transactions details (e.g. sales) to be retained in the EDW to get full picture of the business for next few years.
2. Next s/he will identify the master tables referred by the transaction (e.g. item code).
3. S/he will then identify the business rules applied from rules masters (e.g. monthly price list).
4. S/he will then identify the reason masters related to the transaction (e.g.
goods return reason codes).
5. S/he will then identify the transaction summary table (e.g. sales ledger).
6. S/he will then identify the transaction register table (e.g. sales, debit and credit note register).
7. With the above information s/he will approach the business users to gather sample formats of the various currently available reports, complaints, wish list of enhancements to reports and delivery mechanism of reports.
The above is updated in the Business Data object dictionary by the BA for the system analyst to design queries and summary Fact2 tables.
■ Logical schema creation means 210: The logical schema creation means 210 has pre-saved generic logical schema details which need to be updated to meet exact
requirements of the organization for each module/theme. Logical schema creation means 210 refers to the Business model object dictionary and the generic logical schema details to create a Logical Data Model (LDM) specifically for the organization. All logical grouping, entity relationships, processes and analytics are captured in tabular form (instead of running text).
The generic logical schema details may contain the various business processes and much functionality. Only those required for the themes enumerated in the Conceptual Model are retained. The others are ignored or discarded.
The LDM helps in the logical grouping of business entities, their attributes and help sort the analytics and reports in a chronological order. Advanced analytics can be adapted from the generic LDM into the business model. Once the business model is completed, the LDM in the metadata is fine tuned to the specific functional requirement of the EDW business users for all themes.
The generic LDM is now changed in the metadata as per business requirements. The model is now converted to masters and transactions. Tables and columns are created to form at least a 3rd normal form database. Summaries and aggregation tables are added to meet analysis specified in the business model.
The column attributes are matched to business functional requirements. The composite or single column primary key for each master table is identified. Primary, unique and foreign keys are specified for all transaction tables.
■ Physical schema creation means 212: Physical schema creation means 212 helps in creation of the enterprise data warehouse. Physical tables from Star schemas are created. Full granular data is retained in Fact tables (for each subject matter). Aggregated values are kept in summary fact tables - Daily, Weekly, Monthly, Quarterly, and Annual. The different steps followed to update the Physical data Model (PDM) objects dictionary in accordance with the present invention are given below:
1. Identifying the Fact2 tables from which various data columns of the drill down reports can be populated;
2. tracing the data required for dashboards and ad hoc reports are also traced to the Fact2 tables;
3. defining the primary, unique and foreign keys for each table in the EDW;
4. cleansing the data to validate at least the check constraints including checking for:
i. minimum value
ii. maximum value iii. default value iv. full fill flag
v. valid comma separated values (CSV) vi. calculated values formula vii. Auto Increment viii. key column to be validated in ETL1
5. adding Surrogate keys to each DIM table structure containing master data (dimensions). Each surrogate key column physically contains record numbers of the DIM table. The surrogate key combines composite keys into one numeric surrogate key. The surrogate key becomes primary key of the DIM table.
6. making the begin_date, end_date and original primary key of CLN dimension table together into an unique key.
7. preparing star schema joins by indexed surrogate keys.
8. summarizing and partially de-normalizing FACT tables into following FACT2 aggregate tables:
a. Daily, Weekly, Monthly, Quarterly, Annual,
b. Balanced Score Cards (BSC)
Thus, the EDW design document in accordance with this invention will contain: a. Conceptual data model;
b. Business data Model including Business Measures Handbook and Artist
impression of dashboards and reports
c. Logical Data Model
d. Entity Relationship Diagram (ERD) having structures of the tables given
in the list below as seen in Table 2:
i. Data Flow diagram;
ii. Star Schema (Facts and Dimension tables subject wise); iii. Summarization (FACT2 partially de-normalized tables) corresponding to tables and graphs in the artist impression in the BRD document; and iv. Views based on user levels and roles.
No. Schema Tables Contained
1 EDW Source
2
Stage
3
Clean
4
Reject
5
Dimension
6
Fact
7
History
8
Fact2 (2nf)
10
Balanced Score Card Report
11
Views
Table 2 showing the tables contained in the EDW schema
Security metadata capturing means 214: Security metadata capturing means 214 receives a list of users along with their designation and access level details. The details are captured by the security metadata capturing means 214 to create a security model database table. The security model database data allows or denies Users' access to certain reports or data present in the EDW thus, ensuring data level security. According to the present invention, the users list, security levels, roles and access are kept in the security model database in the objects dictionary 204. A user group may have zero or more users. Similarly a report group may have zero or more reports. A region would have many locations, a product group may have many products and so on. The data level security can be set based on
user group, report group, region and product group. The security is set with AND condition on each layer.
The security model database is adapted to provide data level security for the EDW for more strict control and faster reporting and object level security in the reporting tools. The security data is linked into the views of fact and dim tables of the EDW schema hence it is not possible to bypass security and access tables using any report writer. Power users will only see rows and columns permitted by their role and user level.
The PDM has many views that allow for proper security and helps governance. The report writer does not access any physical tables. The metadata of reporting tools only access the secured views which are filtered based on the security details in the security model database table.
The conceptual model, business model, logical, physical data models and user role and the security model form the metadata objects dictionary 204. The models contain the relationships between entities and a framework of the building blocks of the EDW. During development, the Business Analysts (BA) heavily depend on the conceptual model, business model and business measures handbook to create the artist impressions of dashboards and reports. These become the Business Requirement Document (BRD).
The Technical Architects (TA) depend on the BRD and Logical data model (LDM) to create the Physical data model. The ERD, Star Schemas, Aggregated tables, BSC summary, ETL, DIAD, data level security and Logical Data Marts become the design document.
The Extract Transform and Load (ETL) unit 222:
In accordance with the present invention, the ETL unit 222 extracts the source data from the source database tables (SRC) hosted in the source units 220 and then the data moves by ETL 222 from Source 220 to the STG tables in the Staging unit 224 to CLN tables in the Cleansing unit 226 to DIM table in Dimension 240 or Fact table 238 to Summary
tables 242 (SRC -> STG -> CLN -> FACT/DIM -> FACT2). Data movement is always 100% incremental in the present invention.
The source data tables in the source units 220 are prefixed with the label SRC and stored in staging (STG) unit 224 of the EDW. The data is either extracted from source systems 220 by ETL unit 222 (pull) or they may be deposited regularly in the STG unit 224 of the EDW (push). The SRC, STG and CLN tables contain only incremental (one day's) data from source tables. This saves expensive SAN disk space.
The eight steps of the methodology for ETL in accordance with this invention are given in following TABLE 3. Each of the steps in the ETL methodology are represented generally by reference numerals 0, 1, 2, 3, 4, 5, 6 and 7 for Step Names ETL0, ETL1, ETL2, ETL3, ETL4, ETL5, ETL6 and ETL7 respectively in FIGURE 2.
Step Name Description
1 ETL0 The data is pulled in from OLTP (online transaction processing) source systems 220 into a staged area. Data is kept in this region in tables named with prefix "STG". This unit is called STG 224.
Sometimes a view, a materialized view or synonyms of OLTP system tables might be kept here with prefix "SRC". The SRC tables might be created from database tables, notepad, MS excel, XML (extensive markup language) or any other industry standard data files. Some key SRC tables might be updated continuously by a feed from the OLTP system. These tables are NEVER directly used for displaying dashboards and reports to users.
2 ETL1 The staged data is cleansed and placed in a unit named as CLN 226. Data is kept in this region in tables named with prefix "CLN". Rejected records are kept in tables named with prefix "RJ1" 228.
These tables are NEVER directly used for displaying dashboards and reports to users.
3 ETL2 The data in the CLN schema is copied to a region named as EDW
schema. In this region the data is kept in a dimension (DIM) table 240 (if
the data is reference master data (for example, the location master)) or a
fact (FACT) 238 table if the data is related to transactions (for example.
orders).
Following 'Transformations" are common when data is copied from
cleansed tables 226 to Dim 240 and Fact 238 tables. The records in the
Dim tables 240 have a surrogate key assigned to each record equivalent to
its record number. These surrogate keys are used in the Fact tables 238 to
refer to the dimension tables 240 (for example, the location surrogate key
is used instead of location code).
Rejected records at this stage are kept in tables named with prefix "RJ2"
230.
These tables are NEVER directly used for displaying dashboards and
reports to users.
4 ETL3 Subject wise data is copied from FACT 238 and DIM tables 240 into partially de-normalized tables named with prefix "FACT2" 242. This is to avoid complex joins during reporting. This gives rise to high speed reporting in the present invention.
FACT tables 238 are summarized and partially de-normalized into following FACT2 aggregate tables 242:
c. Daily, Weekly, Monthly, Quarterly and Annual. These tables are NEVER directly used for displaying dashboards and reports to users.
5 ETL4 FACT tables 238 are summarized and partially de-normalized into BSC
aggregate tables 244.
One line summary for each KPI, user role and user level is copied from
Fact2 242 tables into the BSC table 244.
These tables are NEVER directly used for displaying dashboards and
reports to users.
6 ETL5 The DIM tables 240 contain cleansed golden copy records for master data management. These can be written back to source system master tables by
ETL.
These tables are NEVER directly used for displaying dashboards and
reports to users.
7 ETL6 All the data in DIM 240 and Fact tables 238 are incrementally copied to the HDW 270 by ETL unit 222. Whatever inserts and updates happen in the EDW also happen in the HDW 270. Fact2 242 and BSC tables 244 are not there in HDW 270.
HDW tables 270 are NEVER directly used for displaying dashboards and reports to users.
8 ETL7 All the data in DIM 240, Fact2 242 and BSC 244 are incrementally copied to the CDW 272 by ETL unit 222. Whatever inserts and updates happen in the EDW also happen in the CDW 272. Fact tables 238 are not there in CDW 272.
CDW tables are NEVER directly used for displaying dashboards and reports to users.
TABLE 3 showing the eight step ETL methodology.
The data model and ETL allows change data capture giving almost near real time data warehousing. However, it may not be practical to refresh all data in this style as most source data for the data warehouses are themselves updated in batch mode. Therefore, the ETL daily batch update is usually scheduled to run at 5 to 6 a.m. at head quarters' time zone.
This also avoids two persons printing a report with different figures at one hour interval. But huge daily transactional data as in Telecom may be trickle fed by a 24 x 7 ETL into the EDW. Yet there is a lock in the security metadata to allow reporting only as at close of business (COB) yesterday.
Data governance is introduced at all levels of the organization. Data profiling is a continuous activity included in daily ETL. De-duplication is done in ETL0 as only fresh
new incremental data is pulled from source to staging tables. Corrections, cleansing and validation are done inETLl.
In accordance with the present invention, the ETL unit 222 has an ETL engine 232. This engine generates the ETL SQL code to automate the following processes:
a. truncating data in STG unit 224 and CLN unit 226 to save disk space;
b. using fast insert, that is, not checking empty spaces in each database block;
c. reading only incremental data from SRC unit 220 (source tables);
d. picking up corrected records if any in RJ1 228;
e. using the update style Type II to handle slowly changing dimensions;
f. moving data from SRC 220 to STG 224
g. moving data from STG 224 to CLN 226;
i. cleansing data as per PDM table of metadata 202; ii. moving the invalid records to RJ1 228; h. picking up corrected records if any in RJ1 228 and RJ2 230; i. Moving cleansed data from CLN 226 to DIM 240 or FACT 238;
i. checking of the referential integrity check is as per current foreign
keys in PDM metadata table; ii. records failing referential integrity go to RJ2 230; j. moving data from Fact 238 to Fact2_Daily, Fact2 Weely, Fact2_Monthly, Fact2_Quarterly, Fact2_Annual and Fact2_BSC.
In accordance with the present invention, the ETL unit's ETL engine 232 includes the following components for automating the ETL operations. The ETL engine 232 includes:
■ fetching means (not shown in the figures) to read from the objects dictionary 204 the source of each column for each EDW table;
■ Schema generation means (not shown in the figures) adapted to read the objects dictionary 204 and further adapted to generate set of target database tables including staging database table (STG) 224, cleansing database table (CLN) 226, dimensional database table (DIM) 240, Facts database table 238 (FACT), Fact2 summary tables 242 (FACT2) and Balanced Score Card database tables (BSC) 244 for each of the source units 220;
■ First copy generation means (not shown in the figures) co-operating with the fetching means adapted to copy source data to the STG 224 database table;
■ First validation means (not shown in the figures) adapted to read and validate the data in the STG 224 database tables using the validation information in the physical model database table and further adapted to segregate erroneous data;
■ Second copy generation means (not shown in the figures) co-operating with the STG 224 database table adapted to copy 'incremental data' present in the STG 224 database table to the cleansing database table (CLN) 226;
■ second validation means (not shown in the figures) adapted to read and validate the data in the CLN 226 database table to check for referential integrity as per the foreign key specifications stored in the physical model database table and further adapted to segregate erroneous data;
■ third validation means (not shown in the figures) adapted to check business rules in the business model of the metadata dictionary 204 and further adapted to segregate erroneous data. For instance, if on a Tuesday a new business rule is added to the business model that local calls cannot exceed ZAR 10,000, this rule is automatically incorporated in validation from Wednesday onwards in next ETL cycle;
■ fourth data quality validation means (not shown in the figures) adapted to check business validation in the business model of the metadata dictionary and further adapted to segregate erroneous data whereby operational data is tallied with finance data and costing data. For instance, the billing amount must tally with what is passed as journal vouchers in the General Ledger and free offers must also tally as a cost is involved in free product offers;
■ segregation means (not shown in the figures) adapted to segregate the data present in the CLN 226 database table into master data or transaction data;
■ Third copy generation means (not shown in the figures) co-operating with the CLN 226 database table adapted to copy master data present in the CLN 226 database table to the DIM database table 240 and transaction data present in the CLN 226 database table to the corresponding Facts Database table 238 respectively in the database server; In case a fact record appears before a
dimensional record then the third copy generation means includes a surrogate key generation means (not shown in the figures) adapted to insert a surrogate key or a record in a dimensional database corresponding to a Fact record. Once, the dimensional record is received the surrogate key is replaced with that record;
■ Fourth copy generation means (not shown in the figures) is provided to copy the data in the DIM and the FACT tables to the FACT2 summary table and the KPI, KOI, KQI, KRI data from the summary table to the BSC tables. Further, the data from all these tables is stored in a logical data mart 246 present in the database server 236;
■ Error handling means 234 adapted to receive the erroneous data and further adapted to notify a system administrator to correct the records and further adapted to receive and push back the corrected records to the STG database table 224. The error handling means 234 processes the rejected records using two methods. If it is possible to fill up the missing data, then the missing data is updated. For example from pin code the postal area address can be filled up. Otherwise, the rejected records are sent as xml or xls files to the source system owner / administrator for correction via a communication interface (not shown in the figures). The corrected data again forms an input for data cleansing. A reference number is used to ensure that re-rejection does not happen; and
■ Updation means (not shown in the figures) adapted to update the STG database tables 224 with newly received source data using change data and trickle feed techniques.
Similarly, the ETL engine 232 generates code for updating the DIM 240. FACT 238, FACT2 242 and BSC tables 244 as per current definition in the objects dictionary 204. Therefore, this invention is totally metadata driven.
Also, data from new organizations merging into the organization are also processed similarly using ETL1. This data shares some dimensions and mappings with the rest of the EDW (conformed dimensions design). Hence it is possible to absorb data from unlimited sources in the EDW of this invention.
In accordance with the present invention, there is no need to use any other ETL tool, as the engine 232 generates high speed ETL code. The generated code ensures database level data movement which is faster than application level data movement by an ETL tool. Also, the ETL engine 232 enables creation of EVENTS SQL to schedule sequence of data movement. This is useful as the 'Facts' 238 tables have to be updated in the EDW only after the 'Dimensions' 240 hence sequence scheduling can be helpful for ensuring this.
Since the objects dictionary 204 is maintained in Type II mode, the whole history of changes is visible to the DBA (database administrator), the architect and the developer. The CR (change request) documentation not tallying with production tables, indices and constraints is not possible because the objects dictionary 204 has to have correct settings.
Where timestamp is properly available in a source table, this can be used to incrementally update the STG table. Where this is missing, "key column to be validated in ETL1" can be used for incremental update. Full truncate and update of a table for ETL is strongly discouraged in this invention. Filters are applied first when copying data from SRC 220 to STG 224 or from STG 224 to CLN 226. Date filter is first applied so that very old data is not at all brought into the EDW. Rejected records go to respective RJ1 228 table of each CLN table 226.
In accordance with another aspect of the present invention, data correction can be done in RJ1 228 and RJ2 230 tables. It is not mandatory that source systems 220 must be corrected first. Only corrected records are re-processed in next ETL cycle. After the data is successfully moved to CLN 226, corresponding RJ1 228 record is deleted. After the data is successfully moved to DIM 240 or FACT 238, corresponding RJ2 230 record is deleted. Uncorrected records will continue to be in RJ1 228 and RJ2 230 tables. Amounts, volumes and other measures in RJ1 228 and RJ2 230 will be shown as "others" in reports as last line item. Then only control totals will tally between source systems and the data warehouses. Only fresh data is processed from Source units 220. Data processed once is not processed again. Generally only one day's data is retained in STG 224 and
CLN 226. This is to save disk space. STG and CLN data are not copied to disaster recovery site.
In accordance with yet another aspect of the present invention, the metadata's data modeling engine 216 automates the complete ETL cycle by reading the conceptual model and an industry standard logical data model and merges these with the business model to form the logical data model for the organization. This is then converted to a star schema physical data model. This is forward engineered to create the EDW objects and ETL code is generated. All these functions are automatically handled by the Data Modeling engine 216 of this invention. These functions help the System Architect and Data Architect to quickly fine tune the physical data model in a few iterations. This engine 216 is one of the accelerators that will benefit the EDW data modeling.
Particularly, the data modeling engine 216 (a) reads source system tables metadata, (b) converts Third Normal Form (3NF) tables of the logical design to EDW Star Schema, (c) incorporates validation rules from the Business Model, (d) automatically inserts records into the PDM metadata table to define the EDW objects, (e) generates code for forward engineering and (f) generates SQL code for ETL.
The following are the fifteen metadata tables 204 in accordance with this invention as seen in TABLE 4:
No. Table name Metadata Role
1 Tmx CM obj dict Conceptual model
2 Tmx BM obj dict Business model
3 Tmx LDM obj dict Logical data model
4 Tmx PDM obj dict Physical data model
5 Tmx ndx obj dict Indexes
6 Tmx scrn obj dict Screens
7 Tmx rpt obj dict Editable Reports
8 Tmx bsc obj dict Balanced Score Card
9 Tmx dash obj dict 360° seven panel dashboards
10 Tmx summ obj dict Summary reports
11 Tmx details obj dict Detailed reports
12 Tmx proof obj dict Proof reports
13 Tmx_sec_obj_dict Security layers - report/user groups. level access user roles/levels, data
14 Tmx ETL obj dict ETL
15 Tmx db obj dict Databases maintenance
TABLE 4 showing the metadata tables
All these 15 tables are inter-linked by foreign keys. If something is updated in one table. corresponding changes in other 14 tables is done automatically by triggers. For example if "Customer" is changed to "Party" in the Conceptual Model, then all 15 tables are automatically updated as required using Type II style. Therefore, all these 15 tables behave as one integrated metadata 202.
Typically about five records are created in the metadata's object dictionary's physical model table by this engine 216 for each column in a given source master table automatically creating staging, rejection, cleansed and dimension table definitions in the metadata as seen in Table 5.
Source Table PDM tables in EDW example
Location Master 1. SRC Location
2. RJ1 Location
3. RJ2 Location
4. CLN Location
5. DIM Location
TABLE 5 showing the records created by the Data modeling Engine for a physical data
model.
Typically about 11 records are created in the PDM metadata table by this engine 216 for each column in a given source transaction table automatically creating staging, rejection, cleansed, fact, fact2 and BSC table definitions in the metadata as seen in Table 6.
Source Table PDM tables in EDW example
Sales Transaction 1. SRC Sales
2. RJ1 Sales
3. RJ2 Sales
4. CLN Sales
5. Fact Sales
6. Fact2_Daily_Sales
7. Fact2_Weekly_Sales
8. Fact2_Monthly_Sales
9. Fact2_Quarterly_Sales
10. Fact2 _Annual_ Sales
11.Bsc
TABLE 6 showing the records created by the data modeling engine for a source
transaction table.
This engine 216 automatically inserts extra records required to handle data cleansing, simulation, audit trail and data mining columns as seen in Table 7 for RJ1, RJ2, CLN, DIM, FACT, FACT2 and BSC table definitions via the record insertion means (not shown in the figures).
Name Type Null Purpose
begin date Date No valid from for Type II
end date Date Yes valid till for Type II
Transformation char(50) Yes Substring, Num2Char, etc.
Source table char(50) Yes Database, schema, table name
Source column char(50) Yes Source column name
shadow 1 char(50) Yes For calculated or slowly moving dimension or simulation values
shadow2 char(50) Yes
shadow3 char(50) Yes
scene1 numb(15,3) Yes
scene2 numb(15;3) Yes
scene3 numb(15;3) Yes
archieve__FY numb(6,0) Yes Financial year partition
data_purity percentage numb(3,0) Yes at record level
RJ_ref_l numb(8,0) Yes to avoid re-rejection
ETL1 Inserted at date-time No record history
ETL1 Updated at date-time Yes
Record status No Validity of record
Record sec level No Record level Security
TABLE 7 showing the extra records entered in the table definitions
This engine 216 also inserts in each record the source for each column in each EDW table. This facilitates in backward traceability, entity relationship diagram report generation from metadata and ETL code generation automatically from metadata.
The data model is fully documented in the Metadata Objects Dictionary 204 of this invention. Queries can be run on this metadata and reports created in a reporting tool showing relationships between tables, formulae, links and joins.
Industry standard Erwin tool is integrated with the 15 metadata tables in accordance with this invention. A change can be done in Erwin and it will reflect in the appropriate table and be seen in a metadata dashboard and reports. In accordance with this invention, all updates in the 15 metadata tables are in Type II style with end dates for edited or deleted records.
According to this invention fact-less fact, early arriving facts and such exceptions are marked for special ETL treatment. Early arriving facts allow for a temporary insertion of a code in the relevant dimension or use surrogate key 999998 as "Undefined". Later when the dimension record arrives, the correct surrogate key should be updated in the Fact, Fact2 and BSC tables.
For every source table, equivalent STG, RJ1, CLN, RJ2, DIM or FACT, Fact2 and BSC table definitions are automatically created by the engine 216 in the PDM. This ensures accuracy in metadata - column attributes are uniform in source and all EDW tables. The number of decimals does not change in any of the tables' generated DDL (data definition language).
Additional shadow columns are added by the engine 216 for handling simulations, flags, status, audit trail and Type II update for instance the 'with effect from' and 'with effect till dates' columns seen in Table 5. A future end date of 31-DEC-2222 may be used to indicate currently valid record.
Some databases like MS SQL allow filtered index which can be used for including only currently valid records. In Oracle 11 g this has to be handled by a composite primary key which will include the "with effect till date" column. Slowly changing dimension Type II update strategy is used for dimension and fact tables. This is especially useful to handle early arriving facts and fact records which are sometimes updated or even deleted in the source system.
The present invention includes a forward engineering engine having code creation means to generate 'Create code' and code updation means to generate 'Alter code' for EDW objects. This is done from the current objects dictionary 204 definitions of tables, primary
key, foreign key, unique key, check constraints, default values and comments. This feature saves much time in documentation. In development much effort is saved in table creations, maintenance and enforcing all best practices in database definition. Summary tables and views are automatically created by the forward engineering engine based on current definition in this invention's objects dictionary. This saves much effort in R&D while developing optimized data tables for reporting. Composite Indexes are created for high speed by this engine based on current definition in the metadata dictionary.
In accordance with the present invention, the System data Analyst can run a report to generate the DDL code for EDW objects and Entity-Relationship Diagram (ERD) in a metadata report. S/he and the EDW Solution Architect can validate the ERD including the primary, unique and foreign keys. The validations for each column in the PDM have to be matched with the business model. A cross verification of BM - LDM - PDM report is generated to ensure all columns to handle user specified functional requirements are present in the PDM and the ERD has been signed off by the EDW Architects.
Database Server 236: The database server 236 in this invention uses classical subject wise star schemas with facts tables 238 and dimensions tables 240 joined by surrogate keys (not ordinary 3NF tables or data marts). The cleansed data from the ETL unit 222 is loaded onto the Fact 238 or DIM 240 tables accordingly. In the EDW, the Fact tables 238 contain granular data.
Since both the fact and dimension records are updated using Type II strategy it is possible to get reports grouped as on today or as on a previous date. Audit trail of changes in Fact records can also be seen in audit reports.
The conformed dimension tables are stored in 5th normal form without snowflakes. This gives following desirable features to the dimension tables 240 in the EDW:
a) Unlimited joins are possible to link disparate systems such as finance, CRM (customer relationship management) and HR (human resource).
b) Data from all subsidiaries and new companies merged into the organization can be linked seamlessly.
c) Unlimited groupings are possible in the EDW. Finance department may want expenses to be grouped separately for each country. The grouping of individual business units is also possible as required by the different groups of users.
d) Master data management (MDM) can be incorporated into all dimensions very easily. One of the groupings or definitions would be named as the "golden copy". This golden copy is written back to source system master tables by ETL.
Further, as seen in the eight step ETL methodology the subject wise data is copied from FACT 238 and DIM tables 240 into partially de-normalized FACT2 summary tables 242 named with prefix "FACT2" by the ETL unit 222. The Fact2 summary tables 242 hold the summary FACT and DIM data to avoid creation of complex joins to give high speed reporting. FACT tables 238 are summarized and partially de-normalized into aggregate tables such as Daily, Weekly, Monthly, Quarterly and Annual.
In accordance with the present invention, the aggregated summary tables are created across time dimension. The aggregation style of Facts and summarization across time dimension does not restrict any analysis. This reduces the number of aggregated tables and ETL batch window time. The monthly, quarterly and annual summary fact tables are used for Dashboard reports and the weekly and daily summary tables are used for detailed reporting. The aggregated tables are partially de-normalized to meet data mining requirements and to speed up drill down reports. Dimensions and hierarchies in all aggregated tables are same as star schema for Fact 238 and Dim 240 tables. In the metadata of the reporting tools, alternate hierarchies may be provided for local departmental reporting.
Further, the database server 236 includes the BSC aggregate tables 244 which hold the summarized and partially de-normalized FACT table's data. One line summary for each KPI user role and user level is copied from FACT2 tables 242 into the BSC table 244. BSC table 244 is a special aggregated table containing KPI, KOI and KRI wise data for instant display of balanced score card reports for each user.
Furthermore, all the data in DIM 240 and Fact tables 238 are incrementally copied to a History Data Warehouse (HDW) 270 by ETL unit 222. Whatever inserts and updates happen in the EDW also happen in the HDW 270. However, FACT2 242 and BSC tables 244 are not present in HDW 270. Also. HDW tables are never directly used for displaying dashboards and reports to users. The HDW 270 holds the grain level data and summaries and forms a separate backup database for R&D reports. Ad hoc reports can be run on this database. The reports may take time to run as they run on the full data set as there is no cut off of old data from this database. 100% data is retained in the HDW 270. This backup database is hosted on a separate stand alone low power server inside the data center. The purpose of this server is for R&D or tracing fraud origin or for generation of special data mining reports and not fast reports. This ensures that no insight is lost due to non-availability of data online. The HDW 270 also has a rationalization rule to scale data based on each year's revenue. This is to offset and smooth skews due to increased volume of business across long time periods. Reports on very old records can also be generated from this history backup database for regulatory needs.
Further, all the data in DIM, Fact2 and BSC are incrementally copied to the Corporate Data Warehouse 272 by ETL unit 222 for global reports. When inserts and updates happen in the EDW they also take place in the CDW 272. Fact tables are not present in CDW 272. Like HDW 270, CDW 272 tables are never directly used for displaying dashboards and reports to users.
The CDW 272 keeps summary data from EDWs in different countries. The CDW 272 contains only summary data for headquarters (HQ) reporting. There is no Atomic (grain) level data of Fact tables 238 in the CDW 272. The daily summary are available for at least twenty six months. The weekly and monthly summaries are kept in the CDW 272 for at least five years. The quarterly and annual summaries can be in the CDW 272 for a decade. Every month the oldest data is purged in the CDW 272 in line with above recommendation. In some installations, the periods may vary slightly due to regional requirements. The CDW 272 fact2 tables are joined by surrogate key to HQ EDW Dim tables to form star schema.
In contrast to the CDW 272, only Fact and Dim tables are retained in the HDW 270. The HDW has all data from different countries. There is no purging of old data from the HDW.
Cleansed data in DIM tables 240 form the master data management golden copy and can be re-written in source systems' 220 masters by the Master Data Management (MDM) unit 248. Using simple ETL steps the source systems 220 can be updated with information like latest phone numbers, address and other attributes of dimensions in this invention.
In accordance with this invention, none of the EDW tables are exposed to the reporting tool. The database server 236 incorporates a logical data mart 246 for holding data required by the reporting unit 250 for generation of reports. Only logical data mart views (VUS) of tables and security settings as per metadata are exposed to the reporting tool. Hence it is not possible for even a power user to see unauthorized rows and columns.
The database server 236 includes means adapted to perform table indexing on the STG, CLN, FACT, FACT2 and BSC tables for optimizing query performance based on a cost based optimizer rules. Such a cost based optimizer calculates the number of disk reads that will be needed by using various indexes and decides to use or not to use each index. It uses an index only if it will result in ignoring 70% of the records. Hence, index rule in this invention specifies that best practice is indexing fact tables by the primary key, foreign key and a composite key including all the dimensions and flags columns. Therefore, in the present invention, composite indices are very much used. These indices are specified by the SA (system analyst) in a NDX metadata using the order and view column and "key column to be validated in ETL1". This results in very fast searches and updates in ETL even on very long tables. The summary tables are indexed for high speed reporting.
Along with indexing the database server 236 includes means adapted to perform data purging as per a predetermined data purging policy. The oldest data is deleted from Fact and Fact2 tables every month as per the data retention policy. Typically, at least fourteen
months data is retained in FACT tables in the EDW. Daily summaries are retained for at least twenty five months. Weekly and monthly summaries are retained for at least five years. Older data is retained as Quarterly and Annual summaries. As, the HDW 270 holds the complete data with no cut off dates a user may need to raise a ticket to access detailed information from the HDW 270.
Only a few power users may be granted access into the HDW 270 as this is on a low cost and low power server. The logic is that old data can be kept on a slower server as reports on detailed data a few years old may be very infrequent. This brings down the total cost of ownership (TCO) of the EDW.
The physical model data architecture has subject wise star schema. For a new complex business operations analytics, joins may be created using views in UAT. If the performance is not within the SLA, a table may be added as a physical Fact2 table in the physical model, in accordance with this invention.
In accordance with yet another aspect of the present invention, the STG 224 and CLN tables 226 are not transported to the DRS. Only the EDW compressed data is sent to the DRS 100a. The data compression offered by some database vendors saves more than 80% of transmission cost. Because of this, it is economically viable to have the data center 100 and the DRS 100a in synch and in active-active mode giving 24 x 7 reporting from the EDW.
If the data center is down for maintenance, the DRS 100a will continue the processing of ETL and serving reports. After the data center comes up, the (Oracle 1lg grid) database will automatically update all required tables synchronizing the grid.
Ideally there is no need to copy the present invention database from the main data center to the disaster recovery site. A second copy of the present invention database can be installed at the disaster recovery site which will be continuously updating the EDW database from the disaster recovery copy of the core OLTP system data. Only manual data entry, data files and other data SRC tables used in the present invention need to be
copied to the disaster recovery site. This will translate to huge saving in not transmitting 99% of the EDW data from the main site to the disaster recovery site.
REPORTING UNIT 250: The reporting unit 250 generates BSC report, dashboards and reports from views named with a prefix of "VUS".
As reports are generated from views they avoid complex star schema joins and Type II filters in reports at run time and hence give faster outputs. Views are created separately in the Physical Data Model table of the Objects Dictionary 204 for each drill down report and panels in dashboards. This ensures higher performance tuning for each view. Only required columns and rows are exposed by each view giving higher security. This also gives higher scalability of number of users. The views are tuned to give instant reports.
The present invention design avoids complex joins right from the data analysis stage. So the data movement is in a straight line. The views also avoid complex joins. Hence, the views are fast to respond at database level for a query from a report. The report run time only adds a few filters specified by the user to pick appropriate records from the view. Materialized views are not used in this invention. The views are defined in the data dictionary.
Views are normally created only on FACT2 242. BSC 244 and FACT 238 tables joining the Security model database table. Generally, the FACT2 Annual, Monthly, and Weekly data is used for dashboard panels and FACT2 Daily data for drill down reports. Views on DIM tables might be created for Master Data Management 248 or SOA project to get a "golden copy" which is exposed via web services to other applications.
In accordance with this invention, joining multiple Fact tables 238 in a view is strongly discouraged to ensure high speed instant reporting. Joint views on HIST and DIM tables 240 might be created for data mining or rare reports on old data. These involve star schema joins and hence such reports might take up to an hour to run. These may not be optimized as, such reports are run infrequently.
The reporting unit 250 supports multiple reporting tools which can be linked to the Logical Data Mart 246. The multiple reporting tools may be selected from the group of reporting tools consisting of OBIEE (Oracle Business Intelligence suite Enterprise Edition), Cognos for standard reports, SAS for advanced analytics and BO (Business Objects) represented generally by reference numerals 2501, 2502, 2503... 250n in FIGURE 2.
In this invention, only the views of FACT2 242 and BSC 244 tables are allowed in the Cognos Framework Manager or Universe in BO or rpd presentation layer in OBIEE. There can be no DIM 240 or FACT 238 tables in the reporting tools' metadata manager and no joins or filters only then, the generation of instant reports will be possible.
Additionally, the system generates alert reports to monitor each KPI. When a particular % or figure is reached, an alert SMS/ Email or a Report is sent to user(s) concerned.
The various reports generated by the reporting unit 250 as seen in block 252 of FIGURE 2 they are as follows:
■ Balanced Scorecards 254: Balanced Scorecard report 254 (BSC) is the opening screen for all users. This report tracks the KPI, KOI and KRI for the users. From the BSC. users can drill down to 360 degree view dashboards, summary and detailed reports. BSC is a special summary table used for the Balanced Score Card report in this invention. The BSC table 244 view is available in the Logical Data Mart 246 which is referenced by the reporting unit 250 for the generation of this report.
■ Seven panel dashboard 256: This report is generated from Fact2 Month aggregated view to display the seven aspects of a KPI for further inference by the users.
■ Summary Report 258: This report is generated from the Fact2 aggregated weekly table giving a 360 degree view of the weekly data of the enterprise.
■ Detailed Report 260: This report is generated from the Fact2 aggregated daily table to give a 360 degree view of the daily data of the enterprise
■ Proof Report 262: This is a detailed grain level report generated from the Fact table 238.
■ Design reports 264: This report is generated using the logical model database of the metadata unit 202 to generate various schema designs. BMH, ERD relationship diagrams of the EDW.
■ Security usage report 266 and User wise Data access report 268: These reports are generated using the security model database of the metadata unit 202 to give details on the usage of the system across the enterprise and also userwise.
The aggregated summary tables give instant reports in accordance with this invention. The star schema data is summarized across time dimension, this allows for meeting all types of reporting. There is no compromise or trade off that some report views are not possible because summarization was done by "region and product group". Hence all analytics are possible by the summarization technique adopted in this invention.
The aforementioned description detailed the complete operation of the EDW right from loading of source data to the warehouse to the generation of reports. Also, detailed were the operations of the ETL engine and the data modeling engine which help to fasten the process by automatically fetching, validating, cleansing and loading the data into the warehouse.
DATA CLEANSING and SELF VALIDATION
Data cleansing and self validation are the key aspects of the present invention and their detailed operation is described hereinafter. Data cleansing is carried out in two stages in this invention. Validation errors are caught in Rejection 1 228 tables and Referential integrity failures are caught in Rejection 2 230 tables. Data correction can be done with these tables without necessitating corrections in source systems. Triangulation log TL is tracked by the tracking unit 245 which logs the results of the three ways tallying of data for each ETL cycle. The Error log EL stores exceptions raised by the database system. Both these logs are stored in Error handing means 234 of the ETL unit 222.
In accordance with the present invention, the self validation is done in accordance with the present invention to store 100% tallied data in summary tables to be used for reporting. After data is copied from one table to another, at each stage, three comparisons are made to ensure the figures in source and target tally.
1. Records tally ensures that all records were copied (including those rejected).
2. Amount, quantity, number, hours or other measures are also tallied separately.
3. Similarly the KPI formula components are calculated in source and target.
The source minus target difference is kept in Triangulation log table TL of the error handling means 234 and must be zero. This three way tallying ensures 100% tallied data. This three ways tallying in the present invention is called Data Validation by Triangulation. If tally for a particular ETL does not give zero, an ETL alert is raised. The reports related to untallied KPI are not shown to users as they could potentially be less than 100% correct. The alert may be an SMS or an email to the supervisor. S/he has to set right the ETL before the reports can be shown to users.
Since this invention has ETL with trickle feed which gives near real time data warehousing, this can occur any time during day or night. While the team is fixing the problem, a roll back to previous ETL may be done by the supervisor. This will ensure that users can at least see yesterday's reports. A future development of this invention is to ensure that only if an ETL batch concludes properly, it is committed. Else automatically users will continue to see reports as on previous ETL cycle without necessitating a roll back.
When data is moved to summary and aggregated tables, the amount, quantity and number of transactions are tallied. This also gives three ways triangulation validation in the present invention. Thus this triangulation validation is done at every step in the present invention.
Data certification and EDW data audit can be facilitated by the Triangulation log file TL in this invention. This proves that all the data in the data warehouses are in sync and matching. This is done in each ETL cycle. So data accuracy is an ongoing process in this invention. Formal reports on data quality KQIs could be implemented across all data.
The Triangulation table TL keeps track of each ETL cycle. Jt contains information shown in Table 8 below:
Col Id Column Name Data type Data Length Decimals Null Default Key
1 ETL_cycle_number NUMBER 9 0 No 1
2 ETL.step NUMBER 1 0 No
3 Source Schema Table name VARCHAR2 50 No
4 Target_Schema_Table_name VARCHAR2 50 No
5 Reject_Schema_Table_name VARCHAR2 50 No
6 Sourcej-ecords_count NUMBER 9 0 No
7 Target_records_ count NUMBER 9 0 No
8 Difference_Records_Count NUMBER 9 0 No
9 RJ1 records count NUMBER 9 0 No
10 Process_tally_records NUMBER 9 0 No
11 Sou rce_amou nt_or_q uantity NUMBER 12 3 No
12 Target_Amount_or_Quantity NUMBER 12 3 No
13 Difference_Amou nt_or_Qty NUMBER 12 3 No
14 RJ_Amount_or_Quantity NUMBER 12 3 No
15 Process. tally_amt_qty NUMBER 12 3 No
16 Source_KPI_or_other_Fomnula NUMBER 12 3 No
17 Target_KPl_or_other_Formula NUMBER 12 3 No
18 Difference in KPI or Formula NUMBER 12 3 No
19 RJ1_kpi_formula NUMBER 12 3 No
20 Process_tally_formula NUMBER 12 3 No
21 ls_this_ETL_a_success_Flag VARCHAR2 1 No 'N'
22 Seconds_for_this_ETL_step NUMBER 9 0 No
23 Inserted at DATEtime No sysdate
24 Tmx_Securitylevel NUMBER 1 0 No 1
Table 8 shows the format of the triangulation table
Three way triangulation - (1) Records count (2) Amount or Quantity (3) KPI formula or count of one dimension (e.g. customers). Columns 3,4,5 contain schema name + V + table name. That is why these columns are more than 30 characters long. This is especially true as source data might come from various schemas. This triangulation ensures that the source data and the data in all fact and aggregated tables match 100%.
In accordance with the present invention, following is the ETL logic used in this invention for FACT data processing:
Rules:
1. ETL is 100% incremental.
2. Data is copied into a Fact table 238 only once.
3. Re-processing does not result in duplicate entry in Fact table 238.
4. Delete SQL command is not allowed in Fact table 238.
5. Re-process will copy only corrected data into a Fact table 238.
6. Triangulation will always show three ways tallying of data and process tally also. Style:
1. Proc_Day 1_ETL will run only once to load initial data.
2. Proc_Daily_ETL will run continuously 24 x 7 in an endless loop.
a. A wait is introduced at the beginning of this loop with three conditions.
i. Scheduling is done by introducing a sysdate time check at the beginning of the loop. E.g. 03:00:00 hours.
ii. A time interval of 5 minutes is introduced to check for new data arrival. As soon as new data is found, for example in client dimension, this can start processing.
iii. Manual re-process for only corrected records can be triggered by inserting a new record in STG 224. This manual triggering sets a reprocess flag (Reprocess_ETL_flag) in the reprocess table 235 equal to 1 as seen in Table 9.
b. At the end of the loop a new record is inserted in STG 224 and the
reprocess flag in reprocess table 235 is set to 0.
Statistics:
1. A count on reprocess table reprocess table 235 where reprocess flag = 1 will show how many times corrections are happening for a given month; and
2. the triangulation log TL which has an success flag shows Y or N to represent success or failure for each ETL cycle.
Col Id Column Name Data type Data Length Data Precision Nullable Default Primary Key
1 Reprocess count NUMBER 9 0 No 1
2 Reprocess ETL flag NUMBER 1 0 No 0
3 Inserted at DATEtime No sysdate
4 Inserted by Varchar2 10 No 'Admin'
4 Tm.x Security level NUMBER 1 0 No 1
Table 9 shows the reprocess table
Reprocess count is auto incremented by 1.
Col Id Column Name Data type Data Length Decimal Null Default Primary Key
1 ETL cycle number NUMBER 9 0 No 1
2 ETL step NUMBER 1 0 No
3 Source Schema Table name VARCHAR2 50 No
4 Target Schema Table name VARCHAR2 50 No
5 Reject Schema Table name VARCHAR2 50 No
6 Source records number NUMBER 9 0 No
7 Oracle Error number VARCHAR2 20 No
8 Oracle Exception VARCHAR2 20 No
9 Oracle Exception Description VARCHAR2 200 No 'N'
10 Seconds for this ETL step NUMBER 9 0 No
11 Inserted at DATEtime No Svsdate
12 Tmx Security level NUMBER 1 0 No 1
Table 10 showing the staging error log table
Three ways tallied reports can show three extra columns/rows that will show the user the figures from different angles. As the user drills down this technique will show that there is no compensatory errors - 100% proof that the figures shown against each sub-group is tallied.
This reporting technique gives confidence on the dashboards and reports to EDW users. Based on such reporting, management can come to conclusions and arrive at business decisions. Top management can frame business policies based on global dashboards.
As per this invention, techniques of data mining can be implemented across the summary fact tables to get faster reports. Approximation is done using the summary tables and results are re-validated against fact tables to arrive at accurate data mining results.
The cleansed data in the EDW can be exposed using web services. This can then be consumed using SOAP by other systems. Thus, some SOA features are also built into this invention's architecture. Systems developed in future or enhanced by the organization can send a question to the web service and get current details.
An EDW takes several years to mature. The several features of this invention are not interdependent. Some of them could be implemented first and the other features could be taken up as and when required. Slightly different styles may be adopted in different sites where this invention is implemented based on budget, comfort level of users, existing databases, technology, hardware and networking constraints and priorities of management. This flexibility is in-built in this invention.
The various features of this invention can be implemented in parts. Based on urgency, management priorities, maturity of data and other parameters it would typically take three years for most of the features of this invention to be implemented. A few features may not be implemented at all. There is no compulsion that all features must be implemented at one go. Hence these features can be incorporated in the organization on a need base.
This invention can also be ported to mainframes and super computers where very huge amount of data (far beyond the capacity of Oracle and other database management systems) might need to be collated, organized, analyzed and reported.
The EDW can contain Operational, Financial, Costing, Budgeting and Planning data. All data from Operations, Financials, CRM and other modules will reside together in the EDW. The conformed dimensional physical model allows for joint reports. Translation
codes for a corporate common geography, calendar and organizational structure facilitate comparison across regions, modules and time.
In accordance with the present invention, there is provided a method for providing a scalable enterprise data warehousing system, the method comprising the following steps as seen in FIGURE 3:
• capturing the conceptual requirements of an enterprise including the vision, mission, goal, strategy and targets of an enterprise,, the functions of every unit of an enterprise in the form of modules along with business rules, inclusions, exclusions and exceptions to the rules for each module, the KPIs and report generation requirements based on the KPI along with a list of users belonging to the unit and their access level details in a conceptual data model, 1000;
• capturing the business requirements of an enterprise including entity definitions, business rules along with corresponding attributes and attribute properties for the entities and the relationships and dependency between entities, KPI templates for each of the entities including information consisting of KPI associated with the module, the KPI formula, calculations, measures, rules, inclusions, exclusions, exceptions and its related KRI. KQI, KOI and artistic impressions of the dashboard, summary, detailed and proof reports and source units connection information in form of business data model, 1002;
• creating the logical schema based on the conceptual and business requirements, generating column definitions, identifying primary and unique keys and capturing them in a logical data model, 1004;
• creating the physical schema based on the business requirements and the logical schema, a self validation scheme to validate data in a data warehouse using a triangulation technique and views for generation of reports and capturing them in a physical data model, 1006;
• converting the conceptual data model, business data model, logical data model and physical data model into database tables to form an objects dictionary, 1008;
• fetching source unit connection information from the business data model and communicating with the source units, 1010;
• generating a set of target database tables including staging database table (STG), cleansing database table (CLN), dimensional database table (DIM), Facts database table (FACT), Summary database table (Fact2) and Balanced scorecard database table (BSC) for each of the source units based on the physical data model, 1012;
• copying fetched source data to the STG database table, 1014;
• validating the data in the STG database tables using the validation information in the physical data model and segregating erroneous data, 1016;
• copying 'incremental data' present in the STG database table to the cleansing database table (CLN), 1018;
• validating the data in the CLN database table to check for referential integrity as per the physical data model stored in the objects dictionary and further segregating the erroneous data, 1020;
• partitioning the data present in the CLN database table into master data or transaction data, 1022;
• copying master data present in the CLN database table to the DIM database table and transaction data present in the CLN database table to the Facts Database table respectively and further copying FACT and DIM data to the FACT2 summary database table and the KPI, KRI, KQ and KOI data from the Fact2 Summary database table to the BSC Balanced scorecard database table and still further copying the incremental data from the FACT, DIM, FACT2 and BSC database tables to a logical data mart based on the views defined on these tables in the physical data model, 1024;
• notifying a system administrator to correct the erroneous records, 1026;
• receiving and pushing back the corrected records to the STG database table.. 1028;
• updating the STG database table with newly received source data using change data and trickle feed techniques, 1030;
• tracking the data in each of the databases using a triangulating tallying technique at the end of each extract transform load cycle. 1032;
• generating reports from views stored in the physical data model in the form of dashboards and balanced scorecards based on the access level details specified in the business data model, 1034.
TECHNICAL ADVANTAGES
The technical advancements of the present invention include in providing an Enterprise Data Warehousing (EDW) system which is designed, developed, monitored and controlled by metadata. The metadata's object dictionary automates jobs like table creation, index creation, self validation, self correction, data entry screen, views and quick reports thus minimizing human intervention for operation of the warehouse.
In accordance with the present invention, the metadata accepts the vision of the client for whom the system is to be customized, the mission, goal of the mission, targets volume for goal with a time table, strategies to meet targets, measures (KPIs) along with business rules, key risk areas, key risk indicators and key quality indicators along with assumptions, exceptions, inclusions and exclusions. Using these inputs, the data modeling unit present in the metadata generates the design of the system thus minimizing the time and effort involved in manual designing of the entire warehouse for a client. The design of the system is stored within the data modeling unit, thus eliminating the need for separate documentation/storage of the design.
The metadata automates the data entry into the warehouse by capturing the IP address of the source units with passwords for connecting to these units. Thus, the system
automatically connects to these units and extracts the data. The metadata also defines the tables and columns that will contain the source data, mappings for transformations and mapping into fact tables and primary, foreign key and unique key constraints to ensure referential integrity, hence the extracted data is automatically transformed and loaded into the appropriate tables in the data warehouse. This automatic mapping of source data into the warehouse with little supervision by system and data architects enables the present invention to store multinational organizational data which operate in different time zones, countries, languages and currencies.
Also, the metadata stores the views for generation of balanced scorecards, dashboards, and reports. Thus, uniform reports are generated quickly by the present invention based on the data collated from various source units. Thus, performance parameters of individual units of an organization can be reviewed and retrieved effectively using these reports.
As, the various tables in the metadata are interlinked by foreign keys, the update to any one table is reflected across the entire system. Moreover, these changes to the metadata are recorded along with timestamp, making the tracking of changes easier.
The present invention provides a Corporate Data Warehouse (CDW) to store summary data for head quarter reporting, a History Data Warehouse to store full data from multiple countries without any cutoff dates and Logical data marts and operational data stores to save on disk space and save two ETL cycles.
The metadata design allots extra columns in fact and summary tables definitions to enable affinity, patterns, seasonality, trend analysis, forecasting and data mining with simple reporting tools, making the proposed system flexible and interoperable with any of the commercially available reporting tools.
The system also ensures security of data by providing the access level details in the metadata. As, metadata drives the proposed system, the access level details of every user are checked and based on the same, the data is presented to that user.
Additionally, the system uses data compression techniques to compress the data to save disk space and transmission cost involved in sending the data to a disaster recovery site.
Further, the system includes load balancers both at the data center and the disaster recover site which send the report generation request to the less busy of the two servers. In case of breakdown of the data center, the disaster recovery site takes over. This ensures that reports can be generated 24X7 from the warehouse.
The system ensures that the warehouse is up to date by performing near real time system refresh. The refresh is done multiple times a day. with change data capture techniques used to update fresh data in the morning and a trickle feed to handle transactional data during the day.
The system uses the ANSI standard coding, thus making the system easily portable to any of the open source platforms.
The system ensures the accuracy of the data being loaded into the warehouse by performing self validation and tally of data by using a triangulation technique. This gives users the confidence that the data reported on the balanced scorecards and the dashboard is accurate.
The system also provides web services on top of cleansed data to implement SOA (Service Oriented Architecture) to enable users to access the data/ reports from the EDW anywhere and round the clock.
Furthermore, the EDW system is flexible and can be adopted to various technologies, and modified to meet the Teradata design standards. The reports generated by the system as well are flexible and the views can be modified to generated new dashboards, balanced scorecards and other detail and summary reports.
Also, the system enables new business rules and validations to be inserted within the metadata unit and immediately be reflected in the subsequent ETL cycles.
Thus, the present invention provides a scalable enterprise data warehousing system which provides enterprises with huge cost savings along with faster access to comprehensive, consolidated and tallied data based on the user privileges. Also, the system minimizes the design and development time by incorporating an efficient data modeling unit and thereafter efficient automated facilities for ensuring continuous and smooth operation of the warehouse with least human intervention.
While considerable emphasis has been placed herein on the components and component parts of the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other changes in the preferred embodiment as well as other embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.
WE CLAIM:
1. A scalable Enterprise Data Warehousing system comprising:
• a database server for storing an enterprise's transformed data, said database server comprising at least one logical data mart for storing summarized and aggregated data views to facilitate generation of various reports;
• a metadata unit having:
o an object dictionary comprising:
■ conceptual requirement capturing means adapted to provide a graphical user interface to capture the vision, mission, goal, strategy and targets of an enterprise and further adapted to capture the functions of every unit of an enterprise in the form of modules along with business rules, inclusions, exclusions and exceptions to the rules for each module and still further adapted to capture the KPIs and report generation requirements based on the KPI;
■ business requirement capturing means adapted to provide a tabular form to capture entity definitions and rules for each of said modules along with corresponding attributes and attribute properties for the entities and the relationships and dependency between entities and further adapted to capture KPI templates for each of the entities including information selected from the group of information consisting of KPI associated with said module, the KPI formula, calculations, measures, rules, inclusions, exclusions, exceptions and its related KRI (Key Risk Indicator), KQI (Key Quality Indicator), KOI (Key Opportunity Indicator) and artistic impressions of the dashboard, summary, detailed and proof reports and still further adapted to capture source units connection information;
■ logical schema creation means adapted to create a customized database design having database table column definitions,
normalization scheme for the database tables, column creations, summaries and aggregation table creation, column attribute definition creation, identification of primary, unique and foreign keys, creating constraints and rules for performing validation of data;
■ physical schema creation means adapted to create database tables for the data warehouse based on a predetermined modeling technique and further adapted to create a self validation scheme to validate data in said data warehouse using a triangulation technique and still further adapted to create views for generation of reports;
■ security metadata creation means adapted to receive a list of users belonging to each of the source units and their access level details;
o a first tabular database means co-operating with said conceptual requirement capturing means, business requirement capturing means, logical schema creation means, physical schema creation means and security metadata creation means adapted to store the captured and created information into said object dictionary's conceptual model database table, business model database table, logical model database table, physical model database table and security model database table respectively, wherein the database tables are inter-linked using foreign keys, thereby facilitating update to the entire system if any of said database tables are updated; • a Extract Transform and Load (ETL) unit co-operating with said metadata unit including:
o an ETL engine having:
■ fetching means adapted to fetch source unit connection
information from said business model database table and
further adapted to communicate with the source units;
■ Schema generation means adapted to read the objects dictionary and further adapted to generate set of target database tables for each of the source units including staging database table (STG) and cleansing database table (CLN) and dimensional database table (DIM), Facts database table (FACT), Summary database table (Fact2) and Balanced scorecard database table (BSC) in said database server;
■ First copy generation means co-operating with said fetching means adapted to copy source data to said STG database table;
■ First validation means adapted to read and validate the data in the STG database tables using the validation information in the physical model database table and further adapted to segregate erroneous data;
■ Second copy generation means co-operating with said STG database table adapted to copy 'incremental data' present in the STG database table to said cleansing database table (CLN);
■ second validation means adapted to read and validate the data in the CLN database table to check for referential integrity as per the foreign key design stored in the physical model database table and further adapted to segregate erroneous data;
■ segregation means adapted to segregate the data present in the CLN database table into master data or transaction data;
■ Third copy generation means co-operating with said CLN database table adapted to copy master data present in the CLN database table to said DIM database table and transaction data present in said CLN database table to said Facts Database table respectively in said database server;
■ Fourth copy generation means adapted to copy incremental data in DIM database table and Facts Database table to said Fact2 Summary database table and further adapted to copy the KPL KRL KQ and KOI data from the Fact2 Summary database
table to the BSC Balanced scorecard database table and still further adapted to copy the incremental data from the FACT. DIM, FACT2 and BSC database tables to said logical data mart based on the views defined on these tables in the physical model database table;
■ Error handling means adapted to receive said erroneous data and further adapted to notify a system administrator to correct the records and further adapted to receive and use the corrected records in the next copy generation means; and
■ updation means adapted to update the STG database table with newly received source data using change data and trickle feed techniques;
• tracking means adapted to keep a track of data in each of said database tables using a triangulating tallying technique at the end of each ETL cycle; and
• a reporting unit cooperating with said logical data mart and said physical model database table to generate tracked reports from views stored in the physical model database table in the form of dashboards and balanced scorecards based on the user access level details specified in said security model database table.
2. The system as claimed in claim 1. wherein said database server includes a second tabular database means adapted to store the transformed information in the form of database tables, wherein the database holds tables are divided into facts table and dimension tables.
3. The system as claimed in claim 1, wherein said database server includes a second tabular database means adapted to store the transformed information in the form of database tables, wherein the database holds normalized information.
4. The system as claimed in claim 1, wherein said database server includes means for purging of data present in the tables including DIM database table, FACT database
table, FACT2 summary database tables, BSC database table and Corporate Data warehouse tables as per the data retention policy.
5. The system as claimed in claim 1, wherein said database server includes means for indexing of tables including STG database tables, CLN database tables, DIM database table, FACT database table. FACT2 summary database tables, BSC database table .
6. The system as claimed in claim 1, wherein said database server includes a Master Data Management unit adapted to write back cleansed data in said DIM database tables to the source units.
7. The system as claimed in claim 1, wherein said objects dictionary includes object dictionary updation means adapted to update objects in said object dictionary and further adapted to log the time and date of said update.
8. The system as claimed in claim 1, wherein the third copy generation means includes surrogate key generation means adapted to assign a surrogate key to a record in the dimension database table if a fact record is updated to the FACT database table without a corresponding dimension record.
9. The system as claimed in claim 1, wherein said system includes a data modeling engine adapted to automate the extract, map. transform and load operations includes ETL SQL code generation means for automatic generation of data warehouse tables, extraction, cleansing, validation, mapping and loading of data into said tables based on the information in said object dictionary's database tables.
10. The system as claimed in claim 9, wherein said data modeling engine includes record insertion means adapted to insert extra columns into said metadata dictionary tables and further adapted to insert additional shadow columns to said data warehouse tables to handle simulations, what-if analysis, forecast, data mining, weights for old records in HDW, flags, status and audit trails.
11. The system as claimed in claim 1, wherein said system includes a forward engineering engine having code creation means adapted to automatically create objects from said object dictionary and code updation means adapted to automatically update objects from said object dictionary.
12. The system as claimed in claim I, wherein said system includes a data governance unit adapted to track changes made to the metadata and perform audits and data certifications.
13. The system as claimed in claim 1, wherein said system is hosted at a data center.
14. The system as claimed in claim 1, wherein said system includes a disaster recovery site.
15. The system as claimed in claim 1, wherein said system includes columnar compression means adapted to compress the data stored in said database sever.
16. The system as claimed in claim 1, wherein said system includes load balancers at the data center and the disaster recovery site.
17. The system as claimed in claim 1. where said system includes a Corporate Data Warehouse to store data available in tables including DIM database tables. FACT2 summary database table and BSC database table.
18. The system as claimed in claim 1, where said system includes a History Data Warehouse to store full data present in tables including DIM database table and FACT database table.
19. The system as claimed in claim 1, wherein said system includes a communication interface adapted to send the erroneous data to the system administrator / supervisor for correction in the XML format.
20. A method for providing a scalable enterprise data warehousing system, said method comprising the following steps:
• capturing the conceptual requirements of an enterprise including the vision, mission, goal, strategy and targets of an enterprise, the functions of every department of an enterprise in the form of modules along with business rules, inclusions, exclusions and exceptions to the rules for each module, the KPIs and report generation requirements based on the KPI along with a list of users belonging to the unit and their user access level details in a conceptual data model;
• capturing the business requirements of an enterprise including entity definitions, rules along with corresponding attributes and attribute properties for the entities and the relationships and dependency between entities, KPI templates for each of the entities including information consisting of KPI associated with said module, the KPI formula, calculations, measures, rules, inclusions, exclusions, exceptions and its related KRI, KQI, KOI and artistic impressions of the dashboard, summary, detailed and proof reports and source units connection information in the form of a business model;
• creating the logical schema based on the conceptual, business requirements and any industry specific generic logical data model, generating column definitions, identifying primary and unique keys and capturing them in a logical data model;
• creating the physical schema based on the business requirements and the logical schema, a self validation scheme to validate data in a data warehouse using a triangulation technique and views for generation of reports and capturing them in a physical data model;
• converting the conceptual data model, business data model, logical data model (LDM) and physical data model (PDM) into database tables to form an objects dictionary;
• fetching source unit connection information from said business data model and communicating with the source units;
• generating a set of target database tables including staging database table (STG), cleansing database table (CLN), dimensional database
table (DIM), Facts database table (FACT), Summary database table (Fact2) and Balanced Score Card database table (BSC) for each of the Source units based on the physical data model (PDM);
• Copying fetched source data to said STG database table;
• Validating the data in the STG database tables using the validation information in the physical data model and segregating erroneous data;
• Copying 'incremental data' present in the STG database table to the cleansing database table (CLN);
• validating the data in the CLN database table to check for referential integrity as per the physical data model stored in the objects dictionary and further segregating the erroneous data;
• partitioning the data present in the CLN database table into master data or transaction data;
• copying master data present in the CLN database table to the DIM database table and transaction data present in the CLN database table to the Facts Database table respectively and further copying FACT and DIM data to the FACT2 summary database table and the KPI, KRI, KQ and KOI data from the Fact2 Summary database table to the BSC Balanced scorecard database table and still further copying the incremental data from the FACT, DIM, FACT2 and BSC database tables to a logical data mart based on the views defined on these tables in the physical data model;
• notifying a system administrator to correct the erroneous records;
• receiving and using the corrected records in the next copy cycle;
• incrementally updating the STG database table with newly received source data using change data and trickle feed techniques;
• tracking the data in each of the databases using a triangulating tallying technique at the end of each extract transform load (ETL) cycle; and
• generating reports from views stored in the physical data model in the form of dashboards and balanced scorecards based on the user access level details specified in said business data model.
21. The method as claimed in claim 20, wherein the step of incrementally updating staging database table includes the steps of checking the timestamp of the source table and applying filters to ensure that very old data is not copied into the staging database table.
22. The method as claimed in claim 20. wherein the step of validating the data in the STG database tables using the validation information in the physical data model includes the steps of checking whether all records were copied from the source database tables. tallying figures including amount, quantity and KPI numbers for each batch of the records from the source and target tables, checking if the tallying leads to a non-zero result else raising an alert to notify discrepancy in records.
| # | Name | Date |
|---|---|---|
| 1 | 2606-MUM-2009-CORRESPONDENCE(5-2-2010).pdf | 2018-08-10 |
| 1 | 2606-MUM-2009-FORM 5(11-11-2010).pdf | 2010-11-11 |
| 2 | 2606-MUM-2009-FORM 2(TITLE PAGE)-(11-11-2010).pdf | 2010-11-11 |
| 2 | 2606-mum-2009-correspondence.pdf | 2018-08-10 |
| 3 | 2606-mum-2009-form 2(11-11-2010).pdf | 2010-11-11 |
| 3 | 2606-mum-2009-description(provisional).pdf | 2018-08-10 |
| 4 | 2606-mum-2009-drawing.pdf | 2018-08-10 |
| 5 | 2606-MUM-2009-FORM 1(5-2-2010).pdf | 2018-08-10 |
| 5 | 2606-MUM-2009-DRAWING(11-11-2010).pdf | 2010-11-11 |
| 6 | 2606-mum-2009-form 1.pdf | 2018-08-10 |
| 6 | 2606-MUM-2009-DESCRIPTION(COMPLETE)-(11-11-2010).pdf | 2010-11-11 |
| 7 | 2606-mum-2009-form 2(title page).pdf | 2018-08-10 |
| 7 | 2606-MUM-2009-CORRESPONDENCE(11-11-2010).pdf | 2010-11-11 |
| 8 | 2606-mum-2009-form 2.pdf | 2018-08-10 |
| 8 | 2606-MUM-2009-CLAIMS(11-11-2010).pdf | 2010-11-11 |
| 9 | 2606-mum-2009-form 26.pdf | 2018-08-10 |
| 10 | 2606-MUM-2009-ABSTRACT(11-11-2010).pdf | 2010-11-11 |
| 10 | 2606-mum-2009-form 3.pdf | 2018-08-10 |
| 11 | abstract1.jpg | 2018-08-10 |
| 12 | abstract1.jpg | 2018-08-10 |
| 13 | 2606-MUM-2009-ABSTRACT(11-11-2010).pdf | 2010-11-11 |
| 13 | 2606-mum-2009-form 3.pdf | 2018-08-10 |
| 14 | 2606-mum-2009-form 26.pdf | 2018-08-10 |
| 15 | 2606-MUM-2009-CLAIMS(11-11-2010).pdf | 2010-11-11 |
| 15 | 2606-mum-2009-form 2.pdf | 2018-08-10 |
| 16 | 2606-MUM-2009-CORRESPONDENCE(11-11-2010).pdf | 2010-11-11 |
| 16 | 2606-mum-2009-form 2(title page).pdf | 2018-08-10 |
| 17 | 2606-MUM-2009-DESCRIPTION(COMPLETE)-(11-11-2010).pdf | 2010-11-11 |
| 17 | 2606-mum-2009-form 1.pdf | 2018-08-10 |
| 18 | 2606-MUM-2009-DRAWING(11-11-2010).pdf | 2010-11-11 |
| 18 | 2606-MUM-2009-FORM 1(5-2-2010).pdf | 2018-08-10 |
| 19 | 2606-mum-2009-drawing.pdf | 2018-08-10 |
| 20 | 2606-mum-2009-form 2(11-11-2010).pdf | 2010-11-11 |
| 20 | 2606-mum-2009-description(provisional).pdf | 2018-08-10 |
| 21 | 2606-MUM-2009-FORM 2(TITLE PAGE)-(11-11-2010).pdf | 2010-11-11 |
| 21 | 2606-mum-2009-correspondence.pdf | 2018-08-10 |
| 22 | 2606-MUM-2009-FORM 5(11-11-2010).pdf | 2010-11-11 |
| 22 | 2606-MUM-2009-CORRESPONDENCE(5-2-2010).pdf | 2018-08-10 |