FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR DATA HARMONIZATION USING INCREMENTAL ETL-MR
Applicant
TATA CONSULTANCY SERVICES LIMITED
A company Incorporated in India under The Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
FIELD OF THE INVENTION:
The present invention relates generally to data management, and more particularly to a method and system for achieving data harmonization by correlating and fusing incongruous data from a plurality of disparate data sources, the data harmonization being achieved by incremental real time data harmonization and batch mode data harmonization.
BACKGROUND OF THE INVENTION:
With increasing volumes of "big data" becoming available from external sources, the processing and correlation of disparate data sources in real-time has emerged as a big challenge to traditional business-intelligence (BI) technology stack. According to R. Kimball and J. Caserta in a publication titled "The data warehouse ETL Toolkit: Practical techniques for extracting, cleaning, conforming, and delivering data. Wiley, 1 edition, p, 2004," it is observed that BI is becoming dominated by ETL (Extraction Transformation Load) processes that consumes about 70% of the resources needed for implementation and maintenance of a typical data warehouse.
Further, one of the limitations with traditional BI technologies is that inability to process unstructured data, even if it has been converted into structured objects using available technologies. Unstructured data pertain to sources like call center scripts, social media updates, news, or even manually entered surveys in an enterprise environment.
Traditional BI technologies are based on relational databases, and are incapable of processing of more and more data arriving faster than ever from new data sources such as social media, sensors or mobile devices. Further, businesses are evincing
interest in technologies that takes care of commonly arising scenario called as data harmonization between data obtained from disparate Systems. The incoming data has dimensions with different schemes of representation, and it is a need in the art to correlate such data and process in real-time for business intelligence.
Another disadvantage associated with the traditions BI technology stack is its inability to store data in its raw form because of database size constraints and that it aggregates the data on pre-designed dimensions. is a result fine grained data patterns are lost and obstruct the possibility of predictive analytics that is based on such fine grained data patterns. Traditional ETL tools can only process relational data and cannot process the data that has been converted from unstructured or semi structured data sources. One of the prime reasons can be that different subsets of information may be available for the same product by different merchants or other sources. Different sources may describe attributes or schemas differently thereby leading to several product listings for the same product. This poses a formidable challenge in providing meaningful and succinct information to the user of the product catalog.
Considerable work has been done by researchers in area of processing and correlating disparate data sources in real-time but still lot needs to be done. Various studies undertaken by researchers are as follows:
ETL on Map-Reduce
ETL on map-reduce has been reported by Liu et. al. in a study titled "highly scalable dimensional ETL framework based on map-reduce." This paper was published in Proceedings of the 13th international conference on Data warehousing and knowledge discovery. This work seems to be a preliminary one and does not cover all the transformations involved in Type-I ETL. Further Liu et. al. divide the ETL
jobs in two phases, with dimensional transformation in first phase and measure transformation in second. However, in most of the real-world enterprise Type-I ETL applications, transformation of dimensions is directly related with the transformation of measures and should therefore be carried out together.
Another work by Thomsen et al. in a study titled "easy and effective parallel programmable ETL" describes how to run Type-1 ETL jobs in parallel using a framework that obviates the need to model each stage of the process. While such an approach may work well for simple jobs, however, for complex, enterprise-class ETL jobs it becomes necessary to model each ETL application explicitly using formally defined transformations.
Further, pivot and un-pivot operations were formally disclosed by Wyss et al. However, these have been disclosed only in the context of relational databases. Further, Wyss et al. implements pivot/unpivot as a four step process which is computationally too complex to compute for big-data.
Incremental ETL
In a paper by Jorg et al. titled "Towards generating etl processes for incremental loading" incremental ETL has been attempted, but only for reference data, by keeping a log of all changes performed and then updating those in the warehouse. However, most of the real-time ETL or near real-time ETL solutions insert the new data on real-time basis into the warehouse (Vassiliadas in a paper titled "Near real time etl: New trends in data warehousing and Delta warehouses") and do not deal with a situation where new data needs to be re-joined with already transformed data.
Therefore, in light of the above problems, it would be desirable to have:
A method and system for achieving data harmonization to correlate and fuse incongruous data from a plurality of disparate data sources in real time.
A platform that can hold un-aggregated data so that finer patterns in data are not lost.
A method and system for processing unstructured data, e.g. call-center scripts, social media updates, news, or even manually entered surveys.
A provision for approximately correlating keys between different data sources.
OBJECTS OF THE INVENTION:
The primary object of the invention is to provide a method and system for achieving data harmonization between data obtained from disparate sources, data having dimensions with different schemes of representation.
An object of the invention is to provide real time data harmonization using Incremental ETL Map-Reduce technique.
Another object of the invention is to provide batch mode data harmonization using ETL Map-Reduce technique.
Yet another object of the invention is to perform the pivot/unpivot operations using map-reduce technique.
Another object of the invention to provide a method and system for processing unstructured data. e.g. call-center scripts, social media updates, news, or even manually -entered surveys.
Further object of the invention is to enable execution of atomic ETL (type - 1) operators using Map-Reduce technology.
SUMMARY OF THE INVENTION:
In one aspect, the invention disclosed herein is a computer implemented method for dynamically harmonizing a plurality of incongruent data sets with a harmonized data set in real-time, wherein the method comprises of the following steps: receiving at one or more independent processors, the plurality of incongruent data sets, wherein each of said independent processor is configured to process at least one received data set; next annotating schema of the received data set by obtaining schema details and renaming the annotated details; transforming one or more dimensions or measures or a combination thereof of the annotated dataset into a target scheme of representation by utilizing soft key correlation technique, and including entire set of dimensional and measure values of the annotated data set and the target scheme into the transformed data set; performing a sub grouping for n dimensions of a previously harmonized data set into k partially overlapping dimensions by the processor; scanning each of the k groups of the previously harmonized data set for range of keys sharing key values with corresponding values of the transformed data set, thereby retrieving a matching data subset therefrom; and performing a full outer join operation upon the transformed data set and the matching subset of the previously harmonized data set, wherein existing values of said previously harmonized data are updated with latest values of recently transformed data set.
In a further aspect of the invention, the schema is renamed and altered based on pivot operations or unpivot operations using map-reduce paradigm.
In another aspect of the invention, a computer implemented method harmonizing
incongruent data obtained from plurality of disparate data sources is provided,
wherein the method comprises steps of:
storing a plurality of incoming datasets in a staging area until entire datasets for a
specified time period have arrived;
annotating schema of the stored datasets by capturing variations in the schema and
renaming said schema in accordance with a model specified in an annotation file;
performing multiple transformation of the annotated datasets to obtain a common
scheme of representation by:
altering the renamed schema to obtain a congruent data structure having same measure values by performing multiple pivot or unpivot operations; mapping dimensional and subsequently measure values of the data structure to a target data structure using a mapping file, said mapping file being configured to match specific dimensional values of columns of tables integral to the annotated and the target data structures through computation of a similarity score based on a statistical model derived from a set of features to further classify the values as matching or non matching; and
joining the matching values of the transformed data to achieve the harmonized data
set.
In yet another aspect of the invention, every input data file is first parsed on arrival,
if there are radical changes in the schema as against the annotation file, the parser
fails.
Another significant aspect of the present invention provides a data harmonization system to dynamically harmonize a plurality of incongruent data sets with a harmonized data set in real-time, comprising:
one or more independent processors configured to receive the plurality of incongruent data sets, each of said independent processor configured to process at least one received data set; each processor further configured to cause: an annotation module to annotate schema of the received data set by obtaining schema details and renaming thereof;
a transformation grouping module to transform one or more dimensions or measures or a combination thereof of the annotated dataset obtained from the annotation module, into a target scheme of representation by utilizing soft key correlation technique; and thereon perform a sub grouping for n dimensions of a previously harmonized data set into k partially overlapping dimensions; a scanning module to scan each of the k groups of the previously harmonized data set for range of keys sharing key values with corresponding values of the transformed data set to retrieve a matching data subset therefrom; and a compiler module in communication with the transformation grouping module and the scanning module, to perform a full outer join operation upon the transformed data set and the matching subset of the previously harmonized data set. wherein existing values of said previously harmonized data are updated with latest values of recently transformed data set.
It is to be understood that both the foregoing summary and the following detailed description of the present embodiments of the invention are intended to provide an overview or framework for understanding the nature and character of the invention as it is claimed. The accompanying drawings are included to provide a further understanding of the invention and are incorporated into and constitute a part of this specification. The drawings illustrate various embodiments of the invention and together with the description serve to explain the principles and operation of the invention.
BRIEF DESCRIPTION OF DRAWINGS:
The above-mentioned and other features and advantages of the various embodiments of the invention, and the manner of attaining them, will become more apparent and will be better understood by reference to the accompanying drawings, wherein:
FIG. 1 illustrates Batch Mode Data Harmonization to correlate and fuse incongruous data coming from a plurality of disparate sources according to an embodiment of the invention;
FIG. 2 is a block diagram illustrating a data harmonization system according to an embodiment of the disclosure;
FIG.3 is a high level diagram illustrating Incremental ETL-MR for data harmonization according to yet another embodiment of the invention;
FIG. 4 illustrates group Range Index in Incremental ETL-MR for data harmonization in accordance with an embodiment of the present invention;
FIG.5 illustrates keyword matching overview in accordance with the principles of the invention;
FIG. 6 illustrates sample schema details of a working embodiment of the present invention;
FIG. 7 illustrates comparative performance of traditional ETL vs ATJ scheme of batch mode and ATSJI scheme of incremental data harmonization mode; and
FIG. 8 illustrates comparative performance of Batch (ATJ) vs incremental data harmonization technique (ATSJI) in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION:
Disclosed herein is a computer implemented method and system for harmonizing data obtained from a plurality of disparate data sources in real time. In an aspect of
the invention, a computer implemented method of harmonizing incongruent data obtained from plurality of disparate data sources is disclosed. Data is stored in a staging area for a specified period before being processed.
The disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms.
Reference will now be made in detail to the exemplary embodiment(s) of the invention as illustrated in the accompanying drawings. Wherever required same reference numerals will be used to refer to same or like parts.
The words "comprising", "having", "containing", and "including", and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.
It must also be noted that as used herein and in the appended claims, the singular forms "a", "an", and "the" include plural references unless the context clearly dictates otherwise. Although any methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, the preferred, parts are now described. In the following description for the purpose of explanation and understanding reference has been made to numerous embodiments for which the intent is not to limit the scope of the invention.
Definitions:
Incongruent data: refers to inconsistent data with respect to format and structure collected from various data sources.
Disparate data sources: refers to distributed sources data such as social media servers, sensors or mobile devices.
FIG. 1 illustrates the Batch mode data harmonization according to an embodiment of the invention. The Batch mode data harmonization is a technique that harmonizes incongruent incoming data sets arriving at different times and obtained from plurality of disparate data sources.
Firstly, plurality of incoming data sets (101) is stored in a staging area (102) until entire datasets for a specified time period have arrived. Indeed in most enterprise operations inputs from different regions or divisions do not arrive at the same time even for the same time period. Therefore data needs to be stored in a staging area, and harmonization is performed after all the data for a specified time period has arrived. Such staging area is a database according to an invention stored on a computer readable medium.
Then schema of the stored datasets in the staging area (102) is annotated (103) by capturing variations in the schema of the incoming data sets followed by renaming said schema in accordance with a model specified in an annotation file.
In an embodiment of the invention every incoming data set is first parsed on arrival (not shown in figure). The parser fails if there are radical changes in the schema as against the annotation file.
Multiple transformations of the annotated data are then performed (104) to obtain a common scheme of representation as shown in FIG. 1. The transformation comprises four stages of (1) Cleansing and Validation (2) Schema alteration (3) Dimension Transformation and (4) Feature Generation.
The transformation starts with Cleansing and Validation that eliminates extra characters from said data from the input data. Further the schema is altered to obtain
a partially congruent data structure having same measure values by performing multiple pivot operations or unpivot operations.
In the next step of dimensional transformation (104), the dimensional and measure values of the data structure are mapped to a target data structure using a mapping file, said mapping file being configured to match specific measure values of columns of tables integral to the annotated and the target data structures and compute a similarity score based on a statistical model derived from a set of features to further classify the values as matching or non matching. The classification of the values as matching or non-matching is achieved by a support Vector machine classifier that uses a Gaussian kernel. The features are then selected by employing overlap coefficient, Jaccard similarity, cosine similarity, soft TFIDF, Monge Elkan, Jaro Winkler, Soundex, and other such resolution techniques.
Further, the matching values of the transformed data are joined to achieve the harmonized data set. The mapping file is implemented by a soft key correlation technique to obtain common scheme of representation. Also, an embodiment the invention discloses different types of dimensional transformation discussed herein below as:
DH (hereon referred to as data harmonization): One-to-One: In this case, the number of tuples remains the same and no corresponding measures transformation is required.
DH: Many-to-One: In this case, the number of tuples gets reduced and therefore the corresponding measures need to be aggregated. For example when transforming retail store ID' to 'city-name', aggregation of sales data from all the stores of a city is required.
DH: One-to-Many: In this case, the number of tuples increases and corresponding measures need to be apportioned across the new tuples. Instead of an aggregation function as in many-to-one, an apportioning function is required: for example, when transforming product-categories into product names, in the case of survey data indicating how well a product is placed in a store, an apportioning function 'copy' might be used, indicating that a "placement' value given by a surveyor to a category is also assigned to every product in that category.
Measure Only: Here, a measure into common unit of measurement gets transformed,
Derived Measures: In this case, newer columns are generated based on measure columns, which are further used in the pattern mining stage, e.g., difference between revenue in the current week and last year's revenue for same week, or the week-on-week increase / decrease in revenue etc.
FIG. 2 is a schematic block diagram of an exemplary embodiment of a data harmonization system 100, comprising:
one or more independent processors 102(i) 102(H) 102(iii)....102(n) to receive the plurality of incongruent data sets from multiple data sources, each of said
independent processor 102(i. ii, iii, n) further comprising:
an annotation module 104 to annotate schema of the received data; a transformation grouping module 106 to transform the annotated dataset obtained from the annotation module into a target scheme of representation; a scanning module 108 to scan previously harmonized data set; and a compiler module 110 in communication with the transformation grouping module and the scanning module to perform a full outer join operation upon the transformed data set.
Re-referring to FIG. 2 and also to FIG.3 a high level diagram of incremental Real-Time data harmonization according to another embodiment of the invention is
provided. The incremental Real-Time data harmonization is a technique that harmonizes data obtained from a plurality of disparate data sources in real time. This technique enables incremental incorporation of data sets arriving at disparate times. The numbers alongside the arrows in the figure indicate data flow sequence.
Firstly, a plurality of data sets flowing through a data framework is defined, wherein only one data set flows through the one or more of the independent processors 102(i, ii, iii....n) constituting a data framework, at a time.
Schema of first data set selected from the plurality of data sets is annotated within the annotation module 104 by obtaining the schema details and renaming thereof (Step 1 of FIG. 3). Next the transformation of annotated data takes place at a dimensional level within the transformation grouping module 106. The invention discloses different types of dimensional transformation which are as follows:
DH (data harmonization): One-to-One: Wherein the number of tuples remains the same and no corresponding measures transformation is required.
DH: Many-to-One: Wherein, the number of tuples gets reduced and therefore the corresponding measures need to be aggregated. For example when transforming "retail store ID' to 'city-name', aggregation of sales data from all the stores of a city is required.
DH: One-to-Many: Here in this case, the number of tuples increases and corresponding measures need to be apportioned across the new tuples. Instead of an aggregation function as in many-to-one, an apportioning function is required: for example, when transforming product-categories into product names, in the case of survey data indicating how well a product is placed in a store, an apportioning function "copy' might be used, indicating that a "placement' value given by a surveyor to a category is also assigned to every product in that category.
One or more measures of a first dataset is transformed into a target scheme of representation by including entire set of measure values of the target data scheme into the transformed data set and assigning a value of NULL to the measure missing in the first data set but existing in the target data scheme using soft key correlation (Step 2 of FIG. 3).
The transformed data is stored to flow through the scanning module 108, wherein chunks of data gets processes as they arrive, rather than waiting until all the data is available.
In the scanning module 108, previously harmonized data set is scanned for records that may need to be joined with freshly transformed data set. In case the data warehouse contains no previously harmonized data set. the compiler module 110 of the system 100 performs the join operation directly (as shown in step (4)) and the newly transformed data set gets stored in the Hbase data store.
On the contrary, in a scenario wherein the Hbase data store contains sets of previously harmonized data set, scanning is performed to join the newly transformed data set with a corresponding relevant subset of data stored in the data store (shown by steps (8) and (9)). However, prior to this the previously harmonized data is grouped according to "k" out of its "n" dimensions, wherein the choice of k such dimension shall be such that it is highly likely that each incoming data set contains relatively few chunks if grouped in this manner, shown in step (3).
Re-referring to FIG. 3 (step 6), and also to FIG. 4 a Group Range Index Table is defined that stores the maximum and minimum values of the primary key in the harmonized data set, for each combination of the k grouping keys. Further, as shown in T2 of FIG. 4, the data is grouped on the first two dimensions only. In the scan stage, map tasks lookup the Group Range Index Table in parallel for each group actually present in the freshly transformed incoming data set, as shown in step (5)
and (6) of FIG. 3. This is particularly to find the range of keys in the previously harmonized data set that are guaranteed to contain records that share key-values with the incoming data set (shown in step (7)). It shall however be understood that these ranges may contain more data that is needed.
The multiple map tasks retrieve the required subsets of previously harmonized data; using at worst as many range scans as the number of groups in an incoming data-set, and at best no scans: If no incoming group has a matching record in the Group Range Index, this indicates that the fresh data set is completely "new".
Next both the freshly transformed as well as 'matching' subset of previously harmonized data is inputs to the join stage, shown in step 10. A full outer join is then performed between all the incoming data and the matching subset of harmonized data. It should however be noted that the values of any measure columns already present in the matching subset are replaced with new measures as computed by the join.
Most importantly, the output of the join stage in the compiler module 110 is always inserted in to the HBase data store, shown in step (11 of FJG. 3). Thus, even if partially harmonized data was previously present, fresh data is inserted against the same key, which is easily possible since HBase merely maintains a new version of such records and no updates are performed.
Once the incoming data set is harmonized and stored in the warehouse, the "Group Range index" is updated using an additional map-reduce phase (as shown step (12 & 13)). Again referring to FIG. 4, starting with input data in Tl, the data is grouped according to k dimensions, as shown in T2 (where k< total "n" dimensions of the data). Thereon, maximum and minimum values of the primary keys for each such group to create T3. Data of this Table "T3" is then inserted or updated in "Group Range Index".
The second data set from the plurality of data sets is retrieved followed by annotating and transforming of the second data set. Then, second data set is scanned by identifying the range of keys in the first data set containing records sharing key values with the second data set and thereby retrieving matching subset from the first data set. Finally, the second dataset is joined with the matching subset so retrieved to achieve the data harmonization in real time. Joined data is stored into Hbase data store, thereby even if partially harmonized data was previously present, fresh data is inserted against same key (Step 11).
FIG. 5 illustrates keyword matching overview in accordance with the principles of the invention. Jt is also called as soft key correlation. The real challenge in transforming the data comes when data is represented differently across different data sources, for say as two different codes, each referencing the same product. This challenge is met by introduction of soft key correlation technique that gets implemented in the transformation grouping module 106 which matches specific values of two given columns from different tables potentially belonging to different databases. Approximate key-matching features are also available in various data integration packages. However, in most such cases, distance measures need to be chosen manually by developers, based on intuition or trial and error, along with corresponding thresholds value for a matching pair of strings. However, there is no single similarity metric that applies to all problems, and neither is it easy for a developer to judge which metrics or thresholds to use as these are data-dependent. Therefore, according to present invention, similarity scores are computed based on a set of metrics and then these features are used to learn a discriminative statistical model that classifies any pair of strings into 'matching' or 'non matching' categories. The invention takes a training set of matching and non-matching pairs as input and produces a classier based on a learned model that combines many distance measures in a manner appropriate for the data-set at hand.
As shown in FIG. 5, various features on a subset of manually labeled pairs taken from actual data sources for a particular harmonization problem, to serve as a training set for a supervised learning procedure are generated. These features are obtained by resolution techniques including, though not limited to, 'overlap coefficient1, "Jaccard similarity', 'cosine similarity', "soft TF1DF', 'Monge Elkan', 'Jaro Winkler', "Soundex', for learning the model. Since some of these are more important than others in a given problem-set, features that correlate poorly with the labels in training set are dropped. Having selected the features to be used, a support vector machines (SVM) classifier using a Gaussian kernel is employed.
In one preferred embodiment of map-reduce implementation, wherein the map-reduce process is divided in two phases of 'map1 and 'reduce1; both of which run in parallel on many computing nodes in a cluster, an input data is divided in chunks and given to "map' nodes to process by executing map "tasks' in parallel. Each map task takes one input tuple at a time and identifies a key from it and passes on to "reduce' phase. A "reduce1 node receives all the tuples for an instance of the key and can choose to combine them in any manner, e.g., aggregation. Therefore in the "map* phase key should be identified so as to enable the 'reduce1 phase to perform the desired operation. Key features of map-reduce programming paradigm are its simplicity, and the capability to process even peta bytes of data in linearly scalable manner. Further Map-Reduce implementations of standard relational operators typically used in Type-I ETL tasks, as well as the "Un-Pivof, "Pivot1 and 'Apportioning' operators that become useful in harmonizing unstructured data are described herein below. For example, while processing survey data, it is often found that it arrives in "un-pivoted' form,, stored as questions and answers that may need to be pivoted for every different question. Alternatively, events extracted from social media may need to be appropriately apportioned across multiple time periods.
Atomic Operators in Map-reduce
'Selection': Within the map task that applies to every input tuple, a filter condition can be used to propagate only those tuples that satisfy the selection condition.
'Projection': During the map task for every tuple, the input tuple based on its format is merely tokenized ( e.g.,comma ',' for CSV format) and retain only those tokens as specified in the projection.
'Apportion': When many tuples are generated from single tuple, certain measure columns in the source-tuple need to be distributed/apportioned into many tuples. For example, "equal-division1 apportions a single tuple with measure value x into n tuples with measure value x/n. Apportioning is easily accomplished in the map phase with each mapper generating multiple outputs.
'Aggregation' : When many tuples need to be combined into one tuple, the corresponding measure columns need to be reduced, using a specified reduction function (e.g.'sum') to a single value that will be included in the resulting tuple. Here both map and reduce phases are required; tuples that need to be merged are grouped by the map phase and aggregated in the reduce phase.
'Join' : A join operation can be then performed through a combination of map and reduce tasks.
Further, Un-pivot and Pivot operations can be described by way of an example, wherein Legends, for say refer to product, store code followed by 3 weeks (W7, W8, and W9) and consequent sales figures for Coffee Mug and ASDF coffee 100 gm for said 3 weeks.
Now, Un-pivot takes a set of pivot columns as input, creates two additional columns referred as meta data column and measure column, and drops the pivot columns. Pivot columns are those congruent columns that report the same measure. For every
tuple of the input data, the unpivot operation creates as many tuples as the number of pivot coiumns, resulting in increased number of tuples. In this operation, the coiumn headings of the pivot columns (e.g., 'W7', 'W8', and 'W9') become the values of meta-data column, while all the cell values go into a single measure column. Un¬pivot can be performed using the map phase alone. Each map task takes one tuple at a time and generates as many new tuples as the number of pivot columns. The map task only requires the column headings and sequence numbers of the pivot columns as input. Using this technique, un-pivot operation of many sets of pivot columns can be combined in single map phase.
On the other hand, the 'Pivot' operation is the reverse of un-pivot, i.e., it takes two columns meta data column and measure column as input, creates as many additional columns as the cardinality of the meta data column, and finally drops the meta data and measure columns. As a result of pivoting the number of tuples in the data gets reduced by a factor that depends on the cardinality of the Meta data column. Both map and reduce phases are required to execute a pivot operation. Suppose, for the above example, there are N(= 4) columns in the input data and K(= 3) cardinal values in meta data column. Two of the columns will get dropped, therefore new columns should be added after (N-2)th = 2nd position in the output tuple. Therefore a unique sequence number from the interval [(N -2+1); (N -2+K)] = [3; 5] is assigned to every cardinal value of the meta data column. .
BEST MODE/EXAMPLE FOR WORKING OF INVENTION
A method is illustrated by way of an example to make a comparative performance between different data harmonization techniques. It is to be understood that the method claimed invention is not restricted to the said example only.
Example I
In accordance with the principles of the present invention, a case study of a consumer packaged goods (CPG) industry is presented which describes a typical data harmonization scenario: Let a consumer-packaged-goods company receives point-of-sales data from various retailers as well as retail metrics collect through manually conducted sample surveys. It also may have access to events regarding the actions of competitors as gleaned from social media. FIG. 6 shows the schema of the four input sources and of the harmonized data as described below.
1. Point of Sale Data (POS): This data is collected at the checkout counter of retail stores, consolidated for one week and is collected from all the stores that the CPG Company supplies its products to. It contains SKU (Stock Keeping Unit, a code for each unique product for sale in a store), Store#, and #UnitsSold for four weeks of a month.
2. Survey Data - Share of Shelf (SOS): This and next two data-sets are (all of survey data) obtained from a fleet of surveyors who visit stores to conduct surveys. SOS data indicates what share of the total number of shelves in a store is given to each product or product category.
3. Survey Data - Out of Stock (OOS): Out of Stock data indicates whether or not an SKU was available for sale in a store on the date of a surveyor's store visit.
4. Survey Data - Miscellaneous (MISC): This data contains a set of survey questions such as, "is visible from main entrance?' along with their answers. The question-answer pairs are associated with product-names or product categories as observed in given store on the date of a surveyor's store visit.
Harmonization Model of CPG Case Study
POS: SKU names present in the POS data need to be cleansed and then #UnitsSold for various weeks of a month needed to be un-pivoted. Transformation sequence is: 'Cleanse1 and 'UnPivot'. When executed using PIG, these are merged to one map-reduce job.
SOS: Product names reported in this dataset do not match with SKUs, sometimes have typing mistakes, and some of these correspond to shelf names and not the product names. All the shelf names need to be transformed to SKUs. Transformation sequence is taken as : 'Many-to-One' ( for survey date to week number of the year) and 'One-to-Many' (for Shelf name to SKU names).
OOS: SKU names need to be transformed to the same scheme as used in POS data and survey date needs to be transformed to week number of the year. Only the second transformation for the purposes of present experiments is included. Therefore required transformation sequence is : 'Many-to-One' only and therefore amounts to one map-reduce job.
MISC: In this data source, product-names are used similar to SOS data. It is required to create pivot column for every question, as indicated in FIG. 6. Therefore the transformation sequence is taken as: 'Many-to-One1 (for survey date to week number of the year). 'One-to-Many' (for product name to SKU names) and finally a 'pivot' operation to create a column for every question asked. Similar to SOS data, the optimal sequence for these operations was chosen as 'Pivot', 'Many-to-One' and finally 'One-to-Many'.
The requirement is to join these disparate data sources into one harmonized format, where for each week key performance indicators (such as UnitsSold or Revenue) are stored together. Since survey-data is manually entered and comes from multiple provider agencies, product and category names rarely match between data sources, and can also vary over time even from the same source. In actual practice, the
scenario is even more complex, involving data from social media as well as to even interpolate between the most appropriate data-points if survey data is too sparse.
FIG. 7 illustrates comparative performance of traditional ETL vs ATJ and ATSJI. In FIG. 7, a comparison of the data processed per second by this traditional approach using ETL technique (Pharmacy case study) and by Data Harmonization (CPG case study) in batch mode as well as incremental ETL-MR is presented. As is evident, on comparable hardware and ETL jobs of comparable complexity, significantly more data per second is processed using map-reduce as compared to the traditional ETL platform.
To study the additional cost of repeated harmonizations and range scans when harmonization is done incrementally using ATSJI. harmonization is performed for six months, each time in four increments, assuming that different data sources (e.g. POS. SOS, etc.) arrive at different times in the same month. The k dimensions as required by ATSJI are chosen to be time and store#. the rationale being that data from each source itself often arrives in bits and pieces, for a few regions over a given time period.
FIG. 8 illustrates comparative performance of Batch (ATJ) vs incremental (ATSJI). It demonstrates two scenarios: First when the data is harmonized in batch mode, marked as ATJ. In this case harmonized data can be generated only after all the required data-sets (POS, SOS, OOS and MISC) have arrived. The second scenario is that of Incremental ETL-MR where data is harmonized as soon as it arrives. Thus, if data arrives in four different batches, it is harmonized four times. Even then, as shown in FIG. 8, the cumulative cost of four incremental ATSJI harmonization is only 2.5 times that of the single batch ATJ procedure (i.e., not four times). Additionally, the time required to harmonize each increment is always less than the time required for a using the ATJ batch process, even with the overhead of the scan
stage, and can sometimes be as much as 30 - 40 % lower (as in the case of POS. OOS, and SOS data).
While considerable emphasis has been placed herein on the particular features of this invention, it will be appreciated that various modifications can be made, and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other modifications in the nature of the invention or the preferred embodiments will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.
WE CLAIM:
1. A computer implemented method for dynamically harmonizing a plurality of incongruent data sets with a harmonized data set in real-time, the method comprising:
receiving at one or more independent processors, the plurality of incongruent
data sets, each of said independent processor configured to process at least
one received data set;
annotating schema of the received data set by obtaining schema details and
renaming thereof;
transforming one or more dimensions or measures or a combination thereof
of the annotated dataset into a target scheme of representation by utilizing
soft key correlation technique, and including entire set of dimensional and
measure values of the annotated data set and the target scheme into the
transformed data set;
performing a sub grouping for n dimensions of a previously harmonized data
set into k partially overlapping dimensions by the processor;
scanning each of the k groups of the previously harmonized data set for range
of keys sharing key values with corresponding values of the transformed data
set, thereby retrieving a matching data subset therefrom; and
performing a full outer join operation upon the transformed data set and the
matching subset of the previously harmonized data set, wherein existing
values of said previously harmonized data are updated with latest values of
recently transformed data set.
2) The method of claim 1, wherein each of the independent processor executes steps of annotation, transformation, scanning and joining to perform said respective predetermined operations simultaneously in a parallel fashion.
3) The method of claim I, wherein the schema is altered based on pivot operations; unpivot operations or a combination thereof using a map reduce paradigm.
4) The method of claim 1, wherein the soft key correlation enables the transformation at the dimensional and accordingly the measure level using a mapping file, said mapping file being configured to match specific measure values of columns of tables integral to the annotated and the target scheme and compute a similarity score based on a statistical model derived from a set of features to further classify the vafues as matching or non matching.
5) The method of claim 4, wherein the features for machine learning implementation are selected by employing group of resolution techniques comprising and not limited to, overlap coefficient, Jaccard similarity, cosine similarity, soft TFIDF, Monge Elkan, Jaro Winder and Soundex,.
6) The method of claim 1, wherein the entire set of dimensional and measure values of the annotated data set and the target scheme is included into the transformed data set such that a null is assigned to the value found non-existing in the annotated data set, but existing in the target scheme.
7) The method of claim 1, wherein a group range index table is constructed during grouping to store minimum and maximum values of primary key in the harmonized data for each combination of the k grouping keys.
8) The method of claim 1, wherein the joined data is stored into HBase data store for maintaining newer versions of partially, incompletely or completely harmonized data set by inserting new measures against the same key.
9) A computer implemented method of harmonizing incongruent data obtained from plurality of disparate data sources, comprising:
storing a plurality of incoming datasets in a staging area until entire datasets for
a specified time period have arrived;
annotating schema of the stored datasets by capturing variations in the schema
and renaming said schema in accordance with a model specified in an
annotation file;
performing multiple transformation of the annotated datasets to obtain a
common scheme of representation by:
altering the renamed schema to obtain a congruent data structure having same measure values by performing multiple pivot or unpivot operations; mapping dimensional and subsequently measure values of the data structure to a target data structure using a mapping file, said mapping file being configured to match specific dimensional values of columns of tables integral to the annotated and the target data structures and compute a similarity score based on a statistical model derived from a set of features to further classify the values as matching or non matching; and
joining the matching values of the transformed data to achieve the harmonized
data set.
]0)The method of Claim 9, wherein prior to the alteration of schema an additional step of cleansing and validation is performed to remove extra characters.
11) The method of claim 9, wherein the schema is altered based on pivot operations; unpivot operations or a combination thereof using a map reduce paradigm.
12) The method of claim 9, wherein the variations in the schema of the incoming datasets are captured by a plurality of regular expressions.
13) The method of claim 9, wherein the features are selected by employing overlap coefficient, Jaccard similarity, cosine similarity, soft TFIDF, Monge Elkan, Jaro Winkler, Soundex, and other such resolution techniques.
14) The method of claim 9, wherein the classification of the values as matching or non matching is achieved by a support Vector machine classifier that uses a Gaussian kernel.
15) The method of claim 9, wherein the dimensional transformation is achieved by one-to-one, one-to-many, many-to-one association or a combination thereof based upon the schema details of the incoming data structure and the target data structure.
16) A data harmonization system to dynamically harmonize a plurality of incongruent data sets with a harmonized data set in real-time, comprising:
one or more independent processors configured to receive the plurality of
incongruent data sets, each of said independent processor configured to
process at least one received data set; each processor further configured to
cause:
an annotation module to annotate schema of the received data set by
obtaining schema details and renaming thereof;
a transformation grouping module to transform one or more dimensions or measures or a combination thereof of the annotated dataset obtained from the annotation module, into a target scheme of representation by utilizing soft key correlation technique; and thereon perform a sub grouping for n dimensions of a previously harmonized data set into k partially overlapping dimensions;
a scanning module to scan each of the k groups of the previously harmonized data set for range of keys sharing key values with corresponding values of the transformed data set to retrieve a matching data subset therefrom; and a compiler module in communication with the transformation grouping module and the scanning module, to perform a full outer join operation upon the transformed data set and the matching subset of the previously harmonized data set, wherein existing values of said previously harmonized data are updated with latest values of recently transformed data set.
17) The data harmonization system of claim 16, wherein each of the independent processor is enabled to cause the annotation module, the transformation grouping module, the scanning module and the compiler module to perform said respective predetermined operations simultaneously in a parallel fashion.
18) The data harmonization system of claim 16, wherein the transformation grouping module utilizes soft key correlation to enable the transformation at the dimensional and accordingly the measure level using a mapping file, said mapping file being configured to match specific measure values of columns of tables integral to the annotated and the target scheme and compute a similarity score based on a statistical model derived from a set of features to further classify the values as matching or non matching.
19) The data harmonization system of claim 16, wherein the transformation grouping module constructs a group range index table during grouping to store minimum and maximum values of primary key in the harmonized data for each combination of the k grouping keys.
20) The data harmonization system of claim 16, wherein the compiler module stores the joined data into HBase data store for maintaining newer versions of partially, incompletely or completely harmonized data set by inserting new measures against the same key.