Abstract: Disclosed herein is a system (104) for applying data filtering technique to filter unwanted data from the data set and classifying the filtered data set into a plurality of distinct features, calculating a correlation score for each of the classified feature with respect to a target variable, selecting a plurality of features, among the classified features, having correlation score greater than or equal to a first pre-defined threshold, generating a feature correlation matrix for the selected plurality of features and grouping the selected plurality of features in one or more clusters based on a correlation value, selecting one or more clusters among the plurality of clusters containing the features having f-score greater than equal to a second predefined threshold, selecting at least two features from the one or more selected clusters having a highest correlation degree, and processing the at least two selected features to generate the useful insights. [To be published with FIG. 1]
FORM 2
THE PATENTS ACT, 1970
(39 OF 1970)
AND
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See section 10 and rule 13)
“METHOD AND SYSTEM FOR GENERATING USEFUL INSIGHTS”
Zensar Technologies Limited, Plot#4 Zensar Knowledge Park, MIDC, Kharadi, Off Nagar
Road, Pune, Maharashtra – 411014, India
Nationality: India
The following specification particularly describes the invention and the manner in which it is to
be performed.
TECHNICAL FIELD
The present invention relates to a field of data processing, and more particularly to a data analysis and dynamic visualization techniques for generating useful insights from a dataset.
BACKGROUND OF INVENTION
In today’s world, data is the key for any organization’s day to day operations and overall performance, which often is regarded as an integral ingredient to success of the organization. The use of data by the organization is growing on a daily basis with an unforeseen limit to the same. Organizations use data to make plans for its day-to-day operations as well as for crucial business decisions. Thus, data being one of the central elements for today’s organizational requirements, numerous attempts have been made towards representing data in user friendly yet efficient manner .
Among other approaches, gathering and processing raw data and then representing it into graphical form has been considered as a forward-looking approach, as this enables a person to perceive and understand relevant information quickly for any decision-making activity. Further, apart from using these graphical representation of data for industrial decisions making, they have always been valuable to study scientific problems, etc. As this technique not only enables the scientists to learn and work with their data in its natural dimensional form, but it also brings a new level of understanding.
Further, the graphical representation of data may provide insights relevant for scientific study, research, industrial decision-making process, etc. Although there exist multiple approaches for deriving meaningful data insights, existing approaches requires extensive involvement of human intervention and efforts to derive such insights. In present day scenario, rapid proliferation of large amount of data from various sources presents a grave challenge to create insights using the existing techniques. Many enterprises find it challenging to collect huge amount of data and apply it in the proper context. As a result, such organizations fail to get the most out of their growing information resources. Moreover, traditional analytical tools have served a purpose, but they have several shortcomings that are inadequate in today’s industrial
environment. They don’t scale quickly with growing data and can’t provide the real-time insights needed to keep up with innovative competitors in fast-paced markets.
Thus, there exists an urgent need for a technique which can address the above-mentioned shortcomings of traditional techniques which requires frequent human intervention at each step to get relevant information from data and manually performed operations like filtering or relationship mapping between data points and then too just getting limited and traditional form of insights from the data.
The information disclosed in this background of the disclosure section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
SUMMARY OF THE INVENTION
The present disclosure overcomes one or more shortcomings of the prior art and provides additional advantages discussed throughout the present disclosure. Additional features and advantages are realized through the techniques of the present disclosure. Other embodiments and aspects of the disclosure are described in detail herein and are considered a part of the claimed disclosure.
In one embodiment of the present disclosure, a computer implemented data processing method for generating insights from a data set is disclosed. The method comprises applying data filtering technique to filter out unwanted data from the data set and classifying the filtered data set into a plurality of distinct features. The method further comprises calculating a correlation score for each of the classified feature with respect to a target variable. The method further comprises selecting a plurality of features, among the classified features, having correlation score greater than or equal to a first pre-defined threshold. The method further comprises generating a feature correlation matrix for the selected plurality of features and grouping the selected plurality of features in one or more clusters based on a correlation value, wherein the features having relatable correlation value are kept in same cluster. The method further comprises selecting one or more clusters among the plurality of clusters containing the features
having f-score greater than equal a second predefined threshold. The method further comprises selecting at least two features from the one or more selected clusters having a highest correlation degree. The method further comprising processing the at least two selected features, using a pre-defined set of queries, to generate the useful insights.
In another embodiment of the present disclosure, the data filtering technique further comprises removing at least one of incomplete data, redundant data, and duplicate data from the data set.
In yet another embodiment of the present disclosure, the computer implemented data processing method further comprises performing a null hypothesis technique on the at least two selected features to ascertain that the at least two selected features have the highest correlation degree.
In yet another embodiment of the present disclosure, the f-score for each of the features of the selected one or more clusters is calculated by using a SelectKBest technique.
In yet another embodiment of the present disclosure, the computer implemented data processing method further comprises selecting at least two other features from the one or more selected clusters having subsequent correlation degree. The method further comprises processing the at least two selected features using the pre-defined set of queries, to generate other useful insights.
In yet another embodiment of the present disclosure, a data processing system for generating useful insights from a data set is disclosed. The data processing system comprises a memory and a processing unit operationally coupled with the memory. The processing unit is configured to apply data filtering technique to filter out unwanted data from the data set and classify the filtered data set into a plurality of distinct features. The processing unit is further configured to calculate a correlation score for each of the classified feature with respect to a target variable. The processing unit is further configured to select a plurality of features, among the classified features, having correlation score greater than or equal to a first pre-defined threshold. The processing unit is further configured to generate a feature correlation matrix for the selected plurality of features and group the selected plurality of features in one or more clusters based on a correlation value, wherein the features having relatable correlation value
are kept in same cluster. The processing unit is further configured to select one or more clusters among the plurality of clusters containing the features having f-score greater than equal a second predefined threshold. The processing unit is further configured to select at least two features from the one or more selected clusters having a highest correlation degree. The processing unit is further configured to process the at least two selected features, using a pre¬defined set of queries, to generate the useful insights.
The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
BRIEF DESCITPION OF DRAWINGS
The embodiments of the disclosure itself, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings. One or more embodiments are now described, by way of example only, with reference to the accompanying drawings in which:
FIG. 1 shows an exemplary environment 100 for generating insights from a data set, in accordance with an embodiment of the present disclosure;
FIG. 2 shows a block diagram illustrating data processing system 104 for generating useful insights from a data set, in accordance with an embodiment of the present disclosure;
FIG. 3 show a flowchart illustrating computer implemented data processing method 300 for generating insights from a data set, in accordance with an embodiment of the present disclosure.
The figures depict embodiments of the disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments
of the structures and methods illustrated herein may be employed without departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION OF THE INVENTION
In the present disclosure, the term “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or implementation of the present subject matter described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusions, such that a device that comprises a list of components does not include only those components but may include other components not expressly listed or inherent to such setup or device. In other words, one or more elements in a system or method proceeded by “comprises… a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method.
The terms like “at least one” and “one or more” may be used interchangeably or in combination throughout the description.
While the present disclosure is illustrated in the context of generating insights from data, however, the data processing method and system, and aspects and features thereof can also be used for any other application which requires extracting useful and relevant data from a huge data set.
Reference will now be made to the exemplary embodiments of the disclosure, as illustrated in the accompanying drawings. Wherever possible, same numerals will be used to refer to the same or like parts. Embodiments of the disclosure are described in the following paragraphs with reference to FIGs. 1 to 3.
FIG. 1 shows an exemplary environment 100 for generating insights from a data set, in accordance with an embodiment of the present disclosure. It must be understood to a person skilled in art that the present disclosure may also be implemented in various environments,
other than as shown in FIG. 1. As shown in FIG. 1, the system 104 receives a data set 102. In one exemplary scenario, the data set 102 may be any amount of data from one or more sources which may contain missing values, duplicate data, data with low variance, data with anomalies, etc. The system 104 receives the data set 102 in raw or crude form as an input data set for further processing using data processing techniques to generate useful insights, as disclosed in the present disclosure. The data processing technique will now be explained in conjunction with FIG. 2.
FIG. 2 shows a block diagram of a data processing system for generating useful insights from a data set, in accordance with an embodiment of the present disclosure. Although the present disclosure is explained considering that the system 104 is implemented on a computer system, it may be understood that the system 104 may be implemented in a variety of computing systems, such as a laptop computer, a server, a notebook, a workstation, a mainframe computer, a server, a network server, a cloud-based computing environment, etc.
In one implementation, the system 104 may comprise a processing unit 202, a I/O interface 204 and a memory 206. The memory 206 may be communicatively coupled to the processing unit 202 and the I/O Interface 204. In some embodiments, the processing unit 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processing unit 202 is configured to fetch and execute computer-readable instructions stored in the memory 206. The I/O interface 204 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 204 may enable the system 104 to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O interface 204 may facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 204 may include one or more ports for connecting many devices to one another or to another server.
In one embodiment of the present disclosure, the system 104 further includes a Data Query subsystem 208, and a Dynamic Visualization Engine (DVE) 216. The Data Query Subsystem 208 further includes a Data Quality Assessment (DQA) unit 210, a Feature Clustering Unit 212 and a Query Generation unit 214. The Dynamic Visualization Engine (DVE) 216 further includes an Insight Generation unit 218. In one embodiment of present disclosure, these entities 208-218 may comprise hardware components like processor, microprocessor, microcontrollers, application-specific integrated circuit for performing various operations of the system 104. In one embodiment, the entities 208-218 may be dedicated hardware units capable of executing one or more instructions stored in the memory 206 for performing various operations of the system 104. In another embodiment of the present disclosure, the entities 208-218 may be software modules stored in the memory 206 which may be executed by a processor. It must be understood by a person skilled in art that the processor may perform all the functions of the entities 208-218 according to various embodiments of the present disclosure.
Referring back to FIG. 1, the environment 100 shows the system 104 that receives the data set 102 from one or more sources (not shown). The data set 102 may contain a wide variety of data including missing values, duplicate data, data with low variance, data with anomalies, etc. In other words, the system 104, may be configured to receive the data set 102 from a plurality of resources.
In one embodiment of the present disclosure, the data set 102 may include details related to human subjects such as customers, patients, etc. In another embodiment, the data set 102 may further include details related to non-living entities such as, buildings, ships, vehicles, entries related to a civil project, etc. Once the data set 102 is received by the system 104, it is first passed through a Data Query Phase 106 which is implemented by the Data Query Subsystem 208. After the Data Query Subsystem 208 receives the data set 102, the data quality assessment unit 210 assesses quality of the data set 102 by performing a Data quality assessment (DQA) on the data set 102. Those skilled in the art will appreciate that the Data quality assessment is a technique used for scientifically and statistically evaluating data by measuring particular features of the data to see if the data set 102 meets defined standards and are of the right type and quantity to support its intended use. In other words, the data quality assessment aims to
identify incorrect data, estimate the impact on the industrial processes, and implement corrective action.
In one embodiment of the present disclosure, the quality assessment unit 210 is configured to receive the entire data set 102 as input and provide a cleaned data set as an output. The procedure of converting the entire data set 102 into the cleaned data set may include dropping variables that contain more than a pre-defined percentage of missing values such as less than 40% or any other predefined percentage. Further, the data set 102 is scanned by the quality assessment unit 210 to check if data type conversion is required for any of the entries present in the data set. For example, the entries may be converted into binary form or any other pre¬defined data type to bring consistency among all the entries of the data set.
Further, the quality assessment unit 210 may be configured to check for any duplicate data or entry present in the data set. In case, any duplicate entry exists, the duplicate entry is dropped from the data set 102. For example, a data or entry which has only one value, and all instances share the same value on this entry or data, in other words, this data or entry does not have any relevant or valuable information, as all duplicate entries are removed. In another embodiment, the data having low variance i.e., the data that contains more than a pre-defined percentage of zeroes are removed by the quality assessment unit 210. Further, the quality assessment unit 210 is configured to perform an anomaly detection scan and remove the anomaly in the data set, if found in the data set 102.
Further, the quality assessment unit 210 may be further configured to convert categorical variables present in the data set to numerical values using, (i). “Ordinal Encoding” where a data or entry of the data set comprises a finite set of discrete values with a ranked ordering between values, ordinal encoding involves mapping each unique label to an integer value. The ordinal encoding may be preferred if there is a known relationship between the categories. Further, the quality assessment unit 210 may be configured to convert categorical variables present in the data set to numerical values using (ii). “Label Encoding” is used in a case where data or entry of the data set comprises a finite set of discrete values with no relationship between values. The Label Encoding technique assigns close labels to similar categories which leads to fewer splits in the tress hence reducing the execution time. Further, the quality
assessment unit 210 may be configured to convert categorical variables present in the data set 102 to numerical values using “One-hot Encoding” where the one-hot encoding creates one binary variable for each category, when the values that are close to each XXX. Further, the quality assessment unit 210 may be configured to use other techniques such as “Dummy Variable Encoding”, “Embeddings Technique”, etc.
Finally, after performing afore-mentioned procedure, the quality assessment unit 210 outputs a cleaned data set for further processing. In another embodiment, the quality assessment unit 210 outputs a data quality assessment report containing one or more details such as, executive summary, providing a high-level description of the data, a detailed description of the collections used to assess the quality of selected data, a listed observations containing missing values and reasons for consideration of each of the entries or data, etc.
The above-embodiment, may be understood more by means of a data set 102 belonging to several attributes or features of a cancer tumour of a patient (as shown in the Table 1, below). In an example, after data type conversion of the data set 102, the attributes or features of the tumour may be represented in a manner as depicted in the Table 1, below:
Attributes upon diagnosis Data type
radius_mean Float64
texture_mean Float64
perimeter_mean Float64
area_mean Float64
smoothness_mean Float64
compactness_mean Float64
concavity_mean Float64
concave points_mean Float64
symmetry_mean Float64
fractal_dimension_mean Float64
rad i us_se Float64
texture_se Float64
perimeter_se Float64
area_se Float64
smoothness_se Float64
compactness_se Float64
concavity_se Float64
concave points_se Float64
symmetry_se Float64
fractal_dimension_se Float64
rad i us_worst Float64
texture_worst Float64
perimeter_worst Float64
area_worst Float64
smoothness_worst Float64
compactness_worst Float64
concavity_worst Float64
concave points_worst Float64
symmetry_worst Float64
fractal_dimension_worst Float64
Table. 1
As shown in the above table (i.e., Table 1), a cancer tumor may have features like radius_mean, perimeter_mean, area_mean, radius_se, perimeter_se, area_se, radius_worst, perimeter_worst, area_worst, etc. In an exemplary embodiment of the present disclosure, the quality assessment unit 210 converts data type of each of the features in “Float 64” data type. It may be noted that the quality assessment unit 210 may convert the data type in any other data type as well.
In yet another embodiment of the present disclosure, once the data quality assessment (DQA) unit 210 outputs the cleaned data set in accordance with the present disclosure as mentioned above, the feature clustering unit 212 receives the cleaned data set for further processing. In yet another embodiment of the present disclosure, the feature clustering unit 212 may perform
attribute clustering to on a group of attributes in a way such that the similar attributes are clustered together in the same cluster. Further, it would be apparent for a person skilled in the art that different criteria may lead to different clustering results.
In another embodiment, an unsupervised attribute clustering technique may also be applied on the cleaned data set, based on which top-ranked non-redundant attributes or features which are extracted from the cleaned dataset by using “k-means” technique. It is to be noted that this unsupervised attribute clustering technique may be applied to cluster the features together based on their correlation. In yet another embodiment, there are several unsupervised attribute clustering techniques available which may also be applied on the cleaned data set. Such several unsupervised attribute clustering techniques may include, but not limited to, k-means (KM), self-organizing maps (SOM), fuzzy c-means based attribute clustering, simultaneous clustering and attribute discrimination (SCAD). In addition to the above unsupervised attribute clustering techniques, a supervised attribute clustering technique may be applied on the cleaned data set.
In an exemplary embodiment of the present disclosure, the feature clustering unit 212 may be configured to perform the above-mentioned clustering technique on features of a patient’s cancer tumor as mentioned in Table 1. Below table (Table 2) represents clustering the features in different clusters i.e., clusters 0, 1, 2, …, 5:
Features dist_corr dist_corr_norm cluster
smoothness_se 0.666 0.54 1
fractal_dimension_mean 1.233 1.0 5
texture_se 0.756 0.614 1
symmetry_se 0.693 0.562 1
fractal_dimension_se 0.823 0.668 2
concavity_se 0.274 0.222 2
compactness_se 0.36 0.292 2
f racta l _d i m e ns i o n_wo rst 0.711 0.577 5
symmetry_mean 0.693 0.562 5
smoothness_mean 0.672 0.545 5
concave points_se 0.696 0.564 2
texture_mean 0.182 0.147 3
symmetry_worst 0.852 0.691 5
smoothness_worst 0.673 0.546 5
texture_worst 0.182 0.147 3
area_se 0.575 0.466 0
perimeter_se 0.822 0.668 0
radius_se 0.792 0.642 0
com pactness_worst 0.741 0.601 4
compactness_mean 0.522 0.423 4
concavity_worst 0.406 0.329 4
concavity_mean 0.418 0.339 4
area_mean 0.325 0.263 0
radius_mean 0.417 0.338 0
area_worst 0.33 0.267 0
perimeter_mean 0.369 0.3 0
radius_worst 0.427 0.346 0
concave points_mean 0.737 0.598 4
perimeter_worst 0.438 0.355 0
concave points_worst 0.503 0.408 4
Further, the feature clustering unit 212 may be configured to cluster the features of Table 2 into different clusters based on correlation values “dist_corr” and “dist_corr_norm” calculated for each of the features as shown in the Table 2 above. Further, the feature clustering unit 212 may be configured to group each of the features of Table 2 into separate clusters as shown in the below table (Table 3):
Cluster Features Feature count
0 radius_mean, perimeter_mean, area_mean, radius_se, perimeter_se, area_se, radius_worst, perimeter_worst, area_worst 9
1 texture_se, smoothness_se, symmetry_se 3
2 compactness_se, concavity_se, concave points_se, fractal_dimension_se 4
3 texture_mean, texture, worst 2
4 compactness_mean, concavity_mean, concave points_mean, compactness_worst, concavity_worst, concave points_worst 6
5 smoothness_mean, symmetry_mean, fractal_dimension_mean, smoothness_worst, symmetry worst, fractal dimension worst 6
Table 3
In yet another embodiment of the present disclosure, the feature clustering unit 212 may be further configured to determine correlation between each of the features in each of the clusters.
To determine the correlation between each of the features, the feature clustering unit 212 may be configured to calculate correlation score for each of the two features at once and then for each of the combinations. In one implementation of the present disclosure, Pearson Correlation Coefficient Formula may be applied on selected features for determining correlation between features. Correlation coefficient value always lies between -1 to +1. If correlation coefficient value is positive, it indicates a similar and identical relation between the two features. Else it indicates the dissimilarity between the two features. In one aspect, a zero correlation denotes that the correlation statistic does not indicate a relationship between the two features. To elcudidate the above, consider an example where ‘x’ and ‘y’ are the two variables., The correlation coefficient can be calculated using the below formula:
Where:
“n” represents population size;
“Σx” represents sum of 1st feature values list;
“Σy” represents sum of 2nd feature values list;
“Σxy” represents sum of the product of 1st and 2nd features;
“Σx2” represents sum of squares of 1st feature values list; and
“Σy2” represents sum of squares of 2nd feature values list.
In one embodiment of the present disclosure, the feature clustering unit 212 is configured to apply the above-mentioned technique on the features each of the clusters. In an example implementation, the feature clustering unit 212 determines correlation score for features of cluster 0 of Table 3, the correlation scores are shown in below table (Table 4):
radius _mean perimete r_mean area_ mean radi us_s e perime ter_se area _se radius _worst perimete r_worst area_ worst
radius_ mean 1.0000 00 0.997866 0.987 367 0.67 9090 0.6741 72 0.73 6964 0.9895 39 0.965137 0.941 082
perimete r_mean 0.9978 66 1.000000 0.988 607 0.69 1765 0.6931 35 0.74 4983 0.9894 76 0.970387 0.941 660
area_me an 0.9873 67 0.986507 1.000 000 0.73 2662 0.7266 28 0.80 0086 0.9627 46 0.959120 0.969 213
radius_s e 0.6790 90 0.691765 0.732 562 1.00 0000 0.9727 94 0.95 1830 0.7150 65 0.719684 0.751 548
perimete r_se 0.6741 72 0.693135 0.726 628 0.97 2794 1.0000 00 0.93 7655 0.6972 01 0.721031 0.730 713
area se 0.7358 64 0.744983 0.800 086 0.95 1830 0.9376 55 1.00 0000 0.7573 73 0.761213 0.811 408
radius_ worst 0.9895 39 0.989479 0.982 746 0.71 5085 0.8972 01 0.75 7373 1.0000 00 0.993708 0.984 015
perimete r_worst 0.9851 37 0.970387 0.959 120 0.71 9664 0.7210 31 0.78 0.9937 1213 08 1.000000 0.977 578
area_wo rst 0.9410 62 0.941550 0.959 213 0.75 1548 0.7307 13 0.61 1408 0.9840 15 0.977578 1.000 000
Table 4
In the same manner, the feature clustering unit 212 is configured to determine correlation scores for features of other clusters as well.
Once the feature clustering unit 212 determines the correlation between each of the features of each cluster, the feature clustering unit 212 may be further configured to calculate a correlation score for each of the classified feature with respect to a target variable. In one embodiment of the present disclosure, the target variable may be pre-defined values with respect to each of the attributes or features, and these values may be used to calculating the correlation score for each of the classified feature. In an exemplary embodiment of the present disclosure, the attributes or features of the patient’s cancer tumor with respect to corresponding target variables may be represented in a manner as shown in the below table (Table 5):
Features Correlation score with respect to target
variable
smoothness_se -0.067016
fractal_dimension_mean -0.012838
texture_se -0.008303
symmetry_se -0.006522
fractal_dimension_se 0.077972
concavity_se 0.253730
compactness_se 0.292999
f racta l _d i m e ns i o n_wo rst 0.323872
symmetry_mean 0.330499
smoothness_mean 0.358560
concave points_se 0.408042
texture_mean 0.415185
symmetry_worst 0.416294
smoothness worst 0.421465
texture_worst 0.456903
area_se 0.548236
perimeter_se 0.556141
radius_se 0.567134
com pactness_worst 0.590998
Compactness_mean 0.596534
concavity_worst 0.659610
concavity_mean 0.696360
area_mean 0.708984
radius_mean 0.730029
area_worst 0.733825
perimeter_mean 0.742636
radius_worst 0.776454
concave points_mean 0.776614
perimeter_worst 0.782914
concave points_worst 0.793566
Table 5
In yet another embodiment of the present disclosure, the feature clustering unit 212 is further configured to select a plurality of features, among the classified features, having correlation score greater than or equal to a first pre-defined threshold. In an exemplary embodiment of the present disclosure, the pre-defined threshold value may be configured as an absolute value for example, 0.5 i.e., setting the threshold to the absolute value of 0.5 means considering input features only if the correlation score of the input feature with the target variable is greater than 0.5. In an exemplary embodiment of the present disclosure, applying the pre-defined value of 0.5 to the features of Table 5 may be represented in a manner as shown in the below table (Table 6):
Features Correlation score with respect to target
variable
area se 0.548236
perimeter_se 0.556141
radius_se 0.567134
com pactness_worst 0.590998
Compactness_mean 0.596534
concavity_worst 0.659610
concavity_mean 0.696360
area_mean 0.708984
radius_mean 0.730029
area worst 0.733825
perimeter_mean 0.742636
radius_worst 0.776454
concave points_mean 0.776614
perimeter_worst 0.782914
concave points_worst 0.793566
Table 6
In yet another embodiment of the present disclosure, the feature clustering unit 212 may be further configured to select one or more clusters among the plurality of clusters containing the features having a f-score greater than equal to a second pre-defined threshold value. In this regard, the feature clustering unit 212 may be configured to apply the “SelectKBest method” to select the features according to the “K” highest score. The core idea here is to calculate some metrics between the target and each feature, sort them, and then select the K best features. A predefined score functions f_classif ANOVA F-value (i.e., f-score value) between features for classification tasks is calculated. The measured score for each feature state the idea about how well the particular feature discriminates between two classes. Distance between means of class distributions is the numerator (“N”). Population is taken into account, a concept similar to sample variances of classes is the denominator (“D”). Here instead of dividing sum of squares by (sample_population-1), the approach to sum up all (sample_population-1) and divide the final value by them may be adopted. The numerator (“N”) and denominator (“D”) may be defined as below:
Where:
“m” and “n” represent integer values for sample population of features;
“i” and “j” represent count of clusters; and
“x” represents number of features in each of the clusters.
Based on the above functions for “N” and “D”, the “f-score” value for each of the attributes
is calculated and the same may be defines as:
In an exemplary embodiment of the present disclosure, the f-score of each of the features of the patient’s cancer tumour may be represented in the manner as shown in below table (Table 7):
Features f-score values
smoothness se 2.557968
fractal_dimension_mean 0.093459
texture se 0.039095
symmetry_se 0.024117
fractal_dimension_se 3.468275
concavity_se 39.014482
compactness_se 53.247339
fractal dimension worst 66.443961
symmetry_mean 69.527444
smoothness_mean 83.651123
concave points_se 113.262760
texture_mean 118.096059
symmetry_worst 118.096059
smoothness_worst 122.472880
texture worst 149.596905
area_se 243.651586
perimeter_se 253.897392
radius_se 268.840327
com pactness_worst 304.341063
Compactness_mean 313.233079
concavity_worst 436.691939
concavity_mean 533.793126
area_mean 573.060747
radius_mean 646.981021
area_worst 661.600206
perimeter_mean 697.235272
radius_worst 860.781707
concave points_mean 861.676020
perimeter_worst 897.944219
concave points_worst 964.385393
Table 7
Further, after f-score value or ANOVA F-value is calculated by the feature clustering unit 212 for each of the features of each of the clusters, a “second” threshold value is determined, this threshold value “th” may be defined as below:
Where:
“th” represents optimal f_score value i.e., the threshold value; and
“n” = no. of features.
In an exemplary embodiment of the present disclosure, the value of “th” determined by the feature clustering unit 212 in accordance with the above is ‘283.32’. However, different values of “th” may also be determined by the feature clustering unit 212 for different f-score values of features. Thus, the feature clustering unit 212 may be further configured to select those features having f-score value greater than or equal to the “th” and discard or drop the remaining features for further processing.
In yet another embodiment of the present disclosure, the feature clustering unit 212 may be further configured to perform “Ensemble Classifier Technique” on the selected features which are above “th” value. An ensemble contains several sample populations and a group of predictors. This technique uses multiple subsets of a training set by using the bootstrap, i.e., sampling with replacement, samples are randomly generated from training data with or without replacement. It is to be noted that ensemble of decision trees and a natural extension of bagging is well known for its ability to boost weak learners. It involves constructing many decision trees from bootstrap samples from the training dataset, like bagging. Each individual tree predicts the records independently, aggregating the predictions of a group of predictors often gets better than the best individual predictor. Bagging or Bootstrap Aggregation is a powerful, effective, and simple ensemble method. Bagging is only effective when unstable (i.e., a slight change in the training set can cause a significant difference in the model) in non-linear models.
In one embodiment, the important features are determined by calculating a Gini importance, using the below formulation:
Gini = p1(1 - p1) + p2(1 - p2)
Where, p1 and p2 are the probabilities of classes 0 and 1. The Gini index is minimized when either of the possibilities approaches zero, and a total decrease in the Gini index (node impurity) is calculated after each node split and then averaged over all trees. It is to be noted that when the impurity decreases more, it indicates the more critical the input feature is for data processing. In an exemplary embodiment of the present disclosure, the Ensemble Classifier score matrix for the features of patient’s cancer tumour may be determined in the manner as shown in below table:
Index Feature Score
0 radius_worst 0.507
1 area_worst 0.205
2 perimeter_worst 0.158
3 radius mean 0.147
4 area_mean 0.113
5 perimeter_mean 0.099
6 radius_se 0.064
7 perimeter_se 0.042
8 concave points_mean 0.034
9 concave points_worst 0.030
10 area_se 0.029
11 concavity_mean 0.023
12 concavity_worst 0.020
13 concave points_se 0.007
14 compactness_mean 0.003
Table 8
In yet another embodiment of the present disclosure, the feature clustering unit 212 may be further configured to perform a Filter out technique whereby the data set is analyzed methodically, and only a subset of features is kept for model selection, thereby keeping the model simple yet effective. Established feature selection techniques are applied to the data set to help filter out the features that do not significantly contribute toward building a practical yet straightforward model.
In yet another embodiment of the present disclosure, the feature clustering unit 212 may be further configured to select features that significantly influence the performance. In an example embodiment of the present disclosure, the feature clustering unit 212 is configured to determine those cluster(s) that contain those features which may contribute to generation of most relevant insight. For example, the feature clustering unit 212 determines that features of cluster 0 and cluster 4 contribute most to generation of most relevant insight. Thus, clusters 1, 2, 3 and 5 feature groups as shown in Table 3 above are filtered out and only features of clusters 0 and 4 are retained for further processing by the feature clustering unit 212, a correlation chart of selected features is shown in below table (Table 9):
correlation id range correlated features highly correlated feature
perimeter_mean 60-69 radius_se, perimeter_se, perimeter_mean
70-79 area_se area_se
80-89 - -
90-100 radius_mean, radius_mean
area_mean, radius_worst, perimeter_worst, area_worst
radius_mean 60-69 radius_se, perimeter_se, radius_se
70-79 area_se area_se
80-89 - -
90-100 perimeter_mean, perimeter_mean
area_mean, radius_worst, perimeter_worst, area_worst
Table 9
Therefore, the feature clustering unit 212 may determine that the features of perimeter_mean and radius_mean are highly correlated to each other and contribute to prediction whether the patient’s tumor is benign or malignant. In other words, the features perimeter_mean and radius_mean are the two most relevant or important features of the tumor to diagnose the stage of cancer tumor. In this exemplary embodiment of the present disclosure, any other combination of features would not yield information about the tumor as much as the combination of features perimeter_mean and radius_mean would do.
Further, the feature clustering unit 212 may be configured to calculate “t-score” and “p-value” values based on which the feature clustering unit 212 may be configured to determine a category to which the outcome of data fall. For example, to determine whether the patient’s tumor is benign or malignant the feature clustering unit 212 may determine calculate “t-score” and “p-value” values for the two selected features i.e., perimeter_mean and radius_mean.
In one embodiment of the present disclosure, the feature clustering unit 212 may be configured to calculate the “t-score” and “p-value” values based on below technique:
D = sqrt (M_std**2/no_of_M + B_std**2/no_of_B) t-score = (M_mean – B_mean) / D p-value = 2*(1- norm.cdf(abs(t-score)))
Where:
“M_mean” represents mean value of 1st category Malignant;
“M_std” represents standard deviation of 1st category Malignant;
“B_mean” represents mean value of 2nd category Benign;
“B_std” represents standard deviation of 2nd category Benign;
“No_of_M” represents size of 1st category Malignant;
“No_of_B” represents size of 2nd category Benign; and
“norm.cdf” represents normal gaussian cumulative distribution function.
In an example embodiment of the present disclosure, when the feature clustering unit 212 applies the above technique on the features radius_mean and perimeter_mean to calculate “t-score” and “p-value” values, the below table (Table 10) represents the outcome as:
Feature t score P value
radius mean 2.91 0.0036
perimeter_mean 3.146 0.0016
Table 10 Thus, based on above determined values the feature clustering unit 212 may determine that the tumour may be categorized as malignant or bening.
In yet another embodiment of the present disclosure, the feature clustering unit 212 is further configured to perform null hypothesis technique on the selected at least two features to ascertain whether the selected features are the best possible combination of features. For example, null hypothesis indicates that there is no significant difference between the radius_mean/perimeter_mean of target category Malignant and Benign. As null hypothesis rejected indicate that there is a significant difference between the radius_mean/perimeter_mean of target category Malignant and Benign. This procedure increases the confidence in the selected features and any possibility of error may be reduced drastically. If the “P value” score <0.05: null hypothesis may be rejected and if “P value” score>=0.05: null hypothesis may be accepted. Generally, the level of statistical significance is often expressed in p-value and the range between 0 and 1. The smaller the p-value, the stronger the evidence and hence, the result should be statistically significant. Hence, the rejection of the null hypothesis is highly possible, as the p-value becomes smaller. P-value is a statistical measure, that helps to determine whether the hypothesis is correct or not. P-value is a number that lies between 0 and 1.
In yet another embodiment of the present disclosure, the query generation unit 214 may be configured to receive the at least two selected features (for example, radius_mean and perimeter_mean) and apply the at least two selected features to a pre-defined set of queries which may be mapped to selected features for creation of data insights. In another embodiment, the set of queries may be expanded based on the number of selected features. The below table
(Table 11) represents the set of queries which are generated for the two selected features for example, radius_mean and perimeter_mean:
SI No.
1 2 3
4
5
6 Query
select radius_mean, perimeter mean from dataset;
select avg(radius_mean), avg(perimeter_mean) from dataset;
select radius_mean, perimeter_mean from dataset where radius_mean> avg(radius_mean) and perimeter_mean > avg(perimeter_mean);
select radius_mean, perimeter_mean from dataset where radius_mean< avg(radius_mean) and perimeter_mean < avg(perimeter_mean);
select avg(radius_mean) from dataset join (select avg(radius_mean) rad_avg, stddev_pop(radius_mean) rad_std from radius_mean) stats where radius_mean between stats.rad_avg - stats.rad_std and stats.rad_avg + stats.rad_std;
select avg(perimeter_mean) from dataset join (select avg(perimeter_mean) per_avg, stddev_pop(perimeter_mean) per_std from perimeter_mean) stats where perimeter_mean between stats.per_avg - stats.per_std and stats.per_avg + stats.per_std;
Table 11
Now the above query mechanism performed by the query generation unit 214 may be configured to process the data which is now ready for processing and can be fed into our phase 2 of the system 104 which is Dynamic Visualization Phase 108 as an input data set.
In one embodiment of the present disclosure, the Dynamic Visualization Engine (DVE) 216 may be configured to receive the data outputted by the query generation unit 214. The Insight generation unit 218 of the Dynamic Visualization Engine (DVE) 216 may be configured to generate meaningful and actionable insights. The output is a detailed information page show casing the data statistics. The System will understand and create entity specific details which a user can then save as insight dashboard 110. The insight generation unit 218 performs a logical insight generator mechanism which is the core where all the visualization is being done.
In one embodiment of the present disclosure, the Dynamic Visualization Engine (DVE) 216 may be configured to receive formed dataset as input and after performing various technical operations like relationship mapping and calculates various permutations and combinations to filter and generate insights from the data which will then suggest an Insightful dashboard 110 to the user where the user can get a clear view and understanding of the data by visualizing
them on a bunch of insightful data plots for example, in graphical widgets, histograms, venn diagrams, lists, charts, or any other type of widgets for convenient visual representation of the insightful data. While the data gets passed through Dynamic Visualization Engine (DVE) 216, data learning stage starts. Next, the fed data set will be applied with system logic and machine learning algorithms to produce insightful dashboard 110 to the user for data drilling which user can save, deploy, download and communicate in the form of files. The insightful dashboard 110 may be configured in such a way that the user can remove one or more widgets if he thinks are not relevant for his analysis.
FIG. 3 illustrates a flow chart of a computer implemented data processing method for generating useful insights from a data set. The method 300 may be described in the general context of computer executable instructions. Generally, computer executable instructions may include routines, programs, objects, components, data structures, procedures, modules, and functions, which perform specific functions or implement specific abstract data types.
The order in which the method 300 is described is not intended to be construed as a limitation, and any number of the described method blocks may be combined in any order to implement the method. Additionally, individual blocks may be deleted from the methods without departing from the spirit and scope of the subject matter described. In one embodiment of the present disclosure, the system 104 is configured to perform the method 300.
At step 302, the method includes applying data filtering technique to filter out unwanted data from the data set and classifying the filtered data set into a plurality of distinct features as explained in above paragraphs. In one embodiment of the present disclosure, the quality assessment unit 210 is configured to perform the step 302 of the method 300.
At step 304, the method 300 may include calculating a correlation score for each of the classified feature with respect to a target variable. At step 306, the method 300 may include selecting a plurality of features, among the classified features, having correlation score greater than or equal to a first pre-defined threshold. At step 308, the method 300 includes generating a feature correlation matrix for the selected plurality of features and grouping the selected plurality of features in one or more clusters based on a correlation value, wherein the features
having relatable correlation value are kept in same cluster. At step 310, the method 300 includes Selecting one or more clusters among the plurality of clusters containing the features having f-score greater than equal to a second predefined threshold. At step 312, the method 300 includes selecting at least two features from the one or more selected clusters having a highest correlation degree. In one embodiment of the present disclosure, the feature clustering unit 212 may be configured to perform the steps 304-312 of the method 300.
At step 314, the method 300 includes processing the at least two selected features, using a pre¬defined set of queries, to generate the useful insights. In one embodiment of the present disclosure, the query generation unit 214 may be configured to perform the step of processing the at least two selected features, using a pre-defined set of queries. In one embodiment of the present disclosure, the insight generation unit 218 may be configured to perform the step of generating useful insights in accordance with the present disclosure.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps
or stages consistent with the embodiments described herein. The term "computer- readable medium" should be understood to include tangible items and exclude carrier waves and transient signals, i.e., are non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, non-volatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a graphic processing unit (GPU), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
Advantages of the embodiment of the present disclosure are illustrated herein.
In an embodiment, the present disclosure provides an efficient improved data processing technique which requires very minimal human intervention for processing insightful data points.
In an embodiment, the present disclosure provides an improved feature correlation technique which accurately, efficiently, and economically process insights for crucial decision making or analysis of dataset even when an enormous amount of information is present in a dataset.
REFERENCE NUMERALS
Environment 100
Data set 102
System 104
Data Query Phase 106
Dynamic Visualization Phase 108
Insight Dashboard 110
Processing Unit 202
I/O Interface 204
27
Memory 206
Data Query Subsystem 208
Data Quality Assessment Unit 210
Feature Clustering Unit 212
Query Generation Unit 214
Dynamic Visualization Engine (DVE) 216
Insight Generation Unit 218
Method 300
The terms "an embodiment", "embodiment", "embodiments", "the embodiment", "the embodiments", "one or more embodiments", "some embodiments", and "one embodiment" mean "one or more (but not all) embodiments of the invention(s)" unless expressly specified otherwise.
The terms "including", "comprising", “having” and variations thereof mean "including but not limited to", unless expressly specified otherwise.
The enumerated listing of items does not imply that any or all of the items are mutually exclusive, unless expressly specified otherwise.
The terms "a", "an" and "the" mean "one or more", unless expressly specified otherwise.
A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary, a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention.
When a single device or article is described herein, it will be clear that more than one device/article (whether they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether they cooperate), it will be clear that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively
embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
We Claim:
1. A computer implemented data processing method for generating insights from a data set,
the method comprising:
applying data filtering technique to filter out unwanted data from the data set and classifying the filtered data set into a plurality of distinct features;
calculating a correlation score for each of the classified feature with respect to a target variable;
selecting a plurality of features, among the classified features, having correlation score greater than or equal to a first pre-defined threshold;
generating a feature correlation matrix for the selected plurality of features and grouping the selected plurality of features in one or more clusters based on a correlation value, wherein the features having relatable correlation value are kept in same cluster;
selecting one or more clusters among the plurality of clusters containing the features having f-score greater than equal to a second predefined threshold;
selecting at least two features from the one or more selected clusters having a highest correlation degree; and
processing the at least two selected features, using a pre-defined set of queries, to generate the useful insights.
2. The method as claimed in claim 1, wherein the data filtering technique further comprises removing at least one of incomplete data, redundant data, and duplicate data from the data set.
3. The method as claimed in claim 1, further comprise performing a null hypothesis technique on the at least two selected features to ascertain that the at least two selected features have the highest correlation degree.
4. The method as claimed in claim 1, wherein the f-score for each of the features of the selected one or more clusters is calculated by using a SelectKBest technique.
5. The method as claimed in claim 1, further comprising:
selecting at least two other features from the one or more selected clusters having subsequent correlation degree; and
processing the at least two selected features using the pre-defined set of queries, to generate other useful insights.
6. A data processing system for generating useful insights from a data set, the system
comprises:
a memory;
a processing unit operationally coupled with the memory, the processing unit configured to:
apply data filtering technique to filter out unwanted data from the data set and classify the filtered data set into a plurality of distinct features;
calculate a correlation score for each of the classified feature with respect to a target variable;
select a plurality of features, among the classified features, having correlation score greater than or equal to a first pre-defined threshold;
generate a feature correlation matrix for the selected plurality of features and group the selected plurality of features in one or more clusters based on a correlation value, wherein the features having relatable correlation value are kept in same cluster;
select one or more clusters among the plurality of clusters containing the features having f-score greater than equal to a second predefined threshold;
select at least two features from the one or more selected clusters having a highest correlation degree; and
process the at least two selected features, using a pre-defined set of queries, to generate the useful insights.
7. The data processing system as claimed in claim 6, wherein the processing unit is
further configured to:
apply data filtering technique by removing at least one of incomplete data, redundant data, and duplicate data from the data set.
8. The system as claimed in claim 6, wherein the processing unit is further configured
to:
perform null hypothesis technique on the at least two selected features to ascertain that the at least two selected features have the highest correlation degree.
9. The system as claimed in claim 6, wherein the processing unit is further configured
to:
calculate the f-score for each of the features of the selected one or more clusters using SelectKBest technique.
10. The system as claimed in claim 6, wherein the processing unit is further configured
to:
selecting at least two other features from the one or more selected clusters having subsequent correlation degree; and
processing the at least two selected features using the pre-defined set of queries, to generate other useful insights.
| # | Name | Date |
|---|---|---|
| 1 | 202221058762-STATEMENT OF UNDERTAKING (FORM 3) [14-10-2022(online)].pdf | 2022-10-14 |
| 2 | 202221058762-REQUEST FOR EXAMINATION (FORM-18) [14-10-2022(online)].pdf | 2022-10-14 |
| 3 | 202221058762-POWER OF AUTHORITY [14-10-2022(online)].pdf | 2022-10-14 |
| 4 | 202221058762-FORM 18 [14-10-2022(online)].pdf | 2022-10-14 |
| 5 | 202221058762-FORM 1 [14-10-2022(online)].pdf | 2022-10-14 |
| 6 | 202221058762-DRAWINGS [14-10-2022(online)].pdf | 2022-10-14 |
| 7 | 202221058762-DECLARATION OF INVENTORSHIP (FORM 5) [14-10-2022(online)].pdf | 2022-10-14 |
| 8 | 202221058762-COMPLETE SPECIFICATION [14-10-2022(online)].pdf | 2022-10-14 |
| 9 | 202221058762-Proof of Right [19-10-2022(online)].pdf | 2022-10-19 |
| 10 | Abstract1.jpg | 2022-12-15 |
| 11 | 202221058762-FER.pdf | 2025-06-30 |
| 1 | SearchStrategyMatrixE_19-01-2025.pdf |