Methods And Systems For Solving Multi Collinearity During Regression Model Training
Abstract:
This disclosure relates generally to methods and systems for solving multi-collinearity problem during regression model training using Jacobi polynomials. Conventional techniques such as principal component regression (PCR) and partial least squares (PLS) dumps the training data on the memory and needs high memory utilization and takes more execution time. The present disclosure solves the multi-collinearity problem in two solutions based on the data size of the training data. In particular, the present disclosure reduces the multicollinearity of the design matrix in the regression analysis. The first solution uses matrix-inverse of the design matrix transformed with the Jacobi polynomials basis. This is particularly useful for small training data, and accurately determines the model parameters of the regression model. In the second solution, the training data is divided into small chunks when the training data is very large and hence the design matrix obtained through the Jacobi polynomials basis.
.
[To be published with FIG. 2]
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Inventors
1. LAWHATRE, Prashant
Tata Consultancy Services Limited, Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune, Maharashtra 411013, India
2. GHAISAS, Smita
Tata Consultancy Services Limited, Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune, Maharashtra 411013, India
Specification
Claims:We Claim:
1. A processor-implemented method (300) for solving a multi-collinearity during training a regression model, the method comprising the steps of:
receiving, via one or more hardware processors, a training data, and a size of the training data, for training the regression model, wherein the training data is associated with a plurality of independent variables and a dependent variable, wherein at least two independent variables out of the plurality of independent variables are multi-collinear (302);
obtaining, via the one or more hardware processors, a degree of each independent variable of the plurality of independent variables, using a model hypothesis technique (304);
performing, via the one or more hardware processors, one of (i) a first solution when the size of the training data is less than a predefined data size, and (ii) a second solution when the size of the training data is greater than or equal to the predefined data size (306), wherein
the first solution (306a) comprises:
transforming the training data associated with each independent variable of the plurality of independent variables, with Jacobi polynomials, using a Jacobi polynomial transformation technique, based on a degree of transformation for each independent variable, to obtain a transformed training data associated with each independent variable, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable (306a1);
arranging the transformed training data associated with each independent variable, to obtain a transformed design matrix associated with the plurality of independent variables (306a2); and
determining one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables, using a matrix-inverse technique, for validating the multi-collinearity (306a3); and
the second solution (306b) comprises:
dividing the training data into one or more batches, based on a predefined batch size (306b1);
transforming the training data associated with each independent variable, present in each batch, with the Jacobi polynomials, using the Jacobi polynomial transformation technique, based on the degree of transformation for each independent variable, to obtain the transformed training data associated with each independent variable for each batch, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable (306b2);
arranging the transformed training data associated with each independent variable for each batch to obtain the transformed design matrix associated with the plurality of independent variables for each batch (306b3); and
determining the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables for each batch of the one or more batches, using a matrix batched optimization technique, for validating the multi-collinearity (306b4).
2. The method as claimed in claim 1, wherein the model hypothesis technique estimates a hypothesis relation between the plurality of independent variables and the dependent variable to obtain the degree of each independent variable of the plurality of independent variables.
3. The method as claimed in claim 1, wherein the predefined batch size defines a number of samples of training data present in each batch of the one or more batches.
4. The method as claimed claim 1, wherein determining the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables, using the matrix-inverse technique, comprises:
determining one or more projections for one or more pairs formed from the plurality of independent variables, using the transformed design matrix associated with the plurality of independent variables;
determining a common mapping that transforms the one or more projections, to an orthogonal vector space;
determining an independent mapping for each independent variable, by transforming the transformed design matrix associated with the plurality of independent variables, using the common mapping; and
determining the one or more model parameters of the regression model, by transforming the dependent variable, using the independent mapping associated with each independent variable of the plurality of independent variables.
5. The method as claimed in claim 1, wherein determining the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables for each batch at a time, of the one or more batches, using the matrix batched optimization technique, comprises:
initializing the regression model with one or more initial model parameters;
obtaining a predicted value of the dependent variable, based on the transformed design matrix associated with the plurality of independent variables for each batch, using the one or more initial model parameters of the regression model;
minimizing a value of a loss function of the regression model, wherein the loss function is defined as a deviation between the predicted value of the dependent variable and the actual value of the dependent variable; and
determining the one or more model parameters of the regression model, after the one or more batches are completed.
6. A system (100) for solving a multi-collinearity during training a regression model, the system comprising:
a memory (112) storing instructions;
one or more Input/Output (I/O) interfaces (116); and
one or more hardware processors (114) coupled to the memory (112) via the one or more I/O interfaces (116), wherein the one or more hardware processors (114) are configured by the instructions to:
receive a training data, and a size of the training data, for training the regression model, wherein the training data is associated with a plurality of independent variables and a dependent variable, wherein at least two independent variables out of the plurality of independent variables are multi-collinear;
obtain a degree of each independent variable of the plurality of independent variables, using a model hypothesis technique;
perform one of (i) a first solution when the size of the training data is less than a predefined data size, and (ii) a second solution when the size of the training data is greater than or equal to the predefined data size, wherein
the first solution comprises:
transforming the training data associated with each independent variable of the plurality of independent variables, with Jacobi polynomials, using a Jacobi polynomial transformation technique, based on a degree of transformation for each independent variable, to obtain a transformed training data associated with each independent variable, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable;
arranging the transformed training data associated with each independent variable, to obtain a transformed design matrix associated with the plurality of independent variables; and
determining one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables, using a matrix-inverse technique, for validating the multi-collinearity; and
the second solution comprises:
dividing the training data into one or more batches, based on a predefined batch size;
transforming the training data associated with each independent variable, present in each batch, with the Jacobi polynomials, using the Jacobi polynomial transformation technique, based on the degree of transformation for each independent variable, to obtain the transformed training data associated with each independent variable for each batch, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable;
arranging the transformed training data associated with each independent variable for each batch to obtain the transformed design matrix associated with the plurality of independent variables for each batch; and
determining the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables for each batch of the one or more batches, using a matrix batched optimization technique, for validating the multi-collinearity.
7. The system as claimed in claim 6, wherein the one or more hardware processors (114) are configured to estimates a hypothesis relation between the plurality of independent variables and the dependent variable to obtain the degree of each independent variable of the plurality of independent variables, using the model hypothesis technique.
8. The system as claimed in claim 6, wherein the predefined batch size defines a number of samples of training data present in each batch of the one or more batches.
9. The system as claimed claim 6, wherein the one or more hardware processors (114) are configured to determine the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables, using the matrix-inverse technique, by:
determining one or more projections for one or more pairs formed from the plurality of independent variables, using the transformed design matrix associated with the plurality of independent variables;
determining a common mapping that transforms the one or more projections, to an orthogonal vector space;
determining an independent mapping for each independent variable, by transforming the transformed design matrix associated with the plurality of independent variables, using the common mapping; and
determining the one or more model parameters of the regression model, by transforming the dependent variable, using the independent mapping associated with each independent variable of the plurality of independent variables.
10. The system as claimed in claim 6, wherein the one or more hardware processors (114) are configured to determine the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables for each batch at a time, of the one or more batches, using the matrix batched optimization technique, by:
initializing the regression model with one or more initial model parameters;
obtaining a predicted value of the dependent variable, based on the transformed design matrix associated with the plurality of independent variables for each batch, using the one or more initial model parameters of the regression model;
minimizing a value of a loss function of the regression model, wherein the loss function is defined as a deviation between the predicted value of the dependent variable and the actual value of the dependent variable; and
determining the one or more model parameters of the regression model, after the one or more batches are completed.
Dated this 20th day of December 2021
Tata Consultancy Services Limited
By their Agent & Attorney
(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHODS AND SYSTEMS FOR SOLVING MULTI-COLLINEARITY DURING REGRESSION MODEL TRAINING
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to the field of Jacobi linear regression model training, and, more particularly, to methods and systems for solving multi-collinearity problem during regression model training using Jacobi polynomials.
BACKGROUND
Multicollinearity problem arises when a dependent variable depends on multiple independent variables out of which at least two independent variables are dependent to each other. The multicollinearity problem arises in the training data associated with the multiple independent variables in the machine learning especially while training the regression models. Classical regression model is built using model parameters obtained after training with the training data, for making the predictions. When the training data is non multi-colinear, the model parameters may build easily and the predictions from the regression model may be accurate. But, when the training data is multi-colinear, the predictions from the regression model may have high deviance from the actual prediction and may be less robust.
Conventional techniques such as principal component regression (PCR) and partial least squares (PLS) basically reduce the dimensions in the training data to solve the multi-collinearity problem. The said conventional techniques dumps the training data on the memory in case the size of the training data is very large and perform matrix decomposition which may allow the regression model to train properly and to make the predictions. This approach needs high memory utilization and takes more execution time. Further, when new training data arrives, the conventional regression models need re-training on the entire training data which is much time consuming and high resource utilization.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
In an aspect, a processor-implemented method for solving a multi-collinearity during training a regression model, is provided. The method including the steps of: receiving a training data, and a size of the training data, for training the regression model, wherein the training data is associated with a plurality of independent variables and a dependent variable, wherein at least two independent variables out of the plurality of independent variables are multi-collinear; obtaining a degree of each independent variable of the plurality of independent variables, using a model hypothesis technique; performing one of (i) a first solution when the size of the training data is less than a predefined data size, and (ii) a second solution when the size of the training data is greater than or equal to the predefined data size, wherein the first solution comprises: transforming the training data associated with each independent variable of the plurality of independent variables, with Jacobi polynomials, using a Jacobi polynomial transformation technique, based on a degree of transformation for each independent variable, to obtain a transformed training data associated with each independent variable, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable; arranging the transformed training data associated with each independent variable, to obtain a transformed design matrix associated with the plurality of independent variables; and determining one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables, using a matrix-inverse technique, for validating the multi-collinearity; and the second solution comprises: dividing the training data into one or more batches, based on a predefined batch size; transforming the training data associated with each independent variable, present in each batch, with the Jacobi polynomials, using the Jacobi polynomial transformation technique, based on the degree of transformation for each independent variable, to obtain the transformed training data associated with each independent variable for each batch, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable; arranging the transformed training data associated with each independent variable for each batch to obtain the transformed design matrix associated with the plurality of independent variables for each batch; and determining the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables for each batch of the one or more batches, using a matrix batched optimization technique, for validating the multi-collinearity.
In another aspect, a system for solving a multi-collinearity during training a regression model, is provided. The system includes: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a training data, and a size of the training data, for training the regression model, wherein the training data is associated with a plurality of independent variables and a dependent variable, wherein at least two independent variables out of the plurality of independent variables are multi-collinear; obtain a degree of each independent variable of the plurality of independent variables, using a model hypothesis technique; perform one of (i) a first solution when the size of the training data is less than a predefined data size, and (ii) a second solution when the size of the training data is greater than or equal to the predefined data size, wherein the first solution comprises: transforming the training data associated with each independent variable of the plurality of independent variables, with Jacobi polynomials, using a Jacobi polynomial transformation technique, based on a degree of transformation for each independent variable, to obtain a transformed training data associated with each independent variable, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable; arranging the transformed training data associated with each independent variable, to obtain a transformed design matrix associated with the plurality of independent variables; and determining one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables, using a matrix-inverse technique, for validating the multi-collinearity; and the second solution comprises: dividing the training data into one or more batches, based on a predefined batch size; transforming the training data associated with each independent variable, present in each batch, with the Jacobi polynomials, using the Jacobi polynomial transformation technique, based on the degree of transformation for each independent variable, to obtain the transformed training data associated with each independent variable for each batch, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable; arranging the transformed training data associated with each independent variable for each batch to obtain the transformed design matrix associated with the plurality of independent variables for each batch; and determining the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables for each batch of the one or more batches, using a matrix batched optimization technique, for validating the multi-collinearity.
In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: receive a training data, and a size of the training data, for training the regression model, wherein the training data is associated with a plurality of independent variables and a dependent variable, wherein at least two independent variables out of the plurality of independent variables are multi-collinear; obtain a degree of each independent variable of the plurality of independent variables, using a model hypothesis technique; perform one of (i) a first solution when the size of the training data is less than a predefined data size, and (ii) a second solution when the size of the training data is greater than or equal to the predefined data size, wherein the first solution comprises: transforming the training data associated with each independent variable of the plurality of independent variables, with Jacobi polynomials, using a Jacobi polynomial transformation technique, based on a degree of transformation for each independent variable, to obtain a transformed training data associated with each independent variable, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable; arranging the transformed training data associated with each independent variable, to obtain a transformed design matrix associated with the plurality of independent variables; and determining one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables, using a matrix-inverse technique, for validating the multi-collinearity; and the second solution comprises: dividing the training data into one or more batches, based on a predefined batch size; transforming the training data associated with each independent variable, present in each batch, with the Jacobi polynomials, using the Jacobi polynomial transformation technique, based on the degree of transformation for each independent variable, to obtain the transformed training data associated with each independent variable for each batch, wherein the degree of transformation for each independent variable is selected based on the degree of each independent variable; arranging the transformed training data associated with each independent variable for each batch to obtain the transformed design matrix associated with the plurality of independent variables for each batch; and determining the one or more model parameters of the regression model based on the transformed design matrix associated with the plurality of independent variables for each batch of the one or more batches, using a matrix batched optimization technique, for validating the multi-collinearity.
In an embodiment, the hypothesis relation between the plurality of independent variables and the dependent variable is estimated, to obtain the degree of each independent variable of the plurality of independent variables, using the model hypothesis technique.
In an embodiment, the predefined batch size defines a number of samples of training data present in each batch of the one or more batches.
In an embodiment, the one or more model parameters of the regression model are determined based on the transformed design matrix associated with the plurality of independent variables, using the matrix-inverse technique, by: determining one or more projections for one or more pairs formed from the plurality of independent variables, using the transformed design matrix associated with the plurality of independent variables; determining a common mapping that transforms the one or more projections, to an orthogonal vector space; determining an independent mapping for each independent variable, by transforming the transformed design matrix associated with the plurality of independent variables, using the common mapping; and determining the one or more model parameters of the regression model, by transforming the dependent variable, using the independent mapping associated with each independent variable of the plurality of independent variables.
In an embodiment, the one or more model parameters of the regression model are determined based on the transformed design matrix associated with the plurality of independent variables for each batch at a time, of the one or more batches, using the matrix batched optimization technique, by: initializing the regression model with one or more initial model parameters; obtaining a predicted value of the dependent variable, based on the transformed design matrix associated with the plurality of independent variables for each batch, using the one or more initial model parameters of the regression model; minimizing a value of a loss function of the regression model, wherein the loss function is defined as a deviation between the predicted value of the dependent variable and the actual value of the dependent variable; and determining the one or more model parameters of the regression model, after the one or more batches are completed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 is an exemplary block diagram of a system for solving a multi-collinearity during training a regression model, in accordance with some embodiments of the present disclosure.
FIG. 2 is an exemplary block diagram illustrating modules of the system of FIG. 1 for solving the multi-collinearity during training the regression model, in accordance with some embodiments of the present disclosure.
FIG. 3A and FIG. 3B illustrates exemplary flow diagrams of a processor-implemented method for solving the multi-collinearity during training the regression model, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Regression problems such as classical protein structure prediction can be expressed as a linear combination of certain functions called basis functions. Aiming to gain useful insights into the data with complex structure, the basis function of the model can be changed to get non-linear regression models. A variety of basis functions exist that can be used for regression viz. Fourier Basis, Radial Basis Functions and Spline, where each basis function has its own advantage. Basis functions such as Fourier basis have specialized applications. Fourier Series-based regression utilizes trigonometric functions to capture the periodicity in the data. Fourier Series-based regression also been used to predict missing data in applications such as remote sensing analysis and the production of lowland rice irrigation. Spline regression is useful when the nature of the relationship between variables is highly non-linear. Regression spline selects pivot points from the dataset and then models the smooth function to fit the data in the pivot defined bins. Application of spline regression includes housing price prediction and flood risk estimation. The Radial Basis Functions (RBF) are particularly useful for capturing the local properties of the datapoints in feature space. Gaussian basis, in particular, have various important analytical properties. The other common RBFs are multi-quadratic and inverse multi-quadratic function.
Consider a gas emission prediction problem, independent or explanatory variables like ambient temperature are required to determine the level of carbon monoxide (CO) or nitrogen oxides (NOx) emission. The values of these variables can be arranged into a matrix known as data or design matrix. The design matrix is later used to determine the parameters of the model. The phenomenon of multicollinearity occurs when the attributes (columns) like turbine energy yield and exhaust pressure of the data (design matrix) are linearly dependent. It can cause a range of problems. Firstly, it can cause a large sampling variance that affects the precision of the estimation of regression coefficients. Secondly, it may cause regression coefficients to take diminishing values due to a large sampling variance. Thirdly, it may also result in widening the confidence interval of the estimates. Lastly, it may affect the robustness and stability of estimates since the Ordinary Least Square (OLS) estimation is sensitive to small change in the attribute values.
To circumvent the multi-collinearity problem the regression model training, either Principle Components Regression (PCR) or Partial Least Squares (PLS) is used in the art. Both PCR and PLS essentially finds the orthogonal axes or latent structure and then projects the sample points on them. The PCR determines the orthogonal axis by maximizing the variance in the data while, the PLS finds attributes in the data that best describes that has to be predicted. The PCR and PLS addresses both variability and correlation at the same time. However, in the big data regime, both classical PCR and PLS have some limitations. Primarily, when the training data is updated, the analysis requires re-calculation for all the available entities in the training data which is inefficient. Also, when the training data is large, it is often impossible to load the data matrix thereby making it difficult to apply singular value decomposition (SVD).
Hence, the multi-collinearity causes a host of problems such as imprecision in estimation, less stable and robust estimates, diminished regression coefficients and a widened confidence interval. Various regression models that have been proposed in the art that rely on the standard polynomial basis which may not handle multi-collinearity present in the training data.
The disclosed methods and systems of the present disclosure solves the multi-collinearity problem in two solutions based on the data size of the training data. In particular, the present disclosure reduces the multicollinearity of the design matrix in the regression analysis. The first solution uses matrix-inverse of the design matrix transformed with the Jacobi polynomials basis. This is particularly useful for small training data. It accurately determines the model parameters of the regression model for the small training data. In the second solution, the training data is divided into small chunks when the training data is very large and hence the design matrix obtained through the Jacobi polynomials basis.
Referring now to the drawings, and more particularly to FIG. 1 through FIG. 3B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary systems and/or methods.
FIG. 1 is an exemplary block diagram of a system 100 for solving a multi-collinearity during training a regression model, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes or is otherwise in communication with one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more hardware processors 104, the memory 102, and the I/O interface(s) 106 may be coupled to a system bus 108 or a similar mechanism.
The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a plurality of sensor devices, a printer and the like. Further, the I/O interface(s) 106 may enable the system 100 to communicate with other devices, such as web servers and external databases.
The I/O interface(s) 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface(s) 106 may include one or more ports for connecting a number of computing systems with one another or to another server computer. Further, the I/O interface(s) 106 may include one or more ports for connecting a number of devices to one another or to another server.
The one or more hardware processors 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, portable computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102 includes a plurality of modules 102a and a repository 102b for storing data processed, received, and generated by one or more of the plurality of modules 102a. The plurality of modules 102a may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
The plurality of modules 102a may include programs or computer-readable instructions or coded instructions that supplement applications or functions performed by the system 100. The plurality of modules 102a may also be used as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 102a can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. In an embodiment, the plurality of modules 102a can include various sub-modules (not shown in FIG. 1). Further, the memory 102 may include information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure.
The repository 102b may include a database or a data engine. Further, the repository 102b amongst other things, may serve as a database or includes a plurality of databases for storing the data that is processed, received, or generated as a result of the execution of the plurality of modules 102a. Although the repository 102a is shown internal to the system 100, it will be noted that, in alternate embodiments, the repository 102b can also be implemented external to the system 100, where the repository 102b may be stored within an external database (not shown in FIG. 1) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, data may be added into the external database and/or existing data may be modified and/or non-useful data may be deleted from the external database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). In another embodiment, the data stored in the repository 102b may be distributed between the system 100 and the external database.
Referring collectively to FIG. 2 and FIG. 3A - 3B, components and functionalities of the system 100 are described in accordance with an example embodiment of the present disclosure. For example, FIG. 2 is an exemplary block diagram illustrating modules 200 of the system 100 of FIG. 1 for solving the multi-collinearity during training the regression model, in accordance with some embodiments of the present disclosure. As shown in FIG. 2, the modules 200 includes a multi-collinearity solver 202 and a model parameter estimator 204. The multi-collinearity solver 202 further includes a model hypothesis module 202a and a Jacobi polynomial transformer 202b. The model parameter estimator 204 further includes a matrix-inverse solver 204a and an optimizer 204b. The multi-collinearity solver 202 is used to remove the multi-collinearity present in the training data and the model parameter estimator 204 is used to estimate the model parameters based on the multi-collinearity-free training data.
FIG. 3A and FIG. 3B illustrates exemplary flow diagrams of a processor-implemented method 300 for solving the multi-collinearity during training the regression model, in accordance with some embodiments of the present disclosure. Although steps of the method 300 including process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any practical order. Further, some steps may be performed simultaneously, or some steps may be performed alone or independently.
At step 302 of the method 300, the one or more hardware processors 104 of the system 100 are configured to receive a training data and a size of the training data, for training the regression model. The training data is associated with the plurality of independent variables and the dependent variable. More specifically, the dependent variable depends on the plurality of independent variables. The training data associated with the plurality of independent variables has the multi-collinearity problem. That is, at least two independent variables out of the plurality of independent variables are multi-collinear in nature. For example, one independent variable may be dependent on other independent variable, or one or more independent variables may be dependent on one or more other independent variables, or one or more independent variables may be dependent on one independent variable. For simplification, at least two independent variables may be dependent to each other. Determining such dependent variable is always complex and may not diverge. Hence, such multi-collinearity in the training data associated with the plurality of independent variables must be resolved to obtain multi-collinear free training data.
The size of the training data is represented in terms of several bytes or in terms of several samples. For example, the size of the training data may be 2 giga bytes (GB) or 5000 samples.
At step 304 of the method 300, the one or more hardware processors 104 of the system 100 are configured to obtain a degree of each independent variable of the plurality of independent variables, using a model hypothesis technique present in the model hypothesis module 202a of the multi-collinearity solver 202. The model hypothesis technique estimates a hypothesis relation between the plurality of independent variables and the dependent variable to obtain the degree of each independent variable of the plurality of independent variables. The hypothesis relation is in the form of the Jacobi polynomial basis function that defines the relation between the dependent variable and the plurality of independent variables. In an embodiment, the model hypothesis technique may be one of a histogram technique or a line plotting technique, which derives the hypothesis relation between the plurality of independent variables and the dependent variable, based on the training data associated with them. The hypothesis relation is a polynomial relation or expression. Then from derived hypothesis relation, the degree of each independent variable is obtained based on the degree (power) associated with each independent variable.
For example, if Y is the dependent variable, and X_1, X_2, X_3 are the independent variables, and further an exemplary hypothesis relation between the dependent variable Y, and the independent variables X_1, X_2, X_3 derived by the hypothesis technique is Y=?X_1?^2+?X_2?^3+?X_3?^3, then the degree of the independent variable X_1 is ‘2’, the degree of the independent variable X_2 is ‘3’ and the degree of the independent variable X_3 is ‘3’.
At step 306 of the method 300, the one or more hardware processors 104 of the system 100 are configured to obtain perform one of (i) a first solution and (ii) a second solution, based on the size of the training data received at step 302 of the method 300. The first solution 306a is performed when the size of the training data is less than a predefined data size. For example, if the size of the training data is less than 64 GB, then the first solution 306a is performed on the training data to solve the multi-collinearity problem. The second solution 306b is performed when the size of the training data is greater than or equal to the predefined data size. For example, if the size of the training data is greater than or equal to 10000 samples, then the second solution 306b is performed on the training data to solve the multi-collinearity problem.
The first solution 306a to solve the multi-collinearity problem in the training data is further explained through steps 306a1 to 306a3. At step 306a1, the training data associated with each independent variable of the plurality of independent variables, is transformed with Jacobi polynomials, using a Jacobi polynomial transformation technique, based on a degree of transformation for each independent variable. After the transformation, a transformed training data associated with each independent variable is obtained. The Jacobi polynomial transformation technique is present in the Jacobi polynomial transformer 202b of the multi-collinearity solver 202. The degree of transformation for each independent variable is selected based on the degree of each independent variable obtained step 304 of the method 300. For example, if the degree of the independent variable is ‘3’, then the degree of transformation for such independent variable is considered as ‘3’ wile transforming the training data associated with such independent variable to obtain the training data. The transformed training data associated with each independent variable of the plurality of independent variables ensures the multi- collinearity problem-free training data and the transformed training data is safely used for training the regression models.
In the Jacobi polynomial transformation technique, the transformation is iteratively computed in step size equal to the degree determined in the model hypothesis technique. During each iteration, In the first step, a total number of combinations of aggregated degree i.e. sum of degree and first model parameter a is determined with respect to the current iteration number and the first model parameter a is considered for all the successive steps. In the second step, the total number of combinations of aggregated degree is determined i.e. sum of degree and second model parameter ß is determined with respect to the remaining iteration number and the second model parameter ß taken for all the completed steps. In the third step, each independent variable is eroded (reduced) by one unit, scaled to half, and then amplified to the number of completed steps. In the fourth step, each independent variable is coagulated by one unit, scaled to half, and then amplified to the number of remaining steps. After all the steps are completed, the effect of each iteration is aggregated to obtain the transformed data.
However, checking whether the transformed training data is multi- collinearity problem-free or not, validating the model parameters of the regression model after training is important. The model parameters of the of the regression model are obtained using the model parameter estimator 204. For this, at step 306a2, the transformed training data associated with each independent variable of the plurality of independent variables is arranged in a matrix form, to obtain a transformed design matrix associated with the plurality of independent variables.
Next, at step 306a3, one or more model parameters of the regression model after the training, based on the transformed design matrix associated with the plurality of independent variables, is determined, using a matrix-inverse technique, for validating the multi-collinearity problem. Determining the one or more model parameters of the regression model using the matrix-inverse technique is further explained through steps 306a3a to 306a3d. The matrix-inverse technique is present in the matrix-inverse solver 204a of the model parameter estimator 204.
At step 306a3a, one or more pairs are formed from the plurality of independent variables, wherein each pair comprises at least two independent variables out of the plurality of variables. Then, one or more projections for the one or more pairs are determined, from the plurality of independent variables, using the transformed design matrix associated with the plurality of independent variables.
At step 306a3b, a common mapping that transforms the one or more projections, to an orthogonal vector space, is determined. At step 306a3c, an independent mapping for each independent variable, is determined, by transforming the transformed design matrix associated with the plurality of independent variables, using the common mapping determined at step 306a3b.
Lastly, at step 306a3d, the one or more model parameters of the regression model, are determined by transforming the dependent variable, using the independent mapping associated with each independent variable of the plurality of independent variables. The one or more model parameters of the regression model obtained at step 306a3d, defines the trained regression model after training with the transformed training data obtained at step 306a1. The trained regression model is validated for testing an accuracy. If the accuracy of the trained regression model is more than a predefined threshold accuracy, then the trained regression model is trained properly which means the multi-collinearity problem present in the training data received at step 302 of the method 300 is solved.
The second solution 306b to solve the multi-collinearity problem in the training data is further explained through steps 306b1 to 306a4. At step 306b1, the training data into one or more batches, based on a predefined batch size, as the training data in this case is very huge. The predefined batch size defines a number of samples out of total samples of the training data, present in each batch of the one or more batches. The predefined batch size may be, 32 samples, 64 samples, 128 samples, and so on.
At step 306b2, the training data associated with each independent variable of the plurality of independent variables, present in each batch, is transformed with Jacobi polynomials, using a Jacobi polynomial transformation technique, based on a degree of transformation for each independent variable, to obtain a transformed training data associated with each independent variable, as mentioned at step 306a1 of the method 300. The Jacobi polynomial transformation technique is present in the Jacobi polynomial transformer 202b of the multi-collinearity solver 202. The degree of transformation for each independent variable is selected based on the degree of each independent variable obtained step 304 of the method 300. For example, if the degree of the independent variable is ‘3’, then the degree of transformation for such independent variable is considered as ‘3’ while transforming the training data associated with such independent variable to obtain the training data. The transformed training data associated with each independent variable of the plurality of independent variables associated with each batch, ensures the multi- collinearity problem-free training data and the transformed training data associated with each batch is safely used for training the regression models.
However, checking whether the transformed training data obtained at step 306b2 is multi- collinearity problem-free or not, validating the model parameters of the regression model after training is important. The model parameters of the regression model are obtained using the model parameter estimator 204. For this, at step 306b3, the transformed training data associated with each independent variable of the plurality of independent variables for each batch is arranged in a matrix form, to obtain the transformed design matrix associated with the plurality of independent variables for each batch.
Next, at step 306b4, one or more model parameters of the regression model associated with each batch of the one or more batches, are determined after the training. The one or more model parameters are determined based on the transformed design matrix associated with the plurality of independent variables, , using the matrix batched optimization technique, for validating the multi-collinearity problem, as explained at step 306a3. The matrix batched optimization technique is present in the optimizer 204b of the model parameter estimator 204. Determining the one or more model parameters of the regression model using the matrix batched optimization technique in this case is further explained through steps 306b4a to 306b4d.
At step 306b4a, the regression model is initialized with one or more initial model parameters while training with the transformed training data. At step 306b4b, a predicted value of the dependent variable is obtained, based on the transformed design matrix associated with the plurality of independent variables for each batch, using the one or more initial model parameters of the regression model.
At step 306b4c, a value of a loss function of the regression model is minimized. The loss function is defined as a deviation between the predicted value of the dependent variable obtained at step 306b4b and the actual value of the dependent variable obtained from the transformed training data through the transformed design matrix. If the value of the loss function of the regression model is less than a predefined loss function threshold value, then the initial model parameters are adjusted and the adjusted model parameters are considered for next batch to obtain the predicted value of the dependent variable for successive batches. Like these, the regression model is trained with the transformed training data for all the batches.
At step 306b4d, the one or more model parameters of the regression model, are determined, after the one or more batches are completed. The one or more model parameters of the regression model obtained at step 306b4d defines the trained regression model after training with the transformed training data obtained at step 306b2 for each batch. The trained regression model is validated for testing an accuracy. If the accuracy of the trained regression model is more than the predefined threshold accuracy, then the trained regression model is trained properly which means the multi-collinearity problem present in the training data received at step 302 of the method 300 is solved.
The matrix batched optimization technique utilizes a network model having a batch normalization layer, Jacobi polynomial transformer layer and a dense layer, is used to solve the multi-collinearity problem. The batch normalization layer is used to normalize the batch of the training data and the batch of the training data is transformed using the Jacobi polynomial transformer layer and the dense layer is used to generate the network parameters of the regression model using the transformed training data.
Proof to show that Jacobi Polynomials solves multicollinearity in datasets:
Jacobi polynomials have numerous applications such as, activation function in feed-forward networks improving the computational speed, forecasting precision and structural robustness, facilitating acceleration technique in gossip problem, improving its convergence, image moments for invariant description, and so on. These applications can be credited to certain analytical properties of these polynomials.
Notably, the Jacobi polynomials satisfies orthogonality condition and have symmetric relations. Moreover, the Jacobi polynomials are a generalized case of Chebyshev, Legendre and Zernike polynomials. These properties of the Jacobi polynomials are important to explore the possibility of using them as the basis of regression to solve the multi-collinearity.
Consider a design matrix X
X=[¦(¦(x_1@x_2 )@?@x_m )] ?x_i?R^n,X ? R^(m × n)
Since X is multi-colinear,
?_(j\j^')¦?s_j x_ij=x_(ij^' ) ?
where s_j are scalars, j is a set of linearly independent column numbers and j^' is a set of linearly dependent column numbers, which mean:
rank (X)
Documents
Application Documents
#
Name
Date
1
202121059476-STATEMENT OF UNDERTAKING (FORM 3) [20-12-2021(online)].pdf
2021-12-20
2
202121059476-REQUEST FOR EXAMINATION (FORM-18) [20-12-2021(online)].pdf
2021-12-20
3
202121059476-FORM 18 [20-12-2021(online)].pdf
2021-12-20
4
202121059476-FORM 1 [20-12-2021(online)].pdf
2021-12-20
5
202121059476-FIGURE OF ABSTRACT [20-12-2021(online)].jpg
2021-12-20
6
202121059476-DRAWINGS [20-12-2021(online)].pdf
2021-12-20
7
202121059476-DECLARATION OF INVENTORSHIP (FORM 5) [20-12-2021(online)].pdf