Method And System For Data Selection From Multi Label Dataset To

< Back

Method And System For Data Selection From Multi Label Dataset To Create Balanced Training Dataset

Abstract: State of art systems handling data balancing for ML techniques focus on single labels and not on multi-label classification. While some prior works discuss balancing training data for multi-label classification but generate data synthetically to manage imbalance. Some of the prior works discuss on having more data but do not mention on how to pick balanced training data from the collected data in the case of multi-label classification. A method and system for data selection from multi-label dataset to create balanced training dataset enables balancing the data by breaking the data into the various combinations of classes and putting it back together by expressing the problem as a set of constrained linear equations, convert solving the equation to a minimization problem based on customized gradient descent technique to derive out a solution vector. The balanced dataset can be used for training a ML model for classification or other tasks. . [To be published with FIG. 2A and 2B]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

11 March 2020

Publication Number

38/2021

Publication Type

INA

Invention Field

ELECTRONICS

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2025-01-27

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai 400021 Maharashtra, India

Inventors

1. PATNAIK, Amaresh

Tata Consultancy Services Limited Plot No 54, Second Floor, Intersil Building, SEEPZ, Andheri East Mumbai 400096 Maharashtra, India

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention: METHOD AND SYSTEM FOR DATA SELECTION FROM MULTI-LABEL DATASET TO CREATE BALANCED TRAINING DATASET
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The embodiments herein generally relate to the field of machine learning and, more particularly, to a method and system for data selection from a multi-label dataset to create a balanced training dataset for training a Machine Learning model (ML) for multi-label classification.
BACKGROUND
[002] Data classification algorithms in machine learning (ML) require training data to be balanced across multiple classes that are present in a dataset. Balancing refers to the count of samples present for each class in the training dataset being approximately the same or in a comparable range. Balanced training dataset is critical for achieving higher and higher accuracy ML based classifications. Most of the existing methods focus on generating synthetic data to obtain a well-balanced training dataset. However, accuracy of the classification with ML model trained using synthetically balanced data will be low compared to scenarios where real data is selected appropriately to generate the balanced training dataset. Moreover, in areas where data available is in abundance (running into millions of rows), it is more appropriate to rely on real data than on synthetic data for achieving highest possible accuracy in classification results.
[003] Moreover, selecting the training data for multi-label classification wherein the input data sample can belong to multiple classes is a further challenge. Herein, expected is to select the data such that approximately equal number of labels are present in the selected training data. However, the available dataset, from which a training dataset is to be selected, is a multi-label dataset with each data sample having one or more labels and combination of labels is unordered or varying across data samples. In such scenarios, even manual intervention or supervised data selection is challenging and hardly can provide the right balance of all labels in the training dataset, considering huge volume of data to be analyzed.
[004] Data balancing problem is worked upon in literature. However, some of the prior woks only deal with single labels but do not deal with multi-label classification, while some of the prior works discuss on of multi-label classification

but generate data synthetically to manage imbalance. Some of the prior works discuss on having more data but do not mention how to pick balanced training data from the collected data in the case of multi-label classification.
SUMMARY [005] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for creating a balanced training dataset for multi-label classification is provided. The method comprises acquiring from a database the multi-label dataset comprising a plurality of rows of imbalanced data, wherein each row among the plurality of rows comprises data with a combination of one or more labels among a plurality of labels present in the multi-label dataset. Further, the method comprises processing the multi-label dataset to:1) arrange the plurality of rows into a plurality of groups, wherein each group among the plurality of groups is defined as an unordered unique combination of one or more labels; 2) determine a set of first counts, wherein each first count among the set of first counts indicates total number of rows among the plurality of rows, present in each group; 3) determine a set of second counts, wherein each second count among the set of second counts corresponds to each label among the plurality of labels, and indicates total number of rows, among the plurality of rows, each label is present into; 4) identify a minimum second count from the set of second counts, wherein the minimum second count equals to a minimum value in the set of second counts; and 5) assign a variable to each group to generate a set of variables associated with the plurality of groups, wherein the variable represents a number of rows of each group to be selected from the plurality of rows of the imbalanced data for creating the balanced training dataset. Furthermore, the method comprises generating a set of linear equations for the plurality of labels, wherein each equation among the set of linear equations is generated for a corresponding label among the plurality of labels, and wherein each equation for the label is a summation of one or more variables, among the set of variables, that are associated with the corresponding label, and

wherein each equation is equated to the minimum second count. Furthermore, the method comprises representing the set of equations, the set of variables and the minimum second count as a matrix equation (Ax=b), wherein A in the matrix equation is a two dimensional coefficient matrix representing the set of equations, x in the matrix equation is a vector representing the set of variables, b in the matrix equation is the vector representing the minimum second count; determine a solution vector (s) for the set of variables in the vector (x) by minimizing a vector length ||Ax – b|| under constraints (x <= c) based on a customized gradient descent technique, wherein the constraints (c) is a vector representing the set of first counts. Furthermore, the method comprises creating the balanced training dataset by selecting the number of rows of each group based on corresponding values of the solution vector (s), wherein the solution vector (s) is a value of the vector (x) such that the vector length ||Ax – b|| is at a minimum; and training a Machine Learning (ML) model for multi-label classification using the balanced training dataset.
[006] Further, determining of the solution vector (s) is performed using an iterative process to minimize the vector length (||Ax – b||) under constraints (x <= c) based on the customized gradient descent technique. The iterative process comprises: initializing a corresponding value in solution vector (s) to 0 for a first group among the plurality of groups if the group is associated with only one label; initializing a corresponding value in the solution vector(s) to the corresponding value in the vector (c) if a second group among the plurality of groups is associated with more than one labels. Further comprises, calculating a gradient of length of the vector (||Ax – b||) for a given value of the solution vector (s) for a particular index of the solution vector s[index]; wherein if the gradient is positive and if value of the s[index] is greater than zero, decrementing the s[index] by 1; if the gradient is positive and if value of the s[index] is zero, retain s[index] as zero; if the gradient is negative and if value of the s[index] is less than the corresponding value in vector (c), incrementing the s[index] by 1; and if the gradient is negative and if value of the s[index] is equal to the corresponding value in vector (c), retain the value of s[index]; . Further repeating the step of calculating of the gradient length vector until a value of the solution vector (s) repeats a prior computed value.

[007] In another aspect, a system for creating a balanced training dataset for multi-label classification is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to acquire from a database the multi-label dataset comprising a plurality of rows of imbalanced data, wherein each row among the plurality of rows comprises data with a combination of one or more labels among a plurality of labels present in the multi-label dataset. Further, the one or more hardware processors are configured to process the multi-label dataset to:1) arrange the plurality of rows into a plurality of groups, wherein each group among the plurality of groups is defined as an unordered unique combination of one or more labels; 2) determine a set of first counts, wherein each first count among the set of first counts indicates total number of rows, among the plurality of rows, present in each group; 3) determine a set of second counts, wherein each second count among the set of second counts corresponds to each label among the plurality of labels, and indicates total number of rows, among the plurality of rows, each label is present into; 4) identify a minimum second count from the set of second counts, wherein the minimum second count equals to a minimum value in the set of second counts; and 5) assign a variable to each group to generate a set of variables associated with the plurality of groups, wherein the variable represents the number of rows of each group to be selected from the plurality of rows of the imbalanced data for creating the balanced training dataset. Furthermore, one or more hardware processors are configured generate a set of linear equations for the plurality of labels, wherein each equation among the set of linear equations is generated for a corresponding label among the plurality of labels, and wherein each equation for the label is a summation of one or more variables, among the set of variables, that are associated with the corresponding label, and wherein each equation is equated to the minimum second count. Furthermore, one or more hardware processors are configured to represent the set of equations, the set of variables and the minimum second count as a matrix equation (Ax=b), wherein A in the matrix equation is a two dimensional coefficient matrix

representing the set of equations, x in the matrix equation is a vector representing the set of variables, b in the matrix equation is the vector representing the minimum second count; determine a solution vector (s) for the set of variables in the vector (x) by minimizing the vector length ||Ax – b|| under constraints (x <= c) based on a customized gradient descent technique, wherein the constraints (c) is a vector representing the set of first counts. Furthermore, one or more hardware processors are configured to create the balanced training dataset by selecting the number of rows of each group based on corresponding values of the solution vector (s), wherein the solution vector (s) is a value of the vector (x) such that the vector length ||Ax – b|| is at a minimum; and training a Machine Learning (ML) model for multi-label classification using the balanced training dataset.
[008] Further, determining of the solution vector (s) is performed using an iterative process to minimize the vector length (||Ax – b||) under constraints (x <= c) based on the customized gradient descent technique. The iterative process comprises initializing a corresponding value in solution vector (s) to 0 for a first group among the plurality of groups if the group is associated with only one label; initializing a corresponding value in the solution vector(s) to the corresponding value in the vector (c) if a second group among the plurality of groups is associated with more than one labels. Further comprises, calculating a gradient of length of the vector (||Ax – b||) for a given value of the solution vector (s) for a particular index of the solution vector s[index]; wherein if the gradient is positive and if value of the s[index] is greater than zero, decrementing the s[index] by 1; if the gradient is positive and if value of the s[index] is zero, retain s[index] as zero; if the gradient is negative and if value of the s[index] is less than the corresponding value in vector (c), incrementing the s[index] by 1; and if the gradient is negative and if value of the s[index] is equal to the corresponding value in vector (c), retain the value of s[index]. Further repeating the step of calculating of the gradient length vector until a value of the solution vector (s) repeats a prior computed value.
[009] In yet another aspect, there are provided one or more non-transitory machine readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a

method for creating a balanced training dataset for multi-label classification. The method comprises acquiring from a database the multi-label dataset comprising a plurality of rows of imbalanced data, wherein each row among the plurality of rows comprises data with a combination of one or more labels among a plurality of labels present in the multi-label dataset. Further, the method comprises processing the multi-label dataset to:1) arrange the plurality of rows into a plurality of groups, wherein each group among the plurality of groups is defined as an unordered unique combination of one or more labels; 2) determine a set of first counts, wherein each first count among the set of first counts indicates total number of rows, among the plurality of rows, present in each group; 3) determine a set of second counts, wherein each second count among the set of second counts corresponds to each label among the plurality of labels, and indicates total number of rows, among the plurality of rows, each label is present into; 4) identify a minimum second count from the set of second counts, wherein the minimum second count equals to a minimum value in the set of second counts; and 5) assign a variable to each group to generate a set of variables associated with the plurality of groups, wherein the variable represents a number of rows of each group to be selected from the plurality of rows of the imbalanced data for creating the balanced training dataset. Furthermore, the method comprises generating a set of linear equations for the plurality of labels, wherein each equation among the set of linear equations is generated for a corresponding label among the plurality of labels, and wherein each equation for the label is a summation of one or more variables, among the set of variables, that are associated with the corresponding label, and wherein each equation is equated to the minimum second count. Furthermore, the method comprises representing the set of equations, the set of variables and the minimum second count as a matrix equation (Ax=b), wherein A in the matrix equation is a two dimensional coefficient matrix representing the set of equations, x in the matrix equation is a vector representing the set of variables, b in the matrix equation is the vector representing the minimum second count; determine a solution vector (s) for the set of variables in the vector (x) by minimizing the vector length ||Ax – b|| under constraints (x <= c) based on a customized gradient descent technique, wherein the

constraints (c) is a vector representing the set of first counts. Furthermore, the method comprises creating the balanced training dataset by selecting the number of rows of each group based on corresponding values of the solution vector (s), wherein the solution vector (s) is a value of the vector (x) such that the vector length ||Ax – b|| is at a minimum; and training a Machine Learning (ML) model for multi-label classification using the balanced training dataset.
[0010] Further, determining of the solution vector (s) is performed using an iterative process to minimize the vector length (||Ax – b||) under constraints (x <= c) based on the customized gradient descent technique. The iterative process comprises initializing a corresponding value in solution vector (s) to 0 for a first group among the plurality of groups if the group is associated with only one label; initializing a corresponding value in the solution vector(s) to the corresponding value in the vector (c) if a second group among the plurality of groups is associated with more than one labels. Further comprises, calculating a gradient of length of the vector (||Ax – b||) for a given value of the solution vector (s) for a particular index of the solution vector s[index]; wherein if the gradient is positive and if value of the s[index] is greater than zero, decrementing the s[index] by 1; if the gradient is positive and if value of the s[index] is zero, retain s[index] as zero; if the gradient is negative and if value of the s[index] is less than the corresponding value in vector (c), incrementing the s[index] by 1; and if the gradient is negative and if value of the s[index] is equal to the corresponding value in vector (c), retain the value of s[index]. Further, repeating the step of calculating of the gradient length vector until a value of the solution vector (s) repeats a prior computed value.
[0011] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS [0012] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:

[0013] FIG. 1 is a functional block diagram of a system for data selection from a multi-label dataset to create a balanced training dataset for multi-label classification, in accordance with some embodiments of the present disclosure.
[0014] FIG. 2A and FIG. 2B is a flow diagram illustrating a method for data selection from a multi-label dataset to create a balanced training dataset for multi-label classification, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [0015] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[0016] Embodiments herein provide a method and system for data selection from a multi-label dataset to create a balanced training dataset for training a ML model for multi-label classification. Unlike existing methods that synthetically generate a balanced training dataset, the method disclosed utilizes the available dataset to automatically select appropriate data, such that, there exists a balance among the labels present in the training dataset effectively improving the classification accuracy of the ML model. The method disclosed represents the data balancing problem as a linear algebra problem by exploring combinations and variation in the dataset. Further, the linear equation is solved using a customized gradient descent technique, under identified constraints derived from variation in labels present across the dataset. A solution vector obtained for the linear equation enables selecting appropriate data samples such that multiple labels present in the training dataset are balanced. Further, the training dataset so obtained is used to train

a Machine Learning (ML) model for multi-label classification. An example of multi-label dataset is a patent database at a repository of any patent office. Herein each data sample, is a patent literature , which may fall into one or more classes( typically herein, CPC classes identifying technology domain the patent literature belong to. Thus, selecting right combination of patent literature samples so that all technology classes (CPC classes) have an almost equal representation in the training data is a challenging task, wherein solution for the data selection is provided by the method disclosed. Another example of multi-label classification is database of movies wherein each movie may fall into multiple classes (herein movie genres). In the travel domain, a destination could be classified as hill station, beach, religious, historical etc., where each destination can belong to multiple classes. Similarly, news articles can be classified as politics, sports, movies, and business and so on and each news article can belong to multiple classes at the same time. Being able to classify automatically by examining the given data will mean a huge reduction in human effort, at the same time helping customers to choose their items easily by filtering data as per their interest. Providing a balanced dataset during training ensures the training is not biased towards classes that have more rows of data, and thus increases the accuracy of automated prediction of classification.
[0017] Referring now to the drawings, and more particularly to FIGS. 1 through 2B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[0018] FIG. 1 is a functional block diagram of a system 100 for data selection from a multi-label dataset to create the balanced training dataset for multi-label classification, in accordance with some embodiments of the present disclosure.
[0019] In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware

processors is configured to execute functions of one or more functional blocks of the system 100.
[0020] Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 is configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, a network cloud and the like.
[0021] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, a touch user interface (TUI) and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting a number of devices (nodes) of the system 100 to one another or to another server.
[0022] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. Thus, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure.
[0023] Further, the memory 102 may include a database 108, which may store the dataset to be analyzed to select data samples to generate the balanced

training dataset. In an embodiment, the database 108 may be external (not shown) to the system 100 and coupled to the system via the I/O interface 106. Functions of the components of system 100 are explained in conjunction with flow diagram of FIGS. 2A and 2B for data selection from a multi-label dataset to create the balanced training dataset for multi-label classification.
[0024] FIG. 2A and FIG. 2B is a flow diagram illustrating a method for data selection from a multi-label dataset to create a balanced training dataset for multi-label classification, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0025] In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 2A and FIG. 2B. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[0026] Referring to the steps of the method 200, at step 202, one or more hardware processors 104 are configured to acquire from a database such as the database 108, a multi-label dataset comprising a plurality of rows of imbalanced data. Each row among the plurality of rows comprises data with a combination of one or more labels among a plurality of labels present in the multi-label dataset.
[0027] For example, the database 108, can be a X patent office database with huge volume of granted patents and patent applications arranged as rows of data or data samples. The method 200, thus acquires a sample dataset from the huge volume, which comprises a plurality of rows or data samples from which the

balanced training dataset is to be selected. The plurality of rows, in example herein, correspond to patents and patent applications, wherein every patent or patent application falls into one or more, technology classes (say, CPC classes). Thus, every patent or patent application may have one or more labels (CPC classes). A representative sample dataset acquired from a patent database is provided below in table 1. However, it can be understood that the actual dataset acquired will have very large number of rows: TABLE 1:

Sample patent rows Patent Text CPC class (labels)
1 … A
2 … B
3 … B, C
4 … A
5 … C
6 … A, B, C
7 … A
8 … A, B
9 … A
10 … A, B
[0028] Once the multi-label dataset is obtained, then at step 204 of the method 200, one or more hardware processors 104 are configured to process the multi-label dataset to:
1) Arrange the plurality of rows into a plurality of groups, wherein each group among the plurality of groups is defined as an unordered unique combination of one or more labels.

2) Determine a set of first counts, wherein each first count among the set of first counts indicates total number of rows, among the plurality of rows, present in each group.
3) Determine a set of second counts, wherein each second count among the set of second counts corresponds to each label among the plurality of labels, and indicates total number of rows, among the plurality of rows, each label is present into. For the table 1, the set of second counts is ( label A has 7 rows, B has 5 rows and C has 3 rows).
4) Identify a minimum second count from the set of second counts wherein the minimum second count equals to a minimum value in the set of second counts. Thus, minimum of the set of second counts for the sample dataset of table 1, number of rows label C falls into, which is equal to (3).
5) Assign a variable to each group to generate a set of variables associated with the plurality of groups, wherein the variable represents a number of rows of each group to be selected from the plurality of rows of the imbalanced data for creating the balanced training dataset.
The processed multi-label dataset of the sample dataset of table 1, is provided in table 2 below: TABLE 2:

# (groups Number of rows Set of variables assigned to
identified in the each group has each group indicating number
sample data) in the sample of rows of each group that
dataset must be present in the
( first count) balanced training dataset.
GRP 1 (A) 4 n1
GRP2 (B) 1 n2
GRP3 (C) 1 n3
GRP4 (A, B) 2 n4

GRP5 (B, C) 1 n5
GRP6 (A, B, C) 1 n6
[0029] Upon processing the multi-label dataset, at step 206 of the method 200, one or more hardware processors (104) are configured to generate a set of linear equations for the plurality of labels. Each equation among the set of linear equations is generated for a corresponding label among the plurality of labels, and wherein each equation for the label is a summation of one or more variables, among the set of variables, that are associated with the corresponding label, and wherein each equation is equated to the minimum second count. Required are 3 rows each from A, B, C
n1 + n4 + n6 = 3 (coefficients of n2, n3, n5 = 0) (1)
n2 + n4 + n5 + n6 = 3 (coefficients of n1, n3 = 0) (2)
n3 + n5 + n6 = 3 (coefficients of n1, n2 = 0) (3)
[0030] At step 208 of the method 200, one or more hardware processors (104) are configured to represent the set of equations, the set of variables and the minimum second count as a matrix equation (Ax=b), wherein A in the matrix equation is a two dimensional coefficient matrix representing the set of equations, x in the matrix equation is a vector representing the set of variables, b in the matrix equation is the vector representing the minimum second count. At step 210 of the method 200, the one or more hardware processors (104) are configured to determine a solution vector (s) for the set of variables in the vector (x) by minimizing the vector length ||Ax - b|| under constraints (x <= c), wherein the constraints (c) is a vector representing the set of first counts.

Ax = b bounded by constraints ‘c’ (5)

Constraints ‘c’ are as below
n1 <= 4
n2 <= 1
n3 <= 1
n4 <= 2
n5 <= 1
n6 <= 1 (6)
Minimize ||Ax – b|| bounded by constraints ‘c’, wherein ‘s’ is the solution vector
for the variables corresponding to x (7)
[0031] For the example considered herein A, c and b are as follows:

[0032] The one or more hardware processors 104 are configured to determine the solution vector (s) using an iterative process to minimize the vector length (||Ax – b||) under constraints (x <= c) based on the customized gradient descent technique. Unlike, well known usage of Gradient descent approach for minimizing a loss function while training a neural network, the method disclosed herein utilizes the customized gradient descent technique to minimize vector lengths to solve linear equations. In conventional gradient descent approach, weights are reduced by learning rate multiplied by the gradient. However, the method disclosed herein modifies the conventional gradient descent approach, to increment or decrement by 1 or not incrementing/decrementing at all depending on the gradient. Conventional gradient descent stops after a fixed number of steps, but the method herein introduces a stopping condition. Thus, method disclosed herein customizes the conventional gradient descent for solving the linear equation to rightly provide the solution vector. The steps of the iterative process to obtain solution vector comprise:
a) Initializing a corresponding value in solution vector (s) to 0 for a first group among the plurality of groups if the group is associated with only one label.
b) Initializing a corresponding value in the solution vector(s) to the corresponding value in the vector (c) if a second group among the plurality of groups is associated with more than one label.
c) Calculating a gradient of length of the vector (||Ax – b||) for a given value of the solution vector (s) for a particular index of the solution vector s[index].
if the gradient is positive and if value of the s[index] is greater than zero, decrementing the s[index] by 1,
if the gradient is positive and if value of the s[index] is zero, retain s[index] as zero ( indicates number of rows cannot be negative);
if the gradient is negative and if value of the s[index] is less than the corresponding value in vector (c), incrementing the s[index] by 1; and
if the gradient is negative and if value of the s[index] is equal to the corresponding value in vector (c), retain the value of s[index]. (indicates

number of rows cannot exceed the corresponding values in constraints vector)
d) Repeating the step of calculating the gradient length vector until a value of
the solution vector (s) repeats a prior computed value.
[0033] For the above example, explained below are the steps followed to determine the solution vector by minimizing the vector length ||Ax - b|| in accordance to the customized gradient descent approach.
Step 1. Determining the gradient of the vector length with respect to each of the variables
Let g =

Since h-1/2 is always positive, sign of gradient with respect to n1= sign of p
Hence sign of gradient w.r.t n1 = sign of n1 + n4 + n6 - 3
Similarly, sign of gradient w.r.t n2 = sign of n2 + n4 + n5 + n6 - 3
Sign of gradient w.r.t n3 = sign of nn3 + n5 + n6 - 3
Sign of gradient w.r.t n4 = sign ofn1 + n4 + n6 - 3 + n2 + n4 + n5 + n6 - 3
Sign of gradient w.r.t n5 = n2 + n4 + n5 + n6 - 3 +n3 + n5 + n6 - 3
Sign of gradient w.r.t n6 = n1 + n4 + n6 - 3 + n2 + n4 + n5 + n6 - 3 + n3 +
n5 + n6 - 3
Step 2. Initial value of solution vector

Step 4. (Iteration terminates ). This value of ‘s’ matches values computed two steps back, hence the solution is:

[0034] Thus, the number of rows of GRP1, GRP2.GRP5 and GRP 6 are zero and hence need not be picked. However, for GRP3 and GRP 4 one row per group is to be selected. As can be seen, considering the elements (labels) present in each group been selected, obtained is a training dataset with label A , label B and label C represented equally.

[0035] As can be understood the above example is illustrative for sake of explanation and the method can be applied for any number of data samples.
[0036] Once the solution vector ’s’ is obtained, at step 212 of the method 200, the one or more hardware processors (104) are configured to create the balanced training dataset by selecting the number of rows of each group based on corresponding values of the solution vector (s), wherein the solution vector (s) is a value of the vector (x) such that the vector length ||Ax – b|| is at a minimum.
[0037] At step 214 of the method 200, one or more hardware processors (104) are configured to train the ML model for multi-label classification using the balanced training dataset. Typically for example of patent classification mentioned herein, which is text data, a pre-trained Word2Vec model to vectorize the text can be used. The vectorized text is then passed to a multilayer neural network that is used for multi-label classification. A relu activation function is used for the intermediate layers and sigmoid activation function for the output layer. However, after balancing, any machine learning model architecture could have been used. Once trained, the ML model is able to classify any test input such as a patent text into associated one or more labels (such as one or more CPC classes). e)
[0038] As mentioned the problem of data balancing for multi-label classification is challenging. Even when the data balancing problem is represented as a linear system equation the approach to solve it, only if rightly selected, will lead to optimal data balancing solution. There exists many challenges in solving the data balancing equation stated below, which the method addresses by using the customized gradient descent technique to solve the linear equation representing the data balancing problem.

a) With straight forward approach the problem can be solved by taking rows (data samples) in the order in which they occur in the database and stop when a balanced dataset is obtained. However, with database having large volume of unordered imbalanced data, success of this approach totally depends on the order in which data is present in the database and may fail to provide a true balanced dataset.
b) In another simple approach may select one row each from each combination (of one or more labels) and fill in all combinations at the same pace, however it is observed that even this approach is based on the order of getting the rows (type of data samples) correctly.
c) In another approach all rows from the combinations having multiple classes can be taken and then rows from the combinations having a single class (label) can be taken. Such an approach can work for the simpler cases but for more complex cases it is required is to take either negative number of rows or more than the existing number of rows.
d) Brute force method of trying all combinations of possible values for the solution vector of the linear equation representing the balancing problem is computationally impossible even for a small number of classes.
e) Further, manually examination of equations can be performed and may be solved for lesser number of variables, but it is observed that as the number of variables increased, manual solution becomes almost impossible
f) Another approach can be reducing the A matrix to row echelon form, however in such scenario there exist too many free variables. The free variables could not take any possible value because of the set of constraints represented by the vector ‘c’.
g) LU factorization to determine the pivot and non-pivot columns of the A matrix can be used, however it is further observed that the skip library in Python gives the lower matrix in just a triangular form and not in the row echelon form. Even if a proper LU matrix is obtained, the problem of too many constrained free variables remain.

[0039] Thus, the method disclosed enables balancing the data by breaking the data into the various combinations of classes and putting it back together by expressing the problem as a set of constrained linear equations, convert solving the equation to a minimization problem and approach based on gradient descent to work out the solution vector.
[0040] The method disclosed provides a fully automatic, computation and time efficient approach, which is independent of the order in which data samples in the dataset are encountered. Thus, it provides an optimally balanced training dataset for training data from large datasets.
[0041] TEST RESULTS: The balanced training dataset, obtained from a sample dataset associated with patent database, using the method and the system disclosed herein is stated below:
[0042] First, the method disclosed herein follows a data preparation step to automatically identify a dataset (data samples) to be processed to extract the balanced training dataset. The data preparation includes 1)identify number of data rows you want per CPC class (label) from the database. (For example, for 32 CPC subclasses, it is observed that 1000 rows per subclass gives good results. For different use cases the number may be different). This also depends on the availability of those many rows in the database. 2) Find out how many classes have those many rows in the database. If a row is picked, process all the CPC classes (labels) that are present in that row. Thus, the rows, where all the CPC classes present in the row are not of interest, such rows can be ignored.
[0043] After data preparation, the system 100 loops through all the rows and collects all occurrences of all CPC classes in an array. Thereafter, the system 100 out how many distinct classes are there in the array, and how many times each class is present in the array. Here is a sample output of this step. Note the minimum value for the CPC class (label), which is 1223 (minimum second count)here. Distribution: [('G06F', 12029), ('H04L', 8882), ('H04W', 4517), ('H04N', 3351), ('G06Q', 2560), ('A61B', 2181), ('G06T', 1944), ('G06K', 1715), ('H04B', 1391), ('H04M', 1223)]

[0044] The system 100 then loops through all the rows again and collect all groups (combinations of labels) in an array as they occur and identifies how many distinct combinations (groups) are there in the array, and how many times each combination ( number of rows per group) is present in the array. Here is a sample output of this step, along with a serial number given to every combination.
[0045] [(0, (('G06F',), 7574)), (1, (('H04L',), 3175)), (2, (('G06F', 'H04L'), 1866)), (3, (('A61B',), 1830)), (4, (('H04N',), 1804)), (5, (('H04W',), 1455)), (6, (('H04L', 'H04W'), 1417)), (7, (('G06Q',), 1025)), (8, (('G06T',), 635)), (9, (('G06K',), 553)), (10, (('G06F', 'G06Q'), 534)), (11, (('H04B', 'H04L'), 434)), (12, (('G06T', 'H04N'), 316)), (13, (('H04B', 'H04W'), 311)), (14, (('H04B', 'H04L', 'H04W'), 292)), (15, (('G06K', 'G06T'), 288)), (16, (('H04M',), 278)), (17, (('G06F', 'H04N'), 270)), (18, (('G06F', 'G06K'), 235)), (19, (('G06F', 'G06Q', 'H04L'), 228)), (20, (('H04L', 'H04N'), 221)), (21, (('G06Q', 'H04L'), 218)), (22, (('G06F', 'H04L', 'H04W'), 199)), (23, (('G06F', 'G06T'), 195)), (24, (('H04L', 'H04M', 'H04W'), 172)), (25, (('H04M', 'H04W'), 158)), (26, (('H04B',), 146)), (27, (('A61B', 'G06T'), 135)), (28, (('G06K', 'H04N'), 134)), (29, (('H04L', 'H04M'), 133)), (30, (('G06F', 'H04M'), 123)), (31, (('G06K', 'G06T', 'H04N'), 96)), (32, (('A61B', 'G06F'), 70)), (33, (('G06Q', 'H04L', 'H04W'), 65)), (34, (('G06F', 'H04L', 'H04N'), 65)), (35, (('G06F', 'H04W'), 63)), (36, (('G06F', 'G06K', 'G06T'), 60)), (37, (('G06Q', 'H04N'), 59)), (38, (('G06K', 'G06Q'), 55)), (39, (('G06Q', 'H04W'), 52)), (40, (('G06F', 'G06T', 'H04N'), 42)), (41, (('G06F', 'G06K', 'H04N'), 42)), (42, (('H04L', 'H04N', 'H04W'), 41)), (43, (('G06Q', 'H04M'), 34)), (44, (('H04B', 'H04M'), 31)), (45, (('G06F', 'H04M', 'H04W'), 25)), (46, (('G06F', 'H04L', 'H04M', 'H04W'), 25)), (47, (('A61B', 'G06K', 'G06T'), 25)), (48, (('A61B', 'G06K'), 24)), (49, (('G06F', 'G06K', 'G06Q'), 22)), (50, (('G06F', 'G06K', 'H04L'), 22)), (51, (('G06F', 'H04B'), 21)), (52, (('H04L', 'H04M', 'H04N'), 20)), (53, (('H04N', 'H04W'), 19)), (54, (('G06F', 'H04L', 'H04M'), 19)), (55, (('H04M', 'H04N'), 19)), (56, (('G06K', 'H04B'), 18)), (57, (('G06F', 'G06Q', 'G06T'), 17)), (58, (('G06F', 'G06Q', 'H04L', 'H04W'), 17)), (59, (('G06Q', 'H04L', 'H04M'), 17)), (60, (('H04B', 'H04N'), 17)), (61, (('A61B', 'G06F', 'G06T'), 16)), (62, (('G06Q', 'H04L', 'H04N'), 16)), (63, (('G06F', 'G06Q', 'H04N'), 15)), (64, (('H04B', 'H04M', 'H04W'), 14)), (65,

(('G06Q', 'G06T'), 14)), (66, (('G06F', 'H04B', 'H04L'), 14)), (67, (('G06F', 'G06K', 'G06T', 'H04N'), 14)), (68, (('G06F', 'G06T', 'H04L'), 14)), (69, (('G06Q', 'H04M', 'H04W'), 14)), (70, (('G06F', 'G06Q', 'H04L', 'H04M', 'H04W'), 13)), (71, (('G06Q', 'H04L', 'H04M', 'H04W'), 13)), (72, (('G06F', 'G06Q', 'H04L', 'H04N'), 11)), (73, (('A61B', 'H04N'), 10)), (74, (('G06F', 'H04B', 'H04W'), 10)), (75, (('G06F', 'H04M', 'H04N'), 10)), (76, (('H04B', 'H04L', 'H04M', 'H04W'), 9)), (77, (('G06F', 'H04L', 'H04N', 'H04W'), 9)), (78, (('A61B', 'G06F', 'G06K'), 8)), (79, (('G06K', 'H04L'), 8)), (80, (('G06F', 'G06Q', 'H04M'), 8)), (81, (('G06F', 'G06Q', 'H04W'), 8)), (82, (('G06K', 'H04W'), 8)), (83, (('G06K', 'G06Q', 'G06T'), 8)), (84, (('H04B', 'H04L', 'H04M'), 8)), (85, (('A61B', 'G06T', 'H04N'), 7)), (86, (('A61B', 'H04W'), 7)), (87, (('G06F', 'H04B', 'H04M'), 7)), (88, (('A61B', 'G06F', 'G06Q'), 7)), (89, (('H04B', 'H04L', 'H04N', 'H04W'), 6)), (90, (('G06T', 'H04L'), 6)), (91, (('G06F', 'G06Q', 'H04L', 'H04M'), 6)), (92, (('G06F', 'H04N', 'H04W'), 5)), (93, (('A61B', 'H04L', 'H04W'), 5)), (94, (('A61B', 'G06Q'), 4)), (95, (('H04B', 'H04L', 'H04N'), 4)), (96, (('G06F', 'G06K', 'H04L', 'H04W'), 4)), (97, (('G06F', 'G06Q', 'G06T', 'H04N'), 4)), (98, (('A61B', 'G06F', 'G06K', 'G06T'), 4)), (99, (('G06F', 'H04B', 'H04L', 'H04W'), 4)), (100, (('G06F', 'G06K', 'G06Q', 'H04L'), 4)), (101, (('A61B', 'G06F', 'H04L'), 4)), (102, (('G06F', 'H04B', 'H04M', 'H04W'), 4)), (103, (('G06F', 'G06K', 'G06Q', 'H04N'), 4)), (104, (('G06F', 'G06T', 'H04L', 'H04N'), 4)), (105, (('G06F', 'G06Q', 'H04M', 'H04W'), 4)), (106, (('G06F', 'G06K', 'G06Q', 'G06T'), 4)), (107, (('G06Q', 'H04B', 'H04L', 'H04W'), 4)), (108, (('G06F', 'G06K', 'G06Q', 'H04L', 'H04W'), 4)), (109, (('G06K', 'H04L', 'H04N'), 3)), (110, (('G06K', 'G06Q', 'G06T', 'H04N'), 3)), (111, (('H04L', 'H04M', 'H04N', 'H04W'), 3)), (112, (('G06K', 'H04B', 'H04W'), 3)), (113, (('H04B', 'H04N', 'H04W'), 3)), (114, (('G06F', 'G06K', 'H04L', 'H04M'), 3)), (115, (('G06F', 'H04L', 'H04M', 'H04N'), 3)), (116, (('G06K', 'G06Q', 'H04L'), 3)), (117, (('H04M', 'H04N', 'H04W'), 3)), (118, (('G06F', 'G06Q', 'G06T', 'H04L'), 3)), (119, (('G06K', 'G06Q', 'H04B'), 2)), (120, (('A61B', 'H04L'), 2)), (121, (('G06K', 'H04B', 'H04L'), 2)), (122, (('A61B', 'G06F', 'H04B'), 2)), (123, (('G06F', 'G06K', 'G06Q', 'H04B'), 2)), (124, (('G06F', 'G06K', 'H04M', 'H04N', 'H04W'), 2)), (125, (('A61B', 'G06F', 'H04L', 'H04W'), 2)), (126, (('G06K', 'H04M'), 2)), (127, (('G06F', 'G06K', 'H04L', 'H04N'), 2)), (128, (('G06T', 'H04B'),

2)), (129, (('G06Q', 'H04B'), 2)), (130, (('A61B', 'G06K', 'G06T', 'H04N'), 2)), (131, (('A61B', 'G06F', 'G06K', 'G06T', 'H04N'), 2)), (132, (('G06F', 'G06T', 'H04M'), 2)), (133, (('A61B', 'H04B', 'H04W'), 2)), (134, (('G06F', 'G06K', 'H04M', 'H04W'), 2)), (135, (('G06F', 'G06K', 'G06Q', 'G06T', 'H04N'), 2)), (136, (('G06Q', 'H04L', 'H04M', 'H04N'), 2)), (137, (('G06F', 'H04B', 'H04L', 'H04M', 'H04W'), 2)), (138, (('G06F', 'G06K', 'H04M'), 2)), (139, (('G06K', 'H04L', 'H04W'), 2)), (140, (('A61B', 'G06F', 'G06Q', 'H04L'), 2)), (141, (('G06F', 'H04L', 'H04M', 'H04N', 'H04W'), 2)), (142, (('G06F', 'G06K', 'G06Q', 'H04L', 'H04N'), 2)), (143, (('G06Q', 'G06T', 'H04N'), 2)), (144, (('G06F', 'G06K', 'H04W'), 2)), (145, (('G06F', 'H04M', 'H04N', 'H04W'), 2)), (146, (('G06T', 'H04L', 'H04N'), 2)), (147, (('G06F', 'G06Q', 'H04B', 'H04L', 'H04M', 'H04W'), 2)), (148, (('G06K', 'H04N', 'H04W'), 1)), (149, (('G06Q', 'H04N', 'H04W'), 1)), (150, (('G06F', 'G06Q', 'H04L', 'H04M', 'H04N'), 1)), (151, (('G06F', 'G06Q', 'H04M', 'H04N'), 1)), (152, (('A61B', 'G06F', 'G06T', 'H04N'), 1)), (153, (('G06K', 'G06T', 'H04N', 'H04W'), 1)), (154, (('G06F', 'G06K', 'G06Q', 'G06T', 'H04L', 'H04W'), 1)), (155, (('G06T', 'H04B', 'H04M'), 1)), (156, (('G06K', 'G06T', 'H04L'), 1)), (157, (('G06F', 'G06K', 'G06T', 'H04W'), 1)), (158, (('H04B', 'H04M', 'H04N', 'H04W'), 1)), (159, (('G06K', 'H04B', 'H04N'), 1)), (160, (('G06K', 'H04L', 'H04M', 'H04W'), 1)), (161, (('G06F', 'G06K', 'G06T', 'H04L', 'H04N', 'H04W'), 1)), (162, (('G06F', 'G06K', 'G06T', 'H04L', 'H04N'), 1)), (163, (('G06Q', 'G06T', 'H04L', 'H04N'), 1)), (164, (('G06F', 'G06Q', 'G06T', 'H04M', 'H04N'), 1)), (165, (('G06F', 'G06K', 'H04M', 'H04N'), 1)), (166, (('G06F', 'G06K', 'G06Q', 'H04B', 'H04L', 'H04M', 'H04W'), 1)), (167, (('G06Q', 'H04L', 'H04M', 'H04N', 'H04W'), 1)), (168, (('A61B', 'G06F', 'H04L', 'H04M'), 1)), (169, (('G06K', 'H04B', 'H04L', 'H04W'), 1)), (170, (('G06F', 'G06T', 'H04M', 'H04N'), 1)), (171, (('A61B', 'G06F', 'G06Q', 'H04L', 'H04M', 'H04N', 'H04W'), 1)), (172, (('A61B', 'G06F', 'H04M'), 1)), (173, (('G06F', 'H04B', 'H04L', 'H04M'), 1)), (174, (('G06F', 'G06Q', 'H04L', 'H04M', 'H04N', 'H04W'), 1)), (175, (('G06K', 'H04L', 'H04M', 'H04N', 'H04W'), 1)), (176, (('G06F', 'G06K', 'H04L', 'H04M', 'H04W'), 1)), (177, (('A61B', 'G06K', 'G06Q', 'G06T'), 1)), (178, (('A61B', 'G06F', 'G06Q', 'H04L', 'H04W'), 1)), (179, (('G06F', 'G06Q', 'H04L', 'H04N', 'H04W'), 1)), (180, (('G06Q', 'H04B', 'H04W'), 1)), (181, (('A61B', 'H04B',

'H04L'), 1)), (182, (('G06F', 'G06K', 'G06T', 'H04L'), 1)), (183, (('G06K', 'H04M', 'H04N'), 1)), (184, (('G06Q', 'H04M', 'H04N', 'H04W'), 1)), (185, (('G06K', 'G06Q', 'H04W'), 1)), (186, (('A61B', 'G06F', 'G06K', 'H04W'), 1)), (187, (('G06K', 'G06Q', 'H04N'), 1)), (188, (('G06F', 'G06T', 'H04L', 'H04W'), 1)), (189, (('G06F', 'G06Q', 'G06T', 'H04L', 'H04N'), 1)), (190, (('G06K', 'H04M', 'H04W'), 1)), (191, (('A61B', 'G06K', 'H04M'), 1)), (192, (('G06T', 'H04M'), 1)), (193, (('A61B', 'G06F', 'H04N'), 1)), (194, (('G06K', 'G06Q', 'H04L', 'H04W'), 1)), (195, (('G06K', 'G06Q', 'H04M'), 1)), (196, (('G06K', 'G06Q', 'G06T', 'H04B', 'H04M', 'H04W'), 1)), (197, (('G06K', 'G06Q', 'H04B', 'H04W'), 1)), (198, (('G06F', 'G06K', 'G06T', 'H04M', 'H04N'), 1)), (199, (('G06F', 'G06Q', 'G06T', 'H04W'), 1)), (200, (('G06K', 'H04B', 'H04L', 'H04M', 'H04W'), 1)), (201, (('A61B', 'G06Q', 'H04N'), 1)), (202, (('G06F', 'G06T', 'H04W'), 1)), (203, (('G06F', 'H04B', 'H04N'), 1)), (204, (('G06F', 'G06Q', 'H04B', 'H04M'), 1))]
[0046] Initialize the matrix A to all zeros, the vector b to all minimum values and the vector c to all zeros. The number of rows for A will be equal to the number of classes (labels) and the number of columns will be equal to the number of groups or combinations. The length of vector b will be the number of subclasses and the length of vector c will be the number of groups. Loop through all the classes and all the groups (nested loop). If a class is present in a group, set the corresponding element in A to 1. Next loop through all the groups and set each value of c to the count of the corresponding element in the group (the count comes from the previous step). Here is the sample output for the first 18 columns of A, and the vectors b and c. A =
[[1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1] [0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0] [0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0] [0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1] [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0] [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0]

[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]] b =
[1223 1223 1223 1223 1223 1223 1223 1223 1223 1223] c =
[7574 3175 1866 1830 1804 1455 1417 1025 635 553 534 434 316 311 292 288 278 270 235 228 221 218 199 195 172 158 146 135 134 133 123 96 70 65 65 63 60 59 55 52 42 42
41 34 31 25 25 25 24 22 22 21 20 19 19 19
18 17 17 17 17 16 16 15 14 14 14 14 14 14
13 13 11 10 10 10 9 9 8 8 8 8 8 8
8 7 7 7 7 6 6 6 5 5 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 3 3 3
3 3 3 3 3 3 3 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1]
[0047] Initialize the solution vector s to an empty array. Loop through all the groups. If the length of the group is one label only, append a zero to s. Otherwise, append the count for that group as found in step 4 to s.
[0048] Solve for s using gradient descent. Here is a sample solution vector s after solving. s =
[247, 0, 486, 938, 512, 146, 0, 897, 293, 526, 1, 267, 316, 311, 181, 288, 278, 1, 1, 1, 0, 0, 1, 1, 122, 158, 146, 135, 134, 119, 123, 96, 26, 1, 1, 63, 1, 59, 55, 52, 1, 1, 1, 34, 31, 25, 25, 25, 24, 1, 1, 21, 1, 1, 19, 19, 18, 1, 1,

17, 17, 15, 0, 1, 14, 14, 14, 1, 1, 14, 2, 3, 1, 10, 10, 10, 9, 1, 8, 0, 8, 1, 1, 8, 8, 7, 7, 7, 7, 1, 0, 1, 1, 1, 4, 4, 1, 1, 4, 4, 1, 1, 4, 3, 1, 4, 1, 4, 1, 0, 3, 1, 3, 3, 1, 1, 0, 3, 1, 2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0049] Pick the balanced data by picking from each group a number of rows equaling the corresponding value from s. Here are the distribution of classes in the balanced data set used for training.
Distribution: [('H04L', 1343), ('H04N', 1257), ('G06K', 1257), ('G06T', 1245),
('G06Q', 1244), ('A61B', 1233), ('H04W', 1226), ('G06F', 1224), ('H04B', 1107),
('H04M', 1107)]
['H04L', 'H04N', 'G06K', 'G06T', 'G06Q', 'A61B', 'H04W', 'G06F', 'H04B', 'H04M']
[0050] It can be observed form the test results that even though the number of classes (labels) is high , the method provides an optimally balanced training dataset.
[0051] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0052] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination

thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0053] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0054] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be

limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0055] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0056] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
[0057]

WE CLAIM:
1. A method for creating a balanced training dataset for multi-label classification, the method comprising:
acquiring (202), by one or more hardware processors, from a database a multi-label dataset comprising a plurality of rows of imbalanced data, wherein each row among the plurality of rows comprises data with a combination of one or more labels among a plurality of labels present in the multi-label dataset;
processing (204) ), by the one or more hardware processors, the multi-label dataset to:
1) arrange the plurality of rows into a plurality of groups, wherein each group among the plurality of groups is defined as an unordered unique combination of one or more labels;
2) determine a set of first counts, wherein each first count among the set of first counts indicates total number of rows, among the plurality of rows, present in each group;
3) determine a set of second counts, wherein each second count among the set of second counts corresponds to each label among the plurality of labels, and indicates total number of rows, among the plurality of rows, each label is present into;
4) identify a minimum second count from the set of second counts, wherein the minimum second count equals to a minimum value in the set of second counts; and
5) assign a variable to each group to generate a set of
variables associated with the plurality of groups, wherein the
variable represents a number of rows of each group to be
selected from the plurality of rows of the imbalanced data for
creating the balanced training dataset;
generating (206), by the one or more hardware processors, a set of linear equations for the plurality of labels, wherein each equation among the

set of linear equations is generated for a corresponding label among the plurality of labels, and wherein each equation for the label is a summation of one or more variables, among the set of variables, that are associated with the corresponding label, and wherein each equation is equated to the minimum second count;
representing (208), by the one or more hardware processors, the set of equations, the set of variables and the minimum second count as a matrix equation (Ax=b), wherein A in the matrix equation is a two dimensional coefficient matrix representing the set of equations, x in the matrix equation is a vector representing the set of variables, b in the matrix equation is the vector representing the minimum second count;
determining (210), by the one or more hardware processors, a solution vector (s) for the set of variables in the vector (x) by minimizing a vector length ||Ax – b|| under constraints (x <= c) based on a customized gradient descent technique, wherein the constraints (c) is a vector representing the set of first counts; and
creating (212), by the one or more hardware processors, the balanced training dataset by selecting the number of rows of each group based on corresponding values of the solution vector (s), wherein the solution vector (s) is a value of the vector (x) such that the vector length ||Ax – b|| is at a minimum.
2. The method as claimed in claim 1, further comprising training (214), by the one or more hardware processors, a Machine Learning (ML) model for multi-label classification using the balanced training dataset.
3. The method as claimed in claim 1, wherein determining of the solution vector (s) is performed using iterative process to minimize the vector length ||Ax – b|| under constraints ((x <= c)) based on the customized gradient descent technique, and wherein the iterative process comprising steps of:

initializing a corresponding value in solution vector (s) to 0 for a first group among the plurality of groups if the group is associated with only one label;
initializing a corresponding value in the solution vector(s) to a corresponding value in the vector (c) if a second group among the plurality of groups is associated with more than one labels;
calculating a gradient of length of the vector ||Ax – b|| for a given value of the solution vector (s) for a particular index of the solution vector s[index];
if the gradient is positive and if value of the s[index] is greater than zero, decrementing the s[index] by 1; if the gradient is positive and if value of the s[index] is zero, retain s[index] as zero;
if the gradient is negative and if value of the s[index] is less than the corresponding value in vector (c), incrementing the s[index] by 1;and
if the gradient is negative and if value of the s[index]
is equal to the corresponding value in vector (c),
retain the value of s[index]; and
repeating the step of calculating the gradient length vector
until a value of the solution vector (s) repeats a prior computed value
4. A system (100) for creating a balanced training dataset for multi-label classification, the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the
one or more I/O interfaces (106), wherein the one or more hardware
processors (104) are configured by the instructions to:
acquire from a database the multi-label dataset comprising a plurality of rows of imbalanced data, wherein each row among the plurality

of rows comprises data with a combination of one or more labels among a plurality of labels present in the multi-label dataset; process the multi-label dataset to:
1) arrange the plurality of rows into a plurality of groups, wherein each group among the plurality of groups is defined as an unordered unique combination of one or more labels;
2) determine a set of first counts, wherein each first count among the set of first counts indicates total number of rows, among the plurality of rows, present in each group;
3) determine a set of second counts, wherein each second count among the set of second counts corresponds to each label among the plurality of labels, and indicates total number of rows, among the plurality of rows, each label is present into;
4) identify a minimum second count from the set of second counts, wherein the minimum second count equals to a minimum value in the set of second counts; and
5) assign a variable to each group, to generate a set of
variables associated with the plurality of groups, wherein the
variable represents a number of rows of each group to be
selected from the plurality of rows of the imbalanced data for
creating the balanced training dataset;
generate a set of linear equations for the plurality of labels, wherein each equation among the set of linear equations is generated for a corresponding label among the plurality of labels, and wherein each equation for the label is a summation of one or more variables, among the set of variables, that are associated with the corresponding label, and wherein each equation is equated to the minimum second count;
represent the set of equations, the set of variables and the minimum second count as a matrix equation (Ax=b), wherein A in the matrix equation is a two dimensional coefficient matrix representing the set of equations, x

in the matrix equation is a vector representing the set of variables, b in the matrix equation is the vector representing the minimum second count;
determine a solution vector (s) for the set of variables in the vector (x) by minimizing a vector length ||Ax – b|| under constraints (x <= c) based on a customized gradient descent technique, wherein the constraints (c) is a vector representing the set of first counts; and
create the balanced training dataset by selecting the number of rows of each group based on corresponding values of the solution vector (s), wherein the solution vector (s) is a value of the vector (x) such that the vector length ||Ax – b|| is at a minimum.
5. The system (100) as claimed in claim 3, wherein the one or more hardware processors (104) are further configured to train a Machine Learning (ML) model for multi-label classification using the balanced training dataset.
6. The system as claimed in claim 3, wherein the one or more hardware processors are configured to determine the solution vector (s) using an iterative process to minimize the vector length (||Ax – b||) under constraints ((x <= c))based on the customized gradient descent technique, wherein steps of the iterative process comprise:
initializing a corresponding value in solution vector (s) to 0 for a first group among the plurality of groups if the group is associated with only one label;
initializing a corresponding value in the solution vector(s) to a corresponding value in the vector (c) if a second group among the plurality of groups is associated with more than one labels; and
calculating a gradient of length of the vector (||Ax – b||) for a given value of the solution vector (s) for a particular index of the solution vector s[index];
if the gradient is positive and if value of the s[index] is greater than zero, decrementing the s[index] by 1;

if the gradient is positive and if value of the s[index] is zero, retain s[index] as zero;
if the gradient is negative and if value of the s[index] is less than the corresponding value in vector (c), incrementing the s[index] by 1; and
if the gradient is negative and if value of the s[index]
is equal to the corresponding value in vector (c),
retain the value of s[index]; and
repeating the step of calculating the gradient length vector
until a value of the solution vector (s) repeats a prior computed value

Documents

Application Documents

#	Name	Date
1	202021010409-CLAIMS [21-01-2022(online)].pdf	2022-01-21
1	202021010409-IntimationOfGrant27-01-2025.pdf	2025-01-27
1	202021010409-STATEMENT OF UNDERTAKING (FORM 3) [11-03-2020(online)].pdf	2020-03-11
1	202021010409-US(14)-HearingNotice-(HearingDate-09-12-2024).pdf	2024-11-18
1	202021010409-Written submissions and relevant documents [17-12-2024(online)].pdf	2024-12-17
2	202021010409-CLAIMS [21-01-2022(online)].pdf	2022-01-21
2	202021010409-FER_SER_REPLY [21-01-2022(online)].pdf	2022-01-21
2	202021010409-FORM-26 [06-12-2024(online)].pdf	2024-12-06
2	202021010409-PatentCertificate27-01-2025.pdf	2025-01-27
2	202021010409-REQUEST FOR EXAMINATION (FORM-18) [11-03-2020(online)].pdf	2020-03-11
3	202021010409-Written submissions and relevant documents [17-12-2024(online)].pdf	2024-12-17
3	202021010409-OTHERS [21-01-2022(online)].pdf	2022-01-21
3	202021010409-Correspondence to notify the Controller [29-11-2024(online)].pdf	2024-11-29
3	202021010409-FORM 18 [11-03-2020(online)].pdf	2020-03-11
3	202021010409-FER_SER_REPLY [21-01-2022(online)].pdf	2022-01-21
4	202021010409-FER.pdf	2021-11-02
4	202021010409-FORM 1 [11-03-2020(online)].pdf	2020-03-11
4	202021010409-FORM-26 [06-12-2024(online)].pdf	2024-12-06
4	202021010409-OTHERS [21-01-2022(online)].pdf	2022-01-21
4	202021010409-US(14)-HearingNotice-(HearingDate-09-12-2024).pdf	2024-11-18
5	202021010409-CLAIMS [21-01-2022(online)].pdf	2022-01-21
5	202021010409-Correspondence to notify the Controller [29-11-2024(online)].pdf	2024-11-29
5	202021010409-FER.pdf	2021-11-02
5	202021010409-FIGURE OF ABSTRACT [11-03-2020(online)].jpg	2020-03-11
5	202021010409-FORM-26 [16-10-2020(online)].pdf	2020-10-16
6	202021010409-DRAWINGS [11-03-2020(online)].pdf	2020-03-11
6	202021010409-FER_SER_REPLY [21-01-2022(online)].pdf	2022-01-21
6	202021010409-FORM-26 [16-10-2020(online)].pdf	2020-10-16
6	202021010409-Proof of Right [17-06-2020(online)].pdf	2020-06-17
6	202021010409-US(14)-HearingNotice-(HearingDate-09-12-2024).pdf	2024-11-18
7	202021010409-CLAIMS [21-01-2022(online)].pdf	2022-01-21
7	202021010409-DECLARATION OF INVENTORSHIP (FORM 5) [11-03-2020(online)].pdf	2020-03-11
7	202021010409-OTHERS [21-01-2022(online)].pdf	2022-01-21
7	202021010409-Proof of Right [17-06-2020(online)].pdf	2020-06-17
7	Abstract1.jpg	2020-03-14
8	202021010409-COMPLETE SPECIFICATION [11-03-2020(online)].pdf	2020-03-11
8	202021010409-FER.pdf	2021-11-02
8	202021010409-FER_SER_REPLY [21-01-2022(online)].pdf	2022-01-21
8	Abstract1.jpg	2020-03-14
9	202021010409-COMPLETE SPECIFICATION [11-03-2020(online)].pdf	2020-03-11
9	202021010409-DECLARATION OF INVENTORSHIP (FORM 5) [11-03-2020(online)].pdf	2020-03-11
9	202021010409-FORM-26 [16-10-2020(online)].pdf	2020-10-16
9	202021010409-OTHERS [21-01-2022(online)].pdf	2022-01-21
9	Abstract1.jpg	2020-03-14
10	202021010409-DECLARATION OF INVENTORSHIP (FORM 5) [11-03-2020(online)].pdf	2020-03-11
10	202021010409-DRAWINGS [11-03-2020(online)].pdf	2020-03-11
10	202021010409-FER.pdf	2021-11-02
10	202021010409-Proof of Right [17-06-2020(online)].pdf	2020-06-17
11	202021010409-DRAWINGS [11-03-2020(online)].pdf	2020-03-11
11	202021010409-FIGURE OF ABSTRACT [11-03-2020(online)].jpg	2020-03-11
11	202021010409-FORM-26 [16-10-2020(online)].pdf	2020-10-16
11	Abstract1.jpg	2020-03-14
12	202021010409-Proof of Right [17-06-2020(online)].pdf	2020-06-17
12	202021010409-FORM 1 [11-03-2020(online)].pdf	2020-03-11
12	202021010409-FIGURE OF ABSTRACT [11-03-2020(online)].jpg	2020-03-11
12	202021010409-FER.pdf	2021-11-02
12	202021010409-COMPLETE SPECIFICATION [11-03-2020(online)].pdf	2020-03-11
13	202021010409-DECLARATION OF INVENTORSHIP (FORM 5) [11-03-2020(online)].pdf	2020-03-11
13	202021010409-FORM 1 [11-03-2020(online)].pdf	2020-03-11
13	202021010409-FORM 18 [11-03-2020(online)].pdf	2020-03-11
13	202021010409-OTHERS [21-01-2022(online)].pdf	2022-01-21
13	Abstract1.jpg	2020-03-14
14	202021010409-COMPLETE SPECIFICATION [11-03-2020(online)].pdf	2020-03-11
14	202021010409-DRAWINGS [11-03-2020(online)].pdf	2020-03-11
14	202021010409-FER_SER_REPLY [21-01-2022(online)].pdf	2022-01-21
14	202021010409-FORM 18 [11-03-2020(online)].pdf	2020-03-11
14	202021010409-REQUEST FOR EXAMINATION (FORM-18) [11-03-2020(online)].pdf	2020-03-11
15	202021010409-CLAIMS [21-01-2022(online)].pdf	2022-01-21
15	202021010409-DECLARATION OF INVENTORSHIP (FORM 5) [11-03-2020(online)].pdf	2020-03-11
15	202021010409-FIGURE OF ABSTRACT [11-03-2020(online)].jpg	2020-03-11
15	202021010409-REQUEST FOR EXAMINATION (FORM-18) [11-03-2020(online)].pdf	2020-03-11
15	202021010409-STATEMENT OF UNDERTAKING (FORM 3) [11-03-2020(online)].pdf	2020-03-11
16	202021010409-DRAWINGS [11-03-2020(online)].pdf	2020-03-11
16	202021010409-FORM 1 [11-03-2020(online)].pdf	2020-03-11
16	202021010409-STATEMENT OF UNDERTAKING (FORM 3) [11-03-2020(online)].pdf	2020-03-11
16	202021010409-US(14)-HearingNotice-(HearingDate-09-12-2024).pdf	2024-11-18
17	202021010409-FORM 18 [11-03-2020(online)].pdf	2020-03-11
17	202021010409-FIGURE OF ABSTRACT [11-03-2020(online)].jpg	2020-03-11
17	202021010409-Correspondence to notify the Controller [29-11-2024(online)].pdf	2024-11-29
18	202021010409-FORM 1 [11-03-2020(online)].pdf	2020-03-11
18	202021010409-FORM-26 [06-12-2024(online)].pdf	2024-12-06
18	202021010409-REQUEST FOR EXAMINATION (FORM-18) [11-03-2020(online)].pdf	2020-03-11
19	202021010409-FORM 18 [11-03-2020(online)].pdf	2020-03-11
19	202021010409-Written submissions and relevant documents [17-12-2024(online)].pdf	2024-12-17
19	202021010409-STATEMENT OF UNDERTAKING (FORM 3) [11-03-2020(online)].pdf	2020-03-11
20	202021010409-REQUEST FOR EXAMINATION (FORM-18) [11-03-2020(online)].pdf	2020-03-11
20	202021010409-PatentCertificate27-01-2025.pdf	2025-01-27
21	202021010409-STATEMENT OF UNDERTAKING (FORM 3) [11-03-2020(online)].pdf	2020-03-11
21	202021010409-IntimationOfGrant27-01-2025.pdf	2025-01-27

Search Strategy

1	searchstrategyE_28-10-2021.pdf