Sign In to Follow Application
View All Documents & Correspondence

A System And A Method For Content Prediction And Classification Through Diverse Model Integration

Abstract: The present embodiment provides a system and a computer-implemented method for content prediction and classification through diverse model integration. The system includes a processing layer, a model generation layer, a specific heuristic rules layer, a configuration layer and an integration layer. The present system and the computer-implemented method is a component of a larger pipeline / system of models that can be applied to a specific business outcome. Reference Figure 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
14 December 2020
Publication Number
24/2022
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
jalanastha64@gmail.com
Parent Application

Applicants

SUMYAG DATA SCIENCES PVT LTD
D603, Mantri Serenity, Doddakallasandra, Bangalore, 560062, Karnataka, India

Inventors

1. VISHWANATH RAMDAS
D603 Mantri Serenity Apts, Doddakallasandra, Bangalore, 560062, Karnataka, India
2. CHANDRA MAHENDRA VIKRAM SINGH
Flat E-102, Wing 1, Saravana Tranquil Heights, Vinayak Nagar, Vidyaranyapura, Bengaluru, 560097, Karnataka, India

Specification

Claims:1. A generic system in data science pipeline for real-time prediction and classification of an unstructured document content, the system comprising:
a processing layer configured to process a vectorized unstructured document as a DataFrame (100), wherein the DataFrame (100) comprises a plurality of columns representing features and a plurality of rows representing content records;
a model generation layer configured to generate a plurality of models (115) for predicting and classifying content types in the unstructured document;
a specific heuristic rules layer for predicting and classifying content types using an array transform;
a configuration layer configured to provide input features, weights, algorithms and flags for processing the array transforms and the plurality of models (115); and
an integration layer (200) configured to combine the outputs of the array transforms and the plurality of models (115) to generate a unified content classification output (130).

2. The generic system as claimed in claim 1, wherein the input features are standardized and normalized across the plurality of models (115).

3. The generic system as claimed in claim 1, wherein the plurality of models and the array transform generates an intermediary class output (120, 122, 124).

4. The generic system as claimed in claim 1, wherein the integration layer performs a sanctity validation and ensures that the features are available for integration.

5. The generic system as claimed in claim 1, wherein the unified content classification output is merged into the Data Frame.

6. A computer-implemented method for generating and integrating multiple models for content classification, the method comprising:
representation of a vectorized form of an unstructured document as a DataFrame (202), wherein the DataFrame (202) has a plurality of rows and a plurality of columns;
generation of a plurality of models and an array transform based on a different content classification;
integration of the plurality of models and the array transform into a multi-dimensional array for generating a unified content classification output (225); and
application of the model weights from the model configuration file (210) onto the multi-dimension array using a configuration driven aggregation function (220) for generating a final classification matrix (225).

7. The computer-implemented method as claimed in claim 6, wherein the unified content classification output is merged into the Data Frame.

8. The computer-implemented method as claimed in claim 6, wherein the method can be generically called from anywhere in the data science pipeline, multiple times with differing configuration files.
, Description:FIELD OF INVENTION
The present embodiment relates to the field of computational systems used in data science, artificial intelligence and machine learning, and more particularly relates to a system and a method for content analysis, prediction and classification in an unstructured document.
BACKGROUND OF INVENTION
Data science involves a diverse choice of models, methods and frameworks for predicting outcomes like prediction, classification, clustering or segmentation. The variety of models include heuristic rules, bayes and count based probabilities, optimization functions, neural networks.
Each model brings its own strengths and weaknesses to the modeler to shape and predict the outcome. While each model, individually, has a strong basis from hypothesis testing for being applied for a specific prediction/outcome, each model applied in isolation is fraught with a residual amount of error rate. The error rates are due to the methodology, input data and testing that are done to build the model. These error rates can be false positives or false negatives and these errors are difficult to overcome and therefore, do not provide accurate results and outcomes.
Additionally, unstructured data in documents come in a variety of formats and do not follow any specific template. Therefore, content prediction within the unstructured document becomes a very difficult task.
Currently, there are many prior arts that address the above-mentioned problem. For example, a prior art proposes the automated meta-analysis system with a moving protocol where multiple studies are combined through an incremental model development using the outputs of one model into the next model in a sequential process. The sequential integration is done during model development to enhance the outputs of the subsequent model’s results. However, the available methods do not provide the accurate prediction, classification, clustering or segmentation. Also, the available methods fail to combine the variety of models along with the human heuristics for generating the optimal content prediction analysis and classification.
Therefore, there is a need for a system and a method that addresses a multi-model approach for content analysis and classification via integration of a variety of models driven by meta/higher level analysis into a single output. Further, there is also a need for the system and the method that helps computationally combine different scientific/analytical models to address the opportunity for eliminating biases/errors in individual isolated models and thereby helps in achieving higher accuracy.
SUMMARY OF THE INVENTION
As mentioned, there is a need for a system and a method for the accurate content analysis and classification through integration of a variety of models driven by meta/higher level analysis.
In an aspect, a generic system in the data science pipeline for real-time prediction and classification of an unstructured document content is provided. The generic system includes a processing layer, a model generation layer, a specific heuristic rules layer, a configuration layer and an integration layer. The processing layer processes a vectorized unstructured document as a DataFrame, wherein the DataFrame comprises a plurality of columns representing features and a plurality of rows representing content records. The model generation layer is configured to generate a plurality of models for predicting and classifying content types in the unstructured document. The specific heuristic rules layer for predicting and classifying content types using an array transform. The configuration layer is configured to provide input features, weights, algorithms and flags for processing the array transforms and the plurality of models. The integration layer combines the outputs of the plurality of models and the array transform to generate an unified content classification output.
In another aspect, a computer-implemented method for generating and integrating multiple models for content classification is provided. The computer-implemented method includes the following steps: representation of a vectorized unstructured document as a DataFrame, wherein the DataFrame has a plurality of rows and a plurality of columns; generation of a plurality of models and an array transform based on a different content classification; integration of the plurality of models and the array transform into a multi-dimensional array for generating an unified content classification output; and application of the model weights from the model configuration file onto the multidimensional array using a configuration driven aggregation function for generating a final classification matrix.

The preceding is a simplified summary to provide an understanding of some aspects of embodiments of the present invention. This summary is neither an extensive nor exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and still further features and advantages of embodiments of the present invention will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings, and wherein:
Figure 1 illustrates a block diagram of a generic system for real-time prediction and classification of an unstructured document content, according to an embodiment herein; and
Figure 2 illustrates a block diagram of a computer-implemented method for generating and integrating a plurality of models for content classification, according to an embodiment herein.
To facilitate understanding, like reference numerals have been used, where possible, to designate like elements common to the figures.
DETAILED DESCRIPTION
As used throughout this application, the word "may" be used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including but not limited to.
The phrases “at least one”, “one or more”, and “and/or” are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.
The term “a” or “an” entity refers to one or more of that entity. As such, the terms “a” (or “an”), “one or more” and “at least one” can be used interchangeably herein. It is also to be noted that the terms “comprising”, “including”, and “having” can be used interchangeably.
Figure 1 illustrates the block diagram of the generic system for real-time prediction and classification of the unstructured document content. The generic system has the ability of predicting unstructured content in a document regardless of formats, authorship and templates. In an embodiment, the generic method combines machine learning and human heuristics for real-time prediction and classification of the unstructured document content. The generic method includes a processing layer, a model generation layer, a specific heuristic rules layer, a configuration layer and an integration layer.
The processing layer processes a vectorized unstructured document as a Data-Frame (100). In an embodiment, the unstructured document content is represented as a vectorized document. In an embodiment, the DataFrame (100) comprises a plurality of rows and a plurality of columns. In an embodiment, the plurality of columns represents features. In an embodiment, the plurality of rows represents content records.
The model generation layer applies a multiple model approach in predicting and classifying content types in the unstructured document. In an embodiment, the model generation layer generates the plurality of models (115) based (115) on the different content classifications. In an embodiment, the plurality of models (115) are based on the approaches such as, but not limited to, Bayesian prediction, perceptron and neural network optimization. In an embodiment, the plurality of models (115) enhances the input DataFrame (100). The model generation layer standardizes and normalizes the output values to enable proper integration of models.
In an embodiment, an intermediary class outputs (122, 124) are generated within a modeling class, when the plurality of models is used inside a specific modeling class, for example ex: token based naive Bayes.
The specific heuristic rules layer is configured to predict and classify content types using an array transform. In an embodiment, the specific heuristic rules layer applies a heuristic rule set. In an embodiment, the heuristic rule set is stored as a configuration file (105) having features as rows and output content classes as columns and the tabular data being the weights of the array transform as an input. In an embodiment, the data-array transform (110) is combined with the heuristic configuration (105) to generate an intermediary DataFrame.
In an embodiment, the intermediary DataFrame is standardized and normalized to generate output classes for the content and are merged into the DataFrame. The standardization and normalization generate new features in the DataFrame (120) thereby generating a heuristic based content classification.
In an embodiment, the model generation layer and the specific heuristic rules layer involves the configuration layer and file that provides the input features, weights, algorithms and flags for processing the array transforms. In an embodiment, the configuration layer and file provide the instruction on how to execute based on the context, including model integration weights, standardization and normalization, and managing missing values and labels.
In an embodiment, the configuration DataFrame object of weights contains model aliases in column axis [model .. model2 ...] and classification entities in rows [class A .. class B ...] [210], therein providing a weighted mapping for each respective model source. Each of these weights also assist in enabling or disabling in complete or in part a model during the integration process for generating the final classification scores.
The integration layer (200) combines the outputs of the plurality of models and the data-array transform to generate a unified content classification output (130). In an embodiment, the integration layer (200) uses meta-analysis/higher analysis and 3-dimensional array stacks to combine the plurality of models and the data-array transform for generating the final content classification scores.

In an embodiment, the integration layer (200) performs a sanctity check on the class labels of the plurality of models. The integration layer adds default values [=0], in case any class label is missing in the inbound Data-Frame [202] and the class labels are generated to make the arrays per source consistent.

The outputs of the plurality of model class [120,122,124…] are combined for integrating the model classes into the final content classification to generate an enhanced output DataFrame (130). In an embodiment, the output DataFrame (130) contains the new class label as the column feature.

In an embodiment, the generic system iteratively takes in multiple stages of the multiple model integration, as needed by the context and the modeler, using configuration files to generate the final content classification. In an embodiment, the model integration layer (205) iteratively takes in the plurality of models and stacks the plurality of models into a multidimensional array (215) having the same number of output columns in the model with the highest dimension.
In an embodiment, the integration layer has the configuration layer and file. In an embodiment, the integration layer has the configuration driven activation function with the model weights from the model configuration file (210). In an embodiment, the model weights are applied onto the multidimensional stack using a configuration driven aggregation function (220) and therein generate the final classification matrix (225), that yields the final scores for a particular classification output.

Figure 2 illustrates the block diagram of the computer-implemented method for generating and integrating the plurality of models for content classification. The computer implemented method includes the following steps:

The first step of the computer-implemented method includes the representation of the DataFrame (202) as the vectorized form of an unstructured document. In an embodiment, the DataFrame (202) has a plurality of rows and a plurality of columns. In an embodiment, the plurality of columns represents features. In an embodiment, the plurality of rows represents content records.
The second step of the computer-implemented method includes the generation of the plurality of models based on the different content classification. In an embodiment, the plurality of models includes a variety of models such as, but not limited to, Bayesian prediction, perceptron, neural network optimization and heuristic model. In an embodiment, the plurality of models enhances the input DataFrame.

In an embodiment, the heuristic rule set is stored as a configuration file (105) having features as rows and output content classes as columns and the tabular data being the weights of the array transform as an input. In an embodiment, the data-array transform (110) is combined with the heuristic configuration (105) to generate an intermediary DataFrame. In an embodiment, the input features, weights, algorithms and flags are added for processing the plurality of models.
The third step of the computer-implemented method includes the integration of the plurality of models into a multi-dimensional array for generating the unified content classification output. In an embodiment, the unified content classification output is generated through meta-analysis/higher analysis and 3-dimensional array stacks.

In an embodiment, a default values [=0] is added, in case any class label is missing in the inbound Data-Frame (202) and the class labels are generated to make the arrays per source consistent.

In an embodiment, the outputs of the plurality of model class (120,122,124……) are combined for integrating the model classes into the final content classification to generate an enhanced output DataFrame (130). In an embodiment, the output DataFrame (130) contains the new class label as the column feature.

In an embodiment, the model integration layer (205) iteratively takes in the plurality of models and stacks the plurality of models into a multi-dimensional array (215) having the same number of output columns in the model with the highest dimension.

The fourth step of the computer-implemented method includes the application of the model weights from the model configuration file (210) onto the multi-dimension array using a configuration driven aggregation function (220) for generating a final classification matrix (225). In an embodiment, the final classification matrix (225) yields the final scores for the particular classification output.

The present system and the computer-implemented method can be run multiple times within the data science pipeline for every content classification type that need to be integrated and generated. The present system and the computer-implemented method is capable of predicting, identifying and classifying the specific set of content types from the unstructured document. In an embodiment, the combined results of the plurality of models are much more effective at predicting the content with lower error rates. The present system and the computer-implemented method is a component of a larger pipeline / system of models that can be applied to a specific business outcome.

Moreover, though the description of the present invention has included description of one or more embodiments, configurations, or aspects and certain variations and modifications, other variations, combinations, and modifications are within the scope of the present invention, e.g., as may be within the skill and knowledge of those in the art, after understanding the present disclosure. It is intended to obtain rights which include alternative embodiments, configurations, or aspects to the extent permitted, including alternate, interchangeable and/or equivalent structures, functions, ranges or steps to those claimed, whether or not such alternate, interchangeable and/or equivalent structures, functions, ranges or steps are disclosed herein, and without intending to publicly dedicate any patentable subject matter.

Documents

Application Documents

# Name Date
1 202041054195-STATEMENT OF UNDERTAKING (FORM 3) [14-12-2020(online)].pdf 2020-12-14
2 202041054195-REQUEST FOR EXAMINATION (FORM-18) [14-12-2020(online)].pdf 2020-12-14
3 202041054195-PROOF OF RIGHT [14-12-2020(online)].pdf 2020-12-14
4 202041054195-POWER OF AUTHORITY [14-12-2020(online)].pdf 2020-12-14
5 202041054195-FORM FOR STARTUP [14-12-2020(online)].pdf 2020-12-14
6 202041054195-FORM FOR SMALL ENTITY(FORM-28) [14-12-2020(online)].pdf 2020-12-14
7 202041054195-FORM 18 [14-12-2020(online)].pdf 2020-12-14
8 202041054195-FORM 1 [14-12-2020(online)].pdf 2020-12-14
9 202041054195-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [14-12-2020(online)].pdf 2020-12-14
10 202041054195-EVIDENCE FOR REGISTRATION UNDER SSI [14-12-2020(online)].pdf 2020-12-14
11 202041054195-DRAWINGS [14-12-2020(online)].pdf 2020-12-14
12 202041054195-DECLARATION OF INVENTORSHIP (FORM 5) [14-12-2020(online)].pdf 2020-12-14
13 202041054195-COMPLETE SPECIFICATION [14-12-2020(online)].pdf 2020-12-14
14 202041054195-FER.pdf 2023-10-19

Search Strategy

1 SearchHistoryE_17-10-2023.pdf