Abstract: ABSTRACT Method and system for performing adaptive document classification. The present invention relates to the field of data analytics and more particularly to performing analytics of data present in a document to classify the document. Embodiments herein disclose a method and system for adaptive document classification assisted by an online means, wherein the document classification can be performed on a device being used to classify the document in real-time. Embodiments herein disclose a method and system for adaptive document classification assisted by an online means, wherein the classification can be performed on a device being used to classify the document in real-time and classification is improved in a continuous manner by learning from feedback. FIG. 5
CLIAMS:STATEMENT OF CLAIMS
We claim:
1. A method for classifying a document, said method comprising
performing a preliminary classification of a document;
performing an offline classification of said document;
performing a simulation of said preliminary classification;
comparing classification performed by said offline classification and classification performed by said simulation of said preliminary classification;
providing feedback to said server, if classification performed by said offline classification and classification performed by said simulation of said preliminary classification do not match; and
updating at least one of classification model used for said preliminary classification, classification model used for said simulation and classification model used for said offline classification, based on said feedback.
2. The method, as claimed in claim 1, wherein said preliminary classification comprises at least one of
performing a rule based classification of said document using at least one rule; and
performing machine learning based classification of said document, wherein said machine learning based classification further comprises of
classifying said document into at least one class by a plurality of classifiers, wherein said plurality of classifiers run in a parallel manner; and
determining a classification of said document by combining outputs of said plurality of classifiers.
3. The method, as claimed in claim 2, wherein said method further comprises of updating said machine learning based classification based on feedback provided by said server.
4. The method, as claimed in claim 1, wherein said offline classification comprises of classifying said document based on information extracted from said document.
5. The method, as claimed in claim 1, wherein providing feedback further comprises of analyzing said document, classification performed by said offline classification and classification performed by said simulation using at least one of an automated means, a manual means, and a combination of an automated means and a manual means.
6. The method, as claimed in claim 1, wherein said method further comprises of
generating at least one rule; and
updating said at least rule on said device.
7. The method, as claimed in claim 1, wherein updating classification model used for said preliminary classification is done in a phased manner at pre-defined intervals.
8. A system for classifying a document, said system comprising of
at least one device configured for
performing a preliminary classification of a document;
a server configured for
performing a offline classification of said document;
performing a simulation of said preliminary classification;
comparing classification performed by said offline classification and classification performed by said simulation of said preliminary classification;
an analysis module configured for
providing feedback to said server, if classification performed by said offline classification and classification performed by said simulation of said preliminary classification do not match; and
said server further configured for
updating at least one of classification model used for said preliminary classification, classification model used for said simulation and classification model used for said offline classification, based on said feedback.
9. The system, as claimed in claim 8, wherein said device is further configured for performing said preliminary classification by
performing a rule based classification of said document using at least one rule; and
performing machine learning based classification of said document, wherein said machine learning based classification further comprises of
classifying said document into at least one class by a plurality of classifiers, wherein said plurality of classifiers run in a parallel manner; and
determining a classification of said document by combining outputs of said plurality of classifiers.
10. The system, as claimed in claim 9, wherein said device is further configured for updating said machine learning based classification based on feedback provided by said server.
11. The system, as claimed in claim 8, wherein said server is further configured for performing said offline classification by classifying said document based on information extracted from said document.
12. The system, as claimed in claim 8, wherein said analysis module is further configured for providing feedback by analyzing said document, classification performed by said offline classification and classification performed by said simulation using at least one of an automated means, a manual means, and a combination of an automated means and a manual means.
13. The system, as claimed in claim 8, wherein said analysis module is further configured for generating at least one rule; and said server is further configured for updating said at least rule on said device.
14. The system, as claimed in claim 8, wherein said server is further configured for updating classification model used for said preliminary classification in a phased manner at pre-defined intervals.
Dated 02nd July 2015
Signature:
Name: Kalyan Chakravarthy
Patent Agent
,TagSPECI:FORM 2
The Patent Act 1970
(39 of 1970)
&
The Patent Rules, 2005
COMPLETE SPECIFICATION
(SEE SECTION 10 AND RULE 13)
TITLE OF THE INVENTION
“Method and system for performing adaptive document classification on a computing device”
APPLICANTS:
Name Nationality Address
SAMSUNG R&D Institute India - Bangalore Private Limited India # 2870, Orion Building, Bagmane Constellation Business Park, Outer Ring Road, Doddanekundi Circle, Marathahalli Post, Bangalore-560037, India
The following specification particularly describes and ascertains the nature of this invention and the manner in which it is to be performed:-
FIELD OF INVENTION
[001] The present invention relates to the field of data analytics on computing devices and more particularly to performing analytics of data present in a document to classify the documents into a set of predefined classes on computing devices.
BACKGROUND OF INVENTION
[002] Currently a user accesses documents on his device. The documents can comprise of documents (such as news sites, media content, knowledge content, and so on), emails, SMSs (Short Messaging Service), IMs (Instant Messages), RSS (Rich Site Summary) feeds and so on. In many cases, these documents need to be processed and analyzed on the device for reasons such as for filtering unwanted content (such as adult content and so on), for information retrieval from specific documents (belonging to a specific category and so on), for showing targeted advertisements, for building specialized vertical search engines and so on. For example, in the case of a family, the parents do not wish their children to view webpages related to adult content, violence, and so on. In another example, in the case of an organization, the organization does not wish their employees to view webpages not related to their work (such as social networks, online merchants, news sites, media and so on). In such cases, the documents (webpages in this example) need to be filtered to remove any webpages that need to be blocked, before enabling a user to view the webpage.
[003] A current solution uses classification at server level, wherein a server can be configured to classify the document, before making the document visible to the user. However, such solutions can result in a delay in the user being able to view the documents, after a request for such a document is made by the user, due to the time taken by the server to perform a classification before presenting the document to the user and the network latency involved.
[004] Other solutions implement classification at the device level, wherein the device used by the user can classify a document, before making the document visible to the user. However, such solutions are limited due to the low processing power and low memory available on the device (as compared to the server based solutions). Hence, less complex classification models are used for performing classification at the device level, which can adversely affect the accuracy of the classification.
[005] Also, documents and their contents can be dynamic in nature (with the content, language, terms and so on which are used in documents subject to change) and so the classification also needs to be done dynamically, to ensure that the documents can be classified in an accurate manner.
OBJECT OF INVENTION
[006] The principal object of the embodiments herein is to propose a method and system for adaptive document classification assisted by an offline means, wherein the document classification can be performed on a device being used to classify the document in real-time.
[007] Another object of the invention is to propose a method and system for adaptive document classification assisted by an offline means, wherein the classification can be performed on a device being used to classify the document in real-time and classification is improved in a continuous manner by learning using a feedback and analysis mechanism.
SUMMARY
[008] Accordingly the invention provides a method for classifying a document, the method comprising performing a preliminary classification of a document; performing an offline classification of the document; performing a simulation of the preliminary classification; comparing classification performed by the offline classification and classification performed by the simulation of the preliminary classification; providing feedback to the server, if classification performed by the offline classification and classification performed by the simulation of the preliminary classification do not match; and updating at least one of classification model used for the preliminary classification, classification model used for the simulation and classification model used for the offline classification, based on the feedback.
[009] Accordingly the invention provides a system for classifying a document, the system comprising of at least one device configured for performing a preliminary classification of a document; a server configured for performing an offline classification of the document; performing a simulation of the preliminary classification; comparing classification performed by the offline classification and classification performed by the simulation of the preliminary classification; an analysis module configured for providing feedback to the server, if classification performed by the offline classification and classification performed by the simulation of the preliminary classification do not match; and the server further configured for updating at least one of classification model used for the preliminary classification, classification model used for the simulation and classification model used for the offline classification, based on the feedback.
[0010] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.
BRIEF DESCRIPTION OF FIGURES
[0011] This invention is illustrated in the accompanying drawings, throughout which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:
[0012] FIG. 1 depicts a system for performing classification of documents, according to embodiments as disclosed herein;
[0013] FIG. 2 depicts a device for enabling a user to access a document, according to embodiments as disclosed herein;
[0014] FIG. 3 depicts a server, according to embodiments as disclosed herein;
[0015] FIG. 4 depicts an analysis module for providing feedback, according to embodiments as disclosed herein;
[0016] FIG. 5 depicts the process of performing classification of an incoming URL, according to embodiments as disclosed herein;
[0017] FIG. 6 is a flowchart illustrating the process of performing the preliminary classification, according to embodiments as disclosed herein;
[0018] FIG. 7 is a flowchart depicting the process of the offline classification, according to embodiments as disclosed herein;
[0019] FIG. 8 is a flowchart depicting the process of providing feedback and generating rules, according to embodiments as disclosed herein;
[0020] FIG. 9 is a flowchart illustrating the process of the server updating the device based on feedback provided by the analysis module, according to embodiments as disclosed herein; and
[0021] FIG. 10 is a flowchart illustrating the process of the server updating the device based on rule(s) provided by the analysis module, according to embodiments as disclosed herein.
DETAILED DESCRIPTION OF INVENTION
[0022] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein can be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0023] The embodiments herein achieve a method and system for adaptive document classification on a device, assisted by an offline means and adapted using a feedback mechanism. Referring now to the drawings, and more particularly to FIGS. 1 through 10, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments.
[0024] Embodiments herein disclose a method and system for classifying a document using a two-stage level of classification. The first stage of classification or preliminary classification is performed on a device being used to access the document in an online manner, and is configured for providing a very low latency. The second stage of classification or offline classification comprises using an offline module (a server) using suitable techniques such as machine learning to classify documents with a very high accuracy level. Embodiments herein also disclose a feedback mechanism; wherein the classification performed at the first stage and second stage are updated based on feedback provided using at least one of a manual and automated means. The two stages of classification need not be performed in a sequential manner. The preliminary classification can run in real-time on the device, whereas the server can perform the offline classification at periodic intervals and/or on pre-defined events occurring.
[0025] FIG. 1 depicts a system for performing classification of documents, according to embodiments as disclosed herein. FIG. 1 depicts a device 101, a server 102, and an analysis module 103. The device 101 can be a device that can be used by a user of the device 101 to access a document. The device 101 can be a computer, a laptop, a tablet, a mobile phone, a smart phone, a wearable computing device, or any other device configured to enable the user of the device to access a document. The device 101 can use a suitable communication means such as a wired means, a wireless means and so on, to access the document. The server 102 can be a module connected to the device 101. The server 102 can be connected to the device 101 using a suitable connection means such as the internet, a private network, a public network, and so on. The analysis module 103 can be a module connected to the server 102. In an embodiment herein, the analysis module 103 can be present within the server 102. In an embodiment herein, the analysis module 103 can be a distinct module from the server 102. In an embodiment herein, the analysis module 103 can be partly present within the server 102.
[0026] The document herein can refer to any document, which is accessible to the device 101 and at least one server. In an example herein, the document can be a webpage, an IM (Instant Message), a SMS (Short Messaging Service), a RSS (Rich Site Summary) feed, emails, attachments to emails, text files, media files comprising of text (such as subtitles, and so on), presentations, slideshows or any other form of data, which can comprise of text.
[0027] The device 101 can receive a request to access a document (wherein the request can be received from a user, from an application residing on the device being used by the user, or from the device without any inputs from the user (such as automatically fetching emails, and so on)). The device 101 can receive the request directly or from another device being used by the user. The device 101 can perform the preliminary classification in an online manner. The device 101 can perform the preliminary classification with a very low latency. The server 102 can perform the offline classification using suitable techniques such as machine learning to classify documents with a very high accuracy level, the techniques used by the device 101 for performing the classification. The server 102 can further compare the classifications performed using the different techniques. The analysis module 103 can provide feedback based on the comparison performed by the server 102, using automated and/or manual means. The analysis module 103 can also generate at least one rule, based on the classifications. The feedback provided by the analysis module 103 can be used to improve the classification, performed by the device 101 and the server 102.
[0028] FIG. 2 depicts a device for enabling a user to access a document, according to embodiments as disclosed herein. The device 101, as depicted, comprises of a rule based classification engine 201, a machine learning based classification engine 202, a controller 203, a communication interface 204, and a memory 205. The communication interface 204 can fetch a document, on receiving a request. The communication interface 204 can enable the device 101 to interact with at least one external module, such as the server 102, the analysis module 103, and so on. The memory 205 can be a non-volatile memory and can be used for storing data, related to the classification such as rules and so on. The memory 205 can be a memory internal to the device 101. The memory 205 can be an expandable storage means, accessible to the device 101. In an embodiment herein, the memory 205 can be a database.
[0029] On receiving a request for a document, the controller 203 can fetch the document and associated content (such as embedded data, metadata, properties and so on), using the communication interface 204. The rule based classification engine 201 can perform classification of the document using at least one suitable rule such as domain knowledge, blacklists of documents, whitelists of documents, and so on. The rule based classification engine 201 can also use a previously performed classification of the document, to perform the classification. The rule based classification engine 201 can also use a previously performed classification of similar documents, to perform the classification. If the document is a webpage, the rule based classification engine 201 can perform the classification using the URL (Uniform Resource Locator) of the requested document and/or contents of the document.
[0030] The rules used to the rule based classification engine 201 can be updated, based on inputs. The analysis module 103 and/or the server 102 can provide the inputs. On receiving information related to the updated rules, the controller 203 can store the information in the memory 206. The rule based classification engine 201 can use the updated rules, on being required to classification.
[0031] The machine learning based classification engine 202 can first check the probability that the requested document belongs to a particular class. The user of the device 101, an administrator of the classification, the server 102, and so on can define the classes. If the document is a webpage, the examples of the classes can be sports, entertainment, technology, news, health, and so on. If the document is an IM, the examples of the classes can be office, family, friends, and so on. The user of the device 101, the administrator, and so on can modify the classes at any instant. The machine learning based classification engine 202 can comprise of a plurality of classifiers, wherein the classifiers run in parallel with each other. In an embodiment, the number of classifiers can be equal to the number of classes. Each of the classifiers determines whether the requested document belongs to a specific class, in a parallel manner and produces a result. The result of the classifier can be in the form of a YES/NO, whether the requested document belongs to the class corresponding to the classifier. The classifier can comprise of at least one indeterminate state, wherein the classifier can produce the indeterminate state as the result, on the classifier being unable to classify the document. The machine learning based classification engine 202 can combine the results from the plurality of classifiers and determine a final classification for the document.
[0032] The machine learning based classification engine 202 can be updated, based on inputs. The inputs can be provided by the server 102 and/or the user of the device 101. On receiving the inputs, the controller 203 can store the information in the memory 206. The machine learning based classification engine 202 can perform classification, based on the inputs.
[0033] The classification performed by the rule based classification engine 201 and the machine learning based classification engine 202 can be provided to the controller 203. The controller 203 can use the classification for purposes such as for filtering unwanted content (such as adult content and so on), for information retrieval from specific documents (belonging to a specific category and so on), for showing targeted advertisements, for building specialized vertical search engines and so on.
[0034] FIG. 3 depicts a server, according to embodiments as disclosed herein. The server 102 can comprise of an offline classifier 301, a controller 302, a training module 303, a memory 304, and a communication interface 305. The communication interface 305 can enable the server to fetch the document. The communication interface 305 can enable the server 102 to interact with at least one external module, such as the device 101, the analysis module 103, and so on. The memory 304 can be a memory storage location, wherein the memory 304 can be a pure database, a memory store, an electronic storage location and so on. The memory 304 can be located locally with the server 102. The memory 304 can be located remotely from the server 102 and the server 102 can communicate with the memory 304 using a suitable means such as LAN, a private network, a WAN, the Internet, Wi-Fi and so on. The memory 304 can comprise of rule(s), classification models, information related to previously classified documents and so on.
[0035] The controller 302 can receive information related to the documents from the device 101. The information can comprise of the location of the document (in the form of a URL (Uniform Resource Locator), a FTP (File Transfer Protocol) link, the location of the server (such as an email server, IM server) which contains the document, and so on). The information can also comprise of the document. In an embodiment herein, if the document is a webpage, the controller 302 can receive the URL of the webpage and the controller 302 can fetch the content of the webpage from the web. The controller 302 can pass the information related to the documents, the documents, and so on to the offline classifier 301. The offline classifier 301 can extract information from the documents including the contents of the document (text, media – images, videos and so on) and associated content (such as metadata, properties, and so on). The offline classifier 301 can also extract information such as hyperlinks present in the document; other documents that link to this document and so on. Based on the extracted information, the offline classifier 301 can classify the document, using a suitable machine learning means such as multi-layer neural networks, support vector machines, and so on. The offline classifier 301 can communicate the classification, as performed, to the controller 302.
[0036] The controller 302 can simulate the preliminary classification, as performed by the device 101. The controller 302 can classify the document, using the same classification methodology and/or rules, as performed by the device 101.
[0037] The controller 302 can compare the classifications – the simulation of the preliminary classification and the classification performed by the offline classifier 301. Based on the comparison, the controller 302 can tag the document for creating rules for the rule based classifier 201 (present in the device 101) or used for feeding back into the training data of the device 101 to boost the performance of the machine learning based classification engine 202. In an example, consider a scenario where simulation of the preliminary classification classifies a particular document as belonging to a particular class and the offline classifier 301 classifies the same document as another class, the controller 302 can provide information related to the classification and the document to the analysis module 103. The controller 302 can also accept input from the user that the document classified by the device 101 is not correct. The controller 302 can also receive feedback from the user and/or the device 101, regarding the correct classification of the document.
[0038] In an embodiment herein, the training module 303 can perform re-training of the models used for classification using feedback data (received by the controller 301 from the analysis module 103 and sent to the training module 303). The training module 303 can use a suitable means such as Adaptive Boosting (AdaBoost) or any other equivalent means. The training module 303 can scale the weights of the misclassified instances (known through feedback from the analysis module 103 and/or the user of the device 101) and correctly classified instances appropriately through an algorithm (such as AdaBoost, bagging algorithms and so on) to improve the performance of the classification performed by the device 101. Once the training module 303 has improved the performance of at least one feature (examples of the features can be a n-gram (such as uni-grams, bi-grams, tri-grams, and so on)) of the online classifier above a threshold, the controller 302 can incrementally update the improvements on to the device 101. In an embodiment herein, the training module 303 can use feedback provided by the analysis module 103 about misclassifications previously performed by the device 101 and/or the server 102. The training module 303 can use feedback provided by the analysis module 103 about misclassifications previously performed by the device 101 and/or the server 102 by assigning a higher weightage to the feedback about the previous misclassifications, while using the feedback to improve feedback.
[0039] The controller 302 can perform the update in an incremental manner so that only those features of the classifier used by the device 101 for which there is a significant change in the parameter values will be updated (wherein the degree of significance can be defined by the user or any other authorized person such as an administrator and so on), hereby resulting in a reduction in the data transfer required for updating the model on the device 101. The controller 302 can perform the update at pre-defined intervals of time.
[0040] The controller 302 can further update and/or introduce rules to the device 101, based on inputs from the analysis module 103. Embodiments herein can update the rules onto the device 101 in an incremental manner, hereby resulting in a reduction in the data transfer required for updating the model on the device 101. Embodiments herein can update the rules onto the device 101 at pre-defined intervals.
[0041] The controller 302 can maintain the classification model used for the simulation in a manner, so as to reflect the classification model used by the device 101 for the preliminary classification. The controller 302 can reflect the updates performed on the device 101, into the classification model used for the simulation.
[0042] FIG. 4 depicts an analysis module for providing feedback, according to embodiments as disclosed herein. The analysis module 103, as depicted, comprises of a rules generation engine 401, and a classification evaluation module 402. The analysis module 103 can receive the information related to the classification and the documents from the server 102. The information related to the document can comprise of the document, the location of the document, further analysis done by the server 102 on the document, and so on. The information related to the classification can comprise of the results of the simulation of the preliminary classification, the classification done by the server 102, the result of the comparison of the simulation of the preliminary classification and the classification done by the server 102, information provided by a user related to the classification, and so on. The classification evaluation module 402 can examine documents, which have been classified by the server 102. The classification evaluation module 402 can verify the classifications in an independent manner by comparing the classifications. In an embodiment herein, the classification evaluation module 402 can verify the classifications by comparing the classifications in an automatic manner. In an embodiment herein, the classification evaluation module 402 can verify the classifications by comparing the classifications manually. In an embodiment herein, the classification evaluation module 402 can verify the classifications by comparing the classifications using a combination of automatic and manual means. If the classification evaluation module 402 identifies a document that has been misclassified (by the simulation of the preliminary classification and/or the offline classification), the classification evaluation module 402 can provide the feedback to the offline classification engine 102. The feedback can comprise of the correct classification of the document, the classification model that misclassified the document (can be either the model in the server 102 or the simulation of the preliminary classification) and so on.
[0043] On the classification evaluation module 402 verifying the classification, the rules generation engine 401 can create at least one rule, based on the verified classification. The rules can be based on facts or domain knowledge related to the document. For example, the rule could be based on the name of the document, sender of an email, creator of a document, presence of specific words within the document, and so on. The rules generation engine 401 can generate the rules in an independent manner. In an embodiment herein, the rules generation engine 401 can generate the rules in an automatic manner. In an embodiment herein, the rules generation engine 401 can generate the rules manually. In an embodiment herein, the rules generation engine 401 can generate the rules using a combination of automatic and manual means. The rules can be new rules. The rules can also be an update of an existing rule. On creating the rules, the rules generation engine 401 can provide the created rule(s) to the device 101 (using the server 102).
[0044] FIG. 5 depicts the process of performing classification of an incoming document, according to embodiments as disclosed herein. The device 101 can receive a request to access a document. The rule based classification engine 201 and the machine learning based classification engine 202 can perform a preliminary classification within the device 101, with a very low latency period (which can be typically less than 100 milliseconds). The device 101 can provide the information related to the document to the server 102. The server 102 can simulate the preliminary classification performed by the device 101. The server 102 can perform the offline classification using suitable techniques such as machine learning to classify documents with a very high accuracy level. The server 102 can compare the simulation of the preliminary classification and the offline classification. If there is a mismatch between the simulation of the preliminary classification and the offline classification, the offline classification can provide information regarding the classification done by the device 101 and the server 102 to the analysis module 103. The analysis module 103 can provide feedback based on performing an analysis (which can be at least one of an automated analysis, a manual analysis, and a combination of a manual and automated analysis) of the classification. The analysis module 103 can consider the inputs from a manual and/or automatic analysis of the classification. The analysis module 103 can provide feedback to the server 102, based on the analysis. The analysis module 103 can further generate/update rules based on the analysis. The analysis module 103 can provide inputs to the device 101 regarding the generated/updated rules. The server 102, on receiving the feedback from the analysis module 103 can retrain the classification models present in the server 102 and can update the device 101, based on the feedback. The updation of the device 101 can be done in an incremental, phased manner. The updation of the device 101 can be done at pre-defined intervals. The server 102, on receiving the generated/updated rules from the analysis module 103 can update the device 101, based on the feedback. The updation of the rules to the device 101 can be done in an incremental, phased manner. The updation of the rules to the device 101 can be done at pre-defined intervals.
[0045] FIG. 6 is a flowchart illustrating the process of performing the preliminary classification, according to embodiments as disclosed herein. On receiving a request for a document, the device 101 fetches (601) the document and associated content (such as embedded data and so on). The rule based classification engine 201 classifies (602) the document using at least one rule such as domain knowledge, blacklists of documents, whitelists of documents, a previously performed classification of the document, and so on. The machine learning based classification engine 202 determines (603) at least one class to which the requested document belongs in a parallel manner using a plurality of classifiers. The machine learning based classification engine 202 determines (604) a final classification for the document. The various actions in method 600 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 6 may be omitted.
[0046] FIG. 7 is a flowchart depicting the process of the offline classification, according to embodiments as disclosed herein. The server 102 receives (701) the information related to the documents from the device 101. The server 102 extracts (702) information from the documents including the contents of the document (text, media – images, videos and so on), hyperlinks present in the document, other documents that link to this document and so on. Based on the extracted information, the server 102 classifies (703) the document using the two classification models. The first classification model can be the simulation of the preliminary classification. The second classification model can use a suitable machine learning means such as multi-layer neural networks, support vector machines, and so on for performing the classification. The server 102 checks (704) if the classifications done by the two classification models match by comparing the simulation of the preliminary classification and the offline classification as performed by the server 102. If the classifications done by the two models match, the server 102 tags (705) the document for creating rules for the rule based classifier 201 (present in the device 101) or used for feeding back into the training data of the device 101 to boost the performance of the machine learning based classification engine 202. If the classifications done by the two models do not match, the server 102 provides (706) information related to the classification and the documents to the analysis module 103. The various actions in method 700 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 7 may be omitted.
[0047] FIG. 8 is a flowchart depicting the process of providing feedback and generating rules, according to embodiments as disclosed herein. The analysis module 103 receives (801) information related to the document, the documents and the classification of the documents as done by the simulation of the preliminary classification and the offline classification. The analysis module 103 analyzes (802) the document. The analysis module 103 performs the analysis using at least one of automated means, manual means or a combination of automated and manual means. The analysis module 103 checks (803) if the document has been misclassified. The analysis module 103 checks if the document has been misclassified using at least one of automated means, manual means or a combination of automated and manual means. If the analysis module 103 detects that the document has been misclassified, the analysis module 103 provides (804) feedback to the server 102. The feedback can comprise of the correct classification of the document, methodologies or the criteria used for determining the classification of the document and so on. If the document has not been misclassified, the analysis module 103 checks (805) if at least one rule can be generated based on the classification done for the document. If at least one rule can be generated, the analysis module 103 generates (806) the rule(s) and provides (807) the rule(s) to the server 102. The various actions in method 800 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 8 may be omitted.
[0048] FIG. 9 is a flowchart illustrating the process of the offline classification module updating the device based on feedback provided by the analysis module, according to embodiments as disclosed herein. On receiving (901) feedback from the analysis module 103, the server 102 checks (902) the feedback if the classification done by the simulation or the offline classification was incorrect. If the classification done by the offline classification was incorrect, the server 102 retrains (903) the classification model used for the offline classification. If the classification done by the simulation was incorrect, the server 102 retrains (904) the model used for the simulation in the server 102 and updates (905) the classification model used in the preliminary classification by the device 101. The server 102 can update the classification model used in the preliminary classification by the device 101 in a phased manner at pre-defined intervals. The server 102 can retrain the model used for the simulation in a manner, such that the model used for the simulation and the model used by the device 101 for performing the preliminary classification remain the same. The various actions in method 900 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 9 may be omitted.
[0049] FIG. 10 is a flowchart illustrating the process of the offline classification module updating the device based on rule(s) provided by the analysis module, according to embodiments as disclosed herein. On receiving (1001) generated rule(s) from the analysis module 103, the server 102 retrains (1002) the simulation model of the preliminary classification present in the server 102 using the rules and also updates (1003) the device 101. The various actions in method 1000 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 10 may be omitted.
[0050] Embodiments herein can track of dynamic content available in the documents. For example, a webpage which spreads spam, constantly changes the way the webpage is written in terms of the content, the terms used, the language and so on to to avoid detection by filtering algorithms. In another example, technical product review webpages, forums and so on keep changing the terminologies, as new technology trends set in. In such scenarios, embodiments as disclosed can classify the documents in an effective manner, as the server 102 and the analysis module 103 can monitor the changes in content and update the machine learning models to ensure that the classification is done in an effective manner.
[0051] The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the elements. The elements shown in Figs. 1, 2, 3 and 4 can be at least one of a hardware device, or a combination of hardware device and software module.
[0052] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.
STATEMENT OF CLAIMS
We claim:
1. A method for classifying a document, said method comprising
performing a preliminary classification of a document;
performing an offline classification of said document;
performing a simulation of said preliminary classification;
comparing classification performed by said offline classification and classification performed by said simulation of said preliminary classification;
providing feedback to said server, if classification performed by said offline classification and classification performed by said simulation of said preliminary classification do not match; and
updating at least one of classification model used for said preliminary classification, classification model used for said simulation and classification model used for said offline classification, based on said feedback.
2. The method, as claimed in claim 1, wherein said preliminary classification comprises at least one of
performing a rule based classification of said document using at least one rule; and
performing machine learning based classification of said document, wherein said machine learning based classification further comprises of
classifying said document into at least one class by a plurality of classifiers, wherein said plurality of classifiers run in a parallel manner; and
determining a classification of said document by combining outputs of said plurality of classifiers.
3. The method, as claimed in claim 2, wherein said method further comprises of updating said machine learning based classification based on feedback provided by said server.
4. The method, as claimed in claim 1, wherein said offline classification comprises of classifying said document based on information extracted from said document.
5. The method, as claimed in claim 1, wherein providing feedback further comprises of analyzing said document, classification performed by said offline classification and classification performed by said simulation using at least one of an automated means, a manual means, and a combination of an automated means and a manual means.
6. The method, as claimed in claim 1, wherein said method further comprises of
generating at least one rule; and
updating said at least rule on said device.
7. The method, as claimed in claim 1, wherein updating classification model used for said preliminary classification is done in a phased manner at pre-defined intervals.
8. A system for classifying a document, said system comprising of
at least one device configured for
performing a preliminary classification of a document;
a server configured for
performing a offline classification of said document;
performing a simulation of said preliminary classification;
comparing classification performed by said offline classification and classification performed by said simulation of said preliminary classification;
an analysis module configured for
providing feedback to said server, if classification performed by said offline classification and classification performed by said simulation of said preliminary classification do not match; and
said server further configured for
updating at least one of classification model used for said preliminary classification, classification model used for said simulation and classification model used for said offline classification, based on said feedback.
9. The system, as claimed in claim 8, wherein said device is further configured for performing said preliminary classification by
performing a rule based classification of said document using at least one rule; and
performing machine learning based classification of said document, wherein said machine learning based classification further comprises of
classifying said document into at least one class by a plurality of classifiers, wherein said plurality of classifiers run in a parallel manner; and
determining a classification of said document by combining outputs of said plurality of classifiers.
10. The system, as claimed in claim 9, wherein said device is further configured for updating said machine learning based classification based on feedback provided by said server.
11. The system, as claimed in claim 8, wherein said server is further configured for performing said offline classification by classifying said document based on information extracted from said document.
12. The system, as claimed in claim 8, wherein said analysis module is further configured for providing feedback by analyzing said document, classification performed by said offline classification and classification performed by said simulation using at least one of an automated means, a manual means, and a combination of an automated means and a manual means.
13. The system, as claimed in claim 8, wherein said analysis module is further configured for generating at least one rule; and said server is further configured for updating said at least rule on said device.
14. The system, as claimed in claim 8, wherein said server is further configured for updating classification model used for said preliminary classification in a phased manner at pre-defined intervals.
Dated 02nd July 2015
Signature:
Name: Kalyan Chakravarthy
Patent Agent
ABSTRACT
Method and system for performing adaptive document classification. The present invention relates to the field of data analytics and more particularly to performing analytics of data present in a document to classify the document. Embodiments herein disclose a method and system for adaptive document classification assisted by an online means, wherein the document classification can be performed on a device being used to classify the document in real-time. Embodiments herein disclose a method and system for adaptive document classification assisted by an online means, wherein the classification can be performed on a device being used to classify the document in real-time and classification is improved in a continuous manner by learning from feedback.
FIG. 5
| Section | Controller | Decision Date |
|---|---|---|
| # | Name | Date |
|---|---|---|
| 1 | 3394-CHE-2015-IntimationOfGrant25-01-2024.pdf | 2024-01-25 |
| 1 | Form5.pdf | 2015-07-06 |
| 2 | 3394-CHE-2015-PatentCertificate25-01-2024.pdf | 2024-01-25 |
| 2 | FORM3.pdf | 2015-07-06 |
| 3 | Form 2.pdf | 2015-07-06 |
| 3 | 3394-CHE-2015-Annexure [22-01-2024(online)]-1.pdf | 2024-01-22 |
| 4 | Drawings_CS.pdf | 2015-07-06 |
| 4 | 3394-CHE-2015-Annexure [22-01-2024(online)].pdf | 2024-01-22 |
| 5 | abstract 3394-CHE-2015.jpg | 2015-09-30 |
| 5 | 3394-CHE-2015-Written submissions and relevant documents [22-01-2024(online)].pdf | 2024-01-22 |
| 6 | 3394-CHE-2015-Power of Attorney-110915.pdf | 2015-11-23 |
| 6 | 3394-CHE-2015-Annexure [04-01-2024(online)].pdf | 2024-01-04 |
| 7 | 3394-CHE-2015-Form 1-110915.pdf | 2015-11-23 |
| 7 | 3394-CHE-2015-Correspondence to notify the Controller [04-01-2024(online)].pdf | 2024-01-04 |
| 8 | 3394-CHE-2015-FORM-26 [04-01-2024(online)].pdf | 2024-01-04 |
| 8 | 3394-CHE-2015-Correspondence-110915.pdf | 2015-11-23 |
| 9 | 3394-CHE-2015-FORM-26 [15-03-2018(online)].pdf | 2018-03-15 |
| 9 | 3394-CHE-2015-US(14)-HearingNotice-(HearingDate-10-01-2024).pdf | 2023-12-05 |
| 10 | 3394-CHE-2015-ABSTRACT [01-06-2020(online)].pdf | 2020-06-01 |
| 10 | 3394-CHE-2015-FORM-26 [16-03-2018(online)].pdf | 2018-03-16 |
| 11 | 3394-CHE-2015-CLAIMS [01-06-2020(online)].pdf | 2020-06-01 |
| 11 | 3394-CHE-2015-FER.pdf | 2019-12-13 |
| 12 | 3394-CHE-2015-CORRESPONDENCE [01-06-2020(online)].pdf | 2020-06-01 |
| 12 | 3394-CHE-2015-OTHERS [01-06-2020(online)].pdf | 2020-06-01 |
| 13 | 3394-CHE-2015-FER_SER_REPLY [01-06-2020(online)].pdf | 2020-06-01 |
| 14 | 3394-CHE-2015-CORRESPONDENCE [01-06-2020(online)].pdf | 2020-06-01 |
| 14 | 3394-CHE-2015-OTHERS [01-06-2020(online)].pdf | 2020-06-01 |
| 15 | 3394-CHE-2015-CLAIMS [01-06-2020(online)].pdf | 2020-06-01 |
| 15 | 3394-CHE-2015-FER.pdf | 2019-12-13 |
| 16 | 3394-CHE-2015-ABSTRACT [01-06-2020(online)].pdf | 2020-06-01 |
| 16 | 3394-CHE-2015-FORM-26 [16-03-2018(online)].pdf | 2018-03-16 |
| 17 | 3394-CHE-2015-US(14)-HearingNotice-(HearingDate-10-01-2024).pdf | 2023-12-05 |
| 17 | 3394-CHE-2015-FORM-26 [15-03-2018(online)].pdf | 2018-03-15 |
| 18 | 3394-CHE-2015-Correspondence-110915.pdf | 2015-11-23 |
| 18 | 3394-CHE-2015-FORM-26 [04-01-2024(online)].pdf | 2024-01-04 |
| 19 | 3394-CHE-2015-Form 1-110915.pdf | 2015-11-23 |
| 19 | 3394-CHE-2015-Correspondence to notify the Controller [04-01-2024(online)].pdf | 2024-01-04 |
| 20 | 3394-CHE-2015-Power of Attorney-110915.pdf | 2015-11-23 |
| 20 | 3394-CHE-2015-Annexure [04-01-2024(online)].pdf | 2024-01-04 |
| 21 | abstract 3394-CHE-2015.jpg | 2015-09-30 |
| 21 | 3394-CHE-2015-Written submissions and relevant documents [22-01-2024(online)].pdf | 2024-01-22 |
| 22 | Drawings_CS.pdf | 2015-07-06 |
| 22 | 3394-CHE-2015-Annexure [22-01-2024(online)].pdf | 2024-01-22 |
| 23 | Form 2.pdf | 2015-07-06 |
| 23 | 3394-CHE-2015-Annexure [22-01-2024(online)]-1.pdf | 2024-01-22 |
| 24 | FORM3.pdf | 2015-07-06 |
| 24 | 3394-CHE-2015-PatentCertificate25-01-2024.pdf | 2024-01-25 |
| 25 | 3394-CHE-2015-IntimationOfGrant25-01-2024.pdf | 2024-01-25 |
| 25 | Form5.pdf | 2015-07-06 |
| 1 | SearchStrategyMatrix_12-12-2019.pdf |