Abstract: This disclosure relates generally to data management, and more particularly to method and system for managing redundant, obsolete, and trivial (ROT) data. In one embodiment, a method for managing ROT documents is disclosed. The method includes receiving a document, and classifying the document into a normal document or a ROT document along with a confidence score using a document classification model. The document classification model may be a domain contextualized machine learning model. The method further includes managing the document according to a document management policy based on the classification and the confidence score. Figure 2
Claims:WE CLAIM
1. A method of managing redundant, obsolete, and trivial (ROT) documents, the method comprising:
receiving, by a ROT data management device, a document;
classifying, by the ROT data management device, the document into a normal document or a ROT document along with a confidence score using a document classification model, wherein the document classification model is a domain contextualized machine learning model; and
managing, by the ROT data management device, the document according to a document management policy based on the classification and the confidence score.
2. The method of claim 1, further comprising building the document classification model by learning a relationship between domain knowledge and document attributes using a machine learning process.
3. The method of claim 2, wherein the relationship is learnt by analyzing the domain knowledge and the document attributes of a set of documents in a training data set.
4. The method of claim 2, wherein the document attributes comprises at least one of document metadata or content category of the document.
5. The method of claim 2, wherein the domain knowledge comprises at least one of a document retention policy, a document handling policy, a document confidentiality policy for a domain, and wherein the domain comprises at least one of a healthcare domain, a finance domain, a utility domain, a retail domain, or an e-commerce domain.
6. The method of claim 1, wherein the document management policy comprises at least one of:
deleting the ROT document with the confidence score equaling or above a first pre-defined threshold,
marking the ROT document with the confidence score below the first pre-defined threshold for further analysis,
marking the normal document with the confidence score below a second pre-defined threshold for further analysis, or
storing the normal document with the confidence score equaling or above the second pre-defined threshold.
7. The method of claim 1, further comprising forecasting a usage pattern and a usability pattern of the document based on an analysis of the document, wherein the usability pattern of the document corresponds to a criticality of the document.
8. The method of claim 7, wherein the forecasting is based on at least one of a frequency of access to the document, a history of modifications to the document, or a number of other documents that have reference to the document.
9. The method of claim 7, further comprising storing the normal document in a multi-tiered storage architecture based on the usage pattern and the usability pattern, wherein a less critical and less frequently used normal document is stored in a low-cost storage while a frequently used priority document is stored in a high-cost storage.
10. A system for managing redundant, obsolete, and trivial (ROT) documents, the system comprising:
a ROT data management device comprising at least one processor and a computer-readable medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:
receive a document;
classify the document into a normal document or a ROT document along with a confidence score using a document classification model, wherein the document classification model is a domain contextualized machine learning model; and
manage the document according to a document management policy based on the classification and the confidence score.
11. The system of claim 10, wherein the operations further comprise building the document classification model by learning a relationship between domain knowledge and document attributes using a machine learning process, and wherein the relationship is learnt by analyzing the domain knowledge and the document attributes of a set of documents in a training data set.
12. The system of claim 11, wherein the document attributes comprises at least one of document metadata or content category of the document, wherein the domain knowledge comprises at least one of a document retention policy, a document handling policy, a document confidentiality policy for a domain, and wherein the domain comprises at least one of a healthcare domain, a finance domain, a utility domain, a retail domain, or an e-commerce domain.
13. The system of claim 10, wherein the document management policy comprises at least one of:
deleting the ROT document with the confidence score equaling or above a first pre-defined threshold,
marking the ROT document with the confidence score below the first pre-defined threshold for further analysis,
marking the normal document with the confidence score below a second pre-defined threshold for further analysis, or
storing the normal document with the confidence score equaling or above the second pre-defined threshold.
14. The system of claim 10, wherein the operations further comprise forecasting a usage pattern and a usability pattern of the document based on at least one of a frequency of access to the document, a history of modifications to the document, or a number of other documents that have reference to the document, and wherein the usability pattern of the document corresponds to a criticality of the document.
15. The system of claim 14, wherein the operations further comprise storing the normal document in a multi-tiered storage architecture based on the usage pattern and the usability pattern, and wherein a less critical and less frequently used normal document is stored in a low-cost storage while a frequently used priority document is stored in a high-cost storage.
Dated this 12th day of February, 2018
R Ramya Rao
Of K&S Partners
Agent for the Applicant
, Description:TECHNICAL FIELD
This disclosure relates generally to data management, and more particularly to method and system for managing redundant, obsolete, and trivial (ROT) data.