Sign In to Follow Application
View All Documents & Correspondence

Method For Identifying Confidential Data Using Unsupervised Machine Learning In Data Leakage Prevention

Abstract: In today’s business world, many organizations use information systems to manage their confidential information. The need to protect confidential information of the organization is very critical. Data leakage threat has become an important issue especially data leakage caused by insiders in the organizations. Data Leakage Prevention (DLP) is one of the methods for effectively preventing data leakages. Data leakage prevention system (DLP) is a system, stops transfer of confidential data from organization’s network to outside world. DLP solutions must be able to identify and protect confidential data within organization. Content-aware DLP is one of the DLP solution can read all the data contained within the file, identify confidential data and provide protection to the organizations data. Content-aware DLP solutions with context information properly classify confidential data and provide more protection to the organization data. The proposed invention prevents data leakages caused by insiders of the organization using context of the content. The existing data leakage prevention methods, Keyword based, Phrase based and Statistical methods identifies the confidentiality of the document based on specific keywords, phrases or statistical values. The keyword, phrase based methods ignore the context of the keyword while statistical methods ignore the content of the analyzed text. 5 claims & 1 Figure

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
11 December 2021
Publication Number
05/2022
Publication Type
INA
Invention Field
COMMUNICATION
Status
Email
ipfc@mlrinstitutions.ac.in
Parent Application

Applicants

MLR Institute of Technology
Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad

Inventors

1. Dr. P. Subhashini
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad
2. Dr. K Srinivas Rao
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad
3. Dr. P Chinnasamy
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad
4. Dr. A Kiran
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad
5. Mr. Kashi Sai Prasad
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad
6. Mrs. Soleti Navya
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad
7. Ms. N Sandhya Rani
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad
8. Mrs. Appam Ashwini
Department of Computer Science and Engineering, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043, Medchal–District, Hyderabad

Specification

Claims:The scope of the invention is defined by the following claims:

Claim:
1. A system/method for confidential data identification using unsupervised machine learning algorithm on data leakage prevention, said system/method comprising the steps of:
a) The system maintains the confidential and non-confidential documents as a separate repositories in a training phase (1).
b) The Language Model (2) is created for confidential and non-confidential documents by two different process like Cluster Creation (3) and Confidential Terms Detection (4).
c) The trained documents are tested in the testing phase (5).
d) The testing phase starts from document processing (6) and generates the term frequency vector creation (7).
e) From this, the similar cluster identification (8), confidentiality value calculation (9) and detection (10) in the final stage.
2. As per claim 1, the training phase maintains the confidential and non-confidential documents in a different repositories.
3. According to claim 1, the separate clusters for confidential and non-confidential documents is created.
4. As per claim 1, the term probability calculated for each term in the clusters and confidential score is calculated based on occurrence of a term in confidential and non-confidential clusters. Finally, confidential terms are identified for each confidential cluster.
5. As per claim 1, the tested document terms are compared with claim 4 confidential terms and detected confidentiality of the tested document , Description:Field of Invention
The present invention focuses on confidentiality service provided by Information security. For every organization protecting confidential data from the leakages is an important issue. The most difficult task is to identify data leakage which is caused by insiders of the organization
Background of the Invention
The Recent Keyword-based methods detect data leakages by looking specific keywords and phrases only. If the document does not contain predefined set of keywords, the Keyword-based method treats it as non-confidential document. Keyword-based methods ignore the context of the keywords to address the data leakages in information security, because these methods combine confidential and non-confidential document terms together before looking for the specific keywords and the statistical method ignore the content of the analyzed text at the time of detecting confidentiality of the test document because of this, the existing Keyword based and statistical confidentiality detection methods are suffering with a high false positive rate (FPR) .
(Katz, Gilad. et al [2014], Information Scinece, Elsevier , 107-128), explained a method called CoBAn: A Context Based Model For Data Leakage Prevention (Abbadi and Alawneh [2008], proceedings, international conference on emerging security information system and technology, 99-106) proposed a framework that allows authorized users access to sensitive information from inside or outside an organization’s premises)..(Zilberman, Polina. et al.[2011], ISI, IEEE) this study focused on analysis of the Group Communication for Preventing Data Leakage via Email in the organization. Hanan Alhindi et.al [2021], Journal of Internet Services and Information Security (JISIS), volume: 11, number:, pp. 78-99.), This method detected altered sensitive data leakage effectively by combining semantic similarity and semantic relevance metrics, which are based on an ontology. This approach uses the fact that a large portion of network traffic is repeated or constrained by protocol specifications.
The system uses network-based sensors that process network traffic to produce information-use events. (Yaseen and Panda[2009], IEEE conference on Computational Science and Engineering . 450-455 ) used a data-centric approach; this method uses dependency graphs based on domain-expert knowledge. These dependency graphs are used to predict the ability of a user to infer sensitive information that might harm the organization using information that he/she was already obtained. The patent US9626528B2 focused on the classification of the documents based on enforcement techniques. The Patent (US20150254469A1) designed a data leakage prevention system, method, and computer program products are provided for preventing a predefined type of operation on predetermined data. The document classification application creates one or more communicative discourse trees from the discourse trees by matching each elementary discourse unit in a discourse tree that has a verb to a verb signature.
The objective of the proposed is to prevent data leakages caused by insiders of the organization using the content of the document. The existing data leakage prevention methods, Keyword based, Phrase based and Statistical methods identifies the confidentiality of the document based on specific keywords, phrases or statistical values.
Summary of the Invention
In light of the above mentioned drawbacks in the prior art, the present invention aims to identifying confidential data in data leakage prevention methods. Once the document is marked as confidential, it is possible to prevent that document from leakage.
The proposed content matching methods are designed such a way that the confidential data is preventing from the leakages. So, Content matching methods will help in preventing data leakage attacks.
Brief Description of Drawings
The invention will be described in detail with reference to the exemplary embodiments shown in the figures wherein:
Figure 1 Architecture of the methodology.
Detailed Description of the Invention
Data Leakage Prevention (DLP) is one of the methods for effectively preventing data leakages. Data leakage prevention system (DLP) is a system, stops transfer of confidential data from organization’s invention to outside world. DLP solutions must be able to identify and protect confidential data within organization. Content-aware DLP is one of the DLP solution can read all the data contained within the file, identify confidential data and provide protection to the organizations data.
The proposed method “Confidential Data Identification using Unsupervised Machine Learning Technique in Data Leakage Prevention” takes the best of both the keyword and statistical approaches and identifies the confidential content of the documents with low false positive rate, also determines the level of threat its leakage presents to the organization and also proposes a score for determining the level of the data leakage.
Proposed method consists of two phases: training and testing phase. During the training phase, clusters of documents with confidential terms are generated, in the testing phase, each tested document is assigned to clusters and its content are then matched to each cluster’s respective confidential terms to determine the confidentiality of the document.
The proposed method maintains the confidential and non- confidential documents separately and identifies the confidential data through confidential terms. Confidential terms serve as an initial indication of the presence of confidential content in the document. The proposed method uses language modeling technique for confidential terms identification.
The training phase presents, the process of language model creation. The testing phase presents, a similar cluster identification process for a tested document and also the process of the confidentiality detection of the tested document. Data leakage prevention system uses content matching methods for controlling data leakages in the organization. The content matching methods classifies the organization documents into confidential and non-confidential. This invention proposed a data leakage prevention solution using content based matching methods. Content based methods identify confidentiality of the document based on the content of the document and context based methods identify the confidentiality of the document based on content and its context. The proposed invention suggests four models for data leakage prevention using content matching and context based content matching methods.
Our invention identified the confidentiality of the document using content-based language methodology. The performance of work have evaluated using Reuters news article , Enron email and real time dataset collected from JNTUH Academic Audit Cell, on three datasets the proposed model have successfully reduced the false positive rate. The performance of the proposed method is compared with a base line method CoBAn-No Context.
5 Claims & 1 Figure

Documents

Application Documents

# Name Date
1 202141057659-REQUEST FOR EARLY PUBLICATION(FORM-9) [11-12-2021(online)].pdf 2021-12-11
2 202141057659-FORM-9 [11-12-2021(online)].pdf 2021-12-11
3 202141057659-FORM FOR SMALL ENTITY(FORM-28) [11-12-2021(online)].pdf 2021-12-11
4 202141057659-FORM FOR SMALL ENTITY [11-12-2021(online)].pdf 2021-12-11
5 202141057659-FORM 1 [11-12-2021(online)].pdf 2021-12-11
6 202141057659-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [11-12-2021(online)].pdf 2021-12-11
7 202141057659-EVIDENCE FOR REGISTRATION UNDER SSI [11-12-2021(online)].pdf 2021-12-11
8 202141057659-EDUCATIONAL INSTITUTION(S) [11-12-2021(online)].pdf 2021-12-11
9 202141057659-DRAWINGS [11-12-2021(online)].pdf 2021-12-11
10 202141057659-COMPLETE SPECIFICATION [11-12-2021(online)].pdf 2021-12-11