Method And System For Automatic Data Security And Loss Protection

< Back

Method And System For Automatic Data Security And Loss Protection

Abstract: The invention relates to method (200) and system (100) for automatic data security and loss protection. The method (200) includes transforming (202) a document having a complex document format into a transformed document having a machine readable format; extracting (204) content from the transformed document using an Optical Character Recognition (OCR) technique; identifying (206) one of a presence or an absence of sensitive data within the content; determining (208) a risk type for the sensitive data using a rule-based engine upon identifying the presence of the sensitive data; remediating (210) the sensitive data based on the risk type to generate a remediated document, by applying at least one of a redaction technique, a masking technique, a tokenization technique, and an automatic encryption technique; and generating (212) a final document upon a successful validation of the remediated document based on a plurality of predefined validation rules.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

30 November 2021

Publication Number

50/2021

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

docketing@inventip.in

Parent Application

Applicants

HCL Technologies Limited

806, Siddharth, 96, Nehru Place, New Delhi - 110019, INDIA

Inventors

1. Rajat Kapoor

HCL Technologies Limited Texas, USA, 210-668-8520 Texas USA

2. Saravana Jayaraman

HCL Technologies Limited Texas, USA, 210-749-9246 Texas USA

Specification

Generally, the invention relates to processing of documents. More specifically, the invention relates to a method and system for automatic data security and loss protection.
BACKGROUND
[002] Traditional manual approaches may use various content editing tools to manually redact and refine documents and other unstructured data formats for compliance of data security standard, and other business critical requirements. However, the traditional approaches face various challenges. For example, the traditional approaches lack in supporting Mixed Object Document Content Architecture MODCA document formats. In other words, the existing editing tools leveraged by Information Technology (IT) teams may lack in providing transformation, redaction or remediation, and validation support for MODCA document formats.
[003] Moreover, a manual redaction process for MODCA documents may be time consuming, inefficient, and may lack in productivity. Additionally, a manual validation process of content or documents may be unreliable and unsecured for confidential documents. Further, the manual approaches may not comply with current ecosystem. Also, the existing tools may not be cost effective due to high recurring cost of manual redaction & refinement.
[004] Therefore, there is a need of a system and method for automatic data security and loss protection that may support all types of document formats (including MODCA documents), and overcome the above-discussed drawbacks.
SUMMARY
[005] In one embodiment, a method for automatic data security and loss protection is disclosed. The method may include transforming a document having a complex document format into a transformed document having a machine readable format. The machine readable format may include at least one of a Portable Document Format (PDF) and an image, and the complex document format may include a Mixed

UDject Document uontent arcnitecture (IVIUUOA) Tormat. i ne metnoa may Turtner include extracting content from the transformed document using an Optical Character Recognition (OCR) technique. The method may further include identifying one of a presence or an absence of sensitive data within the content. The sensitive data correspond to noncompliant data violating data privacy guidelines. The method may further include determining a risk type for the sensitive data using a rule-based engine upon identifying the presence of the sensitive data. It should be noted that rules may be framed based on at least one of a domain guideline, a data privacy guideline, and an associated risk factor. The method may further include remediating the sensitive data based on the risk type to generate a remediated document, by applying at least one of a redaction technique, a masking technique, a tokenization technique, and an automatic encryption technique. The method may further include generating a final document upon a successful validation of the remediated document based on a plurality of predefined validation rules.
[006] In another embodiment, a system for automatic data security and loss protection is disclosed. The system may include a processor and a memory communicatively coupled to the processor. The memory may store processor-executable instructions, which, on execution, may cause the processor to transform a document having a complex document format into a transformed document having a machine readable format. The machine readable format includes at least one of a Portable Document Format (PDF) and an image, and the complex document format may include a Mixed Object Document Content Architecture (MODCA) format. The processor-executable instructions, on execution, may further cause the processor to extract content from the transformed document using an Optical Character Recognition (OCR) technique. The processor-executable instructions, on execution, may further cause the processor to identify one of a presence or an absence of sensitive data within the content. It should be noted that the sensitive data corresponds to noncompliant data violating data privacy guidelines. The processor-executable instructions, on execution, may further cause the processor to determine a risk type for the sensitive data using a rule-based engine upon identifying the presence of the sensitive data. It should be noted that the rules may be framed based on at least one of a domain guideline, a data privacy guideline, and an associated risk factor. The processor-executable instructions, on execution, may further cause the processor to

remediate the sensitive data based on the risk type to generate a remediated document, by applying at least one of a redaction technique, a masking technique, a tokenization technique, and an automatic encryption technique. The processor-executable instructions, on execution, may further cause the processor to generate a final document upon a successful validation of the remediated document based on a plurality of predefined validation rules.
[007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[008] The present application can be best understood by reference to the following description taken in conjunction with the accompanying drawing figures, in which like parts may be referred to by like numerals
[009] FIG. 1 illustrates a block diagram of a system for automatic data security and loss protection, in accordance with some embodiments of the present disclosure.
[010] FIG. 2 illustrates a flow diagram of an exemplary process for automatic data security and loss protection, in accordance with some embodiments of the present disclosure.
[011] FIG. 3 illustrates a flow diagram of an exemplary process for automatic transformation and remediation of MODCA documents, in accordance with some embodiments of the present disclosure.
[012] FIG. 4 illustrates a flow diagram of an exemplary process for automatic validation of a remediated of MODCA document for Payment Card Industry Data Security Standards (PCI DSS) compliance, in accordance with some embodiments of the present disclosure.
[013] FIG. 5 is a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
DETAILED DESCRIPTION OF THE DRAWINGS

[014] The following description is presented to enable a person of ordinary skill in the art to make and use the invention and is provided in the context of particular applications and their requirements. Various modifications to the embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the invention. Moreover, in the following description, numerous details are set forth for the purpose of explanation. However, one of ordinary skill in the art will realize that the invention might be practiced without the use of these specific details. In other instances, well-known structures and devices are shown in block diagram form in order not to obscure the description of the invention with unnecessary detail. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
[015] While the invention is described in terms of particular examples and illustrative figures, those of ordinary skill in the art will recognize that the invention is not limited to the examples or figures described. Those skilled in the art will recognize that the operations of the various embodiments may be implemented using hardware, software, firmware, or combinations thereof, as appropriate. For example, some processes can be carried out using processors or other digital circuitry under the control of software, firmware, or hard-wired logic. (The term "logic" herein refers to fixed hardware, programmable logic and/or an appropriate combination thereof, as would be recognized by one skilled in the art to carry out the recited functions.) Software and firmware can be stored on computer-readable storage media. Some other processes can be implemented using analog circuitry, as is well known to one of ordinary skill in the art. Additionally, memory or other storage, as well as communication components, may be employed in embodiments of the invention.
[016] Referring now to FIG. 1, a system 100 for automatic data security and loss protection is illustrated, in accordance with some embodiments of the present disclosure. In some embodiments, the system 100 may include a risk mitigation device 102. The system 100 may further include sources 102a of a document 104. The sources 102a may include, but not limited to, a repository, a main frame, legacy applications, and web services. The risk mitigation device 102 may be configured to transform, remediate, refine, and validate complex documents (such as, documents in

Mixed Object Document Content Architecture (MODCA) format) on the private cloud or on-premise. Moreover, in some embodiments, the risk mitigation device 102 may support all types of document formats including MICROSOFT OFFICE document format, text (TXT) format, Revisable Form Text (RFT) document format, and other text formats like MODCA.
[017] The risk mitigation device 102 may read from the MODCA document formats, discover and classify a type of sensitive data present in a document (e.g., Payment Card information, Personal Data). By way of an example, the risk mitigation device 102 may protect confidential information like credit card number of clients (clients of financial organizations). Further, the risk mitigation device 102 ensures that the security standards must be met while protecting the confidential information. In short, the risk mitigation device 102 provides a pro-active and bi-directional approach to critical information for protecting individuals and business from unauthorized information sharing, and thereby significantly reduces the number of outbound information breaches that the organizations may experience.
[018] The risk mitigation device 102 may perform various operations to provide data security and loss protection. As illustrated in FIG. 1, the risk mitigation device 102 may include a transformation module 106, a content extraction module 108, a sensitive data identification module 110, a risk type determination module 112, a remediation module 114. Further the risk mitigation device may also include a data store in order to store data and intermediate results generated by the risk mitigation device 102.
[019] The transformation module 106 may be configured to receive the document 104 having a complex document format. It should be noted that the complex document format may be a Mixed Object Document Content Architecture (MODCA) format. In some embodiments, the document may be at least one of a structured document, an unstructured document, and a semi-structured document. The MODCA format may include a Presentation Text Object Content Architecture (PTOCA) format, an Image Object Content Architecture (IOCA) format, a Graphic Object Content Architecture (GOCA) format, a Bar Code Object Content Architecture (BCOCA) format, and a Font Object Content Architecture (FOCA) format.
[020] Further, the transformation module 106 may be configured to generate a transformed document from the received document 104. The transformed document may have a machine readable format. The machine readable format may be at least

one of a Portable Document Format (PDF) and an image. The transformation module 106 is further communicatively coupled to the content extraction module 108.
[021 ] The content extraction module 108 may be configured to extract content from the transformed document. It should be noted that, for extraction, the content extraction module 108 may use an Optical Character Recognition (OCR) technique. The sensitive data identification module 110 may be configured to identify at least one of a presence or an absence of sensitive data within the content. The sensitive data may correspond to noncompliant data violating data privacy guidelines. The sensitive data correspond to confidential information of individuals or businesses. The confidential information may include Protected Health Information (PHI), Personal Identifiable Information (RII), and confidential information associated with Payment Card Industry Data Security Standards (PCI DSS), a Banking Financial Services and Insurance (BFSI) information, a life science information, and a healthcare information.
[022] In some embodiments, if there is any embedded image object identified in the document then the text extraction from the same is a pre-requisite. For the same, OCR service may be leveraged to extract the text. After extracting text, sensitive data discovery process may be followed.
[023] Further, the risk type determination module 112 may be configured for determining a risk type for the sensitive data. For the risk type determination, a rule-based engine may be utilized by the risk type determination module 112. And, rules may be framed based on at least one of a domain guideline, a data privacy guideline, and an associated risk factor. In some embodiments, a risk value of the sensitive data may be compared with pre-defined threshold values. The risk type may be one of a high risk, a low risk, a medium risk, or a no risk category. The risk type determination module 112 may be further operatively coupled to the remediation module 114.
[024] The remediation module 114 may be configured to remediate the sensitive data based on the risk type to generate a remediated document. Remediation module 114 may use at least one of a redaction technique, a masking technique, a tokenization technique, and an automatic encryption technique.
[025] In some embodiments, the risk mitigation device 102 may further include a validation module (not shown in FIG. 1). The validation module may be configured for generating a final document 116. In other words, a final document 116 may be generated upon a successful validation of the remediated document based on a

plurality of predefined validation rules. It should be noted that the validation module may send input to the transformation module 106 or the remediation module 114 until the validation is successful. And, iteratively transformation and remediation may be performed by respective modules, for the document having sensitive data, until the validation is successful.
[026] By way of an example, PCI DSS compliance regulation for validation of MODCA document and other document formats may include control objectives and corresponding PCI DSS requirements. For a control objective "protect cardholder data" corresponding PCI DSS requirement may be "protect stored cardholder data", and "encrypt transmission of cardholder data across open public networks". Further, for a control objective "implement strong control access measures" corresponding PCI DSS requirement may be "restrict access to cardholder data by business need-to-know", "assign a unique ID to each person with computer access", "restrict physical access to cardholder data". For a control objective "maintain information security policy" corresponding PCI DSS requirement may be "maintain a policy that addresses information security".
[027] It should be noted that the risk mitigation device 102 may be implemented in programmable hardware devices such as programmable gate arrays, programmable array logic, programmable logic devices, or the like. Alternatively, the risk mitigation device 102 may be implemented in software for execution by various types of processors. An identified engine/module of executable code may, for instance, include one or more physical or logical blocks of computer instructions which may, for instance, be organized as a component, module, procedure, function, or other construct. Nevertheless, the executables of an identified engine/module need not be physically located together but may include disparate instructions stored in different locations which, when joined logically together, comprise the identified engine/module and achieve the stated purpose of the identified engine/module. Indeed, an engine or a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different applications, and across several memory devices.
[028] As will be appreciated by one skilled in the art, a variety of processes may be employed for automatic data security and loss protection. For example, the exemplary system 100 and associated risk mitigation device 102 may provide

automatic data security and loss protection, by the process discussed herein. In particular, as will be appreciated by those of ordinary skill in the art, control logic and/or automated routines for performing the techniques and steps described herein may be implemented by the system 100 and the associated risk mitigation device 102 either by hardware, software, or combinations of hardware and software. For example, suitable code may be accessed and executed by the one or more processors on the system 100 to perform some or all of the techniques described herein. Similarly, application specific integrated circuits (ASICs) configured to perform some or all the processes described herein may be included in the one or more processors on the system 100.
[029] Referring now to FIG. 2, an exemplary process for automatic data security and loss protection is depicted via a flow diagram 200, in accordance with some embodiments of the present disclosure. Each step of the process may be performed by a risk mitigation device (similar to the risk mitigation device 102). FIG. 2 is explained in conjunction with FIGS. 1.
[030] At step 202, a document having a complex document format may be transformed into a transformed document. The transformed document may have a machine readable format. The machine readable format may include at least one of a Portable Document Format (PDF) and an image. And, the complex document format may include a Mixed Object Document Content Architecture (MODCA) format. It should be noted that a transformation module (similar to the transformation module 206) of the risk mitigation device may be used to transform the document.
[031] The MODCA format may further include a Presentation Text Object Content Architecture (PTOCA) format, an Image Object Content Architecture (IOCA) format, a Graphic Object Content Architecture (GOCA) format, a Bar Code Object Content Architecture (BCOCA) format, and a Font Object Content Architecture (FOCA) format. The document may be a structured document, in accordance with some embodiments of the present invention. In some embodiments, the document may be an unstructured document. In some embodiments, the document may be a semi-structured document.
[032] The document may include, but not limited to, MICROSOFT OFFICE document, a text (TXT) document, Revisable Form Text (RFT) document, an image document.

[033] At step 204, content from the transformed document may be extracted using a content extraction module (same as the content extraction module). In order to extract the content, in some embodiments, an Optical Character Recognition (OCR) technique may be used. Thereafter, one of a presence or an absence of sensitive data within the content may be identified, at step 206. A sensitive data identification module (similar to the sensitive data identification module 110) may execute this step. Here, the sensitive data may correspond to noncompliant data violating data privacy guidelines. In some embodiments, the sensitive data may correspond to confidential information of individuals or businesses. The confidential information may include, but not limited to, Protected Health Information (PHI), Personal Identifiable Information (Pll), and confidential information associated with Payment Card Industry Data Security Standards (PCI DSS), a Banking Financial Services and Insurance (BFSI) information, a life science information, and a healthcare information.
[034] At step 208, a risk type for the sensitive data may be determined upon identification of the presence of the sensitive data, using a risk type determination module (the risk type determination module 112). It should be noted that a rule-based engine may be used to determine the risk type. Here, rules may be framed based on at least one of a domain guideline, a data privacy guideline, and an associated risk factor. In some embodiments, a risk value of the sensitive data may be compared with pre-defined threshold values. The risk type may include one of a high risk, a low risk, a medium risk, or a no risk category.
[035] At step 210, the sensitive data may be remediated based on the risk type. As a result, a remediated document may be generated. It should be noted that at least one of a redaction technique, a masking technique, a tokenization technique, and an automatic encryption technique may be used for to perform remediation.
[036] Thereafter, at step 212, a final document may be generated upon a successful validation of the remediated document. For validation, a plurality of predefined validation rules may be considered. When the validation is unsuccessful transformation and remediation for a document having sensitive data may be performed iteratively until the validation is successful.
[037] Referring now to FIG. 3, an exemplary process for automatic transformation and remediation of MODCA documents is depicted via a flow diagram

300, in accordance with some embodiments of the present disclosure. FIG. 3 is explained in conjunction with FIGS. 1-2.
[038] At step 302, a MODCA document (i.e., a document in MODCA format) may be received. The document may be in at least one of a Presentation Text Object Content Architecture (PTOCA) format, an Image Object Content Architecture (IOCA) format, a Graphic Object Content Architecture (GOCA) format, a Bar Code Object Content Architecture (BCOCA) format, and a Font Object Content Architecture (FOCA) format.
[039] After that, at step 304, the MODCA document may be transformed in to a transformed file (such as, a PDF, and an image). At step 306, content from the transformed file may be extracted using an Optical Character Recognition (OCR) technique. At step 308, a sensitive data discovery process may be performed based as per compliance (for example, PCI compliance). In some embodiments, sensitive data or unsecured data (that may be exposed data) based on the extracted content may be identified based on domain specific data privacy guidelines.
[040] Further, at step 310, PCI non-compliance may be checked. In case of identification of the PCI non-compliance, a non-compliance remediation process may be performed, at step 312a. Otherwise, the process may end, at step 312b. In some embodiments, a validation process may be performed when the PCI non-compliance is not identified.
[041] It should be noted that a rule-based classification sensitive data/unsecured data may be performed, based on different domains, their guidelines and associated risk factor. Further, in some embodiments, a risk category of sensitive data/unsecured data may be determined from a plurality of predefined categories (such as, a low risk category, a high risk category, and a medium risk category).
[042] Further, in some embodiments, a rule-based remediation process may be performed for sensitive data/unsecured data upon identification of non-compliance through at least one of a redaction technique, a masking technique, tokenization technique, and an automatic encryption technique, based on category of sensitive data/unsecured data.
[043] Referring now to FIG. 4, an exemplary process for process for automatic validation of a remediated MODCA document for Payment Card Industry Data Security Standards (PCI DSS) compliance is depicted via a flow diagram 400, in accordance

with some embodiments of the present disclosure. FIG. 4 is explained in conjunction with FIG. 3.
[044] At step 402, a transformed and redacted document (for example, a document after remediation through redaction technique) may be received. At step 404 content from the transformed and remediated and redacted document may be extracted using an Optical Character Recognition (OCR) technique. Thereafter, at step 406, a rule based validation may be performed.
[045] At step 408, it may be checked if the validation is successful or not. In case the validation is successful a validation report may be generated, at step 410. Otherwise, the document may be reprocessed for transformation and remediation, at step 412.
[046] By way of an example, for the redaction process of a card number (for example, a card number "4451-2172-9841-4368"), it may be checked if the card number is identifiable in the extracted content or not. Upon identification of the card number, it may be processed as "4451-XXXX-XXXX-4368". Otherwise, a rule based redaction process may be performed and the output "4451-XXXX-XXXX-4368" may be processed.
[047] The disclosed methods and systems may be implemented on a conventional or a general-purpose computer system, such as a personal computer (PC) or server computer. Referring now to FIG. 5, an exemplary computing system 500 that may be employed to implement processing functionality for various embodiments (e.g., as a SIMD device, client device, server device, one or more processors, or the like) is illustrated. Those skilled in the relevant art will also recognize how to implement the invention using other computer systems or architectures. The computing system 500 may represent, for example, a user device such as a desktop, a laptop, a mobile phone, personal entertainment device, DVR, and so on, or any other type of special or general-purpose computing device as may be desirable or appropriate for a given application or environment. The computing system 500 may include one or more processors, such as a processor 502 that may be implemented using a general or special purpose processing engine such as, for example, a microprocessor, microcontroller or other control logic. In this example, the processor 502 is connected to a bus 504 or other communication medium. In some embodiments, the processor 502 may be an Artificial Intelligence (Al) processor, which

may be implemented as a Tensor Processing Unit (TPU), or a graphical processor unit, or a custom programmable solution Field-Programmable Gate Array (FPGA).
[001] The computing system 500 may also include a memory 506 (main memory), for example, Random Access Memory (RAM) or other dynamic memory, for storing information and instructions to be executed by the processor 502. The memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by the processor 502. The computing system 500 may likewise include a read only memory ("ROM") or other static storage device coupled to bus 504 for storing static information and instructions for the processor 502.
[002] The computing system 500 may also include a storage device 508, which may include, for example, a media drives 510 and a removable storage interface. The media drive 510 may include a drive or other mechanism to support fixed or removable storage media, such as a hard disk drive, a floppy disk drive, a magnetic tape drive, an SD card port, a USB port, a micro USB, an optical disk drive, a CD or DVD drive (R or RW), or other removable or fixed media drive. A storage media 506 may include, for example, a hard disk, magnetic tape, flash drive, or other fixed or removable medium that is read by and written to by the media drive 520. As these examples illustrate, the storage media 512 may include a computer-readable storage medium having stored there in particular computer software or data.
[003] In alternative embodiments, the storage devices 508 may include other similar instrumentalities for allowing computer programs or other instructions or data to be loaded into the computing system 500. Such instrumentalities may include, for example, a removable storage unit 514 and a storage unit interface 516, such as a program cartridge and cartridge interface, a removable memory (for example, a flash memory or other removable memory module) and memory slot, and other removable storage units and interfaces that allow software and data to be transferred from the removable storage unit 514 to the computing system 500.
[004] The computing system 500 may also include a communications interface 518. The communications interface 518 may be used to allow software and data to be transferred between the computing system 500 and external devices. Examples of the communications interface 518 may include a network interface (such as an Ethernet or other NIC card), a communications port (such as for example, a USB

port, a micro USB port), Near field Communication (NFC), etc. Software and data transferred via the communications interface 518 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by the communications interface 518. These signals are provided to the communications interface 518 via a channel 520. The channel 520 may carry signals and may be implemented using a wireless medium, wire or cable, fiber optics, or other communications medium. Some examples of the channel 520 may include a phone line, a cellular phone link, an RF link, a Bluetooth link, a network interface, a local or wide area network, and other communications channels.
[005] The computing system 500 may further include Input/Output (I/O) devices 522. Examples may include, but are not limited to a display, keypad, microphone, audio speakers, vibrating motor, LED lights, etc. The I/O devices 522 may receive input from a user and also display an output of the computation performed by the processor 502. In this document, the terms "computer program product" and "computer-readable medium" may be used generally to refer to media such as, for example, the memory 506, the storage devices 508, the removable storage unit 514, or signal(s) on the channel 520. These and other forms of computer-readable media may be involved in providing one or more sequences of one or more instructions to the processor 502 for execution. Such instructions, generally referred to as "computer program code" (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system 500 to perform features or functions of embodiments of the present invention.
[006] In an embodiment where the elements are implemented using software, the software may be stored in a computer-readable medium and loaded into the computing system 500 using, for example, the removable storage unit 514, the media drive 510 or the communications interface 518. The control logic (in this example, software instructions or computer program code), when executed by the processor 502, causes the processor 502 to perform the functions of the invention as described herein.
[048] Thus, the present disclosure may overcome drawbacks of traditional systems discussed before. The disclosed method and system in the present disclosure may enable the enterprises to redact and refine any digital content like documents and exclusive formats such as MODCA formats on the cloud/on-premise environments

through an intelligent, faster, and reliable approach for data security and loss protection, thereby provide benefits to manual updates. The disclosure provides intelligent classification and extraction. For example, the classification of documents based on type, and fields etc. Information may be extracted from the documents for easy search and organization. Additionally, the disclosure may provide an intelligent validation process.
[049] Moreover, the disclosure may also use a redaction technique for remediation. Thus the traditional "stop and block" nature of data loss protection solutions may overcome with the automatic removal of the exact content which breaks the policy leaving the rest of the communication to continue unhindered and avoiding the delay of valid business communications.
[050] Further, the disclosure helps in removing potentially confidential content from documents which may already have been stored on an enterprise content management system, meant for inbound and outbound communication, enabling the secure flow of business. Also, for many, the ever increasing challenge of receiving documents with embedded Persistent Threats (PTs) may be easily overcome by removing active content from all received documents. The required information gets through unhindered, the malware may be blocked.
[051] The risk mitigation device explained in the present disclosure may be compliant with privacy policies like Protected Health Information (PHI), Payment Card Industry Data Security Standards (PCI DSS), and Personal Identifiable Information (Pll) regulations. It may also be leveraged across various verticals, such as BFSI, Lifesciences & Healthcare, and Government (Legal / Justice etc.).
[052] The disclosure provides a reusable solution for content management and other digital initiative projects. The risk mitigation device 102 may be reused in multiple projects looking for low cost (i.e., open source) document content redaction, refinement and validation solution, and may be offered as a service to multiple clients.
[053] The disclosure may provide various advantages including cost reduction for redaction as compared to available tools in the market, effort through automated validation, higher throughput through batch processing, increased control effectiveness and meeting compliance goal, reduction in risk as automated validation is less error prone.

[054] It will be appreciated that, for clarity purposes, the above description has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processors or domains may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controller. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.
[055] Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention.
[056] Furthermore, although individually listed, a plurality of means, elements or process steps may be implemented by, for example, a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category, but rather the feature may be equally applicable to other claim categories, as appropriate.

CLAIMS
What is claimed is:
1. A method (200) for automatic data security and loss protection, the method (200)
comprising:
transforming (202), by a risk mitigation device (102), a document (104) having a complex document format into a transformed document having a machine readable format, wherein the machine readable format comprises at least one of a Portable Document Format (PDF) and an image, and wherein the complex document format comprises a Mixed Object Document Content Architecture (MODCA) format;
extracting (204), by the risk mitigation device (102), content from the transformed document using an Optical Character Recognition (OCR) technique;
identifying (206), by the risk mitigation device (102), one of a presence or an absence of sensitive data within the content, wherein the sensitive data corresponds to noncompliant data violating data privacy guidelines;
determining (208), by the risk mitigation device (102) and upon identifying the presence of the sensitive data, a risk type for the sensitive data using a rule-based engine, wherein rules are framed based on at least one of a domain guideline, a data privacy guideline, and an associated risk factor;
remediating (210), by the risk mitigation device (102), the sensitive data based on the risk type to generate a remediated document, by applying at least one of a redaction technique, a masking technique, a tokenization technique, and an automatic encryption technique; and
generating (212), by the risk mitigation device (102), a final document (116) upon a successful validation of the remediated document based on a plurality of predefined validation rules.
2. The method (200) of claim 1, wherein the document is at least one of a structured
document, an unstructured document, and a semi-structured document, and wherein
the document comprises at least one of a MICROSOFT OFFICE document, a text
(TXT) document, Revisable Form Text (RFT) document, an image document.

3. The method (200) of claim 1, wherein the MODCA format further comprises a
Presentation Text Object Content Architecture (PTOCA) format, an Image Object
Content Architecture (IOCA) format, a Graphic Object Content Architecture (GOCA)
format, a Bar Code Object Content Architecture (BCOCA) format, and a Font Object
Content Architecture (FOCA) format.
4. The method (200) of claim 1, wherein the sensitive data correspond to confidential information of individuals or businesses, and wherein the confidential information comprises Protected Health Information (PHI), Personal Identifiable Information (Pll), and confidential information associated with Payment Card Industry Data Security Standards (PCI DSS), a Banking Financial Services and Insurance (BFSI) information, a life science information, and a healthcare information.
5. The method (200) of claim 1, wherein determining the risk type further comprises comparing a risk value of the sensitive data with pre-defined threshold values, wherein the risk type comprises one of a high risk, a low risk, a medium risk, or a no risk category.
6. The method (200) of claim 1, further comprising iteratively performing
transformation and remediation for a document having sensitive data, until the
validation is successful.
7. The method (200) of claim 1, wherein the data privacy guideline is associated with at least one of a General Data Protection Regulation (GDPR), a Health Insurance Portability and Accountability Act (HIPAA), a California Consumer Privacy Act (CCPA), a Payment Card Industry Data Security Standards (PCI DSS), a PHI, a Pll, and a BFSI.
8. A system (100) for automatic data security and loss protection, the system (100) comprising:
a processor; and

a memory communicatively coupled to the processor, wherein the memory stores processor-executable instructions, which, on execution, cause the processor to:
transform (202) a document having a complex document format into a transformed document having a machine readable format, wherein the machine readable format comprises at least one of a Portable Document Format (PDF) and an image, and wherein the complex document format comprises a Mixed Object Document Content Architecture (MODCA) format;
extract (204) content from the transformed document using an Optical Character Recognition (OCR) technique;
identify (206) one of a presence or an absence of sensitive data within the content, wherein the sensitive data corresponds to noncompliant data violating data privacy guidelines;
determine (208), upon identifying the presence of the sensitive data, a risk type for the sensitive data using a rule-based engine, wherein rules are framed based on at least one of a domain guideline, a data privacy guideline, and an associated risk factor;
remediate (210) the sensitive data based on the risk type to generate a remediated document, by applying at least one of a redaction technique, a masking technique, a tokenization technique, and an automatic encryption technique; and
generate (212) a final document upon a successful validation of the remediated document based on a plurality of predefined validation rules.
9. The system (100) of claim 8, wherein the processor-executable instructions further
cause the processor to determine the risk type further by comparing a risk value of the
sensitive data with pre-defined threshold values, wherein the risk type comprises one
of a high risk, a low risk, a medium risk, or a no risk category.
10. The system (100) of claim 8, wherein the processor-executable instructions
further cause the processor to iteratively perform transformation and remediation for
a document having sensitive data, until the validation is successful.

Documents

Application Documents

#	Name	Date
1	202111055501-CLAIMS [30-11-2022(online)].pdf	2022-11-30
1	202111055501-STATEMENT OF UNDERTAKING (FORM 3) [30-11-2021(online)].pdf	2021-11-30
2	202111055501-COMPLETE SPECIFICATION [30-11-2022(online)].pdf	2022-11-30
2	202111055501-REQUEST FOR EXAMINATION (FORM-18) [30-11-2021(online)].pdf	2021-11-30
3	202111055501-REQUEST FOR EARLY PUBLICATION(FORM-9) [30-11-2021(online)].pdf	2021-11-30
3	202111055501-CORRESPONDENCE [30-11-2022(online)].pdf	2022-11-30
4	202111055501-PROOF OF RIGHT [30-11-2021(online)].pdf	2021-11-30
4	202111055501-FER_SER_REPLY [30-11-2022(online)].pdf	2022-11-30
5	202111055501-POWER OF AUTHORITY [30-11-2021(online)].pdf	2021-11-30
5	202111055501-OTHERS [30-11-2022(online)].pdf	2022-11-30
6	202111055501-FORM-9 [30-11-2021(online)].pdf	2021-11-30
6	202111055501-FER.pdf	2022-05-30
7	202111055501-FORM 18 [30-11-2021(online)].pdf	2021-11-30
7	202111055501-COMPLETE SPECIFICATION [30-11-2021(online)].pdf	2021-11-30
8	202111055501-FORM 1 [30-11-2021(online)].pdf	2021-11-30
8	202111055501-DECLARATION OF INVENTORSHIP (FORM 5) [30-11-2021(online)].pdf	2021-11-30
9	202111055501-DRAWINGS [30-11-2021(online)].pdf	2021-11-30
9	202111055501-FIGURE OF ABSTRACT [30-11-2021(online)].jpg	2021-11-30
10	202111055501-DRAWINGS [30-11-2021(online)].pdf	2021-11-30
10	202111055501-FIGURE OF ABSTRACT [30-11-2021(online)].jpg	2021-11-30
11	202111055501-DECLARATION OF INVENTORSHIP (FORM 5) [30-11-2021(online)].pdf	2021-11-30
11	202111055501-FORM 1 [30-11-2021(online)].pdf	2021-11-30
12	202111055501-COMPLETE SPECIFICATION [30-11-2021(online)].pdf	2021-11-30
12	202111055501-FORM 18 [30-11-2021(online)].pdf	2021-11-30
13	202111055501-FER.pdf	2022-05-30
13	202111055501-FORM-9 [30-11-2021(online)].pdf	2021-11-30
14	202111055501-OTHERS [30-11-2022(online)].pdf	2022-11-30
14	202111055501-POWER OF AUTHORITY [30-11-2021(online)].pdf	2021-11-30
15	202111055501-FER_SER_REPLY [30-11-2022(online)].pdf	2022-11-30
15	202111055501-PROOF OF RIGHT [30-11-2021(online)].pdf	2021-11-30
16	202111055501-CORRESPONDENCE [30-11-2022(online)].pdf	2022-11-30
16	202111055501-REQUEST FOR EARLY PUBLICATION(FORM-9) [30-11-2021(online)].pdf	2021-11-30
17	202111055501-COMPLETE SPECIFICATION [30-11-2022(online)].pdf	2022-11-30
17	202111055501-REQUEST FOR EXAMINATION (FORM-18) [30-11-2021(online)].pdf	2021-11-30
18	202111055501-STATEMENT OF UNDERTAKING (FORM 3) [30-11-2021(online)].pdf	2021-11-30
18	202111055501-CLAIMS [30-11-2022(online)].pdf	2022-11-30
19	202111055501-US(14)-HearingNotice-(HearingDate-27-11-2025).pdf	2025-11-12
20	202111055501-FORM-26 [24-11-2025(online)].pdf	2025-11-24
21	202111055501-Correspondence to notify the Controller [24-11-2025(online)].pdf	2025-11-24
22	202111055501-US(14)-ExtendedHearingNotice-(HearingDate-02-12-2025)-1130.pdf	2025-11-25

Search Strategy

1	SearchhistoryE_30-05-2022.pdf
2	202111055501_SearchStrategyAmended_E_search_strategyAE_29-08-2025.pdf