Method And System For Detecting Passphrases In Plain Text

< Back

Method And System For Detecting Passphrases In Plain Text

Abstract: Nowadays, platforms are advocating use of passphrases with aim to provide more secure yet memorable form of authentication as passphrase offer ease of remembering, and improved adaptability to password policies without compromising usability. Existing password detection methods fail to detect passphrases due to distinct nature of passphrases, as it involves use of multiple words, symbols, numbers, and special characters. Present disclosure provides method and system for detecting passphrases in plain text. The system first receives plurality of files. Then, system filters files based on file attributes to obtain potential files. Thereafter, sensitivity analysis of potential file is performed based on sensitivity indicators to obtain sensitivity score for potential file. Further, system generates set of context from text present in each potential file. Finally, system utilizes set of context and sensitivity score of potential file to identify set of potential passphrases present in text using pre-trained machine learning based language model. [To be published with FIG. 3]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

28 March 2024

Publication Number

40/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th floor, Nariman point, Mumbai 400021, Maharashtra, India

Inventors

1. MALAVIYA, Shubham Mukeshbhai

Tata Consultancy Services Limited, Plot No. 2 & 3, MIDC-SEZ, Rajiv Gandhi Infotech Park, Hinjewadi Phase III, Pune 411057, Maharashtra, India

2. SHUKLA, Manish

Tata Consultancy Services Limited, Plot No. 2 & 3, MIDC-SEZ, Rajiv Gandhi Infotech Park, Hinjewadi Phase III, Pune 411057, Maharashtra, India

3. LODHA, Sachin Premsukh

Tata Consultancy Services Limited, Plot No. 2 & 3, MIDC-SEZ, Rajiv Gandhi Infotech Park, Hinjewadi Phase III, Pune 411057, Maharashtra, India

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR DETECTING PASSPHRASES IN PLAIN TEXT
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
2
TECHNICAL FIELD
[001]
The disclosure herein generally relates to text processing, and, more particularly, to a method and a system for detecting passphrases in plain text.
BACKGROUND 5
[002]
In today’s digital era, passwords play a crucial role as they are used as fundamental authentication mechanisms for numerous services provided by a multitude of service providers. The service providers generally ask users to create strong and complex passwords for security purposes. However, users often face the challenge of remembering unique passwords for each service they use. 10
[003]
To overcome this issue, the service providers are now encouraging users to use passphrases instead of passwords. A passphrase is a combination of letters, words, numbers, and symbols. The passphrase is designed to strike a balance between memorability and security.
[004]
However, as individuals frequently engage with services offered by 15 multiple vendors, the challenge of remembering multiple passwords/passphrases still exists. In an attempt to address this memorization burden, the users commonly resort to storing their passwords or passphrases in plaintext. However, leaving these sensitive credentials unprotected poses a significant risk. In the event of a security breach, where an attacker gains access to plaintext passwords/passphrases, the 20 potential consequences extend beyond financial losses. Hence, it becomes imperative to identify and secure passwords or passphrases stored in the plaintext by implementing appropriate security measures.
[005]
Currently, various approaches are available for detecting passwords in plaintext. However, methodologies devised for password detection are not 25 directly applicable to passphrase detection due to the distinct nature of passphrases, as it involves the use of multiple words, symbols, numbers, and special characters.
SUMMARY
[006]
Embodiments of the present disclosure present technological 30 improvements as solutions to one or more of the above-mentioned technical
3
problems recognized by the inventors in conventional systems. For example, in one
aspect, there is provided a method for detecting passphrases in plain text. The method comprises receiving, by a system via one or more hardware processors, a plurality of files present in a user device, wherein the user device is associated with a user; filtering, by the system via the one or more hardware processors, the 5 plurality of files based on a set of file attributes to obtain a set of potential files; performing, by the system via the one or more hardware processors, a sensitivity analysis of each potential file of the set of potential files based on one or more sensitivity indicators, wherein a sensitivity score is assigned to each of the potential files from the set of potential files based on the sensitivity analysis; generating, by 10 the system via the one or more hardware processors, a set of context from text present in each potential file of the set of potential files using a constituency tree based technique; and identifying, by the system via the one or more hardware processors, a set of potential passphrases present in the text based on the set of context and the assigned sensitivity score of each potential file using a pre-trained 15 machine learning based language model.
[007]
In an embodiment, the method comprises: determining, by the system via the one or more hardware processors, whether the user is a valid user using an authentication mechanism; and displaying, by the system via the one or more hardware processors, the set of potential passphrases upon determining that 20 the user is the valid user.
[008]
In an embodiment, upon determining that the user is an in-valid user, masking, by the system via the one or more hardware processors, each potential passphrase present in the text of each potential file; and displaying, by the system via the one or more hardware processors, masked potential passphrases on the user 25 device.
[009]
In an embodiment, the method comprises: providing, by the system via the one or more hardware processors, an explanation for each potential passphrase of the set of potential passphrases; evaluating, by the system via the one or more hardware processors, a strength of each potential passphrase of the set of 30 potential passphrases based on a predefined set of criteria; and displaying, by the
4
system
via the one or more hardware processors, the strength of each potential passphrase along with the associated potential passphrase.
[010]
In an embodiment, the method comprises: receiving, by the system via the one or more hardware processors, at least one feedback and at least one comment on one or more potential passphrases present in the set of potential 5 passphrases; and storing, by the system via the one or more hardware processors, the at least one feedback and the at least one comment in a user feedback store.
[011]
In an embodiment, the method comprises: fine-tuning, by the system via the one or more hardware processors, the pre-trained machine learning based language model based on a plurality of feedbacks and comments present in the user 10 feedback store using a fine-tuning scheduler, wherein the fine-tuning scheduler follows an iterative process in which one or more parameters and one or more hyperparameters of the pre-trained machine learning based language model are updated in each iteration until the pre-trained machine learning based language model accurately identifies the set of potential passphrases. 15
[012]
In an embodiment, the constituency tree based technique comprises: training, by the system via the one or more hardware processors, a syntactic embedding model based on one or more constituency parse trees present in a passphrase constituency tree database using a graph embedding algorithm; applying, by the system via the one or more hardware processors, a sliding window 20 technique on the text present in each potential file based on a pre-defined window size to obtain one or more type of content present in the text, wherein a plurality of text windows are created for the text present in each potential file based on the pre-defined window size, and wherein the type of content present in each text window is obtained using the sliding window technique; assigning, by the system via the 25 one or more hardware processors, a plurality of part-of-speech (POS) tags to a window text present in each text window; identifying, by the system via the one or more hardware processors, one or more matches between the window text and one or more POS patterns present in a passphrase pattern database based on the plurality of assigned POS tags; storing, by the system via the one or more hardware 30 processors, the identified one or more matches in a phrase list, wherein the phrase
5
list comprises one or more phrases; identifying,
by the system via the one or more hardware processors, a context window for each phrase in the phrase list; creating, by the system via the one or more hardware processors, a constituency tree for at least one context window matching with a predefined set of context windows; for each created constituency tree, computing, by the system via the one or more 5 hardware processors, an embedding for an associated constituency tree using the trained syntactic embedding model; comparing, by the system via the one or more hardware processors, the embedding of each constituency tree with an embedding of a phrase constituency tree created for each phrase in the phrase list, wherein a similarity score is obtained for each comparison; for each similarity score, 10 determining, by the system via the one or more hardware processors, whether the associated similarity score is greater than a predefined similarity score threshold; and for each phrase whose similarity score is found to greater than the predefined similarity score threshold, appending, by the system via the one or more hardware processors, the associated phrase and the context window to the set of context. 15
[013]
In another aspect, there is provided a system for detecting passphrases in plain text. The system comprises a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a plurality 20 of files present in a user device, wherein the user device is associated with a user; filter the plurality of files based on a set of file attributes to obtain a set of potential; perform a sensitivity analysis of each potential file of the set of potential files based on one or more sensitivity indicators, wherein a sensitivity score is assigned to each of the potential files from the set of potential files based on the sensitivity analysis; 25 generate a set of context from text present in each potential file of the set of potential files using a constituency tree based technique; and identify a set of potential passphrases present in the text based on the set of context and the assigned sensitivity score of each potential file using a pre-trained machine learning based language model. 30
6
[014]
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors perform passphrase detection in plain text by receiving, by a system via one or more hardware processors, a plurality of files present in a user device, wherein the user 5 device is associated with a user; filtering, by the system via the one or more hardware processors, the plurality of files based on a set of file attributes to obtain a set of potential files; performing, by the system via the one or more hardware processors, a sensitivity analysis of each potential file of the set of potential files based on one or more sensitivity indicators, wherein a sensitivity score is assigned 10 to each of the potential files from the set of potential files based on the sensitivity analysis; generating, by the system via the one or more hardware processors, a set of context from text present in each potential file of the set of potential files using a constituency tree based technique; and identifying, by the system via the one or more hardware processors, a set of potential passphrases present in the text based 15 on the set of context and the assigned sensitivity score of each potential file using a pre-trained machine learning based language model.
[015]
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed. 20
BRIEF DESCRIPTION OF THE DRAWINGS
[016]
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles: 25
[017]
FIG. 1 is an example representation of an environment, related to at least some example embodiments of the present disclosure.
[018]
FIG. 2 illustrates an exemplary block diagram of a system for detecting passphrases in plain text, in accordance with an embodiment of the present disclosure. 30
7
[019]
FIG. 3 illustrates a schematic block diagram representation of a processors associated with the system of FIG. 2 for detecting passphrases in the plain text, in accordance with an embodiment of the present disclosure.
[020]
FIG. 4 illustrates a schematic block diagram representation of a passphrase detection module associated with the system of FIG. 2 and FIG. 1 for 5 generating the set of context from text present in each potential file, in accordance with an embodiment of the present disclosure.
[021]
FIG. 5 illustrates an exemplary flow diagram of a method for detecting passphrases in plain text, in accordance with an embodiment of the present disclosure. 10
[022]
FIG. 6 illustrates a schematic representation of a pre-trained machine learning based model, in accordance with an embodiment of the present disclosure.
[023]
FIG. 7 illustrates a schematic representation showing creation of a constituency tree database, in accordance with an embodiment of the present 15 disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[024]
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number 20 identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. 25
[025]
Passwords have become ubiquitous in the digital age, as they serve as a fundamental method for secure access to various online platforms, devices, and services. Whether we are logging into email account, accessing social media profiles, or conducting online banking transactions, passwords play a crucial role in verifying our identity. 30
8
[026]
If the attackers are successful in acquiring user credentials/valid passwords, they might gain access to secure systems and may escalate their access privileges to an administrator or superuser level. With the widespread use of numerous services from various providers, individuals, both for personal use and within the organizations, face the challenge of creating and remembering distinct 5 passwords for each service. Consequently, users and employees often resort to storing their credentials in plaintext on their respective systems. Over time, these stored credentials may be forgotten without any appropriate care. Any compromise of such information can result in significant financial and reputational losses for individuals or organizations. Therefore, the critical task of detecting plaintext-10 stored credentials becomes imperative.
[027]
Further, as per the National Institute of Standards and Technologies (NIST) guidelines, the users are recommended to have a password of a minimum length of 8 characters, incorporating at least one uppercase letter, lowercase letter, numeric digit, and special character. However, with the increasing number of 15 services utilized by individuals today, remembering complex passwords for each service/platform that they use can be daunting. Despite their complexity, traditional passwords remain susceptible to dictionary attacks. So, after recognizing these vulnerabilities, the NIST guidelines from 2017 advocate the use of passphrases as an alternative. The passphrases consist of longer sequences of words or a 20 combination of words and characters, aiming to provide a more secure yet memorable form of authentication. They generally offer advantage of ease of remembering, better alignment with human cognition, higher entropy, improved adaptability to password policies without compromising usability and the like.
[028]
However, despite the many advantages, the passphrases present 25 challenges when it comes to detecting them in plaintext. Detecting a passphrase becomes notably complex when it includes whitespace (" ") characters.
[029]
For instance, discerning the passphrase ‘ants are awesome!’ proves more intricate than ‘ants$are$awesome!’. The latter passphrase can be treated as a single word by existing password detection methods, and based on the entropy of a 30 word, the existing password detection methods can identify it as a password or
9
passphrase
. However, the former passphrase i.e., ‘ants are awesome’ comprises multiple words separated by whitespace, making it impossible to treat the entire passphrase as a single word. The existing password detection methods fail to detect passphrases in such cases.
[030]
The passphrases can consist of varying word counts and may closely 5 resemble normal English words, which makes it nearly impossible to devise a generic strategy that is capable of identifying sets of words as part of a single passphrase. Further, detection of such passphrases without any additional context can result in an alarmingly high number of false positives.
[031]
Additionally, as the passphrases are designed for easy 10 memorization, they often contain words from daily activities i.e., widely used and familiar terms, and personal information. These words, in isolation, are not inherently linked to the sensitive information. However, in a certain context, they may carry sensitive information.
[032]
So, techniques that can efficiently detect passphrases with different 15 variations including passphrases containing whitespace in plain text is still to be explored.
[033]
Embodiments of the present disclosure overcome the above-mentioned disadvantages by providing a method and a system for detecting passphrases in plain text. The system of the present disclosure first receives a 20 plurality of files that are present in a user device. Then, the system filters the plurality of files based on a set of file attributes to obtain a set of potential files. Thereafter, the sensitivity analysis of each potential file is performed based on one or more sensitivity indicators to obtain a sensitivity score for each potential file. Further, the system generates a set of context from text present in each potential file 25 using a constituency tree based technique. Finally, the system utilizes the generated set of context along with the sensitivity score of each potential file to identify a set of potential passphrases present in the text using a pre-trained machine learning based language model.
[034]
In the present disclosure, the system and the method uses the 30 constituency tree based technique in which linguistic features, such as Part-of-
10
speech (POS) tags and constituency tree
are used for extracting potential passphrases and for building relevant context surrounding the potential passphrases from the plaintext, thereby ensuring accurate identification of passphrases containing whitespace characters. The system and the method first performs filtering based on certain file attributes to eliminate irrelevant files, thereby 5 ensuring reduced computational overhead while ensuring a usable system Further, the system uses the passphrase patterns database in which the unique POS patterns are categorized into a plurality of categories. The categorization of the unique POS patterns reduces the time-consumed for performing the passphrase detection and also ensures efficient searching of relevant text in a file. 10
[035]
Referring now to the drawings, and more particularly to FIGS. 1 through 7, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method. 15
[036]
FIG. 1 illustrates an exemplary representation of an environment 100 related to at least some example embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, performing sensitivity analysis, generating a set of 20 context, identifying a set of potential passphrases, etc. The environment 100 generally includes a system 102, an electronic device 106 (hereinafter also referred as a user device 106), each coupled to, and in communication with (and/or with access to) a network 104. It should be noted that one user device is shown for the sake of explanation; there can be more number of user devices. 25
[037]
The network 104 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network 30
11
capable of supporting communication among two or more of the parts or users
illustrated in FIG. 1, or any combination thereof.
[038]
Various entities in the environment 100 may connect to the network 104 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), 5 User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, or any combination thereof.
[039]
The user device 106 is associated with a user (e.g., a general computer user/an employee in a government or private organization) who prefers 10 to store passwords/passphrases in a text file for memorization purposes. Examples of the user device 106 include, but are not limited to, a personal computer (PC), a mobile phone, a tablet device, a Personal Digital Assistant (PDA), a server, a voice activated assistant, a smartphone, and a laptop.
[040]
The system 102 includes one or more hardware processors and a 15 memory. The system 102 is first configured to receive a plurality of files via the network 104 from the user device 106. The system 102 then filters the plurality of files based on a set of file attributes to obtain a set of potential files. Thereafter, the system 102 performs a sensitivity analysis of each potential file of the set of potential files based on one or more sensitivity indicators using a neural network 20 based model. The neural network based model assigns a sensitivity score to each potential file based on the sensitivity analysis of a respective potential file.
[041]
Further, the system 102 generates a set of context from text present in each potential file of the set of potential files using a constituency tree based technique, In an embodiment, the system 102 may use contextual entropy based 25 technique for generating the set of context. In another embodiment, the system 102 may use a perplexity based technique for generating the set of context. In yet another embodiment, the system 102 may use a chunk summary based technique for generating the set of context. The constituency tree based technique, the contextual entropy based technique, the perplexity based technique and the chunk 30 summary based technique are explained in detail with reference to FIG. 4.
12
[042]
Once the set of context and the sensitivity score of each potential file are available, the system 102 identifies a set of potential passphrases present in the text using a pre-trained machine learning based language model based on the set of context and the assigned sensitivity score of each potential file.
[043]
The process of detecting passphrases in plain text is explained in 5 detail with reference to FIG. 4.
[044]
The number and arrangement of systems, devices, and/or networks shown in FIG. 1 are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, 10 and/or networks than those shown in FIG. 1. Furthermore, two or more systems or devices shown in FIG. 1 may be implemented within a single system or device, or a single system or device shown in FIG. 1 may be implemented as multiple, distributed systems or devices. Additionally, or alternatively, a set of systems (e.g., one or more systems) or a set of devices (e.g., one or more devices) of the 15 environment 100 may perform one or more functions described as being performed by another set of systems or another set of devices of the environment 100 (e.g., refer scenarios described above).
[045]
FIG. 2 illustrates an exemplary block diagram of the system 102 for detecting passphrases in plain text, in accordance with an embodiment of the 20 present disclosure. In some embodiments, the system 102 is embodied as a cloud-based and/or SaaS-based (software as a service) architecture. In some embodiments, the system 102 may be implemented in a server system. In some embodiments, the system 102 may be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, 25 mainframe computers, and the like.
[046]
In an embodiment, the system 102 includes one or more processors 204, communication interface device(s) or input/output (I/O) interface(s) 206, and one or more data storage devices or memory 202 operatively coupled to the one or more processors 204. The one or more processors 204 may be one or more software 30 processing modules and/or hardware processors. In an embodiment, the hardware
13
processors can be implemented as one or more microprocessors, microcomputers,
microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, 5 the system 102 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[047]
The I/O interface device(s) 206 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and 10 the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server. 15
[048]
The memory 202 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an 20 embodiment a database 208 can be stored in the memory 202, wherein the database 208 may comprise, but are not limited to, a set of file attributes, the constituency tree based technique, an authentication mechanism, the pre-trained machine learning based language model, a syntactic embedding model, a sliding window technique, a passphrase pattern database, a passphrase constituency tree database, 25 a predefined similarity score threshold, a user feedback store, one or more processes and the like. The memory 202 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 202 and can be 30 utilized in further processing and analysis.
14
[049]
It is noted that the system 102 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. It is noted that the system 102 may include fewer or more components than those depicted in FIG. 2. 5
[050]
FIG. 3, with reference to FIGS. 1 and 2, illustrates a schematic block diagram representation 300 of the processors 204 associated with the system 102 of FIG. 2 for detecting passphrases in the plain text, in accordance with an embodiment of the present disclosure.
[051]
In one embodiment, the one or more processors 204 includes a 10 receiving module 302, a filtering module 304, a file sensitivity predictor module 306, a passphrase detector module 308, a passphrase explanation module 310, a passphrase strength indicator module 312, a leakage prevention module 314, and a fine-tuning scheduler 316.
[052]
The receiving module 302 includes suitable logic and/or interfaces 15 for receiving a plurality of files that are present in a user device, such as the user device 106 associated with a user.
[053]
The filtering module 304 is in communication with the receiving module 302. The filtering module 304 includes suitable logic and/or interfaces for receiving the plurality of files received by the receiving module 302. In an 20 embodiment, the filtering module 304 is configured to filter the plurality of files based on a set of file attributes to obtain a set of potential files. In at least one example embodiment, the set of file attributes refer to attributes that may help in filtering non-essential files among the plurality of files. For instance, certain files of the plurality of files are seldom accessed by users, such as system registry files, 25 virtual memory files (Pagefile.sys), and hibernation files (Hiberfil.sys) in Windows operating systems. Similarly, in Linux and macOS, files like system-wide configuration files, system binaries and admin commands, and system and application logs are typically untouched by users. So, chances of user storing password in system files are negligible. Hence, the system files can be ignored. 30 Further, passwords or passphrases are generally not stored in files exceeding a few
15
kilobytes or megabytes in size
. So, the size of the file also can be a file attribute in determining potential files where passphrases can be stored. Similarly, there are many other file attributes that are utilized by the filtering module 304 to obtain the set of potential files.
[054]
The filtering of the unnecessary files from the passphrase detection 5 analysis significantly reduces computational overhead of the system 102 while ensuring a usable system.
[055]
The file sensitivity predictor module 306 is in communication with the filtering module 304. The file sensitivity predictor module 306 is configured to perform a sensitivity analysis of each potential file of the set of potential files based 10 on one or more sensitivity indicators. In an embodiment, the one or more sensitivity indicators are external attributes or context, that offer additional insights into a file, such as file location and access patterns. In particular, the external attributes or context aides user and the operating system in comprehending its purpose, usage, or significance. For instance, the directory or folder where a file resides is part of 15 its external context, providing insights on the file's organizational structure, purpose, or relationships with other files. In at least one example embodiment, the file sensitivity predictor module 306 uses trained deep learning models or machine learning models or language models to recognize locations in the file system that might contain sensitive information. Examples of the model include, but are not 20 limited to, Neural networks, Deep neural networks (DNNs), Convolutional neural networks (CNNs), Recurrent neural networks (RNNs), Long Short-Term Memory (LSTM) Networks, Transformers and its derivative architectures, and Generative Adversarial Networks (GANs).
[056]
The file sensitivity predictor module 306 also assigns a sensitivity 25 score to each potential file based on the sensitivity analysis of a respective potential file.
[057]
The passphrase detector module 308 is in communication with the file sensitivity predictor module 306. The passphrase detector module 308 is configured to detects singular as well as multiple occurrences of passphrases 30 present within the set of potential files based on the file content and the sensitivity
16
score
s provided by the file sensitivity predictor module 306 using a pre-trained machine learning based language model. The passphrase detector module 308 is explained in detail with reference to FIG. 4.
[058]
The passphrase explanation module 310 is in communication with the passphrase detector module 308. The passphrase explanation module 310 is 5 configured to recognize all potential passphrases present in each potential file. The passphrase explanation module 310 is also configured to provide explanation on why a particular string or phrase is identified as a passphrase. In one embodiment, the explanation provide by the passphrase explanation module 310 enhance the intuitiveness of the system 102 as it enables the user to analyze the behavior of the 10 machine learning based language model and provide feedback in case of any unexpected outcomes.
[059]
The passphrase strength indicator module 312 is in communication with the passphrase detector module 308. The passphrase strength indicator module 312 is configured to evaluate strength of each potential passphrase based on various 15 criterias, such as length, complexity, and uniqueness. The passphrase strength indicator module 312 is also configured to provide feedback to the user about the strength of the potential passphrase in the form of a visual indicator/message.
[060]
In an embodiment, the passphrase strength indicator module 312 may use a color coding scheme (e.g., red for weak, yellow for medium, green for 20 strong) to convey the strength of the detected passphrase to the user. In another embodiment, the passphrase strength indicator module 312 may use a textual feedback (e.g., "weak," "medium," "strong") to convey the strength of the detected passphrase to the user. The color coding scheme/ textual feedback may help the user in understanding the requirements for creating a secure passphrase. 25
[061]
The leakage prevention module 312 is in communication with the passphrase detector module 308. The leakage prevention module 312 is configured to mask the potential passphrases present in the text of each potential file to reduce the risk of unauthorized access in the event of a security breach. In an embodiment, the leakage prevention module 312 ensures the detected passphrases are obscured 30 and are visible only upon the provision of appropriate credentials. In particular, the
17
p
assphrases that are vital for accessing various services, are protected from unauthorized access by masking each potential passphrase present in the text of each potential file. The leakage prevention module 312 ensures that the set of potential passphrases are displayed only upon determining that the user is the valid user, which further ensures that a balance is present between robust security 5 measures and user-friendly sensitive file access.
[062]
The fine-tuning scheduler 316 is in communication with the passphrase detector module 308 and the user feedback store.
[063]
In an embodiment, the user feedback store serves as a centralized hub where valuable insights and comments from the users are systematically stored. 10 The user feedback store is integrated with the finetuning scheduler 316 so that the finetuning scheduler 316 can refine and enhance the performance of the machine learning based language model through a continuous feedback loop. In particular, the user feedback store act as a reservoir for a diverse range of user inputs, such as comments, suggestions, and observations related to the behavior, accuracy, and 15 overall user experience of the system 102. In at least one example embodiment, the user feedback store is also configured to categorize and analyze user feedback systematically using advanced analytics and natural language processing techniques which are then used by the system 102 for identifying patterns and recurring themes, and for extracting valuable insights from user-generated content 20 stored in the user feedback store.
[064]
The fine-tuning scheduler 316 is configured to fine-tune the pre-trained machine learning based language model based on feedback and comments present in the user feedback store. The fine-tuning scheduler 316 follows an iterative process in which one or more parameters and one or more hyperparameters 25 of the pre-trained machine learning based language model are updated in each iteration until the pre-trained machine learning based language model accurately identifies the set of potential passphrases present in the text of at least one potential file. The iterative process ensures a continuous improvement of the pre-trained machine learning based language model with the evolving needs and expectations 30 of users.
18
[065]
FIG. 4, with reference to FIGS. 1-3, illustrates a schematic block diagram representation 350 of the passphrase detection module 308 associated with the system 102 of FIG. 2 and FIG. 1 for generating the set of context from the text present in each potential file, in accordance with an embodiment of the present disclosure. 5
[066]
In an embodiment, the passphrase detection module 308 uses the pre-trained machine learning based language model, language model configuration files (LM config), a syntactic embedding model, a linguistic chunk builder, the passphrase constituency tree database and the passphrase pattern database for generating the set of context from the text present in each potential file. 10
[067]
In an embodiment, the pre-trained machine learning based language model (shown with reference to FIG. 6) is trained to understand and generate human-like language. The machine learning based language model is trained on large datasets containing text from a diverse range of sources, such as books, articles, websites, passwords and passphrase corpus and other text corpora. Once 15 the machine learning based language model is trained, the pre-trained machine learning based language model may generate human-like text by predicting the most likely sequence of words based on a provided context. In at least one example embodiment, the pre-trained language models, such as generative pre-trained transformer (GPT-3), Google Bard, large language model meta artificial 20 intelligence (LLaMA), Anthropic Claude etc., that are pre-trained on massive dataset can be used by the passphrase detection module 308.
[068]
The pre-trained machine learning based language model is designed to predict probability of a sequence of words or characters given the context of the preceding words or characters. The machine learning based language model also 25 learns the patterns, relationships, and structures present in the language in order to understand the context of the words in a sentence.
[069]
In at least one example embodiment, the system 102 captures the syntactic, semantic, and pragmatic aspects of language, which allows the pre-trained machine learning based language model to generate coherent and 30 contextually relevant text.
19
[070]
In an embodiment, the language model configuration files encompasses a range of parameters and settings that defines the architecture, hyperparameters, and behavior of the pre-trained machine learning based language model. The language model configuration files may incorporate hyperparameters like learning rate, batch size, epochs, optimizer type, loss function, evaluation 5 metrics, maximum sequence length, and other relevant parameters. It should be noted that the contents of the language model configuration file may differ based on the underlying architecture and the implementation framework.
[071]
In an embodiment, the linguistic chunk builder is configured to build linguistic chunks and to extract only the text that is relevant to user credentials. It 10 should be noted that the chunk refers to a discrete and manageable portion or segment of the larger text that might be present in a potential file. The breaking down of a long text into smaller chunks may help the system 102 in various ways, such as ease of processing, efficient analysis, and improved performance of the system 102. 15
[072]
In at least one example embodiment, the linguistic chunk builder is configured to generate a set of context from the text present in each potential file of the set of potential files using a context generation technique among one or more context generation techniques that may be available to the system 102. Examples of the one or more context generation techniques include, but are not limited to, the 20 constituency tree based technique, the contextual entropy based technique, the perplexity based technique and the chunk summary based technique.
[073]
In an embodiment, the linguistic chunk builder uses the constituency tree based technique for generating the set of context. In constituency tree-based technique, the linguistic chunk builder requires a passphrase patterns database, a 25 passphrase constituency tree database, a syntactic embedding model and the pre-trained machine learning based language model.
[074]
In one embodiment, the passphrase patterns database is created using passphrases that have been leaked in the public domain. Upon utilizing the passphrase patterns database, a plurality of distinctive patterns commonly used by 30 users when creating passphrases can be identified. Each entry in the passphrase
20
patterns database
exclusively consists of strings representing passphrases. In particular, for creating the passphrase patterns database, a Part of Speech (POS) tag is assigned to each entry present in a passphrase dataset. Subsequently, the unique POS sequences present within the dataset are identified. For instance, an output for the POS of passphrase "sponge bob square pants" is . 5
[075]
In the passphrase patterns database, the unique POS patterns are categorized into following groups: a) passphrase containing words + numbers + special characters, b) passphrase containing words + numbers, and c) passphrase containing words only. The categorization of the unique POS patterns may help in reducing the time-consumed for performing the passphrase detection. 10
[076]
In one embodiment, the passphrase constituency tree database is created to store context around possible passphrases that are present within a text. The constituency parse trees are leveraged to build the context of the passphrases.
[077]
In constituency parsing, generally sentences are analyzed by breaking down it into sub-phrases known as constituents. So, to construct the 15 passphrase constituency tree database, examples from two distinct datasets are combined. A first dataset, denoted as A1, comprises the unique POS patterns identified during the creation of the passphrase patterns database. For the second dataset, A2, diverse patterns observed in how users store their credentials within plain text files are utilized. Examples of some diverse patterns include, but are not 20 limited to, 1) username followed by a passphrase in the next line, 2) application name on the first line, followed by the username on the subsequent line, and the passphrase on the line following that, 3) username followed by the passphrase on a single line, and 4) email address followed by the passphrase on the subsequent line and the like. 25
[078]
Thereafter, to construct constituency parse trees for passphrase structures, constituency trees are generated for the patterns identified in dataset A2. The patterns represent various ways that are used by the users to store credentials. Further, from the collection of patterns, the passphrases are excluded and are replaced with the constituency of passphrases obtained from the passphrase patterns 30 database. As in the passphrase pattern database, the POS sequences are transformed
21
into constituency trees by encapsulating sequences under a Noun Phrase (NP)
parent node. Finally, a strategy that explores different combinations from the datasets A1 and A2 is employed to create a larger dataset. The large dataset then serves as the training data for a constituency tree based syntactic embedding model, which is subsequently utilized to contextualize potential passphrases within the 5 text. A schematic representation showing creation of the constituency tree database is shown with reference to FIG. 7.
[079]
In an embodiment, in the constituency tree based technique, first the syntactic embedding model is trained using constituency parse trees extracted from the passphrase constituency tree database using a graph embedding algorithm. In 10 an embodiment, without limiting the scope of the invention, the graph embedding algorithm used for training the syntactic embedding model is a Node2Vec algorithm. Then, a sliding window technique is applied on the text present in each potential file based on a pre-defined window size to obtain one or more type of content present in the text. In at least one example embodiment, the window size 15 equal to the maximum number of characters allowed for passphrase according to NIST guidelines is selected. A plurality of text windows are created for the text present in each potential file based on the pre-defined window size, and the type of content present in each text window is obtained using the sliding window technique. In particular, within each text window, it is checked whether it consists solely of 20 words, words and numbers, or words, numbers, and special characters.
[080]
Thereafter, a plurality of part-of-speech (POS) tags are assigned to window text present in each text window and the closest matches from the passphrase patterns database are identified. In one embodiment, the text window may contain text unrelated to passphrases, so in that cases, partial matching is 25 performed with the passphrase patterns database. The identified matches are then stored in a phrase list.
[081]
Further, a context window is generated for each phrase in the phrase list. In an embodiment, the context window refers to potential window patterns that can exist in general. Example of the potential window patterns include, but are not 30 limited to, a Username and a Password in same line, the Username and the
22
Password
in different line, and an application name, the Username and the Password in consecutive lines. So, each context window surrounding a potential window pattern 𝐿𝑘 is examined and if the potential file contains a structure resembling any of the context window, the constituency tree for the matched context window is created. Finally, the embedding for the constituency trees are 5 computed using the trained syntactic embedding model and the computed embedding are compared with the embedding of the constituency tree of 𝐿𝑘 to obtain a similarity score. In an embodiment, without limiting the scope of the invention, the similarity score is computed using a cosine similarity technique. If the similarity score is found to be exceeding a predefined similarity score threshold 10 τ, then the context window with the respective phrase is appended to the set of context.
[082]
Once the process is carried out for each phrase present in the phrase list, the set of context is forwarded to the machine learning based language model which then identifies the passphrases present in the extracted text. 15
[083]
An algorithm for constituency tree based technique is defined below:
Input: Text T,, Passphrase patterns database D1, Passphrase constituency tree database D2, window patterns PW, Embedding Model Memb, trained on D2, threshold 𝜏 20
Output: final_context
Step 1: select window size Let sat, W = #max characters in a passphrase suggested by NIST
Step 2: final_context = []
Step 3: Slide window of size W over Text T until all characters in T are 25 analyzed:
Step 4: Lphrase = [] # List of phrases that matches
Step 5: Assign POS tags to text in W and find matching pattern with entries in D1
Step 6: Add all matched entries into list Lphrase 30
Step 7: For each phrase P in Lphrase # Context building
23
Step 8: check whether any window pattern in PW is suitable for context consideration
Step 9: For each window in Pw
Step 10: compute constituency tree of text in the window
Step 11: d = compute embedding of the constituency tree 5
Step 12: compare embeddings of P with d using cosine similarity
Step 13 if score is greater than 𝜏 then add text in window with P into final_context list
Step 14 if final_context list is empty, add P into final_context 10
Return final_context
[084]
In another embodiment, the linguistic chunk builder uses the contextual entropy based technique for generating the set of context. In general, contextual entropy refers to an amount of uncertainty or randomness present in the context surrounding a particular word or a sequence of words in a text. Basically, 15 the ccontextual entropy measures how predictable the surrounding words are provided the context. For instance, consider a sentence "Alice loves to play piano and sing songs." Now, if the focus is on the word "piano" and the context of the word "piano" is examined. The context here includes the words that appear nearby, which are "play", "and", and "sing." The contextual entropy of "piano" in this 20 sentence may be relatively low because the surrounding words provide strong contextual clues about what type of activity Alice is doing. The word "play" suggests an activity involving instruments, and "sing" adds the musical context. Thus, given the context, it may be highly predictable that "piano" would appear in this sentence showing high semantic similarity. Now if a different sentence such as 25 “Correct horse battery staple” is considered. Here, the context surrounding the word “battery” includes “Correct”, “horse”, and “staple”. The words in this context are not related to battery in any sense. In fact, these words are completely random and there is no semantic relation between the words in the considered sentence. Hence, a low contextual entropy indicates that the surrounding words strongly 30 predict the appearance of the word in question, while a high contextual entropy
24
suggests that the context provides less predictive information about the word
in question.
[085]
So, the correlation between two different concepts of entropy and the semantic similarity is used in the context of natural language processing and the contextual entropy based technique propose to use the sematic similarity to measure 5 the contextual entropy.
[086]
In the contextual entropy based technique, first a sliding window technique is applied on the text present in each potential file based on the pre-defined window size. As a result, the plurality of text windows are created for the text present in each potential file based on the pre-defined window size. Then, the 10 text present in each text window is tokenized into a set of words. Thereafter, for each word present in the text, a word embedding of the word is computed with word embeddings of the neighbouring words using a pre-trained embedding model. Further, a cosine similarity score is computed between the embedding of the word and the embedding of its neighbouring words. Furthermore, a minimum cosine 15 similarity score is calculated, and the minimum cosine similarity score is appended to an entropy score list. Finally, an entropy score is calculated by aggregating the scores in the entropy score list and the calculated entropy score is then compared with a predefined entropy threshold. Upon determining that the entropy score is less than the predefined entropy threshold, the text is appended to the set of context. 20
[087]
An algorithm for the contextual entropy based technique is defined below:
Input: Text T, Embedding model Memb
Output: score_entropy
Step 1: Tokenize the Text T into words. 25
Step 2: L_score = [] # list for storing entropy score of words in T.
Step 4: For each word w in T:
Step 5: compute word embedding of w along with the word embeddings of its neighboring words within T.
Step 6: compute the cosine similarity between the word embedding of 30 the w and the word embeddings of its neighboring words.
25
Step 7: select minimum similarity score and add it to the L_score
Step 8: Score_entropy = Aggregated scores of L_score
Return score_entropy
[088]
In yet another embodiment, the linguistic chunk builder uses the perplexity-based technique for generating the set of context. In general, the 5 perplexity serves as a metric to assess language models. In particular, it provides a means to evaluate the quality of written language, sentences, or words. The perplexity quantifies the degree of uncertainty a machine learning model exhibits when assigning probabilities to a given text. A lower perplexity score indicates that the machine learning model is more confident in the input resembling text from the 10 input data distribution.
[089]
In the perplexity-based technique, it is assumed that the machine learning based language model is trained on corpora of text that represent human language. The perplexity-based technique is similar to the contextual entropy-based method. Only difference is that instead of computing perplexity score for each word 15 and its surrounding context in a text window, the machine learning based language model is utilized by the linguistic chunk builder to compute perplexity score of entire text in the window. And, if the perplexity score is found to be greater than a predefined perplexity threshold, then the text window is appended to the set of context. 20
[090]
Further, in another embodiment, the linguistic chunk builder uses the chunk summary based technique for generating the set of context. In chunk summary based technique, the linguistic chunk builder breaks the text present in each potential file into a set of chunks. It should be noted that a chunk size is decided based on a use case, a maximum input length of the machine learning based 25 language model, and/or other model parameters of the machine learning based language model. Thereafter, the linguistic chunk builder summarizes the information present in each chunk and use it as a context for a next chunk. This process is iteratively performed until the end of text present in each potential file. In particular, the process is iteratively performed until all the chunks present in the 30 potential file are analyzed. The analyzed information is then appended to the set of
26
context
. In another embodiment, the created set of chunks can be directly passed to machine learning based language model for passphrase detection.
[091]
Once the set of context from the text present in each potential file of the set of potential files is generated using any of the techniques mentioned above, the generated set of context is passed to the pre-trained machine learning based 5 language model which uses the set of context and the assigned sensitivity score of each potential file to come up with the set of potential passphrases that might be present in the text.
[092]
FIG. 5, with reference to FIGS. 1 - 4, illustrates an exemplary flow diagram of a method for detecting passphrases in plain text, in accordance with an 10 example embodiment of the invention. The method 500 may use the system 102 of FIGS. 1 and 2 for execution. In an embodiment, the system 102 comprises one or more data storage devices or the memory 208 operatively coupled to the one or more hardware processors 206 and is configured to store instructions for execution of steps of the method 500 by the one or more hardware processors 206. The 15 sequence of steps of the flow diagram may not be necessarily executed in the same order as they are presented. Further, one or more steps may be grouped together and performed in form of a single step, or one step may have several sub-steps that may be performed in parallel or in sequential manner. The steps of the method of the present disclosure will now be explained with reference to the components of 20 the system 102 as depicted in FIG. 2 and FIG. 1.
[093]
At step 502 of the present disclosure, the one or more hardware processors 206 of the system 102 receive a plurality of files present in a user device, such as the user device 106. The user device is associated with a user.
[094]
At step 504 of the present disclosure, the one or more hardware 25 processors 206 of the system 102 filters the plurality of files based on a set of file attributes to obtain a set of potential files. In at least one example embodiment, the set of file attributes refer to file attributes that may help in filtering non-essential files among the plurality of files. For instance, certain files of the plurality of files are seldom accessed by users, such as system registry files, virtual memory files 30 (Pagefile.sys), and hibernation files (Hiberfil.sys) in Windows operating systems.
27
Similarly, in Linux and macOS, files like system
-wide configuration files, system binaries and admin commands, and system and application logs are typically untouched by users. So, chances of user storing password in system files are negligible. So, the system 102 uses the set of file attributes to filter the non-essential files, such as system files among the plurality of files. 5
[095]
At step 506 of the present disclosure, the one or more hardware processors 206 of the system 102 perform a sensitivity analysis of each potential file of the set of potential files based on one or more sensitivity indicators. In an embodiment, the one or more sensitivity indicators are external attributes or context, that offer additional insights into a potential file, such as file location and 10 access patterns. So, the hardware processors 206 assigns a sensitivity score to each of the potential files from the set of potential files based on the sensitivity analysis.
[096]
At step 508 of the present disclosure, the one or more hardware processors 206 of the system 102 generate a set of context from text present in each potential file of the set of potential files using a constituency tree based technique. 15
[097]
In constituency tree based technique, the one or more hardware processors 206 of the system 102 first trains a syntactic embedding model based on one or more constituency parse trees present in the passphrase constituency tree database using a graph embedding algorithm. Then, the one or more hardware processors 206 of the system 102 applies a sliding window technique on the text 20 present in each potential file based on the pre-defined window size to obtain one or more type of content present in the text. In particular, the system 102 creates a plurality of text windows for the text present in each potential file based on the pre-defined window size. The type of content present in each text window is obtained using the sliding window technique. 25
[098]
Thereafter, the one or more hardware processors 206 of the system 102 assigns a part-of-speech (POS) tags to window text present in each text window. And the one or more matches present between the window text and one or more POS patterns present in the passphrase pattern database are identified based on the assigned POS tags. The identified one or more matches are then stored in a 30 phrase list. In an embodiment, the phrase list may comprise one or more phrases.
28
[099]
Further, the one or more hardware processors 206 of the system 102 identify a context window for each phrase in the phrase list. Then, a constituency tree is created for at least one context window matching with a predefined set of context windows.
[0100]
For each created constituency tree, the one or more hardware 5 processors 206 of the system 102 compute an embedding for a respective constituency tree using the trained syntactic embedding model. The embedding of each constituency tree is then compared with an embedding of a phrase constituency tree created for each phrase in the phrase list. The one or more hardware processors 206 of the system 102 also provides a similarity score for each 10 comparison.
[0101]
Then, for each similarity score, the one or more hardware processors 206 of the system 102 determine whether the respective similarity score is greater than a predefined similarity score threshold.
[0102]
Finally, for each phrase whose similarity score is found to be greater 15 than the predefined similarity score threshold, the one or more hardware processors 206 of the system 102 append the respective phrase and the context window to the set of context.
[0103]
At step 510 of the present disclosure, the one or more hardware processors 206 of the system 102 identify a set of potential passphrases present in 20 the text based on the set of context and the assigned sensitivity score of each potential file using the pre-trained machine learning based language model.
[0104]
In an embodiment, once the set of potential passphrases are identified, the one or more hardware processors 206 of the system 102 first determines whether the user is a valid user using an authentication mechanism. 25 Upon determining that the user is the valid user, the one or more hardware processors 206 of the system 102 display the set of potential passphrases on the user device 106.
[0105]
In case the user is determined to be an in-valid user, the one or more hardware processors 206 of the system 102 mask each potential passphrase present 30
29
in the
text of each potential file. And the masked potential passphrases are then displayed on the user device 106.
[0106]
In an embodiment, the one or more hardware processors 206 of the system 102 also provides an explanation for each potential passphrase of the set of potential passphrases using the passphrase explanation module 310. 5
[0107]
In an embodiment, the one or more hardware processors 206 of the system 102 also evaluates the strength of each potential passphrase of the set of potential passphrases based on a predefined set of criteria using the passphrase strength indicator module 312. The passphrase strength is then displayed on the user device 106 along with the respective potential passphrase. 10
[0108]
In an embodiment, the one or more hardware processors 206 of the system 102 may receive at least one feedback and at least one comment on the one or more potential passphrases present in the set of potential passphrases. The at least one feedback and the at least one comment is then stored in the user feedback store.
[0109]
In an embodiment, the one or more hardware processors 206 of the 15 system 102 may fine-tune the pre-trained machine learning based language model based on the feedback and the comments present in the user feedback store using the fine-tuning scheduler 316. As explained earlier, the fine-tuning scheduler follows an iterative process in which one or more parameters and one or more hyperparameters of the pre-trained machine learning based language model are 20 updated in each iteration until the pre-trained machine learning based language model accurately identifies the set of potential passphrases.
[0110]
FIG. 7 illustrates a schematic representation showing creation of the constituency tree database, in accordance with an embodiment of the present disclosure. 25
[0111]
As seen in FIG. 7, a distinct tree is generated for each of a username "user123" and a passphrase "sponge bob square pants" using currently available password detection tools. However, to establish context, a unified parse tree i.e. a constituency tree needs to be created for each pattern.
[0112]
For this, the generated trees are merged with supplementary nodes 30 that are contextually appropriate. For instance, in the above example, the username
30
and passphrase are delimited by a newline character ("
\n"). Consequently, the trees are merged accordingly, as seen in FIG. 7. It should be noted that this is just one example representation; depending on the context and pattern, many trees can be linked using various types of nodes and thus the constituency tree database will be created using the created constituency trees. 5
[0113]
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do 10 not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0114]
As discussed earlier, the existing password detection methods fail to detect passphrases. So, to overcome the disadvantages, embodiments of the present disclosure provide a method and a system for detecting passphrases in plain text. 15 More specifically, the system and the method uses the constituency tree based technique in which linguistic features, such as Part-of-speech (POS) tags and constituency tree are used for extracting potential passphrases and for building relevant context surrounding the potential passphrases from the plaintext, thereby ensuring accurate identification of passphrases containing whitespace characters. 20 The system and the method first performs filtering based on certain file attributes to eliminate irrelevant files, thereby ensuring reduced computational overhead while ensuring a usable system Further, the system uses the passphrase patterns database in which the unique POS patterns are categorized into a plurality of categories. The categorization of the unique POS patterns reduces the time-25 consumed for performing the passphrase detection and also ensures efficient searching of relevant text in a file.
[0115]
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for 30 implementation of one or more steps of the method, when the program runs on a
31
server or mobile device or any suitable programmable device. The hardware device
can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable 5 gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include 10 software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[0116]
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by 15 various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. 20
[0117]
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily 25 defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such 30 alternatives fall within the scope of the disclosed embodiments. Also, the words
32
“comprising,” “having,” “containing,” and “including,” and other similar forms are
intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” 5 and “the” include plural references unless the context clearly dictates otherwise.
[0118]
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-10 readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include 15 random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0119]
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the 20 following claims.
We Claim:
1. A processor implemented method (500), comprising:
receiving (502), by a system via one or more hardware processors, a plurality of files present in a user device, wherein the user device is associated with a user;
filtering (504), by the system via the one or more hardware processors, the plurality of files based on a set of file attributes to obtain a set of potential files;
performing (506), by the system via the one or more hardware processors, a sensitivity analysis of each potential file of the set of potential files based on one or more sensitivity indicators, wherein a sensitivity score is assigned to each of the potential files from the set of potential files based on the sensitivity analysis;
generating (508), by the system via the one or more hardware processors, a set of context from text present in each potential file of the set of potential files using a constituency tree based technique; and
identifying (510), by the system via the one or more hardware processors, a set of potential passphrases present in the text based on the set of context and the assigned sensitivity score of each potential file using a pre-trained machine learning based language model.
2. The processor implemented method (500) as claimed in claim 1,
comprising:
determining, by the system via the one or more hardware processors, whether the user is a valid user using an authentication mechanism; and
displaying, by the system via the one or more hardware processors, the set of potential passphrases upon determining that the user is the valid user.
3. The processor implemented method (500) as claimed in claim 2,
comprising:
upon determining that the user is an in-valid user,
masking, by the system via the one or more hardware processors, each potential passphrase present in the text of each potential file; and

displaying, by the system via the one or more hardware processors, masked potential passphrases on the user device.
4. The processor implemented method (400) as claimed in claim 2,
comprising:
providing, by the system via the one or more hardware processors, an explanation for each potential passphrase of the set of potential passphrases;
evaluating, by the system via the one or more hardware processors, a strength of each potential passphrase of the set of potential passphrases based on a predefined set of criteria; and
displaying, by the system via the one or more hardware processors, the strength of each potential passphrase along with the associated potential passphrase.
5. The processor implemented method (500) as claimed in claim 1,
comprising:
receiving, by the system via the one or more hardware processors, at least one feedback and at least one comment on one or more potential passphrases present in the set of potential passphrases; and
storing, by the system via the one or more hardware processors, the at least one feedback and the at least one comment in a user feedback store.
6. The processor implemented method (500) as claimed in claim 5,
comprising:
fine-tuning, by the system via the one or more hardware processors, the pre-trained machine learning based language model based on a plurality of feedbacks and comments present in the user feedback store using a fine-tuning scheduler, wherein the fine-tuning scheduler follows an iterative process in which one or more parameters and one or more hyperparameters of the pre-trained machine learning based language model are updated in each iteration until the pre-

trained machine learning based language model accurately identifies the set of potential passphrases.
7. The processor implemented method (500) as claimed in claim 1, wherein
the constituency tree based technique comprises:
training, by the system via the one or more hardware processors, a syntactic embedding model based on one or more constituency parse trees present in a passphrase constituency tree database using a graph embedding algorithm;
applying, by the system via the one or more hardware processors, a sliding window technique on the text present in each potential file based on a pre-defined window size to obtain one or more type of content present in the text, wherein a plurality of text windows are created for the text present in each potential file based on the pre-defined window size, and wherein the type of content present in each text window is obtained using the sliding window technique;
assigning, by the system via the one or more hardware processors, a plurality of part-of-speech (POS) tags to a window text present in each text window;
identifying, by the system via the one or more hardware processors, one or more matches between the window text and one or more POS patterns present in a passphrase pattern database based on the plurality of assigned POS tags;
storing, by the system via the one or more hardware processors, the identified one or more matches in a phrase list, wherein the phrase list comprises one or more phrases;
identifying, by the system via the one or more hardware processors, a context window for each phrase in the phrase list;
creating, by the system via the one or more hardware processors, a constituency tree for at least one context window matching with a predefined set of context windows;
for each created constituency tree, computing, by the system via the one or more hardware processors, an embedding for an associated constituency tree using the trained syntactic embedding model;

comparing, by the system via the one or more hardware processors, the embedding of each constituency tree with an embedding of a phrase constituency tree created for each phrase in the phrase list, wherein a similarity score is obtained for each comparison;
for each similarity score, determining, by the system via the one or more hardware processors, whether the associated similarity score is greater than a predefined similarity score threshold; and
for each phrase whose similarity score is found to greater than the predefined similarity score threshold, appending, by the system via the one or more hardware processors, the associated phrase and the context window to the set of context.
8. A system (102), comprising:
a memory (202) storing instructions;
one or more communication interfaces (206); and
one or more hardware processors (204) coupled to the memory (202) via the one or more communication interfaces (206), wherein the one or more hardware processors (204) are configured by the instructions to:
receive a plurality of files present in a user device, wherein the user device is associated with a user;
filter the plurality of files based on a set of file attributes to obtain a set of potential files;
perform a sensitivity analysis of each potential file of the set of potential files based on one or more sensitivity indicators, wherein a sensitivity score is assigned to each of the potential files from the set of potential files based on the sensitivity analysis;
generate a set of context from text present in each potential file of the set of potential files using a constituency tree based technique; and
identify a set of potential passphrases present in the text based on the set of context and the assigned sensitivity score of each potential file using a pre-trained machine learning based language model.

9. The system as claimed in claim 8, wherein the one or more hardware
processors (204) are configured by the instructions to:
determine whether the user is a valid user using an authentication mechanism; and
display the set of potential passphrases upon determining that the user is the valid user.
10. The system as claimed in claim 9, wherein the one or more hardware
processors (204) are configured by the instructions to:
upon determining that the user is an in-valid user,
mask each potential passphrase present in the text of each potential file; and
display masked potential passphrases on the user device.
11. The system as claimed in claim 9, wherein the one or more hardware
processors (204) are configured by the instructions to:
provide an explanation for each potential passphrase of the set of potential
passphrases;
evaluate a strength of each potential passphrase of the set of potential passphrases based on a predefined set of criteria; and
display the strength of each potential passphrase along with the associated potential passphrase.
12. The system as claimed in claim 8, wherein the one or more hardware
processors (204) are configured by the instructions to:
receive at least one feedback and at least one comment on one or more potential passphrases present in the set of potential passphrases; and
store the at least one feedback and the at least one comment in a user feedback store.

13. The system as claimed in claim 12, wherein the one or more hardware
processors (204) are configured by the instructions to:
fine-tune the pre-trained machine learning based language model based on a plurality of feedbacks and comments present in the user feedback store using a fine-tuning scheduler, wherein the fine-tuning scheduler follows an iterative process in which one or more parameters and one or more hyperparameters of the pre-trained machine learning based language model are updated in each iteration until the pre-trained machine learning based language model accurately identifies the set of potential passphrases.
14. The system as claimed in claim 8, wherein the constituency tree based
technique comprises:
train a syntactic embedding model based on one or more constituency parse trees present in a passphrase constituency tree database using a graph embedding algorithm;
apply a sliding window technique on the text present in each potential file based on a pre-defined window size to obtain one or more type of content present in the text, wherein a plurality of text windows are created for the text present in each potential file based on the pre-defined window size, and wherein the type of content present in each text window is obtained using the sliding window technique;
assign a plurality of part-of-speech (POS) tags to a window text present in each text window;
identify one or more matches between the window text and one or more POS patterns present in a passphrase pattern database based on the plurality of assigned POS tags;
store the identified one or more matches in a phrase list, wherein the phrase list comprises one or more phrases;
identify a context window for each phrase in the phrase list;
create a constituency tree for at least one context window matching with a predefined set of context windows;

for each created constituency tree, compute an embedding for an associated constituency tree using the trained syntactic embedding model;
compare the embedding of each constituency tree with an embedding of a phrase constituency tree created for each phrase in the phrase list, wherein a similarity score is obtained for each comparison;
for each similarity score, determine whether the associated similarity score is greater than a predefined similarity score threshold; and
for each phrase whose similarity score is found to greater than the predefined similarity score threshold, append the associated phrase and the context window to the set of context.

Documents

Application Documents

#	Name	Date
1	202421025661-STATEMENT OF UNDERTAKING (FORM 3) [28-03-2024(online)].pdf	2024-03-28
2	202421025661-REQUEST FOR EXAMINATION (FORM-18) [28-03-2024(online)].pdf	2024-03-28
3	202421025661-FORM 18 [28-03-2024(online)].pdf	2024-03-28
4	202421025661-FORM 1 [28-03-2024(online)].pdf	2024-03-28
5	202421025661-FIGURE OF ABSTRACT [28-03-2024(online)].pdf	2024-03-28
6	202421025661-DRAWINGS [28-03-2024(online)].pdf	2024-03-28
7	202421025661-DECLARATION OF INVENTORSHIP (FORM 5) [28-03-2024(online)].pdf	2024-03-28
8	202421025661-COMPLETE SPECIFICATION [28-03-2024(online)].pdf	2024-03-28
9	202421025661-Proof of Right [22-04-2024(online)].pdf	2024-04-22
10	202421025661-FORM-26 [08-05-2024(online)].pdf	2024-05-08
11	Abstract1.jpg	2024-05-24
12	202421025661-Power of Attorney [11-04-2025(online)].pdf	2025-04-11
13	202421025661-Form 1 (Submitted on date of filing) [11-04-2025(online)].pdf	2025-04-11
14	202421025661-Covering Letter [11-04-2025(online)].pdf	2025-04-11
15	202421025661-FORM-26 [22-05-2025(online)].pdf	2025-05-22