Method And System For Machine Learning Based Computer System

< Back

Method And System For Machine Learning Based Computer System Validation (Csv)

Abstract: This disclosure relates generally to machine learning based computer system validation (CSV). Typically, CSV is carried out in a semi-automatic manner leading to time-inefficiency and human errors. The disclosed method and system provide a machine learning (ML) based solution for the automating authoring and reviews of documents for CSV compliance. The method includes creation of pharma domain specific knowledge graph using NLP techniques. The knowledge graph is utilized for authoring and reviews of documents. The system receives input documents and extracts metadata associated with functional testing requirements therefrom. The system queries the knowledge graph based on the metadata to obtain a set of knowledge elements associated with the functional testing requirements. The set of knowledge elements obtained based on at least on a prediction of semantic similarity between the metadata and knowledge elements using an NLP model. One or more CSV artifacts may be created using the knowledge elements. [To be published with FIGS. 4A-4B]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

21 June 2021

Publication Number

51/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2025-01-27

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. SAHOO, Nihar Ranjan

Tata Consultancy Services Limited SDF V, Santacruz Electronic Export Processing Zone, Andheri (East), Mumbai Maharashtra India 400096

2. AYALASOMAYAJULA, Kavitha

Tata Consultancy Services Limited SDF V, Santacruz Electronic Export Processing Zone, Andheri (East), Mumbai Maharashtra India 400096

3. KSHIRSAGAR, Mahesh

Tata Consultancy Services Limited SDF V, Santacruz Electronic Export Processing Zone, Andheri (East), Mumbai Maharashtra India 400096

4. SHARMA, Sonam

Tata Consultancy Services Limited Galaxy embassy Park Tower C, Plot No A44, Sector 62, Noida Uttar Pradesh India 201309

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR MACHINE LEARNING BASED COMPUTER SYSTEM VALIDATION (CSV)
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to the field of machine learning based computer system validation, and, more particularly, to method and system for machine learning based computer system validation (CSV) in pharmaceutical domain.
BACKGROUND
[002] The structured and systematic development of information technology (IT) applications necessarily follows stringent adherence to the software development lifecycle (SDLC) processes, which not only leads to building of the robust application, but also leads to manageable maintenance/enhancements of these IT applications. The documentation is core to the SDLC process. This documentation around IT applications is very stringent in industries like Pharmaceutical (Pharma). Due to such stringent documentation requirements, the efforts involved in building IT applications for Pharmaceutical industry are typically 1.5 times more than the efforts required for building IT applications for domains other than Pharmaceutical domain. Hence, any automation around documentation, be it creation of documentation or verification and validation of the documents, is of high importance for Pharmaceutical projects.
[003] While the core of the SDLC does not change, the thoroughness and comprehensiveness of the SDLC processes and involved contents are on far higher side on Pharma. There are Pharma regulations which guides the thoroughness and comprehensiveness of the SDLC processes and contents, both at the requirements phase and also during the testing phase.
[004] In Pharma domain, this automation of documentation is named as Computer Systems Validation (CSV) which ensures that the CSV system meets the purpose it is designed for. The intent is to ensure the system meets a set of defined requirements consistently every time as intended.
[005] Currently, the process of authoring and reviewing SDLC artefacts of IT projects has been completely manual across industries. It is the subject matter expert (SME) who understands the contents of the Pharma regulations and derives

the actionable insights. The current process is predominantly manual process and thus has obvious limitations linked to subjectivity and coverage.
SUMMARY [006] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for machine learning based computer system validation (CSV) is provided. The method includes receiving, via one or more hardware processors, one or more input documents comprising a plurality of requirements for a computer system validation (CSV) process. Further, the method includes extracting, from the one or more input documents, metadata associated with functional testing requirements from amongst the plurality of requirements in the CSV process using natural language proceeding (NLP) techniques, via the one or more hardware processors. Furthermore, the method includes querying, via the one or more hardware processors, a domain specific knowledge graph based on the metadata to obtain a set of knowledge elements associated with the functional testing requirements, the set of knowledge elements obtained based on at least on a prediction of semantic similarity between the metadata and the set of knowledge elements using an NLP model. Herein, constructing the domain specific knowledge graph of regulations is based on the plurality of entities and a plurality of relationships between the plurality of entities associated with pharma domain. Also, the knowledge graph establishes embedding between the plurality of entities using a Graph embedding algorithm, and facilitates in deriving inferencing using path finding algorithm. The information is extracted from the knowledge graph in form of regulation extraction rules that needs to be complied while having functional testing requirements as defined in the regulations, a plurality of entities that are responsible for compliance, and organization of rules in multiple groups. The method further includes creating, via the one or more hardware processors, one or more CSV artifacts based on the set of knowledge elements, wherein creating the one or more CSV artifacts comprises building, via the one or more hardware

processors a plurality of groups from the knowledge graph based on K-means clustering for the plurality of entities, wherein the plurality of groups are associated with a first set of dimensions of the regulations; establishing, based on deterministic rules and probabilistic rules, a correlation between the regulations and SDLC process by associating the plurality of groups to one of a set of dimensions of SDLC via the one or more hardware processors. Herein, establishing the correlation based on the probabilistic rules comprises parsing, using NLP, an enterprise model comprising the set of dimensions associated with SDLC to extract candidate entities for the knowledge base in form of triples, extracting the frequent predicate paths from candidate triples using association rules mining algorithms to extract a set of subgraphs that defines the regulations to be mapped to the set of dimensions, wherein each subgraph of the plurality of subgraphs is associated with a score; and selecting a subgraph from the set of subgraphs based on score. Further, the method includes validating, via the one or more hardware processors, applicability of the regulation for the functional testing requirements by mapping the selected subgraph with a meaning of the selected subgraph and determining a criticality of the regulation.
[007] In another aspect, a system for machine learning based computer system validation (CSV) is provided. The system includes a memory storing instructions, one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to receive one or more input documents comprising a plurality of requirements for a computer system validation (CSV) process, Further, the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to extract, from the one or more input documents, metadata associated with functional testing requirements from amongst the plurality of requirements in the CSV process using NLP. Furthermore, the one or more hardware processors are configured by the instructions to query a domain specific knowledge graph based on the metadata to obtain a set of knowledge elements associated with the functional testing requirements, the set of knowledge elements

obtained based on at least on a prediction of semantic similarity between the metadata and the set of knowledge elements using an NLP model, wherein the one or more hardware processors are configured by the instructions to construct the domain specific knowledge graph of regulations is based on the plurality of entities and a plurality of relationships between the plurality of entities associated with pharma domain. Also, the one or more hardware processors are configured by the instructions to facilitate the knowledge graph to establish embedding between the plurality of entities using a Graph embedding algorithm, and facilitates in deriving inferencing using path finding algorithm, Moreover, the information is extracted from the knowledge graph in form of regulation extraction rules that needs to be complied while having functional testing requirements as defined in the regulations, a plurality of entities that are responsible for compliance, and organization of rules in multiple groups. The one or more hardware processors are configured by the instructions to create one or more CSV artifacts based on the set of knowledge elements, wherein to create the one or more CSV artifacts. The one or more hardware processors are further configured by the instructions to build a plurality of groups from the knowledge graph based on K-means clustering for the plurality of entities, wherein the plurality of groups are associated with a first set of dimensions of the regulations; and establish, based on deterministic rules and probabilistic rules, a correlation between the regulations and SDLC process by associating the plurality of groups to one of a set of dimensions of SDLC via the one or more hardware processors. The one or more hardware processors are configured by the instructions to establish the correlation based on the probabilistic rules by parsing, using NLP, an enterprise model comprising the set of dimensions associated with SDLC to extract candidate entities for the knowledge base in form of triples, extracting the frequent predicate paths from candidate triples using association rules mining algorithms to extract a set of subgraphs that defines the regulations to be mapped to the set of dimensions, wherein each subgraph of the plurality of subgraphs is associated with a score; and selecting a subgraph from the set of subgraphs based on score. The one or more hardware processors are configured by the instructions to validate applicability of the regulation for the

functional testing requirements by mapping the selected subgraph with a meaning of the selected subgraph and determining a criticality of the regulation.
[008] In yet another aspect, a non-transitory computer readable medium for a method for machine learning based computer system validation (CSV). The method includes receiving, via one or more hardware processors, one or more input documents comprising a plurality of requirements for a computer system validation (CSV) process. Further, the method includes extracting, from the one or more input documents, metadata associated with functional testing requirements from amongst the plurality of requirements in the CSV process using NLP, via the one or more hardware processors. Furthermore, the method includes querying, via the one or more hardware processors, a domain specific knowledge graph based on the metadata to obtain a set of knowledge elements associated with the functional testing requirements, the set of knowledge elements obtained based on at least on a prediction of semantic similarity between the metadata and the set of knowledge elements using an NLP model. Herein, constructing the domain specific knowledge graph of regulations is based on the plurality of entities and a plurality of relationships between the plurality of entities associated with pharma domain. Also, the knowledge graph establishes embedding between the plurality of entities using a Graph embedding algorithm, and facilitates in deriving inferencing using path finding algorithm. The information is extracted from the knowledge graph in form of regulation extraction rules that needs to be complied while having functional testing requirements as defined in the regulations, a plurality of entities that are responsible for compliance, and organization of rules in multiple groups. The method further includes creating, via the one or more hardware processors, one or more CSV artifacts based on the set of knowledge elements, wherein creating the one or more CSV artifacts comprises building, via the one or more hardware processors a plurality of groups from the knowledge graph based on K-means clustering for the plurality of entities, wherein the plurality of groups are associated with a first set of dimensions of the regulations; establishing, based on deterministic rules and probabilistic rules, a correlation between the regulations and SDLC process by associating the plurality of groups to one of a set of dimensions of SDLC

via the one or more hardware processors. Herein, establishing the correlation based on the probabilistic rules comprises parsing, using NLP, an enterprise model comprising the set of dimensions associated with SDLC to extract candidate entities for the knowledge base in form of triples, extracting the frequent predicate paths from candidate triples using association rules mining algorithms to extract a set of subgraphs that defines the regulations to be mapped to the set of dimensions, wherein each subgraph of the plurality of subgraphs is associated with a score; and selecting a subgraph from the set of subgraphs based on score. Further, the method includes validating, via the one or more hardware processors, applicability of the regulation for the functional testing requirements by mapping the selected subgraph with a meaning of the selected subgraph and determining a criticality of the regulation.
[009] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[010] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[011] FIG. 1 illustrates an example network implementation of a system for machine learning based computer system validation (CSV), in accordance with an example embodiment.
[012] FIG. 2 illustrates a flow diagram for a method for building a knowledge graph for ML based CSV according to some embodiments of the present disclosure.
[013] FIG. 3 illustrates example representation of a process for extracting subgraphs for building the logic rules and reasoning in the knowledge graph, in accordance with an example embodiment.
[014] FIGS. 4A-4B illustrate a flow diagram for a method for ML based CSV according to some embodiments of the present disclosure.

[015] FIG. 5 illustrates considerations based on both quantitative and qualitative validation of the disclosed system according to some embodiments of the present disclosure.
[016] FIG. 6 illustrates a process for review of deliverables in accordance with an example embodiment of the present disclosure.
[017] FIG. 7 illustrates a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [018] Across the Pharma value chain there are strict regulatory norms and IT systems must adhere to them. Basis the functionality of the IT systems, its scope and its type, the applicable norms for those IT systems must be arrived at from those regulations. While regulations are there across Pharma Value Chain, the IT systems too have multiple dimensions Business (Functional Coverage, Criticality, Sensitivity, Complexity), Operations (Security, Confidentiality, Uptime) and Technology (Infrastructure Products, Architecture, Deployment Mode). Typically, the SMEs manually arrives at the applicable regulatory norms using their knowledge and experience. However, such solutions are time consuming and leads to unproductivity and prone to human errors
[019] Various embodiments described herein provides method and system for automating the CSV process by the use of machine learning model. In particular, the disclosed method and system utilizes a set of rules, natural language processing (NLP), Machine Learning algorithms and Deep Learning techniques to extract data for reviews. Moreover, the proposed system will also be capable enough in terms of extensibility, flexibility and adaptability, as will be described further in conjunction with the following figures.
[020] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are

described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[021] Referring now to the drawings, and more particularly to FIG. 1 through 7, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[022] FIG. 1 illustrates an example network implementation 100 of a system 102 for machine learning based computer system validation (CSV), in accordance with an example embodiment. As previously described, across the Pharma value chain, there are strict regulatory norms and IT systems must adhere to them. Based on the functionality of the IT systems, its scope and its type, the applicable norms for those IT systems must be arrived at from those regulations. While regulations are there across Pharma Value Chain, the IT systems too have multiple dimensions Business (Functional Coverage, Criticality, Sensitivity, Complexity), Operations (Security, Confidentiality, Uptime) and Technology (Infrastructure Products, Architecture, Deployment Mode). The disclosed system 102 is configured to build an exhaustive knowledgebase (also referred to herein as a knowledge graph), and building a global language model to cover a plurality of regulatory norms and the associated business, technology and operations dimensions of the SDLC process. In an embodiment, the system 102 utilizes a set of rules, NLP and machine learning algorithms and deep learning techniques to build the knowledge graph and query the knowledge graph for automated extraction of actionable insights. Further the disclosed system is capable of automating conformity checks linked to validation of test results by associating the regulatory norms with business, operations and technology dimensions of SDLC with test results, and predict similarity therebetween, as will be described further in the description below.

[023] Although the present disclosure is explained considering that the system 102 is implemented on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems 104, such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. It will be understood that the system 102 may be accessed through one or more devices 106-1, 106-2... 106-N, collectively referred to as devices 106 hereinafter, or applications residing on the devices 106. Examples of the devices 106 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, a smartphone, a tablet computer, a workstation and the like. The devices 106 are communicatively coupled to the system 102 through a network 108.
[024] In an embodiment, the network 108 may be a wireless or a wired network, or a combination thereof. In an example, the network 108 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network 108 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network 108 may interact with the system 102 through communication links.
[025] As discussed above, the system 102 may be implemented in a computing device 104, such as a hand-held device, a laptop or other portable computer, a tablet computer, a mobile phone, a PDA, a smartphone, and a desktop computer. The system 102 may also be implemented in a workstation, a mainframe computer, a server, and a network server. In an embodiment, the system 102 may be coupled to a data repository, for example, a repository 112. The repository 112 may store data processed, received, and generated by the system 102. In an alternate embodiment, the system 102 may include the data repository 112.

[026] The network implementation 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The network environment enables connection of devices 106 such as Smartphone with the server 104, and accordingly with the database 112 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 102 is implemented to operate as a stand-alone device. In another embodiment, the system 102 may be implemented to work as a loosely coupled device to a smart computing environment.
[027] As described above, the system is creates a knowledge graph for pharma regulations, as will be defined with reference to FIG. 2 below.
[028] FIG. 2 illustrates a flow diagram of a method for building a knowledge graph for pharma regulations to be used in ML based CSV process, in accordance with an example embodiment.
[029] The knowledge graph facilitates in automated extraction of actionable insights from documents associated with pharma regulations. Herein, the term actionable insights refer to the regulatory statements extracted from the regulatory document facilitates in extraction of a set of triples, such that each triple comprises a Concept-Relation-Concept and provides a graph of connected triples across the document explaining their relations across the document. For example, consider a regulatory statement – “The integrity of the data and the integrity of the protocols should be maintained when making changes to the computerized system”. The aforementioned regulatory statement provides a set of connected triples Integrity-maintain-Data and Integrity-Maintain-Protocol. The triples provide an actionable insight that Data and Protocol are both related and are integral part of Change Control Process, and thus the aforementioned triples may be directed to the Change Control Artifact of the SDLC.
[030] The process of deriving actionable insights begins at 202 wherein the inferencing of the regulations is performed at 202. The regulations are inferenced to understand the said pharma regulations for both the semantics and the context. The regulatory information extraction and transformation being mostly domain specific, there are challenges in parsing of text and tokenization as to

formatting, abbreviations and domain specific references. The interpretation of text is attuned to the context with a limited domain vocabulary. In addition, the sentence structure in the text is complex and non-linear, which makes extraction of information difficult. The solution is to overcome these challenges and make it automated in such a way that the parser works for every kind of regulation irrespective of its domain to reduce rework in the future. The disclosed method and system provide a methodology that may combine a rule-based system with components of standard NLP pipeline and unsupervised machine learning algorithm to define a generalized framework for analysis.
[031] Herein, instead of using NLP directly, the disclosed method first trains the NLP model (for example, using Spacy model) and domain specific words in Rules’ form provided in the form of dictionary. The training of the NLP model helps further in tokenization and POS tagging because it parses the text with understanding of domain specific words. The disclosed method uses NLP pipeline components that are relevant to framework, including, for example, tokenizer, lemmatize, POS tagger and dependency parser.
[032] The proposed method and system achieves important objectives: (1) to extract the information in the form of rules as defined in the regulations, (2) the entities that are responsible for compliance, and (3) organization of rules in multiple homogeneous groups.
[033] For inferencing the regulations, the disclosed method includes extraction of a set regulation extraction rules from the input documents (training data). In an embodiment, the system performs automatic extraction of obligatory sentences. The solution parses whole text from the documents (training data) and extracts only those sentences that define a set regulation extraction rule. This set of regulation extraction rules are to be complied with while having reviews or validations. To recognize said regulation extraction rules, verbs that signifies obligations may be identified in the sentences. For instance, a library of synonyms such as WordNet may facilitate in extraction of verbs signifying obligations. The method may include querying and extracting the synonyms of verbs which expresses obligations e.g., shall, should, must, oblige and others.

[034] The method further includes extraction of subjects/ entities with domain adaptation so as to recognize the entities that are responsible for complying with extracted regulations. The domain terms may be identified in a semi-supervised manner. For instance, the disclosed system may recognize domain terms or concepts, based on inputs from the subject matter experts (SME), regulatory norms and organization standards. Such concepts may be utilized by a parsing engine to identify subjects and/or object phrases that stipulates who and/or what needs to perform verb action. Here, for complex sentences with multiple phrases with the verb, subject that are mostly noun phrases (nsubj in Spacy) may be extracted. In an embodiment, the disclosed system utilizes Dependency Tree and Breadth First Search techniques to find all the tokens that are related to the subject by a parent-child relationship, thereby facilitating in obtaining a set of words involved in defining the entities responsible to comply with the rules extracted.
[035] With the extraction of entities and verbs which acts as relation between the entities, the knowledge graph is constructed. The knowledge graph acts as the knowledge base that maintains the regulations in its original form and relations between entities (and mentions) with relations. It establishes embedding between the nodes using Graph embedding algorithm (node2vec/deep walk), thereby facilitating in having any inferencing on the graph using path finding algorithm. Graph based K-means clustering for the entities is performed to build homogeneous groups from the Graph using graph-based K-means clustering approach. The aforementioned method further facilitates in updating the knowledge graph in case of any change in regulations is identified as the disclosed system has the capability to understand what has changed, where and how it can impact the Pharma SDLC demands.
[036] In an embodiment, updating the knowledge graph comprises extracting the set of regulation defining rules from updated documents associated with the regulation. Further the entities that are responsible for complying with extracted regulations are extracted. Thereafter, a relationship is defined between the set of regulations and the extracted entities to update the knowledge graph. Such updating of the knowledge graph facilitates in maintaining the traceability.

[037] Once the knowledge graph is created, the disclosed method establishes a co-relation between regulations and SDLC process. The method includes associating the regulatory norms to a plurality of dimensions including business dimension, operations dimension and technology dimension of the SDLC. Associating the regulatory norms to the plurality of dimensions is based on both deterministic rules and probabilistic business rules. The probabilistic business rules are determined by leveraging case-based reasoning so as to establish the co-relation between regulations and SDLC process. The determination of the probabilistic rules results in a confidence score associated with it. Herein, the case-based reasoning (CBR) approach executes a new problem by retrieving cases that are similar to it. The approach predicts attributes for an entity by gathering reasoning paths from similar entities in the knowledge base (e.g. Knowledge Graph). The prediction is from a probabilistic model that estimates the likelihood or the confidence score that a path is effective at answering a query about the given entity.
[038] At step 204 of method 200, the process of establishing co-relation between regulations and SDLC is performed and includes understanding and transformation of the plurality of dimensions including the business, operations and technical dimensions of SDLC using NLP. In an embodiment, the plurality of dimensions may be understood by parsing the enterprise model containing details on business, operations and technical dimensions of SDLC. Herein, for each dimension of the plurality of dimensions, the documents relevant to the dimension may be parsed. For instance, for the business dimension, business model document may be parsed using NLP and understood. Further, the candidate entities are extracted and mentioned with relations for it to be available for knowledge base. Herein, parsing and entities extraction helps in understanding the candidate entities in the graph. In an embodiment, the candidate entities are to be extracted using the same NLP model which is used for knowledge base creation.
[039] Upon extraction of the candidate entities, logic rules and reasoning are built in the knowledge graph, as illustrated with reference to FIG. 3. In an embodiment, for building the logic rules and reasoning, logical rules with subgraph mining approach is used. The candidate triples 302 are passed from one of the

dimension of SDLC (business, operation, technology) to query the knowledge graph and extract frequent predicate paths 304 to extract a set of related subgraphs 306. Herein, the candidate triples comprises entities that are competing to be predicted as the correct entity for a phrase These are input for querying from the graph. Each subgraph of the set of subgraphs may be associated with a score. Herein, the score associated with a subgraph is calculated based on the frequency of matches of the triples from the SDLC dimensional space with the Enterprise Model. A subgraph with a highest score may define the followed/ compliant regulations to be mapped to the dimension (operation, technology or business). Herein, to extract frequent pattern paths for each predicate relation from candidate triples in knowledge graph, the method initially starts with finding 1-predicate path for the given predicate and then iteratively find nearest other k predicate paths from (k-1) predicate paths. There may be endless paths in the knowledge graph, and hence pruning is to be done to reduce the complexity in reaching a conclusion. Herein, the approach is to find all the Frequent Predicate Paths (FPPs), and then discover Frequent path Cycles (FPCs) by checking which FPPs can form predicate cycles. Considering it computation intensive, the searching space of predicate paths is trimmed. For a starting predicate, two 1-predicate paths are first generated and evaluated to find whether they are frequent. After that, it starts a loop that retrieves frequent k-predicate paths by extending frequent (k-1)- predicate paths iteratively. In each iteration, once the frequent k-predicate paths are found, FPCs are finalized from them. Consider an example of a regulatory statement – “The integrity of the data and the integrity of the protocols should be maintained when making changes to the computerized system”. The system provides a set of connected triples, for example, Integrity-maintain-Data and Integrity-Maintain-Protocol. This provides an insight that Data and Protocol are both related and are integral part of Change Control Process. This would direct these triples to the Change Control Artifact of the SDLC.
[040] The aforementioned logical rules and reasoning helps in analysis of new regulations and building business model when an enterprise has historical regulations, that are identified on the above SDLC dimensions. The method may

further include building a case-based reasoning using the historical data. The steps to build case-based reasoning is explained as follows:
▪ The graph embedding (as described previously) is used for case-based reasoning and leveraged for the vector representations between domain words. Herein vector representation is made using the frequency of the domain terms and their synonyms in the regulatory document and the SDLC dimension document space.
▪ Sentence embedding of new regulations is made to get cosine distance
▪ Finally, the method includes calculating similarity rank and finding the most matched regulations
▪ This helps in matching the newer similar regulations from the historical data with SDLC dimension space
[041] The above methodology aims in building a semantic logic-based information representation so that the reasoning could be fully automated.
[042] At step 206, the method 200 includes providing recommendations based on actionable insights. For instance, the method may output a decision in terms of how a regulatory norm is applicable for a project. In addition, the method includes outputting what should be the Severity/Criticality. The severity/criticality may have a score associated with it. For instance, score for severity or criticality of a regulatory norm (regulation) may be categorized as low, medium and high to judge the severity of the risk from the likelihood of occurrence and the probability of detection to attain an overall risk level. In an embodiment, the severity/criticality of a regulatory norm may be predefined in a repository of the system. The method may utilize ratings (Severity/Criticality) from previous executions for similar domains (Operation, Technology or Business along with Regulatory Norms), stored such ratings/scores in a repository and builds a robust recommendation system. Based on the similar process and score, the system may validate the test script of any deliverable and recommend compliance. Using the knowledge graph (as described with reference to FIG. 2, the disclosed method and system may be utilized for machine learning based computer system validation). A method for machine

learning based computer system validation is described further with reference to FIGS. 4A-4B.
[043] FIGS. 4A-4B illustrates a flow chart of a method 400 for machine learning based computer system validation (CSV), in accordance with an example embodiment. Operations of the flowchart, and combinations of operation in the flowchart, may be implemented by various means, such as hardware, firmware, processor, circuitry and/or other device associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described in various embodiments may be embodied by computer program instructions. In an example embodiment, the computer program instructions, which embody the procedures, described in various embodiments may be stored by at least one memory device of a system and executed by at least one processor in the system. Any such computer program instructions may be loaded onto a computer or other programmable system (for example, hardware) to produce a machine, such that the resulting computer or other programmable system embody means for implementing the operations specified in the flowchart. It will be noted herein that the operations of the method 400 are described with help of system 102. However, the operations of the method 400 can be described and/or practiced by using any other system.
[044] As previously described, the disclosed method combines a rule-based system with components of standard natural language processing (NLP) pipeline and unsupervised machine learning algorithm to define a generalized framework for analysis. The method for machine learning based computer system validation to achieve the aforementioned objective is described further in detail below.
[045] At 402, the method 400 includes receiving one or more input documents having a plurality of requirements for a computer system validation (CSV) process. The input documents include, for example, Operations Qualification Document, Performance Qualification Document, Business Requirements Document, Design Document and others. In an embodiment, the input documents may be test results associated with the SDLC process for pharma

regulations. The plurality of requirements may be associated with the compliance of pharma regulations. In an embodiment, the one or more input documents may in different formats. For example, the input documents may be in formats including but not limited to PDF, Excel, Word formats, and so on. The test results, if in image form (post-executed test scripts) may be processed by computer vision techniques to bring them in readable form in natural language. The processed input documents in readable format are further processed using NLP techniques, as will be described later in the description. The contents in other forms (PDF, Excel) may be converted and pre-processed with NLP techniques. The extracted contents are further pre-processed by extracting entities therefrom, deriving relations between the entities and forming triples. For example, at 404, the method 400 includes extracting, from the one or more input documents, metadata associated with functional testing requirements from amongst the plurality of requirements in the CSV process using NLP. At 406, the method 400 includes querying a domain specific knowledge graph based on the metadata to obtain a set of knowledge elements associated with the functional testing requirements. The set of knowledge elements are obtained based on a prediction of semantic similarity between the metadata and the set of knowledge elements using NLP model. As previously described, the domain specific knowledge graph of regulations is constructed based on the plurality of entities and a plurality of relationships between the plurality of entities associated with pharma domain. The knowledge graph establishes embedding between the plurality of entities using a Graph embedding algorithm and facilitates in deriving inferencing using path finding algorithm. The information is extracted from the knowledge graph in form of regulation extraction rules that needs to be complied while having functional testing requirements as defined in the regulations, a plurality of entities that are responsible for compliance, and organization of rules in multiple groups. The method of extracting entities, deriving relations between the entities and forming triples is explained previously with reference to FIG. 2, hence for the brevity of description, the discussion of said method is precluded.
[046] At 408, the method 400 includes creating one or more CSV artifacts based on the set of knowledge elements. Creating the one or more CSV artifacts

includes building, at 410, a plurality of groups from the knowledge graph based on K-means clustering for the plurality of entities, wherein the plurality of groups are associated with each dimension of the first set of dimensions (business, technology and organization) of the regulations. At 412, based on deterministic rules and probabilistic rules, a correlation between the regulations and SDLC process is established by associating the plurality of groups to one dimension from amongst the set of dimensions of the SDLC.
[047] In an embodiment, establishing the correlation based on the probabilistic rules includes parsing, using NLP, an enterprise model comprising the set of dimensions associated with SDLC to extract candidate entities for the knowledge base in form of triples. The enterprise model comprises of the knowledge graph created from the Regulatory and SDLC multi-dimensional artifacts. Further the frequent predicate paths are extracted from candidate triples using association rules mining algorithms to extract a set of subgraphs that defines the regulations to be mapped to the set of dimensions. Each subgraph of the plurality of subgraphs is associated with a score. Herein, the score is calculated based on the frequency of matches of the triples from the SDLC dimensional space with the Enterprise Model. The method further includes selecting a subgraph from the set of subgraphs based on the score. The applicability of the regulation for the functional testing requirements is validated by mapping the selected subgraph with a meaning of the selected subgraph and determining a criticality of the regulation at 414. In an embodiment, the output of the test results associated with the functional test requirement is validated in terms of how the actionable insight arrived earlier is gainfully managed. Herein, the triples from the test results artifacts of the SDLC artifacts are used as a query using case-based reasoning to check compliance. On the basis of severity/criticality of the regulation norms, the validation includes defined thresholds, below which the exception is raised, and non-compliance noted, and it gets directed to SME for final validation. Also, a set of deliverables checklist of the SDLC phases, recommended as per the risk profile of the project and automated compliance assessment is made basis of the availability of the artifacts of the Project.

[048] In an example scenario, to perform the risk-based approach of regulatory validation, existing guidelines - GAMP 5 for Goods and Automated Manufacturing Practice Guidelines is referred to. The GxP impact assessment is carried out to determine if the Computer System has an impact on product quality, patient safety or data integrity. These defined guidelines are augmented and verified in the computer systems being built. All GxP based computer systems must comply with applicable regulatory requirements.
[049] The validation framework focuses on risk assessment that considers the purpose of the software being built in the project, i.e. It understands the validation process basis the nature of project and product, and risk profile. The system built is purely automated with the intent to reduce human effort. It aims to provide intelligence with available static information and built on top of the available information and regulatory documents. The validation framework is built on top of established regulatory framework as per the baselined ISPE GAMP 5 (International Society for Pharmaceutical Engineering, Good Automated Manufacturing Practice). It provides best practices for rule-based systems and flexibility to accommodate AI functionality. FIG. 5 explains the considerations based on both quantitative and qualitative validation of the proposed system aligned with the GAMP 5 risk-based validation of the AI based system.
[050] FIG. 6 illustrates a process for review of deliverables in accordance with an example embodiment of the present disclosure. At 602, the Regulatory Norms, Operation, Technology and Business Requirements and Test Script Documents are accessed. At 604, the method 600 includes pre-processing documents (PDF, Excel, Word, Images). At 606, the method 600 includes parsing the documents. At 608, the method 600 includes concept and entity-relationship modelling. At 610, the method 600 includes building the knowledge graph. At 612, the method 600 includes extracting traceability among dimensions of documents. At 614, the method 600 includes performing risk profile compliance assessment. The steps of the aforementioned method are described previously in detail with reference to FIGS. 2, 3, 4A-4B and hence the details thereof are omitted herein for the brevity of description.

[051] The implementation details to achieve validations comprises a planning step, Requirement & Specification, Risk Management, Change Management, Data Selection, Model Development, and Testing. The planning captures guidelines, scope, performance threshold and fallback plan, model validation plan, knowledge graph validation plan, plan to validate new model, plan to validate existing model for biasness, plan if new ai based functionality is added in the model, and plan to build templates for future planning. Requirement & Specification includes validation of results and rules with manual interventions with SME input. A flagging technique is to be built with percentage of risk involved which helps in assessing the performance of each component. Risk Management includes a component required for the validation of model is Concept Drift i.e., dealing with changes in data overtime, consideration on the unknown error which are encountered on automated systems, considering a risk which is taken when most things are dependent on an AI model and they are always error-prone. With the risk taken in building probabilistic model, it is needed to validate it to make things transparent and provide deterministic approach wherever required. Change management the model may be retrained and the knowledge graph may be re-built as per the requirements. The built models may contain version control. The software system may be built where while applying change management everything is to be tested end-to-end. For data selection, richness and completeness of data with quality is to be checked. Same Follow the same methodology for data cleanliness on training and testing phase. Model development includes recommended usage of pre-trained model since they are built. This utilize techniques to reduce bias in the data. Model building practice like training, testing, with the study of metrics like precision, recall, F-score and accuracy validation are followed. For testing, recommended techniques are to be followed for testing of the system as a whole. For AI model testing, all metrics for measurements can be taken up for building the robust model.
[052] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other

modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[053] Various embodiments provide method and system for machine learning based computer system validation specifically for pharma domain. Typically, the structured and systematic development of IT applications necessarily follows stringent adherence to the SDLC processes, which not only leads to building of the robust application, but also leads to manageable maintenance/enhancements of these IT applications. The documentation is core to the SDLC process. This documentation around IT applications is very stringent in industries like Pharma, due to which efforts involved in building IT applications for Pharma are typically 1.5 times more than it typically takes. Hence, any automation around documentation, be it creation of documentation or verification and validation of the documents, is of high importance for Pharma projects.
[054] While the core of the SDLC does not change, the thoroughness and comprehensiveness of the SDLC processes and involved contents are on far higher side on Pharma. There are Pharma Regulations which guides the thoroughness and comprehensiveness of the SDLC processes and contents, both at the requirements phase and also during the testing phase. In Pharma, this automation is named as Computer Systems Validation (CSV) which regulates to ensure that the system meets the purpose it is designed for. The intent is to ensure the system meets a set of defined requirements consistently every time as intended.
[055] The disclosed system leverages use of AI/ML based techniques for CSV. In an embodiment, the disclosed method and system facilitates in authoring and verifying SDLC artifacts of CSV process. An important aspect of the disclosed method and system is building of pharma domain specific knowledge graph using NLP technique.
[056] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for

implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[057] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[058] The illustrated steps are set out to explain the exemplary
embodiments shown, and it should be anticipated that ongoing technological
development will change the manner in which particular functions are performed.
These examples are presented herein for purposes of illustration, and not limitation.
Further, the boundaries of the functional building blocks have been arbitrarily
defined herein for the convenience of the description. Alternative boundaries can
be defined so long as the specified functions and relationships thereof are
appropriately performed. Alternatives (including equivalents, extensions,
variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such

alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[059] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[060] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor implemented method for machine learning based computer
system validation (CSV) comprising:
receiving, via one or more hardware processors, one or more input documents comprising a plurality of requirements for a computer system validation (CSV) process;
extracting, from the one or more input documents, metadata associated with functional testing requirements from amongst the plurality of requirements in the CSV process using a plurality of Natural language programming (NLP) techniques, via the one or more hardware processors;
querying, via the one or more hardware processors, a domain specific knowledge graph based on the metadata to obtain a set of knowledge elements associated with the functional testing requirements, wherein the set of knowledge elements is obtained based at least on a prediction of semantic similarity between the metadata and the set of knowledge elements using an NLP model,
wherein constructing the domain specific knowledge graph of
regulations is based on the plurality of entities and a plurality of
relationships between the plurality of entities associated with
pharmaceutical domain, and
wherein the knowledge graph establishes embedding between the
plurality of entities using a Graph embedding algorithm, and facilitates in
deriving inferencing using a path finding algorithm, and
wherein information is extracted from the knowledge graph in form
of regulation extraction rules that needs to be complied while having
functional testing requirements as defined in the regulations, a plurality of
entities that are responsible for compliance, and organization of rules in
multiple groups;
creating, via the one or more hardware processors, one or more CSV artifacts based on the set of knowledge elements, wherein creating the one or more CSV artifacts comprises:

building, via the one or more hardware processors, a plurality of groups from the knowledge graph based on K-means clustering for the plurality of entities, wherein the plurality of groups are associated with a first set of dimensions of the regulations;
establishing, based on deterministic rules and probabilistic rules, a correlation between the regulations and SDLC process by associating the plurality of groups to one of a set of dimensions of SDLC, via the one or more hardware processors, wherein establishing the correlation based on the probabilistic rules comprises:
parsing, using the NLP techniques, an enterprise model comprising the set of dimensions associated with SDLC to extract candidate entities for the knowledge base in form of triples,
extracting the frequent predicate paths from candidate triples using association rules mining algorithms to extract a set of subgraphs that defines the regulations to be mapped to the set of dimensions, wherein each subgraph of the plurality of subgraphs is associated with a score; and
selecting a subgraph from the set of subgraphs based on the score; and
validating, via the one or more hardware processors, applicability of the regulation for the functional testing requirements by mapping the selected subgraph with a meaning of the selected subgraph and determining a criticality of the regulation.
2. The processor implemented method of claim 1, wherein the set of dimensions comprises business, technology and operations dimensions.
3. The processor implemented method of claim 1, further comprising updating the knowledge graph upon determination of change in the regulations.
4. A system (701), comprising:
a memory (715) storing instructions;

one or more communication interfaces (707); and
one or more hardware processors (702) coupled to the memory (715) via the one or more communication interfaces (707), wherein the one or more hardware processors (702) are configured by the instructions to:
receive one or more input documents comprising a plurality of requirements for a computer system validation (CSV) process;
extract, from the one or more input documents, metadata associated with functional testing requirements from amongst the plurality of requirements in the CSV process using a plurality of Natural language programming (NLP) techniques;
query a domain specific knowledge graph based on the metadata to obtain a set of knowledge elements associated with the functional testing requirements, the set of knowledge elements obtained based at least on a prediction of semantic similarity between the metadata and the set of knowledge elements using an NLP model,
wherein constructing the domain specific knowledge graph of regulations is based on the plurality of entities and a plurality of relationships between the plurality of entities associated with pharmaceutical domain, and
wherein the knowledge graph establishes embedding between the plurality of entities using a Graph embedding algorithm, and facilitates in deriving inferencing using a path finding algorithm, and
wherein information is extracted from the knowledge graph in form of regulation extraction rules that needs to be complied while having functional testing requirements as defined in the regulations, a plurality of entities that are responsible for compliance, and organization of rules in multiple groups;
create one or more CSV artifacts based on the set of knowledge elements, wherein to create the one or more CSV artifacts, the one or more hardware processors are configured by the instructions to:
build a plurality of groups from the knowledge graph based on K-means clustering for the plurality of entities, wherein the plurality of groups
are associated with a first set of dimensions of the regulations;

establish, based on deterministic rules and probabilistic rules, a correlation between the regulations and SDLC process by associating the plurality of groups to one of a set of dimensions of SDLC via the one or more hardware processors, wherein establishing the correlation based on the probabilistic rules comprises:
parse, using the NLP techniques, an enterprise model comprising the set of dimensions associated with SDLC to extract candidate entities for the knowledge base in form of triples;
extract the frequent predicate paths from candidate triples using association rules mining algorithms to extract a set of subgraphs that defines the regulations to be mapped to the set of dimensions, wherein each subgraph of the plurality of subgraphs is associated with a score; and
select a subgraph from the set of subgraphs based on the score; and validate applicability of the regulation for the functional testing requirements by mapping the selected subgraph with a meaning of the selected subgraph and determining a criticality of the regulation.
5. The system of claim 4, wherein the set of dimensions comprises business, technology and operations dimensions.
6. The system of claim 4, further comprising updating the knowledge graph upon determination of change in the regulations.

Documents

Application Documents

#	Name	Date
1	202121027796-STATEMENT OF UNDERTAKING (FORM 3) [21-06-2021(online)].pdf	2021-06-21
2	202121027796-REQUEST FOR EXAMINATION (FORM-18) [21-06-2021(online)].pdf	2021-06-21
3	202121027796-PROOF OF RIGHT [21-06-2021(online)].pdf	2021-06-21
4	202121027796-FORM 18 [21-06-2021(online)].pdf	2021-06-21
5	202121027796-FORM 1 [21-06-2021(online)].pdf	2021-06-21
6	202121027796-FIGURE OF ABSTRACT [21-06-2021(online)].jpg	2021-06-21
7	202121027796-DRAWINGS [21-06-2021(online)].pdf	2021-06-21
8	202121027796-DECLARATION OF INVENTORSHIP (FORM 5) [21-06-2021(online)].pdf	2021-06-21
9	202121027796-COMPLETE SPECIFICATION [21-06-2021(online)].pdf	2021-06-21
10	202121027796-FORM-26 [22-10-2021(online)].pdf	2021-10-22
11	Abstract1..jpg	2021-12-03
12	202121027796-FER.pdf	2023-03-06
13	202121027796-FER_SER_REPLY [08-08-2023(online)].pdf	2023-08-08
14	202121027796-DRAWING [08-08-2023(online)].pdf	2023-08-08
15	202121027796-COMPLETE SPECIFICATION [08-08-2023(online)].pdf	2023-08-08
16	202121027796-CLAIMS [08-08-2023(online)].pdf	2023-08-08
17	202121027796-US(14)-HearingNotice-(HearingDate-14-11-2024).pdf	2024-10-10
18	202121027796-Correspondence to notify the Controller [07-11-2024(online)].pdf	2024-11-07
19	202121027796-Written submissions and relevant documents [28-11-2024(online)].pdf	2024-11-28
20	202121027796-PatentCertificate27-01-2025.pdf	2025-01-27
21	202121027796-IntimationOfGrant27-01-2025.pdf	2025-01-27

Search Strategy

1	202121027796_searchE_03-03-2023.pdf