Abstract: ABSTRACT AUTOMATED RISK ANALYSIS FOR CONTRACT DOCUMENTS The present disclosure provides a system and a method for performing risk analysis for contract documents. The method comprises receiving a contract document to be processed. The method further comprises pre-processing the contract document to extract text therefrom. The method further comprises implementing a clause classification model to extract and classify clauses from the text of the contract document into one or more predefined categories. The method further comprises performing risk analysis of each one of the clauses to determine risk indicators based on the presence or absence of one or more clauses and/or relevant entities in the clauses, and predefined rules associated with the one or more predefined categories. The method further comprises outputting result of risk analysis in form of a traffic light report. FIG. 4
DESC:AUTOMATED RISK ANALYSIS FOR CONTRACT DOCUMENTS
FIELD OF THE PRESENT DISCLOSURE
[0001] The present disclosure generally relates to natural language processing, and particularly to systems and methods for performing risk analysis for contract documents.
BACKGROUND
[0002] Most modern enterprises have a large number of contracts in force at any given time. A contract document defines the scope of obligations and benefits with regards to external and internal parties involved. For example, a non-disclosure agreement (NDA) is a type of binding contract document between two or more parties that prevents sensitive information from being shared with others. Enterprises may regularly be adding new NDA contracts for each new business deal, for example, with a customer, a contractor, a vendor, or the like. It may be understood that enterprises may be exposed to potential liabilities if an NDA contract document may not be properly drafted. Thus, there is a need to review these contract documents for various obligations, including criminal, governmental, or tort obligations; and provisions governing assignments and indemnity, as a part of risk analysis by the enterprise.
[0003] Contract document review may be described as a process of reviewing content of documents to identify information relevant to one or more topics. For example, NDAs typically have a lot of binding clauses which need to be analysed. Such contract document review is typically performed in order to understand contractual obligations, navigate client or customer relationships, and understand compliance risk. The task of reviewing draft contract documents, for instance as part of due diligence before execution, is traditionally performed by humans, specifically legal professionals such as Advocates. Such traditional contract document review is a highly time-intensive, an expensive and a subjective process. Further, such traditional contract document review involving manual extraction of rules and provisions, as may be performed by legal professionals, may contribute to increased inefficiency. Often times the number of NDAs that a company deals with on a regular basis can be very large, and because of such inefficiencies it may lead to delays in contract execution, which in turn may affect business operations and revenues.
[0004] Therefore, in light of the foregoing discussion, there exists a need to overcome problems associated with the traditional contract document review process, and provide an automated system and method for performing risk analysis for contract documents.
SUMMARY
[0005] In an aspect, a method for performing risk analysis for contract documents is disclosed. The method comprises receiving a contract document to be processed. The method further comprises pre-processing the contract document to extract text therefrom. The method further comprises implementing a clause classification model to extract and classify clauses from the text of the contract document into one or more predefined categories. The method further comprises performing risk analysis of each one of the clauses to determine risk indicators based on the presence or absence of one or more clauses and/or relevant entities in the clauses, and predefined rules associated with the one or more predefined categories. The method further comprises outputting result of risk analysis in form of a traffic light report.
[0006] In one or more embodiments, the clause classification model is trained on text from a plurality of sample contract documents by preparing data from the text using one or more of data processing techniques, selected from stemming, lemmatization, vectorization, Parts-Of-Speech (POS) tagging.
[0007] In one or more embodiments, the clause classification model is based on one or more machine learning techniques, selected from Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformer based models.
[0008] In one or more embodiments, the method further comprises implementing a co-reference resolution model for resolving co-references in the extracted clauses, wherein the co-reference resolution model is trained using one or more of rule-based modelling techniques, mention-based modelling techniques, and clustering-based modelling techniques.
[0009] In one or more embodiments, the method further comprises implementing a Named Entity Recognition (NER) model to extract entities in the clauses of the contract document, wherein the NER model is trained by implementing one or more transformer-based machine learning techniques, selected from conditional random fields (CRFs), Bidirectional LSTM (BiLSTM), Convolutional Neural Network (CNN), Embeddings from Language Model (ElMo), Stanford Natural Language Processing (NLP), GPT3, GPT4.
[0010] In one or more embodiments, the method further comprises implementing a clause correctness classification model for predicting correctness of clauses of the contract document, wherein the clause correctness classification model is trained by implementing one or more machine learning techniques, selected from techniques like Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models.
[0011] In one or more embodiments, the output of the result of risk analysis is in the form of a traffic light report marking risk for each of the clauses in the contract document in different colours, including RED colour for high risk clauses, YELLOW colour for medium risk clauses, and GREEN colour for low risk clauses.
[0012] In another aspect, a system for performing risk analysis for contract documents is disclosed. The system comprises an interface configured to receive a contract document to be processed. The system further comprises a processing unit configured to: pre-process the contract document to extract text therefrom; implement a clause classification model to extract and classify clauses from the text of the contract document into one or more predefined categories; and perform risk analysis of each one of the clauses to determine risk indicators based on the presence or absence of one or more clauses and/or relevant entities in the clauses, and predefined rules associated with the one or more predefined categories.
[0013] In one or more embodiments, the clause classification model is trained on text from a plurality of sample contract documents by preparing data from the text using one or more of data processing techniques, selected from stemming, lemmatization, vectorization, Parts-Of-Speech (POS) tagging. Further, the clause classification model is based on one or more machine learning techniques, selected from Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models.
[0014] In one or more embodiments, the processing unit is further configured to implement a co-reference resolution model for resolving co-references in the extracted clauses, wherein the co-reference resolution model is trained using one or more of rule-based modelling techniques, mention-based modelling techniques, and clustering-based modelling techniques.
[0015] In one or more embodiments, the processing unit is further configured to implement a Named Entity Recognition (NER) model to extract entities in the clauses of the contract document, wherein the NER model is trained by implementing one or more transformer-based machine learning techniques, selected from conditional random fields (CRFs), Bidirectional LSTM (BiLSTM), Convolutional Neural Network (CNN), Embeddings from Language Model (ElMo), Stanford Natural Language Processing (NLP), GPT3, GPT4.
[0016] In one or more embodiments, the processing unit is further configured to implement a clause correctness classification model for predicting correctness of clauses of the contract document, wherein the clause correctness classification model is trained by implementing one or more machine learning techniques, selected from techniques like Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models.
[0017] In one or more embodiments, the output of the result of risk analysis is in the form of a traffic light report marking risk for each of the clauses in the contract document in different colours, including RED colour for high risk clauses, YELLOW colour for medium risk clauses, and GREEN colour for low risk clauses.
[0018] In yet another aspect, an apparatus comprising a server, a user device and a computer program stored in a memory is disclosed, with the computer program being configured together with the server and the user device to control the apparatus to perform the method as described above.
[0019] In still another aspect, a computer program comprising computer executable program code is disclosed, which when executed controls a computer to perform the method as described above.
BRIEF DESCRIPTION OF THE FIGURES
[0020] For a more complete understanding of example embodiments of the present disclosure, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
[0021] FIG. 1 illustrates a system that may reside on and may be executed by a computer, which may be connected to a network, in accordance with one or more exemplary embodiments of the present disclosure;
[0022] FIG. 2 illustrates a diagrammatic view of a server, in accordance with one or more exemplary embodiments of the present disclosure;
[0023] FIG. 3 illustrates a diagrammatic view of a user device, in accordance with one or more exemplary embodiments of the present disclosure;
[0024] FIG. 4 illustrates an exemplary flow diagram providing an overall process workflow for implementation of the present system for performing risk analysis for contract documents, in accordance with one or more exemplary embodiments of the present disclosure;
[0025] FIG. 5 illustrates a schematic of a modelling pipeline for generating a clause classification model as implemented in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0026] FIG. 6 illustrates a schematic of an inference pipeline for implementing the generated clause classification model, as per the modelling pipeline of FIG. 5, in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0027] FIG. 7 illustrates a schematic of a modelling pipeline for generating a co-reference resolution model as implemented in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0028] FIG. 8 illustrates a schematic of an inference pipeline for implementing the generated co-reference resolution model, as per the modelling pipeline of FIG. 7, in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0029] FIG. 9 illustrates a schematic of a modelling pipeline for generating an NER model as implemented in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0030] FIG. 10 illustrates a schematic of an inference pipeline for implementing the generated NER model, as per the modelling pipeline of FIG. 9, in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0031] FIG. 11 illustrates a schematic of a modelling pipeline for generating a clause correctness classification model using a supervised training method as implemented in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0032] FIG. 12 illustrates a schematic of an inference pipeline for implementing the generated clause correctness classification model, as per the modelling pipeline of FIG. 11, in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0033] FIG. 13 illustrates a schematic of a modelling pipeline for generating a clause correctness classification model using an unsupervised training method as implemented in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0034] FIG. 14 illustrates a schematic of an inference pipeline for implementing the generated clause correctness classification model, as per the modelling pipeline of FIG. 13, in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0035] FIG. 15 illustrates a schematic of a continuous training framework for the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
[0036] FIGS. 16A-16F provide a representative processing workflow for some of steps involved in performing risk analysis for contract documents for a first exemplary text, in accordance with one or more exemplary embodiments of the present disclosure; and
[0037] FIGS. 17A-17E provide a representative processing workflow for some of steps involved in performing risk analysis for contract documents for a second exemplary text, in accordance with one or more exemplary embodiments of the present disclosure.
DETAILED DESCRIPTION
[0038] In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure is not limited to these specific details.
[0039] Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Further, the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
[0040] Furthermore, in the following detailed description of the present disclosure, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the present disclosure.
[0041] Embodiments described herein may be discussed in the general context of computer-executable instructions residing on some form of computer-readable storage medium, such as program modules, executed by one or more computers or other devices. By way of example, and not limitation, computer-readable storage media may comprise non-transitory computer-readable storage media and communication media; non-transitory computer-readable media include all computer-readable media except for a transitory, propagating signal. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or distributed as desired in various embodiments.
[0042] Some portions of the detailed description that follows are presented and discussed in terms of a process or method. Although steps and sequencing thereof are disclosed in figures herein describing the operations of this method, such steps and sequencing are exemplary. Embodiments are well suited to performing various other steps or variations of the steps recited in the flowchart of the figure herein, and in a sequence other than that depicted and described herein. Some portions of the detailed descriptions that follow are presented in terms of procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present application, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those utilizing physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as transactions, bits, values, elements, symbols, characters, samples, pixels, or the like.
[0043] In some implementations, any suitable computer usable or computer readable medium (or media) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer-usable, or computer-readable, storage medium (including a storage device associated with a computing device) may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable medium may include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fibre, a portable compact disc read-only memory (CD-ROM), an optical storage device, a digital versatile disk (DVD), a static random access memory (SRAM), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, a media such as those supporting the internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be a suitable medium upon which the program is stored, scanned, compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of the present disclosure, a computer-usable or computer-readable, storage medium may be any tangible medium that can contain or store a program for use by or in connection with the instruction execution system, apparatus, or device.
[0044] In some implementations, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. In some implementations, such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. In some implementations, the computer readable program code may be transmitted using any appropriate medium, including but not limited to the internet, wireline, optical fibre cable, RF, etc. In some implementations, a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
[0045] In some implementations, computer program code for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. However, the computer program code for carrying out operations of the present disclosure may also be written in conventional procedural programming languages, such as the "C" programming language, PASCAL, or similar programming languages, as well as in scripting languages such as JavaScript, PERL, or Python. In present implementations, the used language for training may be one of Python, Tensorflow, Bazel, C, C++. Further, decoder in user device (as will be discussed) may use C, C++ or any processor specific ISA. Furthermore, assembly code inside C/C++ may be utilized for specific operation. Also, ASR (automatic speech recognition) and G2P decoder along with entire user system can be run in embedded Linux (any distribution), Android, iOS, Windows, or the like, without any limitations. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the internet using an Internet Service Provider). In some implementations, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGAs) or other hardware accelerators, micro-controller units (MCUs), or programmable logic arrays (PLAs) may execute the computer readable program instructions/code by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
[0046] In some implementations, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus (systems), methods and computer program products according to various implementations of the present disclosure. Each block in the flowchart and/or block diagrams, and combinations of blocks in the flowchart and/or block diagrams, may represent a module, segment, or portion of code, which comprises one or more executable computer program instructions for implementing the specified logical function(s)/act(s). These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the computer program instructions, which may execute via the processor of the computer or other programmable data processing apparatus, create the ability to implement one or more of the functions/acts specified in the flowchart and/or block diagram block or blocks or combinations thereof. It should be noted that, in some implementations, the functions noted in the block(s) may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
[0047] In some implementations, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks or combinations thereof.
[0048] In some implementations, the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed (not necessarily in a particular order) on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts (not necessarily in a particular order) specified in the flowchart and/or block diagram block or blocks or combinations thereof.
[0049] Referring to example implementation of FIG. 1, there is shown a system 100 that may reside on and may be executed by a computer (e.g., computer 12), which may be connected to a network (e.g., network 14) (e.g., the internet or a local area network). Examples of computer 12 may include, but are not limited to, a personal computer(s), a laptop computer(s), mobile computing device(s), a server computer, a series of server computers, a mainframe computer(s), or a computing cloud(s). In some implementations, each of the aforementioned may be generally described as a computing device. In certain implementations, a computing device may be a physical or virtual device. In many implementations, a computing device may be any device capable of performing operations, such as a dedicated processor, a portion of a processor, a virtual processor, a portion of a virtual processor, portion of a virtual device, or a virtual device. In some implementations, a processor may be a physical processor or a virtual processor. In some implementations, a virtual processor may correspond to one or more parts of one or more physical processors. In some implementations, the instructions/logic may be distributed and executed across one or more processors, virtual or physical, to execute the instructions/logic. Computer 12 may execute an operating system, for example, but not limited to, Microsoft Windows; Mac OS X; Red Hat Linux, or a custom operating system. (Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States, other countries or both; Mac and OS X are registered trademarks of Apple Inc. in the United States, other countries or both; Red Hat is a registered trademark of Red Hat Corporation in the United States, other countries or both; and Linux is a registered trademark of Linus Torvalds in the United States, other countries or both).
[0050] In some implementations, the instruction sets and subroutines of system 100, which may be stored on storage device, such as storage device 16, coupled to computer 12, may be executed by one or more processors (not shown) and one or more memory architectures included within computer 12. In some implementations, storage device 16 may include but is not limited to: a hard disk drive; a flash drive, a tape drive; an optical drive; a RAID array (or other array); a random-access memory (RAM); and a read-only memory (ROM).
[0051] In some implementations, network 14 may be connected to one or more secondary networks (e.g., network 18), examples of which may include but are not limited to: a local area network; a wide area network; or an intranet, for example.
[0052] In some implementations, computer 12 may include a data store, such as a database (e.g., relational database, object-oriented database, triplestore database, etc.) and may be located within any suitable memory location, such as storage device 16 coupled to computer 12. In some implementations, data, metadata, information, etc. described throughout the present disclosure may be stored in the data store. In some implementations, computer 12 may utilize any known database management system such as, but not limited to, DB2, in order to provide multi-user access to one or more databases, such as the above noted relational database. In some implementations, the data store may also be a custom database, such as, for example, a flat file database or an XML database. In some implementations, any other form(s) of a data storage structure and/or organization may also be used. In some implementations, system 100 may be a component of the data store, a standalone application that interfaces with the above noted data store and/or an applet / application that is accessed via client applications 22, 24, 26, 28. In some implementations, the above noted data store may be, in whole or in part, distributed in a cloud computing topology. In this way, computer 12 and storage device 16 may refer to multiple devices, which may also be distributed throughout the network.
[0053] In some implementations, computer 12 may execute application 20 for performing risk analysis for contract documents (as discussed later in more detail). In some implementations, system 100 and/or application 20 may be accessed via one or more of client applications 22, 24, 26, 28. In some implementations, system 100 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within application 20, a component of application 20, and/or one or more of client applications 22, 24, 26, 28. In some implementations, application 20 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within system 100, a component of system 100, and/or one or more of client applications 22, 24, 26, 28. In some implementations, one or more of client applications 22, 24, 26, 28 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within and/or be a component of system 100 and/or application 20. Examples of client applications 22, 24, 26, 28 may include, but are not limited to, a standard and/or mobile web browser, an email application (e.g., an email client application), a textual and/or a graphical user interface, a customized web browser, a plugin, an Application Programming Interface (API), or a custom application. The instruction sets and subroutines of client applications 22, 24, 26, 28, which may be stored on storage devices 30, 32, 34, 36, coupled to user devices 38, 40, 42, 44, may be executed by one or more processors and one or more memory architectures incorporated into user devices 38, 40, 42, 44.
[0054] In some implementations, one or more of storage devices 30, 32, 34, 36, may include but are not limited to: hard disk drives; flash drives, tape drives; optical drives; RAID arrays; random access memories (RAM); and read-only memories (ROM). Examples of user devices 38, 40, 42, 44 (and/or computer 12) may include, but are not limited to, a personal computer (e.g., user device 38), a laptop computer (e.g., user device 40), a smart/data-enabled, cellular phone (e.g., user device 42), a notebook computer (e.g., user device 44), a tablet (not shown), a server (not shown), a television (not shown), a smart television (not shown), a media (e.g., video, photo, etc.) capturing device (not shown), and a dedicated network device (not shown). User devices 38, 40, 42, 44 may each execute an operating system, examples of which may include but are not limited to, Android, Apple iOS, Mac OS X; Red Hat Linux, or a custom operating system.
[0055] In some implementations, one or more of client applications 22, 24, 26, 28 may be configured to effectuate some or all of the functionality of system 100 (and vice versa). Accordingly, in some implementations, system 100 may be a purely server-side application, a purely client-side application, or a hybrid server-side / client-side application that is cooperatively executed by one or more of client applications 22, 24, 26, 28 and/or system 100.
[0056] In some implementations, one or more of client applications 22, 24, 26, 28 may be configured to effectuate some or all of the functionality of application 20 (and vice versa). Accordingly, in some implementations, application 20 may be a purely server-side application, a purely client-side application, or a hybrid server-side / client-side application that is cooperatively executed by one or more of client applications 22, 24, 26, 28 and/or application 20. As one or more of client applications 22, 24, 26, 28, system 100, and application 20, taken singly or in any combination, may effectuate some or all of the same functionality, any description of effectuating such functionality via one or more of client applications 22, 24, 26, 28, system 100, application 20, or combination thereof, and any described interaction(s) between one or more of client applications 22, 24, 26, 28, system 100, application 20, or combination thereof to effectuate such functionality, should be taken as an example only and not to limit the scope of the disclosure.
[0057] In some implementations, one or more of users 46, 48, 50, 52 may access computer 12 and system 100 (e.g., using one or more of user devices 38, 40, 42, 44) directly through network 14 or through secondary network 18. Further, computer 12 may be connected to network 14 through secondary network 18, as illustrated with phantom link line 54. System 100 may include one or more user interfaces, such as browsers and textual or graphical user interfaces, through which users 46, 48, 50, 52 may access system 100.
[0058] In some implementations, the various user devices may be directly or indirectly coupled to communication network, such as communication network 14 and communication network 18, hereinafter simply referred to as network 14 and network 18, respectively. For example, user device 38 is shown directly coupled to network 14 via a hardwired network connection. Further, user device 44 is shown directly coupled to network 18 via a hardwired network connection. User device 40 is shown wirelessly coupled to network 14 via wireless communication channel 56 established between user device 40 and wireless access point (i.e., WAP) 58, which is shown directly coupled to network 14. WAP 58 may be, for example, an IEEE 802.11a, 802.11b, 802.11g, Wi-Fi, RFID, and/or Bluetooth (including Bluetooth Low Energy) device that is capable of establishing wireless communication channel 56 between user device 40 and WAP 58. User device 42 is shown wirelessly coupled to network 14 via wireless communication channel 60 established between user device 42 and cellular network / bridge 62, which is shown directly coupled to network 14.
[0059] In some implementations, some or all of the IEEE 802.11x specifications may use Ethernet protocol and carrier sense multiple access with collision avoidance (i.e., CSMA/CA) for path sharing. The various 802.11x specifications may use phase-shift keying (i.e., PSK) modulation or complementary code keying (i.e., CCK) modulation, for example, Bluetooth (including Bluetooth Low Energy) is a telecommunications industry specification that allows, e.g., mobile phones, computers, smart phones, and other electronic devices to be interconnected using a short-range wireless connection. Other forms of interconnection (e.g., Near Field Communication (NFC)) may also be used.
[0060] The system 100 may include a server (such as server 200, as shown in FIG. 2) for performing risk analysis for contract documents (as will be described later in more detail). Herein, FIG. 2 is a block diagram of an example of the server 200 capable of implementing embodiments according to the present disclosure. In one embodiment, an application server as described herein may be implemented on exemplary server 200. In the example of FIG. 2, the server 200 includes a processing unit 205 (sometimes, referred to as CPU 205) for running software applications (such as, the application 20 of FIG. 1) and optionally an operating system. As illustrated, the server 200 further includes a database 210 (hereinafter, referred to as memory 210) which stores applications and data for use by the processing unit 205. Storage 215 provides non-volatile storage for applications and data and may include fixed disk drives, removable disk drives, flash memory devices, and CD-ROM, DVD-ROM or other optical storage devices. An optional user input device 220 includes devices that communicate user inputs from one or more users to the server 200 and may include keyboards, mice, joysticks, touch screens, etc. A communication or network interface 225 is provided which allows the server 200 to communicate with other computer systems via an electronic communications network, including wired and/or wireless communication and including an Intranet or the Internet. In one embodiment, the server 200 receives instructions and user inputs from a remote computer through communication interface 225. Communication interface 225 can comprise a transmitter and receiver for communicating with remote devices. An optional display device 250 may be provided which can be any device capable of displaying visual information in response to a signal from the server 200. The components of the server 200, including the processing unit 205, memory 210, data storage 215, user input devices 220, communication interface 225, and the display device 250, may be coupled via one or more data buses 260.
[0061] In the embodiment of FIG. 2, a graphics system 230 may be coupled with the data bus 260 and the components of the server 200. The graphics system 230 may include a physical graphics processing unit (GPU) 235 and graphics memory. The GPU 235 generates pixel data for output images from rendering commands. The physical GPU 235 can be configured as multiple virtual GPUs that may be used in parallel (concurrently) by a number of applications or processes executing in parallel. For example, mass scaling processes for rigid bodies or a variety of constraint solving processes may be run in parallel on the multiple virtual GPUs. Graphics memory may include a display memory 240 (e.g., a framebuffer) used for storing pixel data for each pixel of an output image. In another embodiment, the display memory 240 and/or additional memory 245 may be part of the memory 210 and may be shared with the processing unit 205. Alternatively, the display memory 240 and/or additional memory 245 can be one or more separate memories provided for the exclusive use of the graphics system 230. In another embodiment, graphics processing unit 230 includes one or more additional physical GPUs 255, similar to the GPU 235. Each additional GPU 255 may be adapted to operate in parallel with the GPU 235. Each additional GPU 255 generates pixel data for output images from rendering commands. Each additional physical GPU 255 can be configured as multiple virtual GPUs that may be used in parallel (concurrently) by a number of applications or processes executing in parallel, e.g., processes that solve constraints. Each additional GPU 255 can operate in conjunction with the GPU 235, for example, to simultaneously generate pixel data for different portions of an output image, or to simultaneously generate pixel data for different output images. Each additional GPU 255 can be located on the same circuit board as the GPU 235, sharing a connection with the GPU 235 to the data bus 260, or each additional GPU 255 can be located on another circuit board separately coupled with the data bus 260. Each additional GPU 255 can also be integrated into the same module or chip package as the GPU 235. Each additional GPU 255 can have additional memory, similar to the display memory 240 and additional memory 245, or can share the memories 240 and 245 with the GPU 235. It is to be understood that the circuits and/or functionality of GPU as described herein could also be implemented in other types of processors, such as general-purpose or other special-purpose coprocessors, or within a CPU.
[0062] The system 100 may also include a user device 300 (as shown in FIG. 3). In embodiments of the present disclosure, the user device 300 may embody a smartphone, a personal computer, a tablet, or the like. Herein, FIG. 3 is a block diagram of an example of the user device 300 capable of implementing embodiments according to the present disclosure. In the example of FIG. 3, the user device 300 includes a processing unit 305 (hereinafter, referred to as CPU 305) for running software applications (such as, the application 20 of FIG. 1) and optionally an operating system. A user input device 320 is provided which includes devices that communicates user inputs from one or more users and may include keyboards, mice, joysticks, touch screens, and/or microphones. Further, a network interface 325 is provided which allows the user device 300 to communicate with other computer systems (e.g., the server 200 of FIG. 2) via an electronic communications network, including wired and/or wireless communication and including the Internet. The user device 300 may also include a decoder 355 may be any device capable of decoding (decompressing) data that may be encoded (compressed). A display device 350 may be provided which may be any device capable of displaying visual information, including information received from the decoder 355. In particular, as will be described below, the display device 350 may be used to display visual information received from the server 200 of FIG. 2. The components of the user device 300 may be coupled via one or more data buses 360.
[0063] It may be seen that compared to the server 200 in the example of FIG. 2, the user device 300 in the example of FIG. 3 may have fewer components and less functionality. However, the user device 300 may include other components, for example, in addition to those described above. In general, the user device 300 may be any type of device that has one or more of display capability and the capability to receive inputs from a user and send such inputs to the server 200. However, it may be appreciated that the user device 300 may have additional capabilities beyond those just mentioned.
[0064] Referring to FIG. 4, illustrated is an exemplary flow diagram of a method (as represented by reference numeral 400) providing an overall process workflow (with the two terms “method” and “process workflow” being interchangeably used hereinafter) for performing risk analysis for contract documents. Herein, the overall process workflow 400 is implemented by the present system 100, as described. As shown, the process workflow 400 includes a step 402 for receiving the contract document(s) to be processed. For the purposes of the present disclosure, the system 100 may provide an interface (as described) for a user to upload the contract document(s) to be processed. Herein, the interface may be provided via the user device 300 (as descried). The interface may allow the user to upload the said contract document(s) directly from the user device, or from a cloud platform, like Google Drive, OneDrive, Dropbox, etc. by implementing the corresponding APIs (as known in the art) without any limitations. In other examples, the system 100 may be configured to automatically fetch the contract document(s) from a data repository of the user (like an enterprise client), given the access thereto, to directly process each newly added contract document for performing risk analysis. Hereinafter, the steps of the process workflow 400 may be understood to be executed by the processing unit 205 as described without any limitations.
[0065] Further, as shown, the process workflow 400 includes a step 404 to perform initial pre-processing of the received contract document(s) to make those usable for further processing as per embodiments of the present disclosure. In the present implementations, the contract document(s) as received may be converted to a suitable format, such as, but not limited to, PDF for further processing. Herein, the received contract document(s) may be in the form of scanned images with machine unreadable text. In such cases, the received contract document(s) may be pre-processed using Optical Character Recognition (OCR) techniques to convert the text in the contract document(s) to machine readable form for further processing. In an example, the machine readable text from each contract document is extracted into a separate file (such as, a text file) in order to be further processed for risk analysis. Such pre-processing step may utilize available resources, such as, but not limited to, Python libraries, AWS Textract, Azure computer vision, etc.
[0066] Now, as discussed, a given contract document may include a plurality of clauses. For performing the risk analysis for the given contract document, it may be required to analyse each clause separately (independently of other clauses, or not) therein to determine risk indicators, which in turn may be based on presence or absence of clauses (based on the contract type), and/or presence or absence of relevant entities in the clauses, etc. (as discussed later in more detail). For this purpose, first the clauses need to be classified to allow for corresponding analysis thereof. It may be appreciated by a person skilled in the art that the clauses may be classified under certain predefined categories, such as clause(s) related to: (i) Parties, (ii) Purpose, (iii) Confidential Information, (iv) Recipient’s Treatment of Confidential Information, (v) Tangible Confidential Information, (vi) Exceptions to Confidential Information, (vii) Information that was available in the public domain., (viii) Information that is obtained other than through a breach of confidentiality, (ix) Information disclosure compelled by legal process, (x) Information that was developed independently, (xi) Term, (xii) No License, (xiii) Governing Law, (xiv) Equitable Relief, (xv) Entire Agreement, (xvi) No Assignment, (xvii) Severability, (xviii) Notices, (xix) No Implied Waiver, (xx) Headings and Interpretation, etc. There may be different risks associated with each of the clause categories, and thus each of such clause types may need to be separately analysed; and for that purpose, the plurality of clauses in the given contract document may first need to be classified as per the clause category/type.
[0067] As shown, the process workflow 400 includes a step 406 for extracting and classifying clauses from the text of the given contract document. The present system 100 may implement a clause classification model (which may be part of the machine learning model, and hereinafter sometimes simply referred to as “classification model” without any limitations) for said purpose of extracting clauses and classifying clauses from the text of the given contract document. Herein, the implemented machine learning model may first be trained, and then may be utilized for providing inferences. A simple model using a limited number of clauses per type may be first trained, and then be used to label clauses using a much bigger dataset. The labelled clauses may then be reviewed by experts and corrected. The corrected labelled dataset may then be used to develop the final model. The clause classification model may utilize one or more machine learning techniques for its implementation. In the present examples, the clause classification model may be based on one or more of machine learning techniques, selected from Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models, such as BERT, GPT3, GPT4, etc., and the like. Such algorithmic extraction may use techniques like splitting paragraphs from the text, identifying keywords from a list of predefined keywords using phrase matching in Spacy library, automatic labelling of the text, and the like. Implementation of the clause classification model may help to reduce manual labelling work.
[0068] Referring to FIG. 5, illustrated is a schematic of a modelling pipeline (as represented by reference numeral 500) for generating a clause classification model, as per embodiments of the present disclosure. Herein, the clause classification model is trained on text from a plurality of sample contract documents. As shown, in the modelling pipeline 500, at step 502, the extracted text from the contract document(s) is first inputted. At step 504, the modelling pipeline 500 involves data preparation for extracting clauses. Herein, the data preparation includes manual/algorithmic extraction of clauses. Herein, the manual extraction may involve legal professional(s) manually separating each clause from each of the sample contract documents in a training dataset. The algorithmic extraction may involve implementing available algorithm(s) to automate the data preparation task. At step 506, the modelling pipeline 500 involves data labelling for different clause types. This step 506 of data labelling may be performed by legal professionals (expert annotators). In the present implementation, the data labelling may result in a training list of various clauses, which may be provided as a sample dataset of correct clauses (as may be reviewed by legal professional(s)), and/or a sample dataset including a mix of correct clauses (labelled as such) and incorrect clauses (labelled as such). At step 508, the modelling pipeline 500 involves data cleaning. Herein, the data cleaning may include text processing to prepare complete sentences from text, stop-word removal, removal of punctuations/numbers, etc. At step 510, the modelling pipeline 500 involves data preparation. Herein, the data preparation may include data normalization, including data stemming (which involves crude heuristic process that chops off the ends) and/or data lemmatization (which involves doing things properly with the use of a vocabulary and morphological analysis of words). The data preparation may also include data vectorization which is used to get some distinct features out of the text for the model to train on. In the present examples, the data vectorization may be achieved by using one or more of techniques, such as TFIDF (term frequency–inverse document frequency), frequency vectorization, word embeddings, word encoding using transformer architectures, and the like, as known in the art. The data preparation may further include POS (Parts-Of-Speech) tagging for different languages and retention of specific types of POS for training. Such techniques for the data preparation may be contemplated by a person skilled in the art, and thus have not been described in detail herein for the brevity of the present disclosure. In more advanced deep learning methods, word embeddings may be used to represent text. These word embeddings may come from custom/pretrained models like BERT, GPT3, GPT4, etc., or they may be trained from scratch using contract documents.
[0069] Further, at step 512, the modelling pipeline 500 involves using the data (as result of above steps) for training of the clause classification model. At step 514, the modelling pipeline 500 may further involve checking a performance of the implemented (trained) clause classification model. This may be achieved by feeding the clause classification model with test text, and evaluating the output(s) therefrom manually by legal professionals. If the results may be satisfying (as represented by block 516), the clause classification model may be considered for deployment. In the present examples, the clause classification model may be deployed using cloud platforms as either server-based or serverless architectures, without any limitations. And, if the results may not be satisfying, the modelling pipeline 500 may move to step 518 (as shown) which involves further training of the clause classification model using supplemental data generated by using data augmentation techniques. Herein, the data augmentation may include adding new clauses manually and/or programmatically, generating new clauses using transformers based models such as BERT, RoBERTa, GPT3, GPT4 and the like, etc. The supplemental data generated by data augmentation (in the step 518) may be processed using data preparation techniques (as discussed in the step 510), to be further fed to the clause classification model for its training as discussed in the step 512 of the modelling pipeline 500.
[0070] Next, referring to FIG. 6, illustrated is a schematic of an inference pipeline (as represented by reference numeral 600) for implementing the clause classification model. As shown, in the inference pipeline 600, at step 602, the contract document(s), which may be in the form of PDF file, DOCX/DOC file, etc., is first inputted. At step 604, the inference pipeline 600 involves extracting text from the contract document(s) by using OCR techniques, i.e., converting the text into machine readable form. At step 606, the inference pipeline 600 involves data cleaning. Herein, the data cleaning may include text processing to split the extracted text into sentences and also, if required, to prepare complete sentences from the extracted text. The data cleaning may further include removal of punctuations and stop-words from the extracted text. At step 608, the inference pipeline 600 involves data preparation. Herein, the data preparation may include data normalization, including data stemming and/or data lemmatization; data vectorization using one or more of techniques, such as TFIDF (term frequency–inverse document frequency), frequency vectorization, word embeddings, word encoding using transformer architectures, and the like; POS tagging; etc. (as described in the preceding paragraphs with reference to the step 510 of the modelling pipeline 500 of FIG. 5).
[0071] Further, at step 610, the inference pipeline 600 involves using the data (as result of above steps) to be fed to the clause classification model (as trained using the modelling pipeline 500 of FIG. 5). At step 612 of the inference pipeline 600, the clause classification model provides extracted clauses which may be classified as per the clause categories/types to allow for corresponding analysis thereof (as described). Herein, the clause classification model may categorize each sentence into a particular clause type. Specifically, the clause classification model may provide a probability score to each sentence corresponding to associated clause type. Since the model calculates probabilities associated with each clause type to each of sentences, the clause categorization (best clause type) corresponds to the clause type which gives maximum probability score. Further, in the present embodiments, to take care of false positives, a threshold is defined for the probability score to determine if a given sentence belongs to any of the defined clause types. That is, the given sentence may only be categorized into one of the clause types if the corresponding probability score is above the given threshold; otherwise, the given sentence may be classified as none. Further, in the step 612 of the inference pipeline 600, the sentences of same clause types are then aggregated into a single paragraph to get the final list of extracted clauses from the text. It may be appreciated that additional rules may be implemented to remove false positives as per the requirements. In alternate methods, the text can be split into paragraphs and each paragraph may then be used to predict the clause type using the classification model, and further a threshold may be used to reduce false positives.
[0072] Referring back to FIG. 4, in the present embodiments, the process workflow 400 may further include implementing co-reference resolution procedure (represented as step 408 in FIG. 4). Co-reference resolution is the task of finding all expressions that refer to the same entity in a text. In the present process workflow 400, the text is sanitized for possible co-references before entity extraction (as explained later in the description). Herein, the co-reference resolution procedure may be executed either before or after the clause classification procedure (as described in the step 612 of the inference pipeline 600 of FIG. 6).
[0073] Referring to FIG. 7, illustrated is a schematic of a modelling pipeline (as represented by reference numeral 700) for generating a co-reference resolution model, as per embodiments of the present disclosure. As shown, in the modelling pipeline 700, at step 702, the contract document(s) or the extracted text is first inputted. At step 704, the modelling pipeline 700 involves building a training corpus by manual labelling the co-references in the text using annotation tools like Prodigy, Inception, Docanno, Brat, etc. At step 706, the modelling pipeline 700 involves training of the co-reference resolution model using one or more of rule-based modelling techniques, mention-based modelling techniques, and clustering-based modelling techniques, as known in the art. Further, referring to FIG. 8, illustrated is a schematic of an inference pipeline (as represented by reference numeral 800) for implementing the generated co-reference resolution model (as described in reference to the modelling pipeline 700 of FIG. 7). Herein, the inference pipeline 800 involves inputting extracted clauses from the clause classification model as described (as being represented by block 802 in FIG. 8). The inference pipeline 800 further involves implementing the co-reference resolution model as described for resolving co-references (as being represented by block 804 in FIG. 8). Herein, an output of the inference pipeline 800 is in the form of resolved clauses after resolution of possible co-references therein (as being represented by block 806 in FIG. 8).
[0074] Again, referring back to FIG. 4, in the present embodiments, the process workflow 400 may further include implementing a Named Entity Recognition (NER) model (represented as step 410 in FIG. 4). Named-entity recognition is a task of information extraction that seeks to locate and classify named entities mentioned in the extracted text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc., which, as may be contemplated, is required for any kind of analysis of a given contract document. In the present process workflow 400, the NER model is implemented to extract important details from the extracted relevant clauses. In particular, specific entities need to be extracted from specific clauses of interest. In the present embodiments, the process workflow 400 may utilize a combined NER model for a group of entities at a time, or separate NER models for each entity type, or a mix of both, without any limitations.
[0075] Referring to FIG. 9, illustrated is a schematic of a modelling pipeline (as represented by reference numeral 900) for generating an NER model, as per embodiments of the present disclosure. As shown, at step 902, the modelling pipeline 900 involves using a training corpus of various clause types (as identified). In the present examples, the training corpus dataset may need a large number of documents, typically more than 1000 per entity, in order to deal with data imbalance. At step 904, the modelling pipeline 900 involves using an annotation tool, such as Prodigy, Inception, Docanno, Brat, etc., to come up with IOB labels (which are a format for chunks, and are similar to part-of-speech tags but can denote the inside, outside, and beginning of a chunk) for the used training corpus. At step 906, the modelling pipeline 900 involves generating the NER model using the annotated training corpus by implementing machine learning techniques including, but not limited to, CRFs (conditional random fields), BiLSTM (Bidirectional LSTM), CNN, ElMo (Embeddings from Language Model), Stanford NLP, or other transformer-based models (e.g., BERT, SPACY, GPT3, GPT4 or any other similar models), etc., without any limitations. Further, referring to FIG. 10, illustrated is a schematic of an inference pipeline (as represented by reference numeral 1000) for implementing the generated NER model (as described in reference to the modelling pipeline 900 of FIG. 9). Herein, the inference pipeline 1000 involves inputting extracted clauses from the clause classification model as described (as being represented by block 1002 in FIG. 10). The inference pipeline 1000 further involves implementing the NER model as described for recognizing entities in the text (as being represented by block 1004 in FIG. 10). In the present implementation, the text is pre-processed as per the requirement of the NER model. Herein, an output of the inference pipeline 1000 is in the form of extracted entities details from the clauses (as being represented by block 1006 in FIG. 10). As may be appreciated, the entities details may include information like parties, term, jurisdiction, effective date, type of NDA (Mutual / one-sided), etc.
[0076] Yet again referring back to FIG. 4, in the present embodiments, the process workflow 400 includes performing clause correction classification (represented as step 412 in FIG. 4), which is precursor to performing risk analysis for the extracted clauses. In the present embodiments, as illustrated in FIG. 4, the clause-level correction classification may be performed directly on the extracted clauses (as from the step 406), and the entity-level correction classification may be performed on the extracted clauses with resolved co-references and recognized entities (as from the step 410). The present disclosure provides two methodologies for performing clause correction classification, including a supervised method (as described in reference to FIGS. 11 and 12), and an unsupervised method (as described in reference to FIGS. 13 and 14).
[0077] Referring to FIG. 11, illustrated is a schematic of a modelling pipeline (as represented by reference numeral 1100) for generating a clause correctness classification model using supervised training method, as per embodiments of the present disclosure. As shown, in the modelling pipeline 1100, at step 1102, a sample set of contract documents are inputted for training purposes. At step 1104, the modelling pipeline 1100 involves data preparation using manual/algorithmic extraction of clauses from the said sample set of contract documents, such data preparation procedure has been described in the preceding paragraphs and thus not repeated herein for the brevity of the present disclosure. At step 1106, the modelling pipeline 1100 involves data labelling for identifying different clause types by legal professionals. Again, such data labelling procedure has been described in the preceding paragraphs and thus not repeated herein for the brevity of the present disclosure. At step 1108, the modelling pipeline 1100 involves data cleaning for text processing to prepare complete sentences from the extracted text, removal of punctuations/numerals, removal of stop-words, etc. Again, such data cleaning procedure has been described in the preceding paragraphs and thus not repeated herein for the brevity of the present disclosure.
[0078] Further, at step 1110, the modelling pipeline 1100 involves feature generation which includes extraction of features specific to various clauses. This feature generation procedure may be achieved by using techniques like Regex, Named Entities, Matcher, etc. as known in the art. At step 1112, the modelling pipeline 1100 involves data preparation after all required pre-processing which may involve data normalization, data vectorization, POS tagging, etc. (as discussed). Further, at step 1114, the modelling pipeline 1100 involves training the clause correctness classification model for identifying clause correctness by using techniques like Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models, and the like (as also discussed in preceding paragraphs). As would be contemplated by a person skilled in the art, the given procedure for training of the clause correctness classification model in reference to the modelling pipeline 1100 of FIG. 11 may be categorized under supervised learning method.
[0079] Further, referring to FIG. 12, illustrated is a schematic of an inference pipeline (as represented by reference numeral 1200) for implementing the generated clause correctness classification model as described in reference to the modelling pipeline 1100 of FIG. 11. Herein, the inference pipeline 1200 involves inputting corrected clauses from the clause correctness classification model as described for predicting correctness of clauses (as being represented by block 1202 in FIG. 12). The inference pipeline 1200 further involves implementing the co-reference resolution model as described for resolving co-references in the text (as being represented by block 1204 in FIG. 12). The inference pipeline 1200 further involves implementing the clause correctness classification model as described for feature extraction (as being represented by block 1206 in FIG. 12). Herein, an output of the inference pipeline 1200 is in the form of clause correctness predictions, i.e., prediction in the form of a correctness score or the like for each clause indicative of its correctness (as being represented by block 1208 in FIG. 12).
[0080] Referring to FIG. 13, illustrated is a schematic of a modelling pipeline (as represented by reference numeral 1300) for generating a clause correctness classification model using unsupervised training method, as per embodiments of the present disclosure. As shown, in the modelling pipeline 1300, at step 1302, a sample set of contract documents with all clauses being correct therein (for example, based on manual checking by legal professional) are inputted for training purposes. Further, the text in such sample set of contract documents may be pre-labelled and pre-cleaned. At step 1304, the modelling pipeline 1300 involves data preparation using manual/algorithmic extraction of clauses from the said sample set of contract documents. Such data preparation procedure has been described in the preceding paragraphs and thus not repeated herein for the brevity of the present disclosure. At step 1306, the modelling pipeline 1300 involves training the clause correctness classification model for identifying clause correctness by using techniques like Word2Vec, Doc2Vec, pretrained transformer models (as per the inputted clauses) and the like. As would be contemplated by a person skilled in the art, the given procedure for training of the clause correctness classification model in reference to the modelling pipeline 1300 of FIG. 13 may be categorized under unsupervised learning method.
[0081] Further, referring to FIG. 14, illustrated is a schematic of an inference pipeline (as represented by reference numeral 1400) for implementing the generated clause correctness classification model as described in reference to the modelling pipeline 1300 of FIG. 13. Herein, the inference pipeline 1400 involves inputting corrected clauses from the clause correctness classification model as described for predicting correctness of clauses (as being represented by block 1402 in FIG. 14). The inference pipeline 1400 further involves data preparation for extracting clauses as described for in the preceding paragraphs (and as being represented by block 1404 in FIG. 14). The inference pipeline 1400 further involves calculating clause similarity in relation to correct clauses as inputted using clause embeddings, or similarity measurement methods like cosine similarity (as being represented by block 1406 in FIG. 14). The inference pipeline 1400 further involves implementing the clause correctness classification model as described for feature extraction (as being represented by block 1408 in FIG. 14). Herein, an output of the inference pipeline 1400 is in the form of clause correctness predictions, i.e., prediction in the form of a correctness score or the like for each clause indicative of its correctness.
[0082] Now, once the predictions in terms of correctness of clauses in a given contract document are received, it may be possible to perform the risk analysis for the said given contract document. In the present disclosure, the risk analysis involves using results from the clause extraction model, the NER model, the clause classification model, and the clause correctness classification model, to come up with a composite risk indicator. It may be understood that higher the number of correct clauses in the given contract document, and also higher being the correctness score of the clauses in the given contract document, lower being the risk factor for the given contract document; and vice-versa. In the present embodiments, the risk indicators may further be based on presence or absence of clauses, and/or presence or absence of entities in the extracted clauses, etc, as per predefined rules. Some other rules that can be implemented for quantification of risk is cross clause analysis. For example, the presence of governing law and absence of a jurisdiction law is a risk. Similarly, a document containing representations and warranties should have indemnity clauses. If they are not there then it is a risk. All such criteria are considered in calculation of risk. Such additional risk indicators may also be considered for performing the risk analysis for the given contract document. Further, in the present examples, the results of the risk analysis for the given contract document may be presented (outputted) in the form of a report, which may be generated using a standard reporting template. In an example, the said report may be in the form of a traffic light output which may mark the risk analysis for each of the clauses in the given contract document in different colours, like RED colour for high risk clauses, YELLOW colour for medium risk clauses, and GREEN colour for low risk clauses. In an example, the said report may be either in DOC format or PDF format, or the like, without any limitations.
[0083] The present disclosure may further provide a scheme for continuous training of the utilized models for performing risk analysis for the contract document, in order to improve those over time. That is, once deployed, the models need to be continuously improved to improve their performance. For this purpose, the present system 100 may provide a user-interface to provide a provision to the user to indicate if a given model prediction is accurate or not. For example, the clause types will be verified by a competent person who may further identify the right clause type or can identify whether a particular detail is properly extracted or not, and then correct the prediction. Referring to FIG. 15, illustrated is a schematic of a continuous training framework (as represented by reference numeral 1500), as described. As shown, at step 1502, the continuous training framework 1500 involves generating the prediction using the utilized model. Herein, the model in perspective may be any one of the clause extraction model, the NER model, the clause classification model, and the clause correctness classification model, as utilized. At step 1504, the continuous training framework 1500 involves receiving user input(s) to correct the prediction(s), if required. At step 1506, the continuous training framework 1500 involves retraining of the utilized model using the user input(s) (representative of correct prediction(s)). At step 1508, the continuous training framework 1500 may further involve verification of the re-trained model for performance improvement. This step 1508 may be carried out manually by legal professional (as involved). If the re-trained model is verified to have improved performance, at step 1510, such re-trained (improved) model is deployed, replacing corresponding existing model, for further implementation as per embodiments of the present disclosure. In an example, the continuous training framework 1500 may be used to continuously re-train models as scheduled, for example once every week, once every month, etc. If the re-trained model provided better results (based on verification), it may automatically replace the corresponding existing model; or otherwise, the existing model is kept being used.
[0084] FIGS. 16A-16F provide representative processing workflow for some of the steps as described in the preceding paragraphs using a first exemplary text, according to embodiments of the present disclosure. FIG. 16A provides examples of manually labelling of clauses for a sample set of clauses. FIG. 16B provides methodology for data cleaning for the said first exemplary text by utilizing stop words. FIG. 16C provides methodology for data augmentation using the said first exemplary text. FIG. 16D provides methodology for POS tagging for the said first exemplary text. FIG. 16E provides methodology for vectorization for the said first exemplary text. FIG. 16F provides methodology for inference pipeline for classification for the said first exemplary text.
[0085] FIGS. 17A-17E provide representative processing workflow for some of the steps as described in the preceding paragraphs using a second exemplary text, according to embodiments of the present disclosure. FIG. 17A provides methodology for pre-processing and annotation for the said second exemplary text. FIG. 17B provides methodology for data cleaning for the said second exemplary text. FIG. 17C provides methodology for vectorization for the said second exemplary text. FIG. 17D provides methodology for co-reference resolution for the said second exemplary text. FIG. 17E provides methodology for named entity recognition for the said second exemplary text.
[0086] The present disclosure also relates to a system (such as, the system 100 including the server 200 and the user device 300) for performing risk analysis for contract documents. Various embodiments and variants disclosed above, with respect to the aforementioned method 400 as per the first aspect, apply mutatis mutandis to the present system 100. It may be appreciated for the given purpose, the described components of the system 100 may be considered interconnected with each other, and the steps as described above for the method 400 are generally sequential in nature.
[0087] The embodiments disclosed herein may be implemented through an apparatus comprising the server 200, the user device 300 and a computer program stored in a memory 210 of the server 200, wherein the computer program is configured together with the server 200 and the user device 300 to control the apparatus to perform the method 400 for performing risk analysis for contract documents. The embodiments disclosed herein may additionally and/or alternatively be implemented through a computer executable program code running on at least one hardware device and performing functions to control the elements shown in FIGS. 4-15 include blocks which can be at least one of a hardware device, or a combination of hardware device and software module to perform the method 400 for performing risk analysis for contract documents.
[0088] Thereby, the present disclosure provides systems and methods to be used for risk analysis of contract documents, specifically NDA documents. As may be appreciated that the NDA documents typically have a lot of binding clauses which may be required to be analysed by expert advocates. Often times the number of NDAs that a company deals with can be very large leading to delays in contract execution. The present disclosure solves this problem by automating the risk analysis for such contract documents using machine learning (ML) techniques. The present disclosure uses NDA documents provided by the clients in any format (doc, pdf, etc.), processes them using ML techniques, carries out a risk analysis, and finally provides a detailed risk report. To summarize, the present disclosure provides automated risk analysis of contract documents (such as, but not limited to, NDA documents), summarization of the risk analysis in the form of a report; and thus overcome the need for availability of a legal professional (such as, an advocate) to go through the contract document each time a given contract document is to be executed.
[0089] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. While the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the present disclosure. It is therefore intended that the scope of the patent rights be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the patent rights, which is set forth in the following claims.
,CLAIMS:WE CLAIM:
1. A method for performing risk analysis for contract documents, comprising:
receiving a contract document to be processed;
pre-processing the contract document to extract text therefrom;
implementing a clause classification model to extract and classify clauses from the text of the contract document into one or more predefined categories;
performing risk analysis of each one of the clauses to determine risk indicators based on the presence or absence of one or more clauses and/or relevant entities in the clauses, and predefined rules associated with the one or more predefined categories; and
outputting result of risk analysis in form of a report.
2. The method as claimed in claim 1, wherein the clause classification model is trained on text from a plurality of sample contract documents by preparing data from the text using one or more of data processing techniques, selected from stemming, lemmatization, vectorization, Parts-Of-Speech (POS) tagging.
3. The method as claimed in claim 1, wherein the clause classification model is based on one or more machine learning techniques, selected from Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models.
4. The method as claimed in claim 1 further comprising implementing a co-reference resolution model for resolving co-references in the extracted clauses, wherein the co-reference resolution model is trained using one or more of rule-based modelling techniques, mention-based modelling techniques, and clustering-based modelling techniques.
5. The method as claimed in claim 1 further comprising implementing a Named Entity Recognition (NER) model to extract entities in the clauses of the contract document, wherein the NER model is trained by implementing one or more transformer-based machine learning techniques, selected from conditional random fields (CRFs), Bidirectional LSTM (BiLSTM), Convolutional Neural Network (CNN), Embeddings from Language Model (ElMo), Stanford Natural Language Processing (NLP), GPT3, GPT4.
6. The method as claimed in claim 1 further comprising implementing a clause correctness classification model for predicting correctness of clauses of the contract document, wherein the clause correctness classification model is trained by implementing one or more machine learning techniques, selected from techniques like Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models.
7. The method as claimed in claim 1, wherein the output of the result of risk analysis is in the form of a traffic light report marking risk for each of the clauses in the contract document in different colours, including RED colour for high risk clauses, YELLOW colour for medium risk clauses, and GREEN colour for low risk clauses.
8. A system for performing risk analysis for contract documents, comprising:
an interface configured to receive a contract document to be processed; and
a processing unit configured to:
pre-process the contract document to extract text therefrom;
implement a clause classification model to extract and classify clauses from the text of the contract document into one or more predefined categories;
perform risk analysis of each one of the clauses to determine risk indicators based on the presence or absence of one or more clauses and/or relevant entities in the clauses, and predefined rules associated with the one or more predefined categories; and
output result of risk analysis in form of a report.
9. The system as claimed in claim 8, wherein the clause classification model is trained on text from a plurality of sample contract documents by preparing data from the text using one or more of data processing techniques, selected from stemming, lemmatization, vectorization, Parts-Of-Speech (POS) tagging, and wherein the clause classification model is based on one or more machine learning techniques, selected from Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models.
10. The system as claimed in claim 8, wherein the processing unit is further configured to implement a co-reference resolution model for resolving co-references in the extracted clauses, wherein the co-reference resolution model is trained using one or more of rule-based modelling techniques, mention-based modelling techniques, and clustering-based modelling techniques.
11. The system as claimed in claim 8, wherein the processing unit is further configured to implement a Named Entity Recognition (NER) model to extract entities in the clauses of the contract document, wherein the NER model is trained by implementing one or more transformer-based machine learning techniques, selected from conditional random fields (CRFs), Bidirectional LSTM (BiLSTM), Convolutional Neural Network (CNN), Embeddings from Language Model (ElMo), Stanford Natural Language Processing (NLP), GPT3, GPT4.
12. The system as claimed in claim 8, wherein the processing unit is further configured to implement a clause correctness classification model for predicting correctness of clauses of the contract document, wherein the clause correctness classification model is trained by implementing one or more machine learning techniques, selected from techniques like Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers based models.
13. The system as claimed in claim 8, wherein the output of the result of risk analysis is in the form of a traffic light report marking risk for each of the clauses in the contract document in different colours, including RED colour for high risk clauses, YELLOW colour for medium risk clauses, and GREEN colour for low risk clauses.
14. An apparatus comprising a server, a user device and a computer program stored in a memory, the computer program being configured together with the server and the user device to control the apparatus to perform the method according to any one of claims 1-7.
15. A computer program comprising computer executable program code, when executed the program code controls a computer to perform the method according to any one of claims 1-7.
| # | Name | Date |
|---|---|---|
| 1 | 202241042760-PROVISIONAL SPECIFICATION [26-07-2022(online)].pdf | 2022-07-26 |
| 2 | 202241042760-FORM FOR STARTUP [26-07-2022(online)].pdf | 2022-07-26 |
| 3 | 202241042760-FORM FOR SMALL ENTITY(FORM-28) [26-07-2022(online)].pdf | 2022-07-26 |
| 4 | 202241042760-FORM 1 [26-07-2022(online)].pdf | 2022-07-26 |
| 5 | 202241042760-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [26-07-2022(online)].pdf | 2022-07-26 |
| 6 | 202241042760-DRAWINGS [26-07-2022(online)].pdf | 2022-07-26 |
| 7 | 202241042760-DECLARATION OF INVENTORSHIP (FORM 5) [26-07-2022(online)].pdf | 2022-07-26 |
| 8 | 202241042760-Proof of Right [24-08-2022(online)].pdf | 2022-08-24 |
| 9 | 202241042760-FORM-26 [21-10-2022(online)].pdf | 2022-10-21 |
| 10 | 202241042760-DRAWING [11-05-2023(online)].pdf | 2023-05-11 |
| 11 | 202241042760-COMPLETE SPECIFICATION [11-05-2023(online)].pdf | 2023-05-11 |