Abstract: ABSTRACT SYSTEM AND METHOD FOR TRAINING MACHINE LEARNING MODEL FOR PROCESSING OF CONTRACT DOCUMENTS The present disclosure provides a system and a method for training a machine learning model for processing of contract documents. The method comprises implementing pre-trained instance of the machine learning model to parse a contract document and identify one or more clause types in the contract document. The method further comprises configuring an interface to: provide a list of the one or more clause types in the contract document; allow a user to select one of the one or more clause types from the list; display text from the contract document corresponding to the selected one of the one or more clause types; and allow the user to either confirm correctness or to edit the corresponding one of the one or more clause types with respect to the displayed text. The method further comprises utilizing this information for re-training the machine learning model to generate an updated instance of the machine learning model. FIG. 7
Description:SYSTEM AND METHOD FOR TRAINING MACHINE LEARNING MODEL FOR PROCESSING OF CONTRACT DOCUMENTS
FIELD OF THE PRESENT DISCLOSURE
[0001] The present disclosure generally relates to machine learning and processing, and particularly to a system and a method for training a machine learning model for processing of contract documents.
BACKGROUND
[0002] Companies, increasingly, have a huge corpus of historical legal documents that they will need to maintain and store and take action till their expiry. It is a great amount of pain if these documents are dumped into a data store and if they need to be referred to when required. Particularly, if the documents are still active, it is important to be able to refer to them from time to time to address customer queries or deal with legal proceedings. For instance, most modern enterprises have a large number of contracts in force at any given time. A contract document defines the scope of obligations and benefits with regards to external and internal parties involved. For example, a non-disclosure agreement (NDA) is a type of binding contract document between two or more parties that prevents sensitive information from being shared with others. Enterprises may regularly be adding new NDA contracts for each new business deal, for example, with a customer, a contractor, a vendor, or the like. Contract document review may be described as a process of reviewing content of documents to identify information relevant to one or more topics. For example, NDAs typically have a lot of binding clauses which need to be analysed. Such contract document review is typically performed in order to understand contractual obligations, navigate client or customer relationships, and understand compliance risk.
[0003] The process of sifting through the documents and finding the right information can potentially be automated. Putting in place an efficient data management system with obligation management functionalities needs concepts to be identified in the document so that when a user needs to extract information regarding a particular contract that was signed with a particular party, such information is easily available to access. Also, as another use case, in traditional organizations where legal professionals stay in a company for many years, historical information regarding engagements with certain parties of interest used to stay in their minds. With changing work cultures and increased employee churn, the continuity of information is difficult to establish and historical information about a particular engagement is lost. In such scenarios having a smart repository where contracts could be stored and easily accessed is very important. Such smart repositories should have all features like data storage, multi format file management, searchability across the documents using multiple criteria, obligation management etc. Advanced machine learning algorithms are necessary to handle some of these tasks.
[0004]
[0005] In last few years, machine learning has been increasingly employed for processing of contract documents, such as for contract document review. This can significantly reduce human effort involved in review process, thus increasing productivity. However, limitations with machine learning approach is that when a machine learning model is deployed in production, the performance of the deployed model starts degrading because it is sensitive to changes in the real world, and user behaviour keeps changing with time. Although all machine learning models decay, the speed of decay varies with time. This is mostly caused by data drift, concept drift, or both. Therefore, there is a need for continuous training of machine learning models to keep the same updated to address performance deficit of machine learning models in information extraction from contract documents. Continuous training is an aspect of machine learning operations that automatically and continuously retrains machine learning models to adapt to changes in the data before it is redeployed. The trigger for a re-build can be data change, model change, or code change.
[0006] Currently there are no efficient tools that may streamline the process of training a machine learning model, especially any tool tailored specifically for implementation of the trained machine learning model for processing of contract documents. Therefore, in light of the foregoing discussion, there exists a need to overcome problems associated with the traditional machine learning model training process, and provide a system and a method for training a machine learning model for processing of contract documents in an efficient and user-friendly manner.
SUMMARY
[0007] The present disclosure aims to provide a smart repository used to maintain, store, and render searchable, historical documents. The primary objective of the present disclosure is a smart contract management tool, which provides a repository which, in turn, can be used to query information from the stored documents. It may be understood that the performance of the search engine for the searchability of (a) contract clauses in legal documents, (b) parameters in legal documents like party names, effective date, term, jurisdiction etc., (c) financial details in financial documents like balance sheets, revenue statements, P&L statements etc., to mention a few, depends solely on the performance of the inbuilt machine learning or Artificial Intelligence (AI) algorithm that forms heart of the information extraction process. The present disclosure proposes a platform for generalized dynamic review of documents and automatic finetuning of the machine learning model for smart document repositories. In particular, the proposed platform has two main objectives: (i) a framework for review of documents by expert users to identify information from documents uploaded on to the smart repository, and (ii) a framework for re-training the machine learning model using the reviewed information.
[0008] The present framework has multiple steps that are looped over time to train and automatically finetune the machine learning models for a given requirement. Herein, the first step entails information identification by users, data pre-processing and training of the model, to isolate the process involved in developing a base model. The second step demonstrates the process of information extraction and parsing to the search engine. And the third step specifies the process involved in reviewing of predictions for new documents, correction of these predictions and retraining the machine learning models over and over again to iteratively improve such models. The platform is “dynamic” due to the continuous nature of the process and dynamic updation of AI models over time.
[0009] Thereby, the present disclosure solves the problems associated with extraction of information from documents to be used in smart document repositories, performance deficit of machine learning models in information extraction from documents, automatic training of machine learning models and deployment into production, as well as ad-hoc information identification and extraction based on user’s requirement. Such problems are solved by training the machine learning models to a certain use case, but not in a very generalizable framework.
[0010] In an aspect, the present disclosure provides a system for training a machine learning model for processing of contract documents. The system comprises an interface. The system further comprises a database having the machine learning model as a pre-trained instance of the machine learning model stored therein. The system further comprises a processing unit. The processing unit is configured to implement the pre-trained instance of the machine learning model to parse a contract document and identify one or more clause types in the contract document. The processing unit is further configured to configure the interface to: provide a list of the one or more clause types in the contract document; allow a user to select one of the one or more clause types from the list; display text from the contract document corresponding to the selected one of the one or more clause types; and provide one or more options correspondent to each of the one or more clause types to allow the user to either confirm correctness of the corresponding one of the one or more clause types with respect to the displayed text or to edit the corresponding one of the one or more clause types with respect to the displayed text. The processing unit is further configured to utilize the text with corresponding confirmed clause type and the text with corresponding edited clause type for re-training the machine learning model to generate an updated instance of the machine learning model. The processing unit is further configured to store the updated instance of the machine learning model as the machine learning model in the database.
[0011] In one or more embodiments, the processing unit is further configured to configure the interface to allow the user to select a portion of the text being displayed and assign a clause type thereto; and utilize the text with the assigned clause type for re-training the machine learning model.
[0012] In one or more embodiments, the processing unit is further configured to configure the interface to authorize the user to compare the updated instance of the machine learning model with the pre-trained instance of the machine learning model using a sample contract document; implement the updated instance of the machine learning model to parse the sample contract document and identify one or more first clause types in the sample contract document; implement the pre-trained instance of the machine learning model to parse the sample contract document and identify one or more second clause types in the sample contract document; configure the interface to display text from the sample contract document, at least one of the one or more first clause types corresponding to the displayed text from the sample contract document, and at least one of the one or more second clause types corresponding to the displayed text from the sample contract document; and authorize the user to deploy either the updated instance of the machine learning model or the pre-trained instance of the machine learning model.
[0013] In one or more embodiments, the processing unit is further configured to configure the interface to allow a first-level reviewer as the user to select one of the one or more clause types, and to confirm the correctness and/or edit the clause type; and configure the interface to allow a second-level reviewer as the user to authorize the comparison and the deployment of an instance of the machine learning model. Herein, the first-level reviewer is different from the second-level reviewer.
[0014] In one or more embodiments, the processing unit is further configured to configure the interface to mask confidential information in the text of the contract document being displayed.
[0015] In one or more embodiments, the processing unit is further configured to configure the interface to highlight portion of the text from the contract document corresponding to the selected one of identified one or more clause types thereof.
[0016] In one or more embodiments, the processing unit is further configured to check for duplicity of the contract document being utilized for training the machine learning model by comparing the contract document to existing contract documents having previously been used for training the machine learning model.
[0017] In one or more embodiments, the processing unit is further configured to implement the machine learning model to process a set of contract documents containing contract documents of different contract types and containing different clause types, and determine a frequency of occurrence of one of the different clause types vis-à-vis one of the different contract types; and generate a library containing relevant different clause types for each of the different contract types based on the determined frequency of occurrence.
[0018] In one or more embodiments, the processing unit is further configured to configure the interface to receive an input from the user indicating drafting of a particular contract type; and utilize the generated library to suggest relevant clause types for the said particular contract type.
[0019] In one or more embodiments, the processing unit is further configured to configure the interface to allow the user to define an entity from the displayed text from the contract document corresponding to the selected one of the one or more clause types; and train the machine learning model to determine entities complementary to the defined entity in a target contract document.
[0020] In another aspect, the present disclosure provides a method for training a machine learning model for processing of contract documents. The method comprises implementing a pre-trained instance of the machine learning model to parse a contract document and identify one or more clause types in the contract document. The method further comprises configuring an interface to provide a list of the one or more clause types in the contract document. The method further comprises configuring the interface to allow a user to select one of the one or more clause types from the list. The method further comprises configuring the interface to display text from the contract document corresponding to the selected one of the one or more clause types. The method further comprises configuring the interface to provide one or more options correspondent to each of the one or more clause types to allow the user to either confirm correctness of the corresponding one of the one or more clause types with respect to the displayed text or to edit the corresponding one of the one or more clause types with respect to the displayed text. The method further comprises utilizing the text with corresponding confirmed clause type and the text with corresponding edited clause type for re-training the machine learning model to generate an updated instance of the machine learning model.
[0021] In one or more embodiments, the method further comprises configuring the interface to allow the user to select a portion of the text being displayed and assign a clause type thereto; and utilizing the text with the assigned clause type for re-training the machine learning model.
[0022] In one or more embodiments, the method further comprises configuring the interface to authorize the user to compare the updated instance of the machine learning model with the pre-trained instance of the machine learning model using a sample contract document; implementing the updated instance of the machine learning model to parse the sample contract document and identify one or more first clause types in the sample contract document; implementing the pre-trained instance of the machine learning model to parse the sample contract document and identify one or more second clause types in the sample contract document; configuring the interface to display text from the sample contract document, at least one of the one or more first clause types corresponding to the displayed text from the sample contract document, and at least one of the one or more second clause types corresponding to the displayed text from the sample contract document; and authorizing the user to deploy either the updated instance of the machine learning model or the pre-trained instance of the machine learning model.
[0023] In one or more embodiments, the method further comprises configuring the interface to allow a first-level reviewer as the user to select one of the one or more clause types, and to confirm the correctness and/or edit the clause type; and configuring the interface to allow a second-level reviewer as the user to authorize the comparison and the deployment of an instance of the machine learning model. Herein, the first-level reviewer is different from the second-level reviewer.
[0024] In one or more embodiments, the method further comprises configuring the interface to mask confidential information in the text of the contract document being displayed.
[0025] In one or more embodiments, the method further comprises configuring the interface to highlight portion of the text from the contract document corresponding to the selected one of identified one or more clause types thereof.
[0026] In one or more embodiments, the method further comprises checking for duplicity of the contract document being utilized for training the machine learning model by comparing the contract document to existing contract documents having previously been used for training the machine learning model.
[0027] In one or more embodiments, the method further comprises implementing the machine learning model to process a set of contract documents containing contract documents of different contract types and containing different clause types, and determine a frequency of occurrence of one of the different clause types vis-à-vis one of the different contract types; and generating a library containing relevant different clause types for each of the different contract types based on the determined frequency of occurrence.
[0028] In one or more embodiments, the method further comprises configuring the interface to receive an input from the user indicating drafting of a particular contract type; and utilizing the generated library to suggest relevant clause types for the said particular contract type.
[0029] In one or more embodiments, the method further comprises configuring the interface to allow the user to define an entity from the displayed text from the contract document corresponding to the selected one of the one or more clause types; and training the machine learning model to determine entities complementary to the defined entity in a target contract document.
[0030] In yet another aspect, the present disclosure provides a computer program comprising computer executable program code, when executed the program code controls a computing system to perform the method as described in the preceding paragraphs.
[0031] Embodiments of the present disclosure substantially eliminate or at least partially address the aforementioned problems in the prior art, and enables for an efficient and user-friendly system and method for training a machine learning model for processing of contract documents.
[0032] Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative embodiments construed in conjunction with the appended claims that follow.
[0033] It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
BRIEF DESCRIPTION OF THE FIGURES
[0034] The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those skilled in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
[0035] For a more complete understanding of example embodiments of the present disclosure, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
FIG. 1 illustrates a system that may reside on and may be executed by a computer, which may be connected to a network, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 2 illustrates a diagrammatic view of a server, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 3 illustrates a diagrammatic view of a user device, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 4 illustrates an exemplary flow diagram providing an overall process workflow for implementation of the present system for processing of contract documents, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 5 illustrates a schematic of a modelling pipeline for generating a clause classification model as implemented in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 6 illustrates a schematic of an inference pipeline for implementing the generated clause classification model, as per the modelling pipeline of FIG. 5, in the process workflow of FIG. 4, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 7 illustrates a high-level schematic of a workflow for training the machine learning model for processing of contract documents, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 8 illustrates an exemplary interface implemented to provide a list of the one or more clause types in the contract document, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 9 illustrates an exemplary interface implemented for deployment of one of the instances of the machine learning model, in accordance with one or more exemplary embodiments of the present disclosure;
FIG. 10 illustrates a simplified schematic of a continuous training framework for a machine learning model for processing of contract documents, in accordance with one or more exemplary embodiments of the present disclosure; and
FIG. 11 illustrates a flowchart listing steps involved in a method for training a machine learning model for processing of contract documents, in accordance with one or more exemplary embodiments of the present disclosure.
[0036] In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION
[0037] In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure is not limited to these specific details.
[0038] Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Further, the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
[0039] Furthermore, in the following detailed description of the present disclosure, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the present disclosure.
[0040] Embodiments described herein may be discussed in the general context of computer-executable instructions residing on some form of computer-readable storage medium, such as program modules, executed by one or more computers or other devices. By way of example, and not limitation, computer-readable storage media may comprise non-transitory computer-readable storage media and communication media; non-transitory computer-readable media include all computer-readable media except for a transitory, propagating signal. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or distributed as desired in various embodiments.
[0041] Some portions of the detailed description that follows are presented and discussed in terms of a process or method. Although steps and sequencing thereof are disclosed in figures herein describing the operations of this method, such steps and sequencing are exemplary. Embodiments are well suited to performing various other steps or variations of the steps recited in the flowchart of the figure herein, and in a sequence other than that depicted and described herein. Some portions of the detailed descriptions that follow are presented in terms of procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present application, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those utilizing physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as transactions, bits, values, elements, symbols, characters, samples, pixels, or the like.
[0042] In some implementations, any suitable computer usable or computer readable medium (or media) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer-usable, or computer-readable, storage medium (including a storage device associated with a computing device) may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable medium may include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fibre, a portable compact disc read-only memory (CD-ROM), an optical storage device, a digital versatile disk (DVD), a static random access memory (SRAM), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, a media such as those supporting the internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be a suitable medium upon which the program is stored, scanned, compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of the present disclosure, a computer-usable or computer-readable, storage medium may be any tangible medium that can contain or store a program for use by or in connection with the instruction execution system, apparatus, or device.
[0043] In some implementations, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. In some implementations, such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. In some implementations, the computer readable program code may be transmitted using any appropriate medium, including but not limited to the internet, wireline, optical fibre cable, RF, etc. In some implementations, a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
[0044] In some implementations, computer program code for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. However, the computer program code for carrying out operations of the present disclosure may also be written in conventional procedural programming languages, such as the "C" programming language, PASCAL, or similar programming languages, as well as in scripting languages such as JavaScript, PERL, or Python. In present implementations, the used language for training may be one of Python, Tensorflow, Bazel, C, C++. Further, decoder in user device (as will be discussed) may use C, C++ or any processor specific ISA. Furthermore, assembly code inside C/C++ may be utilized for specific operation. Also, ASR (automatic speech recognition) and G2P decoder along with entire user system can be run in embedded Linux (any distribution), Android, iOS, Windows, or the like, without any limitations. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the internet using an Internet Service Provider). In some implementations, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGAs) or other hardware accelerators, micro-controller units (MCUs), or programmable logic arrays (PLAs) may execute the computer readable program instructions/code by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
[0045] In some implementations, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus (systems), methods and computer program products according to various implementations of the present disclosure. Each block in the flowchart and/or block diagrams, and combinations of blocks in the flowchart and/or block diagrams, may represent a module, segment, or portion of code, which comprises one or more executable computer program instructions for implementing the specified logical function(s)/act(s). These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the computer program instructions, which may execute via the processor of the computer or other programmable data processing apparatus, create the ability to implement one or more of the functions/acts specified in the flowchart and/or block diagram block or blocks or combinations thereof. It should be noted that, in some implementations, the functions noted in the block(s) may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
[0046] In some implementations, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks or combinations thereof.
[0047] In some implementations, the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed (not necessarily in a particular order) on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts (not necessarily in a particular order) specified in the flowchart and/or block diagram block or blocks or combinations thereof.
[0048] Referring to example implementation of FIG. 1, there is shown a system 100 that may reside on and may be executed by a computer (e.g., computer 12), which may be connected to a network (e.g., network 14) (e.g., the internet or a local area network). Examples of computer 12 may include, but are not limited to, a personal computer(s), a laptop computer(s), mobile computing device(s), a server computer, a series of server computers, a mainframe computer(s), or a computing cloud(s). In some implementations, each of the aforementioned may be generally described as a computing device. In certain implementations, a computing device may be a physical or virtual device. In many implementations, a computing device may be any device capable of performing operations, such as a dedicated processor, a portion of a processor, a virtual processor, a portion of a virtual processor, portion of a virtual device, or a virtual device. In some implementations, a processor may be a physical processor or a virtual processor. In some implementations, a virtual processor may correspond to one or more parts of one or more physical processors. In some implementations, the instructions/logic may be distributed and executed across one or more processors, virtual or physical, to execute the instructions/logic. Computer 12 may execute an operating system, for example, but not limited to, Microsoft Windows; Mac OS X; Red Hat Linux, or a custom operating system. (Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States, other countries or both; Mac and OS X are registered trademarks of Apple Inc. in the United States, other countries or both; Red Hat is a registered trademark of Red Hat Corporation in the United States, other countries or both; and Linux is a registered trademark of Linus Torvalds in the United States, other countries or both).
[0049] In some implementations, the instruction sets and subroutines of system 100, which may be stored on storage device, such as storage device 16, coupled to computer 12, may be executed by one or more processors (not shown) and one or more memory architectures included within computer 12. In some implementations, storage device 16 may include but is not limited to: a hard disk drive; a flash drive, a tape drive; an optical drive; a RAID array (or other array); a random-access memory (RAM); and a read-only memory (ROM).
[0050] In some implementations, network 14 may be connected to one or more secondary networks (e.g., network 18), examples of which may include but are not limited to: a local area network; a wide area network; or an intranet, for example.
[0051] In some implementations, computer 12 may include a data store, such as a database (e.g., relational database, object-oriented database, triplestore database, etc.) and may be located within any suitable memory location, such as storage device 16 coupled to computer 12. In some implementations, data, metadata, information, etc. described throughout the present disclosure may be stored in the data store. In some implementations, computer 12 may utilize any known database management system such as, but not limited to, DB2, in order to provide multi-user access to one or more databases, such as the above noted relational database. In some implementations, the data store may also be a custom database, such as, for example, a flat file database or an XML database. In some implementations, any other form(s) of a data storage structure and/or organization may also be used. In some implementations, system 100 may be a component of the data store, a standalone application that interfaces with the above noted data store and/or an applet / application that is accessed via client applications 22, 24, 26, 28. In some implementations, the above noted data store may be, in whole or in part, distributed in a cloud computing topology. In this way, computer 12 and storage device 16 may refer to multiple devices, which may also be distributed throughout the network.
[0052] In some implementations, computer 12 may execute application 20 for training a machine learning model for processing of contract documents (as discussed later in more detail). In some implementations, system 100 and/or application 20 may be accessed via one or more of client applications 22, 24, 26, 28. In some implementations, system 100 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within application 20, a component of application 20, and/or one or more of client applications 22, 24, 26, 28. In some implementations, application 20 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within system 100, a component of system 100, and/or one or more of client applications 22, 24, 26, 28. In some implementations, one or more of client applications 22, 24, 26, 28 may be a standalone application, or may be an applet / application / script / extension that may interact with and/or be executed within and/or be a component of system 100 and/or application 20. Examples of client applications 22, 24, 26, 28 may include, but are not limited to, a standard and/or mobile web browser, an email application (e.g., an email client application), a textual and/or a graphical user interface, a customized web browser, a plugin, an Application Programming Interface (API), or a custom application. The instruction sets and subroutines of client applications 22, 24, 26, 28, which may be stored on storage devices 30, 32, 34, 36, coupled to user devices 38, 40, 42, 44, may be executed by one or more processors and one or more memory architectures incorporated into user devices 38, 40, 42, 44.
[0053] In some implementations, one or more of storage devices 30, 32, 34, 36, may include but are not limited to: hard disk drives; flash drives, tape drives; optical drives; RAID arrays; random access memories (RAM); and read-only memories (ROM). Examples of user devices 38, 40, 42, 44 (and/or computer 12) may include, but are not limited to, a personal computer (e.g., user device 38), a laptop computer (e.g., user device 40), a smart/data-enabled, cellular phone (e.g., user device 42), a notebook computer (e.g., user device 44), a tablet (not shown), a server (not shown), a television (not shown), a smart television (not shown), a media (e.g., video, photo, etc.) capturing device (not shown), and a dedicated network device (not shown). User devices 38, 40, 42, 44 may each execute an operating system, examples of which may include but are not limited to, Android, Apple iOS, Mac OS X; Red Hat Linux, or a custom operating system.
[0054] In some implementations, one or more of client applications 22, 24, 26, 28 may be configured to effectuate some or all of the functionality of system 100 (and vice versa). Accordingly, in some implementations, system 100 may be a purely server-side application, a purely client-side application, or a hybrid server-side / client-side application that is cooperatively executed by one or more of client applications 22, 24, 26, 28 and/or system 100.
[0055] In some implementations, one or more of client applications 22, 24, 26, 28 may be configured to effectuate some or all of the functionality of application 20 (and vice versa). Accordingly, in some implementations, application 20 may be a purely server-side application, a purely client-side application, or a hybrid server-side / client-side application that is cooperatively executed by one or more of client applications 22, 24, 26, 28 and/or application 20. As one or more of client applications 22, 24, 26, 28, system 100, and application 20, taken singly or in any combination, may effectuate some or all of the same functionality, any description of effectuating such functionality via one or more of client applications 22, 24, 26, 28, system 100, application 20, or combination thereof, and any described interaction(s) between one or more of client applications 22, 24, 26, 28, system 100, application 20, or combination thereof to effectuate such functionality, should be taken as an example only and not to limit the scope of the disclosure.
[0056] In some implementations, one or more of users 46, 48, 50, 52 may access computer 12 and system 100 (e.g., using one or more of user devices 38, 40, 42, 44) directly through network 14 or through secondary network 18. Further, computer 12 may be connected to network 14 through secondary network 18, as illustrated with phantom link line 54. System 100 may include one or more user interfaces, such as browsers and textual or graphical user interfaces, through which users 46, 48, 50, 52 may access system 100.
[0057] In some implementations, the various user devices may be directly or indirectly coupled to communication network, such as communication network 14 and communication network 18, hereinafter simply referred to as network 14 and network 18, respectively. For example, user device 38 is shown directly coupled to network 14 via a hardwired network connection. Further, user device 44 is shown directly coupled to network 18 via a hardwired network connection. User device 40 is shown wirelessly coupled to network 14 via wireless communication channel 56 established between user device 40 and wireless access point (i.e., WAP) 58, which is shown directly coupled to network 14. WAP 58 may be, for example, an IEEE 802.11a, 802.11b, 802.11g, Wi-Fi, RFID, and/or Bluetooth (including Bluetooth Low Energy) device that is capable of establishing wireless communication channel 56 between user device 40 and WAP 58. User device 42 is shown wirelessly coupled to network 14 via wireless communication channel 60 established between user device 42 and cellular network / bridge 62, which is shown directly coupled to network 14.
[0058] In some implementations, some or all of the IEEE 802.11x specifications may use Ethernet protocol and carrier sense multiple access with collision avoidance (i.e., CSMA/CA) for path sharing. The various 802.11x specifications may use phase-shift keying (i.e., PSK) modulation or complementary code keying (i.e., CCK) modulation, for example, Bluetooth (including Bluetooth Low Energy) is a telecommunications industry specification that allows, e.g., mobile phones, computers, smart phones, and other electronic devices to be interconnected using a short-range wireless connection. Other forms of interconnection (e.g., Near Field Communication (NFC)) may also be used.
[0059] The system 100 may include a server (such as server 200, as shown in FIG. 2) for training a machine learning model for processing of contract documents (as will be described later in more detail). In the present implementations, the present system 100 may be embodied as the server 200. Herein, FIG. 2 is a block diagram of an example of the server 200 capable of implementing embodiments according to the present disclosure. In one embodiment, an application server as described herein may be implemented on exemplary server 200. In the example of FIG. 2, the server 200 includes a processing unit 205 for running software applications (such as, the application 20 of FIG. 1) and optionally an operating system. As illustrated, the server 200 further includes a database 210 which stores applications and data for use by the processing unit 205. Storage 215 provides non-volatile storage for applications and data and may include fixed disk drives, removable disk drives, flash memory devices, and CD-ROM, DVD-ROM or other optical storage devices. An optional user input device 220 includes devices that communicate user inputs from one or more users to the server 200 and may include keyboards, mice, joysticks, touch screens, etc. A communication or network interface 225 is provided which allows the server 200 to communicate with other computer systems via an electronic communications network, including wired and/or wireless communication and including an Intranet or the Internet. In one embodiment, the server 200 receives instructions and user inputs from a remote computer through communication interface 225. Communication interface 225 can comprise a transmitter and receiver for communicating with remote devices. An optional display device 250 may be provided which can be any device capable of displaying visual information in response to a signal from the server 200. The components of the server 200, including the processing unit 205, the database 210, the data storage 215, the user input devices 220, the communication interface 225, and the display device 250, may be coupled via one or more data buses 260.
[0060] In the embodiment of FIG. 2, a graphics system 230 may be coupled with the data bus 260 and the components of the server 200. The graphics system 230 may include a physical graphics processing unit (GPU) 235 and graphics memory. The GPU 235 generates pixel data for output images from rendering commands. The physical GPU 235 can be configured as multiple virtual GPUs that may be used in parallel (concurrently) by a number of applications or processes executing in parallel. For example, mass scaling processes for rigid bodies or a variety of constraint solving processes may be run in parallel on the multiple virtual GPUs. Graphics memory may include a display memory 240 (e.g., a framebuffer) used for storing pixel data for each pixel of an output image. In another embodiment, the display memory 240 and/or additional memory 245 may be part of the database 210 and may be shared with the processing unit 205. Alternatively, the display memory 240 and/or additional memory 245 can be one or more separate memories provided for the exclusive use of the graphics system 230. In another embodiment, graphics processing unit 230 includes one or more additional physical GPUs 255, similar to the GPU 235. Each additional GPU 255 may be adapted to operate in parallel with the GPU 235. Each additional GPU 255 generates pixel data for output images from rendering commands. Each additional physical GPU 255 can be configured as multiple virtual GPUs that may be used in parallel (concurrently) by a number of applications or processes executing in parallel, e.g., processes that solve constraints. Each additional GPU 255 can operate in conjunction with the GPU 235, for example, to simultaneously generate pixel data for different portions of an output image, or to simultaneously generate pixel data for different output images. Each additional GPU 255 can be located on the same circuit board as the GPU 235, sharing a connection with the GPU 235 to the data bus 260, or each additional GPU 255 can be located on another circuit board separately coupled with the data bus 260. Each additional GPU 255 can also be integrated into the same module or chip package as the GPU 235. Each additional GPU 255 can have additional memory, similar to the display memory 240 and additional memory 245, or can share the memories 240 and 245 with the GPU 235. It is to be understood that the circuits and/or functionality of GPU as described herein could also be implemented in other types of processors, such as general-purpose or other special-purpose coprocessors, or within a CPU.
[0061] The system 100 may also include a user device 300 (as shown in FIG. 3). In embodiments of the present disclosure, the user device 300 may embody a smartphone, a personal computer, a tablet, or the like. Herein, FIG. 3 is a block diagram of an example of the user device 300 capable of implementing embodiments according to the present disclosure. In the example of FIG. 3, the user device 300 includes a processor 305 (hereinafter, referred to as CPU 305) for running software applications (such as, the application 20 of FIG. 1) and optionally an operating system. A user input device 320 is provided which includes devices that communicates user inputs from one or more users and may include keyboards, mice, joysticks, touch screens, and/or microphones. Further, a network interface 325 is provided which allows the user device 300 to communicate with other computer systems (e.g., the server 200 of FIG. 2) via an electronic communications network, including wired and/or wireless communication and including the Internet. The user device 300 may also include a decoder 355 may be any device capable of decoding (decompressing) data that may be encoded (compressed). A display device 350 may be provided which may be any device capable of displaying visual information, including information received from the decoder 355. In particular, as will be described below, the display device 350 may provide an interface (with the two terms being interchangeably used), such that the interface 350 is configured to display information received from the server 200 of FIG. 2. The components of the user device 300 may be coupled via one or more data buses 360.
[0062] It may be seen that compared to the server 200 in the example of FIG. 2, the user device 300 in the example of FIG. 3 may have fewer components and less functionality. However, the user device 300 may include other components, for example, in addition to those described above. In general, the user device 300 may be any type of device that has one or more of display capability and the capability to receive inputs from a user and send such inputs to the server 200. However, it may be appreciated that the user device 300 may have additional capabilities beyond those just mentioned.
[0063] Referring to FIG. 4, illustrated is an exemplary flow diagram providing an overall process workflow (as represented by reference numeral 400) for implementation of the present system 100 for processing of contract documents. As shown, the process workflow 400 includes a step 402 for receiving the contract document(s) to be processed. For the purposes of the present disclosure, the system 100 may provide an interface (as described) for a user to upload the contract document(s) to be processed. Herein, the interface may allow the user to upload the said contract document(s) directly from the user device, or from a cloud platform, like Google Drive, OneDrive, Dropbox, etc. by implementing the corresponding APIs (as known in the art) without any limitations. In other examples, the system 100 may be configured to automatically fetch the contract document(s) from a data repository of the user (like an enterprise client), given the access thereto, to directly process each newly added contract document.
[0064] Further, as shown, the process workflow 400 includes a step 404 to perform initial pre-processing of the received contract document(s) to make those usable for further processing as per embodiments of the present disclosure. In the present implementations, the contract document(s) as received may be converted to a suitable format, such as, but not limited to, PDF for further processing. Herein, the received contract document(s) may be in the form of scanned images with machine unreadable text. In such cases, the received contract document(s) may be pre-processed using Optical Character Recognition (OCR) techniques to convert the text in the contract document(s) to machine readable form for further processing. In an example, the machine readable text from each contract document is extracted into a separate file (such as, a text file). Such pre-processing step may utilize available resources, such as, but not limited to, Python libraries, AWS Textract, Azure computer vision, etc.
[0065] Now, as discussed, a given contract document may include a plurality of clauses. It may be appreciated by a person skilled in the art that the clauses may be classified under certain predefined categories, such as clause(s) related to: (i) Parties, (ii) Purpose, (iii) Confidential Information, (iv) Recipient’s Treatment of Confidential Information, (v) Tangible Confidential Information, (vi) Exceptions to Confidential Information, (vii) Information that was available in the public domain., (viii) Information that is obtained other than through a breach of confidentiality, (ix) Information disclosure compelled by legal process, (x) Information that was developed independently, (xi) Term, (xii) No License, (xiii) Governing Law, (xiv) Equitable Relief, (xv) Entire Agreement, (xvi) No Assignment, (xvii) Severability, (xviii) Notices, (xix) No Implied Waiver, (xx) Headings and Interpretation, etc.
[0066] As shown, the process workflow 400 includes a step 406 for extracting and classifying clauses from the text of the given contract document. The present system 100 may implement a clause classification model (which may be part of the machine learning model, and hereinafter sometimes simply referred to as “classification model” without any limitations) for said purpose of extracting clauses and classifying clauses from the text of the given contract document. Herein, the implemented machine learning model may first be trained, and then may be utilized for providing inferences in the form of by processing of the text from the contract document(s) for clause classification. A simple model using a limited number of clauses per type may be first trained, and then be used to label clauses using a much bigger dataset. The labelled clauses may then be reviewed by experts and corrected. The corrected labelled dataset may then be used to develop the final model. The clause classification model may utilize one or more machine learning techniques for its implementation. In the present examples, the clause classification model may be based on any one of Naive Bayes classifier, Logistic regression, Support Vector Machine (SVM) Algorithm, Neural Embeddings based classifier, Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTM) models, Transfer Learning using transformers functions, and the like. Such algorithmic extraction may use techniques like splitting paragraphs from the text, identifying keywords from a list of predefined keywords using phrase matching in Spacy library, automatic labelling of the text, and the like. Implementation of the clause classification model may help to reduce manual labelling work.
[0067] Referring to FIG. 5, illustrated is a schematic of a modelling pipeline (as represented by reference numeral 500) for generating a clause classification model, as per embodiments of the present disclosure. As shown, in the modelling pipeline 500, at step 502, the extracted text from the contract document(s) is first inputted. At step 504, the modelling pipeline 500 involves data preparation for extracting clauses. Herein, the data preparation includes manual/algorithmic extraction of clauses. Herein, the manual extraction may involve legal professional(s) manually separating each clause from each of the sample contract documents in a training dataset. The algorithmic extraction may involve implementing available algorithm(s) to automate the data preparation task. At step 506, the modelling pipeline 500 involves data labelling for different clause types. This step 506 of data labelling may be performed by legal professionals (expert annotators). In the present implementation, the data labelling may result in a training list of various clauses, which may be provided as a sample dataset of correct clauses (as may be reviewed by legal professional(s)), and/or a sample dataset including a mix of correct clauses (labelled as such) and incorrect clauses (labelled as such). At step 508, the modelling pipeline 500 involves data cleaning. Herein, the data cleaning may include text processing to prepare complete sentences from text, stop-word removal, removal of punctuations/numbers, etc. At step 510, the modelling pipeline 500 involves data preparation. Herein, the data preparation may include data normalization, including data stemming (which involves crude heuristic process that chops off the ends) and/or data lemmatization (which involves doing things properly with the use of a vocabulary and morphological analysis of words). The data preparation may also include data vectorization which is used to get some distinct features out of the text for the model to train on. In the present examples, the data vectorization may be achieved by using one or more of techniques, such as TFIDF (term frequency–inverse document frequency), frequency vectorization, word embeddings, word encoding using transformer architectures, and the like, as known in the art. The data preparation may further include POS (Parts-Of-Speech) tagging for different languages and retention of specific types of POS for training. Such techniques for the data preparation may be contemplated by a person skilled in the art, and thus have not been described in detail herein for the brevity of the present disclosure.
[0068] Further, at step 512, the modelling pipeline 500 involves using the data (as result of above steps) for training of the clause classification model. At step 514, the modelling pipeline 500 may further involve checking a performance of the implemented (trained) clause classification model. This may be achieved by feeding the clause classification model with test text, and evaluating the output(s) therefrom manually by legal professionals. If the results may be satisfying (as represented by block 516), the clause classification model may be considered for deployment. In the present examples, the clause classification model may be deployed using cloud platforms as either server-based or serverless architectures, without any limitations. And, if the results may not be satisfying, the modelling pipeline 500 may move to step 518 (as shown) which involves further training of the clause classification model using supplemental data generated by using data augmentation techniques. Herein, the data augmentation may include adding new clauses manually and/or programmatically, generating new clauses using transformer functions, etc. The supplemental data generated by data augmentation (in the step 518) may be processed using data preparation techniques (as discussed in the step 510), to be further fed to the clause classification model for its training as discussed in the step 512 of the modelling pipeline 500.
[0069] Referring back to FIG. 4, in the present embodiments, the process workflow 400 may further include implementing co-reference resolution procedure (represented as step 408 in FIG. 4). Co-reference resolution is the task of finding all expressions that refer to the same entity in a text. In the present process workflow 400, the text is sanitized for possible co-references before entity extraction (as explained later in the description). The process workflow 400 may further include implementing a Named Entity Recognition (NER) model (represented as step 410 in FIG. 4). Named-entity recognition is a task of information extraction that seeks to locate and classify named entities mentioned in the extracted text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc., which, as may be contemplated, is required for any kind of analysis of a given contract document. In the present process workflow 400, the NER model is implemented to extract important details from the extracted relevant clauses. In particular, specific entities need to be extracted from specific clauses of interest. In the present embodiments, the process workflow 400 may utilize a combined NER model for a group of entities at a time, or separate NER models for each entity type, or a mix of both, without any limitations. The process workflow 400 may further includes performing clause correction classification (represented as step 412 in FIG. 4), which is precursor to processing of the extracted clauses. In the present embodiments, as illustrated in FIG. 4, the clause-level correction classification may be performed directly on the extracted clauses (as from the step 406), and the entity-level correction classification may be performed on the extracted clauses with resolved co-references and recognized entities (as from the step 410).
[0070] Referring to FIG. 6, illustrated is a schematic of an inference pipeline (as represented by reference numeral 600) for implementing the clause classification model. As shown, in the inference pipeline 600, at step 602, the contract document(s), which may be in the form of PDF file, DOCX/DOC file, etc., is first received. At step 604, the inference pipeline 600 involves extracting text from the contract document(s) by using OCR techniques, i.e., converting the text into machine readable form. At step 606, the inference pipeline 600 involves data cleaning. Herein, the data cleaning may include text processing to split the extracted text into sentences and also, if required, to prepare complete sentences from the extracted text. The data cleaning may further include removal of punctuations and stop-words from the extracted text. At step 608, the inference pipeline 600 involves data preparation. Herein, the data preparation may include data normalization, including data stemming and/or data lemmatization; data vectorization using one or more of techniques, such as TFIDF (term frequency–inverse document frequency), frequency vectorization, word embeddings, word encoding using transformer architectures, and the like; POS tagging; etc. (as described in the preceding paragraphs with reference to the step 510 of the modelling pipeline 500 of FIG. 5). Further, at step 610, the inference pipeline 600 involves using the data (as result of above steps) to be fed to the clause classification model (as trained using the modelling pipeline 500 of FIG. 5). At step 612 of the inference pipeline 600, the clause classification model provides extracted clauses which may be classified as per the clause categories/types to allow for corresponding analysis thereof (as described). Herein, the clause classification model may categorize each sentence into a particular clause type. Further, in the step 612 of the inference pipeline 600, the sentences of same clause types are then aggregated into a single paragraph to get the final list of extracted clauses from the text. It may be appreciated that additional rules may be implemented to remove false positives as per the requirements.
[0071] The present disclosure further provides a scheme for continuous training of the utilized machine learning model for processing of the contract documents, in order to improve those over time. That is, once deployed, the models need to be continuously re-trained/updated to improve their performance. For this purpose, the present system 100 may provide a user-interface to provide a provision to the user indicating if a given model predictions are accurate or not. For example, the clause types will be verified by a competent person who may further identify the right clause type or can identify whether a particular detail is properly extracted or not, and then correct the prediction.
[0072] Referring to FIG. 7, illustrated is a high-level schematic of a workflow (as represented by numeral 700) for (continuous) training the machine learning model for processing of contract documents. As shown, the workflow 700 includes three separate processes 710, 720, 730 which may be occurring simultaneously or non-simultaneously without any limitations. Herein, the process 710 relates to initial training the machine learning model for development of a (base) pre-trained instance of the machine learning model (as represented by numeral 724); the process 720 relates to implementation of the pre-trained instance of the machine learning model for information extraction; and the process 730 relates to re-training the machine learning model to generate an updated instance of the machine learning model for model improvement.
[0073] As shown, in the process 710, at step 712, user inputs related to training the machine learning model are received. Herein, the user inputs may be in the form of training dataset comprising information for training the machine learning model. In the present implementation of the system 100, such user inputs may be received via the user input device 320 of the user device 300. At step 714, the process 710 includes data preparation and pre-processing. Such steps of data preparation and pre-processing have been explained in the preceding paragraphs in reference to the modelling pipeline 500 (specifically, steps 502 to 510) of FIG. 5, and thus not repeated herein for brevity of the present disclosure. Further, at step 716, the process 710 includes initial training of the machine learning model. Such step of initial training of the machine learning model has been explained in the preceding paragraphs in reference to the modelling pipeline 500 (specifically, steps 512 to 518) of FIG. 5, and thus not repeated herein for brevity of the present disclosure.
[0074] Also, as shown in FIG. 7, in the process 720, at step 722, a contract document for information extraction is received. In some embodiments, the processing unit (such as, the processing unit 205) is configured to check for duplicity of the contract document being utilized for training the machine learning model by comparing the contract document to existing contract documents having previously been used for training the machine learning model. That is, in order to ensure that there is no duplication of documents, the backend algorithm compares documents that are being uploaded and checks for the underlying text with existing documents and disallows the user from uploading them. This saves the effort which may have otherwise been involved in unnecessary labelling of the contract document. Further, in the process 720, at step 724, the received contract document is processed via the pre-trained instance of the machine learning model. At step 726, the process 720 includes extracting information from the received contract document. In the present implementation, the processing unit (such as, the processing unit 205) is configured to implement the pre-trained instance of the machine learning model 724 to parse the received contract document and identify one or more clause types in the received contract document. Further, at step 728, the process 720 includes generating results in the form of a list of the one or more clause types in the contract document. Such steps of implementation of the machine learning model have been explained in the preceding paragraphs in reference to the inference pipeline 600 of FIG. 6, and thus not repeated herein for brevity of the present disclosure.
[0075] Further, as shown in FIG. 7, in the process 730, at step 732, the contract document may be reviewed in respect of the extracted information therefrom. Herein, the process 730 may include allowing the user to either confirm correctness of the corresponding one of the one or more clause types with respect to the text of the contract document or to edit the corresponding one of the one or more clause types with respect to the text of the contract document. At step 734, the process 730 includes utilizing the text with corresponding confirmed clause type and the text with corresponding edited clause type for re-training the machine learning model to generate an updated instance of the machine learning model.
[0076] Referring to FIG. 8, illustrated is an exemplary interface 800 (as part of the interface 350) implemented to provide (display) a list of the one or more clause types in the contract document, for reference of the user. As shown, the interface 800 includes a first window 802 configured to display text of the contract document. In an embodiment, the processing unit 205 is configured to configure the interface 800 to mask confidential information in the text of the contract document being displayed, and thereby ensures that the data used for training the machine learning model is anonymized. That is, before review, the contract documents may automatically be anonymized to comply with, for example, GDPR requirements. Such confidential information may be personal information, such as party names, addresses, email IDs, phone numbers, etc. of the parties involved in the contract document, which may be identified using techniques as known in the art, and may be blurred or redacted (black-lined) from the displayed text in the first window 802. In an example, the interface 800 may provide an “Anonymize” button 804 which may trigger masking of the confidential information in the text of the contract document being displayed. In an example implementation, the interface 800 may be configured to only display the text of the contract document once the confidential information has been anonymized. The interface 800 may also provide controls 806 for navigating between different pages of the contract document. The interface 800 may also provide a “contract type” drop down menu 808 to allow the user to select a particular contract type for processing, as per requirements of training the machine learning model. The interface 800 may also provide a toggle 810 to put the current process in a training mode, or otherwise.
[0077] Furthermore, the interface 800 may provide a panel 812 which may be utilized to provide a list of the one or more clause types in the contract document (as identified). In an implementation, the panel 812 may display all the clauses in the text as being displayed in the first window 802. In the present embodiments, the interface 800 is configured to allow the user to select one of the one or more clause types from the list as being displayed in the panel 812. Once a particular clause type has been selected, such as “Clause 1” in the exemplary illustration of FIG. 8, the interface 800 is configured to display text from the contract document corresponding to the selected one of the one or more clause types (i.e., “Clause 1” in present case). In an example, the interface 800 may provide a second window 814 which may display the text from the contract document corresponding to the selected clause type, for reference of the user. In another example, the interface 800 may be configured to highlight portion of the text from the contract document corresponding to the selected one of identified one or more clause types thereof, in the first window 802 itself, which may be in addition to displaying only that portion of text in the second window 814 (as discussed). It may be understood that multiple clauses of interest can be highlighted in the contract document. The resulting document may be displayed as an additional page with legends corresponding to those highlighted clauses. Further, the legends may be marked at the location they are highlighted in the document in a distinct colour for reference of the user.
[0078] Furthermore, the interface 800 may be configured for correcting and/or confirming the identified clause types in the contract document, so as to be used for re-training the machine learning model if required. Herein, features like highlighting of predictions from the machine learning model in the window, selecting new clause value or entity value by highlighting the text on the document, removing certain portions of clauses by selecting those portions on the document, etc. can be implemented to edit/correct the clauses. For this purpose, the interface 800 is configured to provide one or more options correspondent (adjacent) to each of the one or more clause types, as being displayed in the panel 812, to allow the user to either confirm correctness of the corresponding one of the one or more clause types with respect to the displayed text (in the second window 814) or to edit the corresponding one of the one or more clause types with respect to the displayed text (in the second window 814). Herein, the said one or more options may be (i) “OK” to confirm correctness of the corresponding clause type, and (ii) “Edit” to edit the corresponding clause type. Furthermore, the interface 800 may also provide an option via a text box 816 (or a list, and the like) to allow the user to correct the clause type of the text being displayed, such as, in case “Edit” option has been selected. Herein, the user may input, by typing (or by selecting from a list), a “correct” clause type corresponding to the text being displayed. In the present examples, the interface 800 may also allow the user to delete and/or edit text corresponding to the corrected clause type, if required. In some embodiments, the interface 800 may further be configured to allow the user to select a portion of the text being displayed and assign a clause type thereto. For this purpose, the user may select a portion of text, like a sentence, from the text being displayed in the first window 802, which may then be displayed in the second window 814 for confirmation, and the user may input the clause type of the selected text, e.g., in the text box 816, to assign the inputted clause type thereto.
[0079] Furthermore, the interface 800 may also provide a “Save clause” option 818 to enable the user to assign the (corrected) clause type to the text being displayed, overriding the identified clause type therefor. The interface 800 may also provide options 820 to save the selected one of the identified clauses and the corresponding text as a training dataset in the database (such as, the database 210). The interface 800 may further provide a “Train model” button 822 to use the defined training dataset for training the machine learning model. That is, herein, the processing unit 205 configures the interface 800 to utilize the text with corresponding confirmed clause type and the text with corresponding edited clause type for re-training the machine learning model to generate an updated instance of the machine learning model. Further, herein, the processing unit 205 configures the interface 800 to utilize the text with the assigned clause type for re-training the machine learning model. The interface 800 may also provide a “View results” button 824 to cause implementation of the re-trained machine learning model (as generated herein) and display the results generated therefrom, i.e., the list of the one or more clause types in the contract document.
[0080] Referring to FIG. 9, illustrated is an exemplary interface 900 (as part of the interface 350) implemented for deployment of one of the instances of the machine learning model. As shown, the interface 900 provides an option 902 to choose two or more instances of the machine learning model to be compared with each other. In the illustrated example, the interface 900 has been shown to allow for comparison of two instances of the machine learning model. Herein, the processing unit 205 is configured to configure the interface 900 to authorize the user to compare the updated instance of the machine learning model with the pre-trained instance of the machine learning model using a sample contract document. The two chosen instances of the machine learning model, such as the pre-trained instance of the machine learning model and the updated instance of the machine learning model, as utilized may be implemented to generate corresponding results for processing of the sample contract document. Herein, the processing unit 205 is configured to implement the updated instance of the machine learning model to parse the sample contract document and identify one or more first clause types in the sample contract document, and implement the pre-trained instance of the machine learning model to parse the sample contract document and identify one or more second clause types in the sample contract document.
[0081] Further, the processing unit 205 is configured to configure the interface 900 to display text from the sample contract document, at least one of the one or more first clause types corresponding to the displayed text from the sample contract document, and at least one of the one or more second clause types corresponding to the displayed text from the sample contract document. As shown in FIG. 9, the interface 900 provides a first window 904 which may display comparison of confusion matrix for the generated results of the two chosen instances of the machine learning model. Further, the interface 900 provides a second window 906 which may display comparison of classification reports for the generated results of the two chosen instances of the machine learning model. Such comparison may allow the user to compare and review performance of the two chosen instances of the machine learning model, and consequently be able to select one of the two chosen instances of the machine learning model for deployment.
[0082] Herein, the processing unit 205 is configured to configure the interface 900 to authorize the user to deploy either the updated instance of the machine learning model or the pre-trained instance of the machine learning model. As shown in FIG. 9, the interface 900 provides a dialog box 908 to allow the user to select one of the two chosen instances of the machine learning model for deployment. In some examples, the interface 900 may further provide a third window 910 which may display details of the selected instance of the machine learning model for reference of the user. The interface 900 may further provide a text box 912 to optionally assign a version number to the selected instance of the machine learning model (which may otherwise be automatically assigned). The interface 900 may further provide a “Save model” button 914 which may be used to configure the processing unit 205 to store the updated instance of the machine learning model as the machine learning model in the database 210.
[0083] Referring to FIG. 10, illustrated is a simplified schematic of a continuous training framework (as represented by reference numeral 1000) for a machine learning model for processing of contract documents, as described. As shown, at step 1002, the continuous training framework 1000 involves generating the prediction using the utilized machine learning model. Herein, the machine learning model in perspective is the clause classification model, as utilized. At step 1004, the continuous training framework 1000 involves receiving user input(s) to correct the prediction(s), if required. At step 1006, the continuous training framework 1000 involves retraining of the utilized model using the user input(s) (representative of correct prediction(s)). At step 1008, the continuous training framework 1000 may further involve verification of the re-trained model for performance improvement. This step 1008 may be carried out manually by legal professional (as involved). If the re-trained model is verified to have improved performance, at step 1010, such re-trained (improved) model is deployed, replacing corresponding existing model, for further implementation as per embodiments of the present disclosure. In an example, the continuous training framework 1000 may be used to continuously re-train models as scheduled, for example once every week, once every month, etc. If the re-trained model provided better results (based on verification), it may automatically replace the corresponding existing model; or otherwise, the existing model is kept being used.
[0084] In the present embodiments, the processing unit 205 is further configured to configure the interface 350 to allow a first-level reviewer as the user to select one of the one or more clause types, and to confirm the correctness and/or edit the clause type (as described in reference to FIG. 8), and a second-level reviewer as the user to authorize the comparison and the deployment of an instance of the machine learning model (as described in reference to FIG. 9). Herein, the first-level reviewer may have control to load contract documents into the framework, review them and save them. Once saved, the clause values / entity values get overwritten in the production database (such as, the database 210 that serves the production application). The encrypted reviewed information saved by the reviewer will be accessible to the second-level reviewer for training and deployment of the machine learning models. Herein, the second-level reviewer may have control to review documents (correct extracted clauses, other critical information useful for search engine), train models based on reviewed data and deploy them. The first-level reviewer may also have control to schedule automatic model training and deployment. In the present implementations, usually, the first-level reviewer is different from the second-level reviewer. However, in other implementations, the first-level reviewer may be the same as the second-level reviewer without departing from the spirit and the scope of the present disclosure.
[0085] In the present embodiments, the processing unit 205 is further configured to implement the machine learning model to process a set of contract documents containing contract documents of different contract types and containing different clause types. That is, the processing unit 205 is configured for automatic extraction of clauses through continuous (legally) scraping opensource documents using keyword searches on databases. The processing unit 205 is further configured to determine a frequency of occurrence of one of the different clause types vis-à-vis one of the different contract types, and generate a library containing relevant different clause types for each of the different contract types based on the determined frequency of occurrence. Herein, for a particular type of clause, a set of important keywords is created. Then, open source databases, such as Edgar, is scraped for occurrence of this set of words, in which simultaneous occurrence of some or most of the keywords indicates a potential clause description. Once a sizeable such number of clauses are extracted, they may be manually reviewed for correctness. A temporary clause classification model is created for the clause type, and multiple such models may be built over time.
[0086] In one or more embodiments, the processing unit is further configured to configure the interface to receive an input from the user indicating drafting of a particular contract type; and utilize the generated library to suggest relevant clause types for the said particular contract type. It may be appreciated by a person skilled in the art that there are multiple ways of writing a good clause. A user may need to have all such set of clauses in a library in order to ensure that new clauses could be identified seamlessly from a newly drafted document. For continuous improvement of the recommendations for clauses, it is important to create and add the new clauses into the clause library. The clause library consists of a list of clauses relevant to a particular type of contract document.
[0087] In one or more embodiments, the processing unit is further configured to configure the interface to allow the user to define an entity from the displayed text from the contract document corresponding to the selected one of the one or more clause types; and train the machine learning model to determine entities complementary to the defined entity in a target contract document. It may be understood that when documents are uploaded into a smart repository, tags need to be manually set. If hundreds of documents are being uploaded at a time, setting up tags is a painful process. Since identifying tags, document types, effective dates, renewal dates, etc. is primordial to the functioning of the smart repository, automation of the process of tag identification using the machine learning model would be helpful. Herein, an NER based algorithm is used to extract tags for the uploaded documents. Some tags of interest are: party1, party2, effective date which are available in the party clause of a contract, contract type, which may, for example, be available in the heading of the document and the term/termination clause of the contract document. The proposed process of dynamic review and the present machine learning model training platform may be used to train the tag extraction model and further be finetuned over time. Firstly, the entities are identified by a reviewer in corresponding clauses. That is, the first step is to train a model with some data. Ideally, the user may need to identify where the entity can be found in the entire document, and use that paragraph only for training the machine learning model. The reviewer may identify required information for 20-30 such clauses, and the training process is initiated to train the model. Then, a prediction module may identify hidden information from remaining thousands of clauses. A list of such paragraphs across multiple documents may then be used for training the model. In an example, separate models may be created to identify each or a group of these entities by training the corresponding NER models. The NER algorithm is trained by using data with tags in an IOB form (short for inside, outside, beginning), extracted from various documents. The machine learning model may be trained using variety of models that are available in open domain. The process can be iterated multiple times till a desired level of accuracy is achieved. Once the models are trained, tag extraction is carried out on relevant clauses using the inference pipeline. The users can extract information of their interest using this framework. This is an intuitive and quick way for users to enable to extract custom information from text data.
[0088] The present disclosure also relates to a method for training a machine learning model for processing of contract documents. Various embodiments and variants disclosed above, with respect to the aforementioned system 100 as per the first aspect, apply mutatis mutandis to the present method.
[0089] FIG. 11 illustrates a flowchart listing steps involved in a method 1100 for training a machine learning model for processing of contract documents, in accordance with one or more exemplary embodiments of the present disclosure. Herein, at step 1102, the method 1100 comprises implementing a pre-trained instance of the machine learning model to parse a contract document and identify one or more clause types in the contract document. At step 1104, the method 1100 further comprises configuring an interface 350 to provide a list of the one or more clause types in the contract document. At step 1106, the method 1100 further comprises configuring the interface 350 to allow a user to select one of the one or more clause types from the list. At step 1108, the method 1100 further comprises configuring the interface 350 to display text from the contract document corresponding to the selected one of the one or more clause types. At step 1110, the method 1100 further comprises configuring the interface 350 to provide one or more options correspondent to each of the one or more clause types to allow the user to either confirm correctness of the corresponding one of the one or more clause types with respect to the displayed text or to edit the corresponding one of the one or more clause types with respect to the displayed text. At step 1112, the method 1100 further comprises utilizing the text with corresponding confirmed clause type and the text with corresponding edited clause type for re-training the machine learning model to generate an updated instance of the machine learning model.
[0090] In one or more embodiments, the method 1100 further comprises configuring the interface 350 to allow the user to select a portion of the text being displayed and assign a clause type thereto; and utilizing the text with the assigned clause type for re-training the machine learning model.
[0091] In one or more embodiments, the method 1100 further comprises configuring the interface 350 to authorize the user to compare the updated instance of the machine learning model with the pre-trained instance of the machine learning model using a sample contract document; implementing the updated instance of the machine learning model to parse the sample contract document and identify one or more first clause types in the sample contract document; implementing the pre-trained instance of the machine learning model to parse the sample contract document and identify one or more second clause types in the sample contract document; configuring the interface 350 to display text from the sample contract document, at least one of the one or more first clause types corresponding to the displayed text from the sample contract document, and at least one of the one or more second clause types corresponding to the displayed text from the sample contract document; and authorizing the user to deploy either the updated instance of the machine learning model or the pre-trained instance of the machine learning model.
[0092] In one or more embodiments, the method 1100 further comprises configuring the interface 350 to allow a first-level reviewer as the user to select one of the one or more clause types, and to confirm the correctness and/or edit the clause type; and configuring the interface to allow a second-level reviewer as the user to authorize the comparison and the deployment of an instance of the machine learning model. Herein, the first-level reviewer is different from the second-level reviewer.
[0093] In one or more embodiments, the method 1100 further comprises configuring the interface 350 to mask confidential information in the text of the contract document being displayed.
[0094] In one or more embodiments, the method 1100 further comprises configuring the interface 350 to highlight portion of the text from the contract document corresponding to the selected one of identified one or more clause types thereof.
[0095] In one or more embodiments, the method 1100 further comprises checking for duplicity of the contract document being utilized for training the machine learning model by comparing the contract document to existing contract documents having previously been used for training the machine learning model.
[0096] In one or more embodiments, the method 1100 further comprises implementing the machine learning model to process a set of contract documents containing contract documents of different contract types and containing different clause types, and determine a frequency of occurrence of one of the different clause types vis-à-vis one of the different contract types; and generating a library containing relevant different clause types for each of the different contract types based on the determined frequency of occurrence.
[0097] In one or more embodiments, the method 1100 further comprises configuring the interface 350 to receive an input from the user indicating drafting of a particular contract type; and utilizing the generated library to suggest relevant clause types for the said particular contract type.
[0098] In one or more embodiments, the method 1100 further comprises configuring the interface 350 to allow the user to define an entity from the displayed text from the contract document corresponding to the selected one of the one or more clause types; and training the machine learning model to determine entities complementary to the defined entity in a target contract document.
[0099] The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the elements. In particular, the elements shown in FIGS. 4-10 include blocks which can be at least one of a hardware device, or a combination of hardware device and software module, may be implemented for training a machine learning model for processing of contract documents.
[00100] Thereby, the present disclosure provides systems and methods to be used for training a machine learning model for processing of contract documents. In particular, the present disclosure provides a solution to the problem of continuous training of machine learning models to keep the same updated to address performance deficit of machine learning models in information extraction from contract documents. Continuous training is an aspect of machine learning operations that automatically and continuously retrains machine learning models to adapt to changes in the data before it is redeployed. The trigger for a re-build can be data change, model change, or code change.
[00101] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the present disclosure.
, Claims:WE CLAIM:
1. A system for training a machine learning model for processing of contract documents, the system comprising:
an interface;
a database having the machine learning model as a pre-trained instance of the machine learning model stored therein; and
a processing unit configured to:
implement the pre-trained instance of the machine learning model to parse a contract document and identify one or more clause types in the contract document;
configure the interface to:
provide a list of the one or more clause types in the contract document;
allow a user to select one of the one or more clause types from the list;
display text from the contract document corresponding to the selected one of the one or more clause types; and
provide one or more options correspondent to each of the one or more clause types to allow the user to either confirm correctness of the corresponding one of the one or more clause types with respect to the displayed text or to edit the corresponding one of the one or more clause types with respect to the displayed text;
utilize the text with corresponding confirmed clause type and the text with corresponding edited clause type for re-training the machine learning model to generate an updated instance of the machine learning model; and
store the updated instance of the machine learning model as the machine learning model in the database.
2. The system as claimed in claim 1, wherein the processing unit is further configured to:
configure the interface to allow the user to select a portion of the text being displayed and assign a clause type thereto; and
utilize the text with the assigned clause type for re-training the machine learning model.
3. The system as claimed in claim 1, wherein the processing unit is further configured to:
configure the interface to authorize the user to compare the updated instance of the machine learning model with the pre-trained instance of the machine learning model using a sample contract document;
implement the updated instance of the machine learning model to parse the sample contract document and identify one or more first clause types in the sample contract document;
implement the pre-trained instance of the machine learning model to parse the sample contract document and identify one or more second clause types in the sample contract document;
configure the interface to display text from the sample contract document, at least one of the one or more first clause types corresponding to the displayed text from the sample contract document, and at least one of the one or more second clause types corresponding to the displayed text from the sample contract document; and
authorize the user to deploy either the updated instance of the machine learning model or the pre-trained instance of the machine learning model.
4. The system as claimed in claim 3, wherein the processing unit is further configured to:
configure the interface to allow a first-level reviewer as the user to select one of the one or more clause types, and to confirm the correctness and/or edit the clause type; and
configure the interface to allow a second-level reviewer as the user to authorize the comparison and the deployment of an instance of the machine learning model,
wherein the first-level reviewer is different from the second-level reviewer.
5. The system as claimed in claim 1, wherein the processing unit is further configured to configure the interface to mask confidential information in the text of the contract document being displayed.
6. The system as claimed in claim 1, wherein the processing unit is further configured to configure the interface to highlight portion of the text from the contract document corresponding to the selected one of identified one or more clause types thereof.
7. The system as claimed in claim 1, wherein the processing unit is further configured to check for duplicity of the contract document being utilized for training the machine learning model by comparing the contract document to existing contract documents having previously been used for training the machine learning model.
8. The system as claimed in claim 1, wherein the processing unit is further configured to:
configure the interface to allow the user to define an entity from the displayed text from the contract document corresponding to the selected one of the one or more clause types; and
train the machine learning model to determine entities complementary to the defined entity in a target contract document.
9. A method for training a machine learning model for processing of contract documents, the method comprising:
implementing pre-trained instance of the machine learning model to parse a contract document and identify one or more clause types in the contract document;
configuring an interface to:
provide a list of the one or more clause types in the contract document;
allow a user to select one of the one or more clause types from the list;
display text from the contract document corresponding to the selected one of the one or more clause types; and
provide one or more options correspondent to each of the one or more clause types to allow the user to either confirm correctness of the corresponding one of the one or more clause types with respect to the displayed text or to edit the corresponding one of the one or more clause types with respect to the displayed text; and
utilizing the text with corresponding confirmed clause type and the text with corresponding edited clause type for re-training the machine learning model to generate an updated instance of the machine learning model.
10. The method as claimed in claim 9 further comprising:
configuring the interface to allow the user to select a portion of the text being displayed and assign a clause type thereto; and
utilizing the text with the assigned clause type for re-training the machine learning model.
11. The method as claimed in claim 9 further comprising:
configuring the interface to authorize the user to compare the updated instance of the machine learning model with the pre-trained instance of the machine learning model using a sample contract document;
implementing the updated instance of the machine learning model to parse the sample contract document and identify one or more first clause types in the sample contract document;
implementing the pre-trained instance of the machine learning model to parse the sample contract document and identify one or more second clause types in the sample contract document;
configuring the interface to display text from the sample contract document, at least one of the one or more first clause types corresponding to the displayed text from the sample contract document, and at least one of the one or more second clause types corresponding to the displayed text from the sample contract document; and
authorizing the user to deploy either the updated instance of the machine learning model or the pre-trained instance of the machine learning model.
12. The method as claimed in claim 11 further comprising:
configuring the interface to allow a first-level reviewer as the user to select one of the one or more clause types, and to confirm the correctness and/or edit the clause type; and
configuring the interface to allow a second-level reviewer as the user to authorize the comparison and the deployment of an instance of the machine learning model,
wherein the first-level reviewer is different from the second-level reviewer.
13. The method as claimed in claim 9 further comprising configuring the interface to mask confidential information in the text of the contract document being displayed.
14. The method as claimed in claim 9 further comprising configuring the interface to highlight portion of the text from the contract document corresponding to the selected one of identified one or more clause types thereof.
15. The method as claimed in claim 9 further comprising checking for duplicity of the contract document being utilized for training the machine learning model by comparing the contract document to existing contract documents having previously been used for training the machine learning model.
16. The method as claimed in claim 9 further comprising:
configuring the interface to allow the user to define an entity from the displayed text from the contract document corresponding to the selected one of the one or more clause types; and
training the machine learning model to determine entities complementary to the defined entity in a target contract document.
| # | Name | Date |
|---|---|---|
| 1 | 202241073796-Proof of Right [20-12-2022(online)].pdf | 2022-12-20 |
| 2 | 202241073796-FORM FOR STARTUP [20-12-2022(online)].pdf | 2022-12-20 |
| 3 | 202241073796-FORM FOR SMALL ENTITY(FORM-28) [20-12-2022(online)].pdf | 2022-12-20 |
| 4 | 202241073796-FORM FOR SMALL ENTITY [20-12-2022(online)].pdf | 2022-12-20 |
| 5 | 202241073796-FORM 1 [20-12-2022(online)].pdf | 2022-12-20 |
| 6 | 202241073796-FIGURE OF ABSTRACT [20-12-2022(online)].pdf | 2022-12-20 |
| 7 | 202241073796-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [20-12-2022(online)].pdf | 2022-12-20 |
| 8 | 202241073796-EVIDENCE FOR REGISTRATION UNDER SSI [20-12-2022(online)].pdf | 2022-12-20 |
| 9 | 202241073796-DRAWINGS [20-12-2022(online)].pdf | 2022-12-20 |
| 10 | 202241073796-DECLARATION OF INVENTORSHIP (FORM 5) [20-12-2022(online)].pdf | 2022-12-20 |
| 11 | 202241073796-COMPLETE SPECIFICATION [20-12-2022(online)].pdf | 2022-12-20 |
| 12 | 202241073796-Request Letter-Correspondence [13-02-2023(online)].pdf | 2023-02-13 |
| 13 | 202241073796-Form 1 (Submitted on date of filing) [13-02-2023(online)].pdf | 2023-02-13 |
| 14 | 202241073796-Covering Letter [13-02-2023(online)].pdf | 2023-02-13 |
| 15 | 202241073796-FORM-26 [16-02-2023(online)].pdf | 2023-02-16 |