Abstract: A method and system is provided for optimal compute infrastructure utilization for in silico analysis in life science space in a High Performance Computing (HPC) environment. The method comprises characterizing a plurality of applications to be used with reference to infrastructure available for life science application analysis workflow; profiling the workflow for distinguishing between the plurality of applications; harnessing utilization cycles of the infrastructure available for life science application analysis workflow; assigning infrastructure available for workflow to at least one application out of the plurality of applications; automatically preparing the cloud environment if required and shifting the load to cloud. Further, the present application also provides a method and system for analyzing life science data and enabling automatic analysis workflow for managing life science data, including storage, archival, retrieval and search.
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR OPTIMAL COMPUTE INFRASTRUCTURE UTILIZATION FOR WORKFLOW EXECUTION IN LIFE SCIENCE
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[0001] The present application claims priority from Indian provisional specification no. 3761/MUM/2015 filed on 3rd October, 2015, the complete disclosure of which, in its entirety is herein incorporated by references.
FIELD OF THE INVENTION
[0002] The present application generally relates to the field of workflow management in high performance computing (HPC) environment. More particularly, but not specifically, the invention provides a method and system for optimally utilizing available infrastructure for in silico analysis for life science and workflow automation in a High Performance Computing (HPC) environment.
BACKGROUND OF THE INVENTION
[0003] A typical in silico analysis workflow in a High Performance Computing (HPC) environment may have complexity of highest order. The HPC environment also involves a plurality of life science applications. These applications require huge amount of infrastructure with precision and accuracy for their operation. The optimal utilization of available infrastructure is always a challenge. For example, the Next Generation Sequencing (NGS) analysis workflow may involve data running into Peta bytes, which may be multi-located and may have to be carried by shipping. While using the Next Generation Sequencing (NGS) analysis, alignment step is compute, data and memory intensive with multi-billion read alignment, the subsequent step is annotation, wherein the annotation is of the order of large dataset of SAM/BAM/VCF files.
[0004] Similarly in another example, the high throughput virtual screening (HTVS) workflow will have millions/billions of input files which will be subjected to molecular docking. Molecular docking generates a number of result files corresponding to each input file. HTVS workflow is embarrassingly parallel application hence individual jobs can be run in parallel. Unlike the NGS workflow HTVS workflow does not generate huge datasets however it requires infrastructure with good I/O characteristic and large number of parallel computing units to process millions/billions of small molecules. Each input file has to be submitted as each individual job that runs in parallel. Ultimately the visualization step should enable one to view the data with minimum data movement.
[0005] System supporting the in silico analysis for life science should be capable enough to support elastic computation, combining serial and parallel execution of plurality of task, which in turn may lead to suboptimum use of available computational infrastructure resources. As data management is a challenge a system supporting the in silico analysis for life science should accommodate variable data size across workflow, ensure efficient storage and retrieval of meta-data.
[0006] Prior art literature has never explored problem based model for optimal infrastructure utilization for in silico analysis for life science workflow. Some of the prior art literature describes automation of workflow only for Next Generation Sequencing (NGS) analysis. However, none of the prior art literature discloses a generic method or system for optimal infrastructure utilization for life science computational workflow, efficient storage, archival, retrieval and data management which may run into Peta bytes. Thereby, optimal infrastructure utilization for life science computational workflow automation in HPC environment is still considered as one of the biggest challenges of the technical domain.
OBJECTIVES OF THE INVENTION
[0007] In accordance with the present disclosure, the primary objective is to provide a method and system for optimal compute infrastructure utilization for life science computational workflow automation.
[0008] Another objective of the disclosure is to provide a method and system for providing problem based model for optimal compute infrastructure utilization for life science applications workflow automation in a High Performance Computing (HPC) environment.
[0009] Yet another objective of the disclosure is to provide a method and system for analyzing life science data.
[0010] Still another objective of the disclosure is to provide a method and system for managing Life science data enabling efficient storage, archival, and retrieval of data management pertaining to the representative computational workflows.
[0011] Still another objective of the disclosure is to provide a method and system for minimizing user specific errors while handling Life science data.
[0012] Still another objective of the disclosure is to provide a method and system for providing problem based model for optimal compute infrastructure utilization for Next Generation Sequencing (NGS) analysis workflow automation in a High Performance Computing (HPC) environment.
[0013] Still another objective of the disclosure is to provide a method and system for optimal compute infrastructure utilization for High Throughput Virtual Screening (HTVS) workflow using molecular docking technique.
[0014] Other objects and advantages of the present disclosure will be more apparent from the following description when read in conjunction with the accompanying figures, which are not intended to limit the scope of the present disclosure.
SUMMARY OF THE INVENTION
[0015] The following presents a simplified summary of some embodiments of the disclosure in order to provide a basic understanding of the embodiments. This summary is not an extensive overview of the embodiments. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the embodiments. Its sole purpose is to present some embodiments in a simplified form as a prelude to the more detailed description that is presented below.
[0016] In view of the foregoing, an embodiment herein provides a system for optimal compute infrastructure utilization for a life science application. The system comprises an application characterization module, a cloud infrastructure preparation module, a workflow profiling module, a utilization cycle harnessing module, an infrastructure assignment module, a life science data analytical module and a life science data management module. The application characterization module characterizes a plurality of applications to be used for the life science application workflow corresponding to life science data. The cloud infrastructure preparation module prepares cloud infrastructure environment based on the life science data if the plurality of applications are executed on the cloud infrastructure. The workflow profiling module profiles the life science application workflow to distinguish between the plurality of applications, if the plurality of applications are executed on at least one of a local infrastructure, a server or available infrastructure. The utilization cycle harnessing module harnesses the utilization cycles of the infrastructure available for the life science application workflow. The infrastructure assignment module dynamically assigns infrastructure available for the life science application workflow to at least one application out of the plurality of applications based on the profiling of the life science application workflow. The life science data analytical module analyzes the life science data. The life science data management module manages the life science data by enabling efficient storage, archival and retrieval and search of the life science data.
[0017] Another embodiment provides a method for optimal compute infrastructure utilization for a life science application. Initially, a plurality of applications to be used for the life science application workflow is characterized corresponding to life science data; At the next step, it is checked by the processor, if the plurality of applications is to be executed on a cloud infrastructure. At the next step, cloud infrastructure environment is prepared based on the life science data if the plurality of applications are executed on the cloud infrastructure. At the next step, it is checked by the processor if the plurality of applications to be executed on at least one of a local infrastructure, a server or available infrastructure. At the next step, the life science application workflow is profiled to distinguish between the plurality of applications, if the plurality of applications are executed on at least one of a local infrastructure, a server or available infrastructure. At next step, the utilization cycles of the infrastructure available for the life science application workflow is harnessed. At next step, infrastructure available for the life science application workflow is dynamically assigned to at least one application out of the plurality of applications based on the profiling of the life science application workflow. At the next step, the life science data is analyzed. And finally, the life science data is managed by enabling efficient storage, archival, retrieval and search of life science data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
[0019] Fig. 1 is a flowchart illustrating the steps involved for optimal compute infrastructure utilization for a life science application, in accordance with an embodiment of the disclosure;
[0020] Fig. 2 shows a block diagram illustrating a system architecture for optimal compute infrastructure utilization for the life science application, in accordance with an embodiment of the disclosure;
[0021] Fig. 3 is a flowchart illustrating the steps involved for optimal compute infrastructure utilization for next generation sequencing analysis, in accordance with an embodiment of the disclosure
[0022] Fig. 4 shows a block diagram illustrating the steps for RNA – sequence analysis (next generation sequencing analysis), in accordance with an embodiment of the disclosure;
[0023] Fig. 5 is a flowchart illustrating the steps involved for optimal compute infrastructure utilization for a high throughput virtual screening workflow, in accordance with an embodiment of the disclosure;
[0024] Fig. 6 shows a block diagram illustrating system architecture for optimal compute infrastructure utilization for high throughput virtual screening workflow, in accordance with an embodiment of the disclosure; and
[0025] Fig. 7 shows a flowchart illustrating the steps for molecular docking for high throughput virtual screening workflow, in accordance with an embodiment of the disclosure.
DETAILED DESCRIPTION OF THE INVENTION
[0026] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0027] Referring now to the drawings, and more particularly to Fig. 1, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[0028] Referring to Fig.1 is a flow chart 100 illustrating a method for optimal compute infrastructure utilization for a life science application. The process starts at step 102, a plurality of applications to be used with reference to infrastructure available for life science application workflow is characterized corresponding to life science data. At the step 104, it is checked by the processor (202) that the plurality of applications are to be executed on a cloud infrastructure. At step 106, a cloud infrastructure environment is prepared based on the life science data if the plurality of applications are executed on the cloud infrastructure. At step 108, it is checked that if the plurality of applications to be executed on at least one of a local infrastructure, a server or available infrastructure. At step 110, the life science application workflow is profiled to distinguish between the plurality of applications. At the step 112, utilization cycles of the infrastructure available for life science application analysis workflow is harnessed. At the step 114, infrastructure available for life science application analysis workflow is dynamically assigned to at least one application out of the plurality of applications based on the profiling of the life science application workflow. At the step 116, the life science data is analyzed. The process ends at the step 118, the life science data is managed by enabling efficient storage, archival, retrieval and search of life science data.
[0029] Referring to Fig. 2, is a block diagram illustrating a system architecture for optimal compute infrastructure utilization for life science application. In an embodiment of the present invention, a system (200) is provided for optimal compute infrastructure utilization for in silico life science analysis workflow in a High Performance Computing (HPC) environment.
[0030] The system (200) for optimal compute infrastructure utilization for in silico life science analysis workflow in a High Performance Computing (HPC) environment comprising a processor (202); a data bus coupled to said processor (202); a memory (204) and a computer-usable medium embodying computer code, said computer-usable medium being coupled to said data bus, said computer program code comprising instructions executable by said processor (202) and configured for operating an application characterization module (206); a cloud infrastructure preparation module (208); a workflow profiling module (210); an utilization cycle harnessing module (212); an infrastructure assignment module (214); a life science data analytical module (216); and a life science data management module (218).
[0031] In an embodiment of the present disclosure, the application characterization module (206) is adapted for characterizing a plurality of applications to be used with reference to infrastructure available for in silico analysis in life science workflow. Characterization of the plurality of applications to be used with reference to the infrastructure available for analysis workflow may be done manually or automatically. The infrastructure available for in silico life science analysis workflow is selected from a group comprising but not limited to CPU, memory, network, server, cloud servers and input/ output.
[0032] In an embodiment of the present disclosure, the cloud infrastructure preparation module (208) is configured to prepare the cloud infrastructure environment based on the life science data if the processor (202) checks that the plurality of applications are to be executed on the cloud infrastructure. In another embodiment of the disclosure, the cloud infrastructure preparation module (208) may also check if the plurality of applications needs to be partially executed on the cloud infrastructure.
[0033] In an embodiment of the present disclosure, the life science workflow profiling module (210) is adapted for profiling of the life science application analysis workflow for distinguishing between the plurality of applications based on respective resource requirements for the analysis workflow. The profiling of the workflow for distinguishing between the plurality of applications based on respective resource requirements for the life science application may be done manually or automatically.
[0034] In another embodiment of the present disclosure, the utilization cycle harnessing module (212) is adapted for harnessing utilization cycles of the infrastructure available for in silico life science application analysis workflow. The utilization cycle of infrastructure available for analysis workflow may comprise at least one CPU cycles of a plurality of idle workstations along with at least one corresponding cluster.
[0035] In another embodiment of the present disclosure, the infrastructure assignment module (214) is adapted for assigning the infrastructure available for in silico life science application analysis workflow, wherein at least one cluster is assigned for at least one intensive application out of the plurality of applications pertaining to the analysis workflow and assigning at least one workstations to at least one lighter application out of the plurality of applications pertaining to the analysis workflow.
[0036] In another embodiment of the present disclosure, the life science data analytical module (216) is adapted for analyzing life science data.
[0037] In another embodiment of the present disclosure, the life science data management module (218) is adapted for enabling efficient storage, archival, retrieval and search of data pertaining to the life science data. The data management module (218) is further adapted for tagging the life science data for validation and faster access, wherein the life science application data may be accessed selectively and shared accordingly. The data may be archived during runtime. The data management module (218) is furthermore adapted for tracking origin and processing history of the life science data. Thereby, the data management module (218) enables increased user’s efficiency and minimizes user specific errors while handling life science data.
[0038] According to an embodiment of the disclosure, the system 100 can be explained with the help of the example of Next Generation Sequencing (NGS) analysis workflow using RNA Sequence analysis workflow and High Throughput Virtual Screening (HTVS) workflow using molecular docking.
[0039] Referring to Fig. 3 is a flowchart 300 illustrating the method for optimal compute infrastructure utilization for next generation RNA sequencing analysis workflow. The process starts at step 302, a plurality of applications to be used with reference to infrastructure available for Next Generation Sequencing (NGS) analysis workflow is characterized. At the step 304, the Next Generation Sequencing (NGS) analysis workflow is profiled for distinguishing between the plurality of applications. At the step 306, utilization cycles of the infrastructure available for Next Generation Sequencing (NGS) analysis workflow is harnessed. At the step 308, infrastructure available for Next Generation Sequencing (NGS) analysis workflow is assigned to the at least one application out of the plurality of applications. At the step 310, the Next Generation Sequencing (NGS) data is analyzed. The process ends at the step 312, the Next Generation Sequencing (NGS) data is managed by enabling efficient storage, archival, retrieval and search of Next Generation Sequencing (NGS).
[0040] In an exemplary embodiment of the present invention, a method and system is provided for optimal compute infrastructure utilization for Next Generation Sequencing (NGS) analysis workflow, wherein the Next Generation Sequencing (NGS) analysis workflow is RNA Sequence analysis workflow and automation thereof. An Integrated Rule Oriented Data System (iRODS) in combination with a HT Condor, which is a scheduler is used for optimal compute infrastructure utilization for Next Generation Sequencing (NGS) analysis of RNA Sequence workflow and automation thereof. The Integrated Rule Oriented Data System (iRODS) is used for efficient management of input and output data, enabling to tag data, providing single point of access for data, appropriate routing and sharing data pertaining to the life science data across multiple users. The Integrated Rule Oriented Data System (iRODS) is further adapted for tagging files that are uploaded into the Integrated Rule Oriented Data System (iRODS). A rule may be configured in the Integrated Rule Oriented Data System (iRODS) to automatically tag the files uploaded into the Integrated Rule Oriented Data System (iRODS) based on recognized extensions. For example. test. fastq file may be attached a tag "file_type: fasta", for searching of desired files from a large pool of data in an efficient manner.
[0041] According to an embodiment of the disclosure, the NGS analysis workflow pertaining to RNA Sequence analysis workflow is shown in the block diagram of Fig. 4. The RNA Sequence analysis workflow comprises of a brain sequence alignment module (402); an UHR sequence alignment module (404); an identify filter SNP module (406); and a cufflink module (408).
[0042] In another embodiment of the present disclosure, the brain sequence alignment module (402) is adapted for validating the input files using a brain sequence alignment script for determining whether proper tags are attached or not, wherein the "file_type" tag is checked; downloading a plurality of input files from Integrated Rule Oriented Data System (iRODS) and place them into proper directory using a job script after successful validation; uploading an output folder/ directory appended with current time stamp into the Integrated Rule Oriented Data System (iRODS) on successful completion of Tophat; and attaching two more tags to output folder/directory uploaded into the Integrated Rule Oriented Data System (iRODS), "file_type : brain_alignment_output" and "time : time_stamp".
[0043] In another embodiment of the present invention, the UHR sequence alignment module (404) is adapted for validating the input files using a UHR sequence alignment script for determining whether proper tags are attached or not, wherein the "file_type" tag is checked; downloading a plurality of input files from Integrated Rule Oriented Data System (iRODS) and place them into proper directory using a job script after successful validation; uploading an output folder/ directory appended with current time stamp into the Integrated Rule Oriented Data System (iRODS) on successful completion of Tophat; and attaching two more tags to output folder/directory uploaded into the Integrated Rule Oriented Data System (iRODS), "file_type : brain_alignment_output" and "time : time_stamp".
[0044] In another embodiment of the present invention, the identify filter SNP module (406) is adapted for querying required files for SNP, which requires input from both brain sequence alignment module (402) and UHR sequence alignment module (404). The identify filter SNP module (406) is adapted for querying the Integrated Rule Oriented Data System (iRODS) for getting desired input files, such as the brain sequence alignment output with the latest time_stamp and same for UHR sequence alignment; downloading the queried output files from the Integrated Rule Oriented Data System (iRODS) and performing the SNP identification over them; uploading output folder which is appended with the current time stamp into the Integrated Rule Oriented Data System (iRODS); and attaching two more tags to output folder/directory uploaded into the Integrated Rule Oriented Data System (iRODS), "file_type : snp_output" and "time : time_stamp".
[0045] In another embodiment of the present invention, the cufflink module (408) is adapted for querying required files for cufflink, which requires input from both brain sequence alignment module (402) and UHR sequence alignment module (404). The cufflink module (408) is adapted for querying the Integrated Rule Oriented Data System (iRODS) for getting desired input files, such as the brain sequence alignment output with the latest time_stamp and same for UHR sequence alignment; downloading the queried output files from the Integrated Rule Oriented Data System (iRODS) and storing said queried output files on a local storage for performing the cufflink analysis over them; uploading output folder into the Integrated Rule Oriented Data System (iRODS); attaching two more tags to output folder/directory uploaded into the Integrated Rule Oriented Data System (iRODS), "file_type : cufflink_output" and "time : time_stamp"; and sharing results with a plurality of users by granting them read and write permissions. Further, the cufflink module (408) is adapted to launch cufflink job over at least one workstation, as it does not have high compute, memory requirements and the amount of input data transfer required is also less.
[0046] Referring to Fig. 5 is a flowchart illustrating the method for optimal compute infrastructure utilization for High Throughput Virtual Screening (HTVS) workflow using molecular docking. The process starts at step 502, a plurality of applications to be used with reference to infrastructure available for HTVS workflow is characterized. At the step 504, cloud execution environment based on the HTVS workflow requirement is prepared. At the step 506, infrastructure available for HTVS analysis workflow is dynamically assigned to the at least one application out of the plurality of applications. At step 508, the HTVS workflow is executed. At the step 510, the cloud execution environment is pulled down on the completion of the HTVS workflow. At the step 512, the HTVS data is analyzed. The process ends at the step 514, the HTVS data is managed by enabling efficient storage, archival, and retrieval of HTVS data.
[0047] Referring to Fig. 6 is a block diagram illustrating system architecture for optimal compute infrastructure utilization for High Throughput Virtual Screening (HTVS) workflow using molecular docking. The system (100) comprises an HTVS application characterization module (602), a HiPerCB dynamically infrastructure provisioning module (604), an HTVS infrastructure assignment module (606), an HTVS data analytical module (608) and an HTVS data management module (610). The HTVS application characterization module (602) is configured to characterize the plurality of applications to be used with reference to available cloud infrastructure. It should be appreciated that the function of HTVS application characterization module (602) is same as the application characterized module (206) described earlier. The HiPerCB is a dynamically infrastructure provisioning module (604) or HiPerCB module (604) is configured to automatically prepare cloud execution environment based on HTVS workflow resource requirement. The HiPerCB module (604) is used to provision virtual machine instances on demand to run the workflow. It also pull down execution environment once execution of workflow is completed. In another embodiment of the disclosure, the plurality of applications may be partially executed on the cloud infrastructure. The HTVS infrastructure assignment module (606) is configured to assign the cloud infrastructure available for HTVS analysis workflow. It uses virtual machine instance as compute resources deployed by HiPerCB module (604). It should be appreciated that the function of HTVS infrastructure assignment module (606) is same as the infrastructure assignment module (214) described earlier. The HTVS data analytical module (608) is configured to analyze molecular docking data. It should be appreciated that the function of HTVS data analytical module (608) is same as the life science data analytical module (216) described earlier. The HTVS data management module (610) is configured to store, archive, retrieve and search data generated by the workflow. This module is also used to track historical data. Thereby, the HTVS data management module (610) enables user’s efficiency by reducing search time and minimizes user specific errors while handling molecular docking data. It should be appreciated that the function of HTVS data management module (610) is same as the life science data module (218) described earlier.
[0048] According to an embodiment of the disclosure, Fig. 7 is a flowchart 700 showing different steps of High Throughput Virtual Screening (HTVS) workflow which are implemented on HPC and cloud using the HiPerCB module (604). The various steps of High Throughput Virtual Screening (HTVS) workflow are executed as single job for a receptor-ligand pair. Multiple such jobs are submitted for a number of ligands on the available infrastructure i.e. HPC + cloud. Initially at step 702, the data is preprocessed. The preprocessing involves Ligand preparation and receptor preparation. Preprocessing of molecular docking inputs using MGLtools to convert the receptor protein and ligand file to the Autodock (molecular docking) tool compatible format. At the step 704, the parameters are defined for grid preparation and molecular docking using MGLtools to generate grid parameter file (GPF) and docking parameter file (DPF) files respectively. These files will be used to run following steps of grid preparation and molecular docking. At the step 706, the grid is prepared over receptor protein using Autogrid command with the help of grid parameter file (GPF) prepared in the previous step, it will define the computation space for docking calculation. At the step 708, molecular docking is performed using Autodock command with the help of docking parameter file (DPF). It will perform the docking computation using genetic algorithm and generate the docking output in a .dlg file. And finally at the step 710, docking results generated in previous step are summarized using MGL tools on the basis of binding energy and relevant hydrogen bonds between ligand and receptor protein.
[0049] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims. The embodiment, thus provides the system and method for optimal compute infrastructure utilization for the life science application in high performance computing environment.
[0050] It is, however to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0051] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0052] The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
[0053] A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
[0054] Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
[0055] A representative hardware environment for practicing the embodiments may include a hardware configuration of an information handling/computer system in accordance with the embodiments herein. The system herein comprises at least one processor or central processing unit (CPU). The CPUs are interconnected via system bus to various devices such as a random access memory (RAM), read-only memory (ROM), and an input/output (I/O) adapter. The I/O adapter can connect to peripheral devices, such as disk units and tape drives, or other program storage devices that are readable by the system. The system can read the inventive instructions on the program storage devices and follow these instructions to execute the methodology of the embodiments herein.
[0056] The system further includes a user interface adapter that connects a keyboard, mouse, speaker, microphone, and/or other user interface devices such as a touch screen device (not shown) to the bus to gather user input. Additionally, a communication adapter connects the bus to a data processing network, and a display adapter connects the bus to a display device which may be embodied as an output device such as a monitor, printer, or transmitter, for example.
[0057] The preceding description has been presented with reference to various embodiments. Persons having ordinary skill in the art and technology to which this application pertains will appreciate that alterations and changes in the described structures and methods of operation can be practiced without meaningfully departing from the principle, spirit and scope.
,CLAIMS:
1. A method for optimal compute infrastructure utilization for life science applications, the method comprising a processor implemented steps of:
characterizing a plurality of applications to be used for the life science application workflow corresponding to life science data;
checking if the plurality of applications are to be executed on a cloud infrastructure;
preparing cloud infrastructure environment based on the life science data if the plurality of applications are to be executed on the cloud infrastructure;
checking if the plurality of applications to be executed on at least one of a local infrastructure, a server or available infrastructure;
profiling the life science application workflow to distinguish between the plurality of applications and harnessing the utilization cycles of the infrastructure available for the life science application workflow, if the plurality of applications are executed on at least one of a local infrastructure, a server or available infrastructure;
dynamically assigning infrastructure available for the life science application workflow to at least one application out of the plurality of applications based on the profiling of the life science application workflow;
analyzing the life science data; and
managing the life science data by enabling efficient storage, archival retrieval and search of life science data.
2. The method of claim 1, wherein the life science application is next generation sequencing (NGS) and the life science data is NGS data.
3. The method of claim 1, wherein the life science application is high throughput virtual screening (HTVS) and the life science data is HTVS data.
4. The method of claim 1 wherein the optimal infrastructure utilization is performed in a high performance computing (HPC) environment.
5. The method of claim 1, wherein the characterization is performed either manually or automatically.
6. The method of claim 1, wherein the plurality of applications comprising at least one resource intensive application and at least one lighter application in terms of their operational requirements.
7. The method of claim 1, wherein a cluster is assigned to the resource intensive application and workstations are assigned to the lighter applications.
8. The method of claim 1, wherein the characterization of the plurality of application is with reference to CPU, memory, network or I/O utilization.
9. The method of claim 1 wherein the profiling is performed based on the respective resource requirement for the NGS workflow analysis.
10. The method of claim 1, wherein the Next Generation Sequencing (NGS) analysis workflow is RNA Sequence analysis workflow.
11. A system for optimal compute infrastructure utilization for a life science application, the system comprising:
an application characterization module configured to characterize a plurality of applications to be used for the life science application workflow corresponding to life science data;
a cloud infrastructure preparation module configured to prepare cloud infrastructure environment based on the life science data if the plurality of applications are executed on the cloud infrastructure;
a workflow profiling module configured to profile the life science application workflow to distinguish between the plurality of applications, if the plurality of applications are executed on at least one of a local infrastructure, a server or available infrastructure;
a utilization cycle harnessing module configured to harness the utilization cycles of the infrastructure available for the life science application workflow;
an infrastructure assignment module configured to dynamically assign infrastructure available for the life science application workflow to at least one application out of the plurality of applications based on the profiling of the life science application workflow;
a life science data analytical module configured to analyze the life science data; and
life science data management module configured to manage the life science data by enabling efficient storage, archival, retrieval and search of the life science data.
12. A method for optimal compute infrastructure utilization for next generation sequencing (NGS) data, the method comprising a processor implemented steps of:
characterizing a plurality of applications to be used for a next generation sequencing (NGS) analysis workflow corresponding to the NGS data;
profiling the NGS analysis workflow to distinguish between the plurality of applications;
harnessing the utilization cycles of the infrastructure available for NGS analysis workflow;
assigning infrastructure available for NGS analysis workflow to the at least one application out of the plurality of applications based on the profiling of the NGS analysis workflow;
analyzing the NGS data; and
managing the NGS data by enabling efficient storage, archival, retrieval and search of NGS data.
13. A method for optimal compute infrastructure utilization for a high throughput virtual screening (HTVS) workflow, the method comprising a processor implemented steps of:
characterizing a plurality of applications to be executed on a cloud infrastructure;
preparing cloud execution environment based on the HTVS workflow requirement;
dynamically assigning prepared infrastructure for HTVS workflow to at least one application out of the plurality of applications;
executing the HTVS workflow;
pulling down the cloud execution environment on the completion of the workflow;
analyzing the HTVS data; and
managing the HTVS data by enabling efficient storage, archival retrieval and search of HTVS data.
| Section | Controller | Decision Date |
|---|---|---|
| # | Name | Date |
|---|---|---|
| 1 | 3761-MUM-2015-IntimationOfGrant07-03-2024.pdf | 2024-03-07 |
| 1 | Form 3 [03-10-2015(online)].pdf | 2015-10-03 |
| 2 | 3761-MUM-2015-PatentCertificate07-03-2024.pdf | 2024-03-07 |
| 2 | Drawing [03-10-2015(online)].pdf | 2015-10-03 |
| 3 | Description(Provisional) [03-10-2015(online)].pdf | 2015-10-03 |
| 3 | 3761-MUM-2015-Written submissions and relevant documents [12-01-2024(online)].pdf | 2024-01-12 |
| 4 | Form 3 [01-10-2016(online)].pdf | 2016-10-01 |
| 4 | 3761-MUM-2015-Correspondence to notify the Controller [31-12-2023(online)].pdf | 2023-12-31 |
| 5 | Form 18 [01-10-2016(online)].pdf | 2016-10-01 |
| 5 | 3761-MUM-2015-FORM-26 [31-12-2023(online)]-1.pdf | 2023-12-31 |
| 6 | Drawing [01-10-2016(online)].pdf | 2016-10-01 |
| 6 | 3761-MUM-2015-FORM-26 [31-12-2023(online)].pdf | 2023-12-31 |
| 7 | Description(Complete) [01-10-2016(online)].pdf | 2016-10-01 |
| 7 | 3761-MUM-2015-US(14)-HearingNotice-(HearingDate-02-01-2024).pdf | 2023-12-05 |
| 8 | Assignment [01-10-2016(online)].pdf | 2016-10-01 |
| 8 | 3761-MUM-2015-CLAIMS [09-12-2020(online)].pdf | 2020-12-09 |
| 9 | 3761-MUM-2015-COMPLETE SPECIFICATION [09-12-2020(online)].pdf | 2020-12-09 |
| 9 | Form-2(Online).pdf | 2018-08-11 |
| 10 | 3761-MUM-2015-FER_SER_REPLY [09-12-2020(online)].pdf | 2020-12-09 |
| 10 | Form-18(Online).pdf | 2018-08-11 |
| 11 | 3761-MUM-2015-OTHERS [09-12-2020(online)].pdf | 2020-12-09 |
| 11 | ABSTRACT1.jpg | 2018-08-11 |
| 12 | 3761-MUM-2015-FER.pdf | 2020-06-10 |
| 12 | 3761-MUM-2015-Power of Attorney-220316.pdf | 2018-08-11 |
| 13 | 3761-MUM-2015-Correspondence-201115.pdf | 2018-08-11 |
| 13 | 3761-MUM-2015-Form 1-201115.pdf | 2018-08-11 |
| 14 | 3761-MUM-2015-Correspondence-220316.pdf | 2018-08-11 |
| 15 | 3761-MUM-2015-Correspondence-201115.pdf | 2018-08-11 |
| 15 | 3761-MUM-2015-Form 1-201115.pdf | 2018-08-11 |
| 16 | 3761-MUM-2015-FER.pdf | 2020-06-10 |
| 16 | 3761-MUM-2015-Power of Attorney-220316.pdf | 2018-08-11 |
| 17 | ABSTRACT1.jpg | 2018-08-11 |
| 17 | 3761-MUM-2015-OTHERS [09-12-2020(online)].pdf | 2020-12-09 |
| 18 | Form-18(Online).pdf | 2018-08-11 |
| 18 | 3761-MUM-2015-FER_SER_REPLY [09-12-2020(online)].pdf | 2020-12-09 |
| 19 | 3761-MUM-2015-COMPLETE SPECIFICATION [09-12-2020(online)].pdf | 2020-12-09 |
| 19 | Form-2(Online).pdf | 2018-08-11 |
| 20 | 3761-MUM-2015-CLAIMS [09-12-2020(online)].pdf | 2020-12-09 |
| 20 | Assignment [01-10-2016(online)].pdf | 2016-10-01 |
| 21 | 3761-MUM-2015-US(14)-HearingNotice-(HearingDate-02-01-2024).pdf | 2023-12-05 |
| 21 | Description(Complete) [01-10-2016(online)].pdf | 2016-10-01 |
| 22 | 3761-MUM-2015-FORM-26 [31-12-2023(online)].pdf | 2023-12-31 |
| 22 | Drawing [01-10-2016(online)].pdf | 2016-10-01 |
| 23 | 3761-MUM-2015-FORM-26 [31-12-2023(online)]-1.pdf | 2023-12-31 |
| 23 | Form 18 [01-10-2016(online)].pdf | 2016-10-01 |
| 24 | 3761-MUM-2015-Correspondence to notify the Controller [31-12-2023(online)].pdf | 2023-12-31 |
| 24 | Form 3 [01-10-2016(online)].pdf | 2016-10-01 |
| 25 | Description(Provisional) [03-10-2015(online)].pdf | 2015-10-03 |
| 25 | 3761-MUM-2015-Written submissions and relevant documents [12-01-2024(online)].pdf | 2024-01-12 |
| 26 | Drawing [03-10-2015(online)].pdf | 2015-10-03 |
| 26 | 3761-MUM-2015-PatentCertificate07-03-2024.pdf | 2024-03-07 |
| 27 | Form 3 [03-10-2015(online)].pdf | 2015-10-03 |
| 27 | 3761-MUM-2015-IntimationOfGrant07-03-2024.pdf | 2024-03-07 |
| 1 | D1_NPLE_10-06-2020.pdf |
| 1 | SearchStrategyMatrix3761MUM2015E_10-06-2020.pdf |
| 2 | D1_NPLE_10-06-2020.pdf |
| 2 | SearchStrategyMatrix3761MUM2015E_10-06-2020.pdf |