Method And System For Legacy Code Transformation

< Back

Method And System For Legacy Code Transformation

Abstract: This disclosure relates to method and system for facilitating legacy code transformation. The method includes receiving (301) legacy code data (801) and natural language document from one or more data sources. Each of the one or more data sources is one of an external data source or an internal data source. Further, the method includes generating (302) a first natural language output (807) based on the legacy code data (801) through a first LLM (804), and a second natural language output based on the natural language document through a second LLM; fine-tuning (303) one of the first LLM (804) or the second LLM based on the first natural language output (807) and the second natural language output, through a third LLM (810); and generating (304) a natural language specification document (811) based on the first natural language output (807) and the second natural language output through the third LLM (810).

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

09 August 2023

Publication Number

28/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

INFOSYS LIMITED

44, Infosys Avenue, Electronics City, Hosur Road, Bangalore, 560100, Karnataka

Inventors

1. Srinivas Jaggumantri

B 08 Patterns, Good Earth Malhar, Kambipura, Bangalore

2. Madhavi Latha Padakanti

105 Vars Ferndale Apts, 1st Main Rd Kodihalli, Hal II Stage, Bangalore 560008

3. Nareshkumar Manoharan

B115 Prime City Apartments, Doddathogur, Electronics City phase -1, Bangalore, KA - 560100

Specification

Technical Field
[001] This disclosure relates generally to legacy code transformation, and
more particularly to a method and a system for legacy code transformation through
generative AI models.
Background
10 [002] Various organizations, such as banking, insurance, or government,
built using legacy programming languages (for example, a common business-oriented
language (COBOL)), pose significant challenges in terms of maintenance, scalability,
and integration with modern technologies. As organizations strive to modernize their
software systems, there is a growing need for efficient and accurate methods to
15 transform legacy codebases into modernized code languages.
[003] Financial institutions often face significant challenges with their aging
legacy systems, particularly those heavily dependent on mainframe platforms and the
COBOL programming language. These systems, of which approximately 43% of all
banking systems, present obstacles for modernization efforts. Traditional large-scale
20 modernization programs may stretch over a decade and have a low success rate, which
creates hesitation among banks to embark on such endeavors.
[004] One of the key technical problems encountered in legacy language
transformation is the manual or line-by-line code conversion process, which lacks a
comprehensive understanding of a code's context. This approach often results in
25 monolithic code structures that are difficult to maintain, and the migration process
becomes time-consuming and costly. Code refactoring becomes a considerable
undertaking, further hindering the modernization efforts.
[005] To understand legacy lanscape, several tools have been developed for
extracting business rules and providing inventory and dead code analysis. However,
30 these tools only assist with manual efforts and do not offer end-to-end automation. In
large-scale legacy modernization programs, reverse engineers work closely with
business analysts to manually create detailed specification documents, which may be a
labor-intensive and error-prone process.
[006] Moreover, financial institutions encounter challenges when
35 modernizing legacy batch processes. These processes are often tightly coupled and
complex, making it difficult to trace dependencies between different components. The
monolithic nature of batch system further complicates the transition to real-time
operations, impeding efforts to enhance straight-through processing.
[007] Recent advancements in generative Artificial Intelligence (AI)
40 techniques, such as codex transformers and unsupervised neural machine translation
(NMT), have shown promise in understanding, generating, and translating source code.
However, these models have predominantly been trained on modern programming
languages like Java and Python, and their applicability to legacy languages is limited.
3
5 [008] There is, therefore, a need in the present state of art, for techniques to
address the challenges faced by various organizations in legacy language
transformation. The proposed techniques may focus on the transformation of the
legacycodebases such as COBOL to a modernized code language, such as java or
python, ensuring improved system agility while reducing timeline typically associated
10 with such transformations.
SUMMARY
[009] In one embodiment, a method for facilitating legacy code
transformation is disclosed. In one example, the method may include receiving legacy
code data and at least one natural language document from one or more data sources.
15 Further, the method may include generating a first natural language output based on
the legacy code data through a first Large Language Model (LLM), and a second
natural language output based on the at least one natural language document through a
second LLM. The first natural language output may include domain context or code
explanation corresponding to the legacy code data, and the second natural language
20 output may include extracted knowledge from the at least one natural language
document. Further, the method may include fine-tuning at least one of the first LLM or
the second LLM based on the first natural language output and the second natural
language output, through a third LLM. Further, the method may include generating a
natural language specification document corresponding to the legacy code data based
25 on the first natural language output and the second natural language output through the
third LLM.
[010] In one embodiment, a system for facilitating legacy code
transformation is disclosed. In one example, the system may include a processor and a
computer-readable medium communicatively coupled to the processor. The computer30 readable medium may store processor-executable instructions, which, on execution,
may cause the processor to receive legacy code data and at least one natural language
document from one or more data sources. Further, the processor-executable
instructions, on execution, may further cause the processor to generate a first natural
language output based on the legacy code data through a first Large Language Model
35 (LLM), and a second natural language output based on the at least one natural language
document through a second LLM. The first natural language output may include
domain context or code explanation corresponding to the legacy code data, and the
second natural language output may include extracted knowledge from the at least one
natural language document. Further, the processor-executable instructions, on
40 execution, may further cause the processor to fine-tune at least one of the first LLM or
the second LLM based on the first natural language output and the second natural
language output, through a third LLM. Further, the processor-executable instructions,
on execution, may further cause the processor to generate a natural language
specification document corresponding to the legacy code data based on the first natural
45 language output and the second natural language output through the third LLM.
4
5 [011] It is to be understood that both the foregoing general description and
the following detailed description are exemplary and explanatory only and are not
restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[012] The accompanying drawings, which are incorporated in and constitute
10 a part of this disclosure, illustrate exemplary embodiments and, together with the
description, explain the disclosed principles.
[013] FIG. 1 is a block diagram of an environment for facilitating legacy
code transformation, in accordance with an exemplary embodiment of the present
disclosure;
15 [014] FIG. 2 is a block diagram of a computing device for facilitating legacy
code transformation, in accordance with an exemplary embodiment of the present
disclosure;
[015] FIG. 3 is a flow diagram of an exemplary process for facilitating
legacy code transformation, in accordance with an exemplary embodiment of the
20 present disclosure;
[016] FIG. 4 is a flow diagram of an exemplary process for training a first
LLM, in accordance with an exemplary embodiment of the present disclosure;
[017] FIG. 5 is a flow diagram of an exemplary process for generating a first
natural language output, in accordance with an exemplary embodiment of the present
25 disclosure;
[018] FIG. 6 is a flow diagram of an exemplary process for fine-tuning at
least one of a first LLM and a second LLM, in accordance with an exemplary
embodiment of the present disclosure;
[019] FIG. 7 a flow diagram of an exemplary process for facilitating legacy
30 code transformation, in accordance an exemplary embodiment of the present
disclosure;
[020] FIG. 8 is a diagram that illustrates transformation of a legacy code
data to a modernized code, in accordance with an exemplary embodiment of the present
disclosure;
35 [021] FIG. 9 is a diagram that illustrates generation of a natural language
specification document corresponding to legacy code data, in accordance with an
exemplary embodiment of the present disclosure; and
[022] FIG. 10 a diagram that illustrates training of first LLM, in accordance
with an exemplary embodiment of the present disclosure.
40 DETAILED DESCRIPTION
[023] Exemplary embodiments are described with reference to the
accompanying drawings. Wherever convenient, the same reference numbers are used
throughout the drawings to refer to the same or like parts. While examples and features
5
5 of disclosed principles are described herein, modifications, adaptations, and other
implementations are possible without departing from the spirit and scope of the
disclosed embodiments. It is intended that the following detailed description be
considered as exemplary only, with the true scope and spirit being indicated by the
following claims.
10 [024] FIG. 1 is a diagram that illustrates a block diagram of an environment
100 for facilitating legacy code transformation, in accordance with an exemplary
embodiment of the present disclosure.
[025] The environment 100 may include a user device 101, and a computing
device 102. The user device 101 and the computing device 102 are configured to
15 communicate with each other via a communication network 103. Examples of the
communication network 103 may include, but are not limited to, a wireless fidelity
(Wi-Fi) network, a light fidelity (Li-Fi) network, a local area network (LAN), a wide
area network (WAN), a metropolitan area network (MAN), a satellite network, the
Internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a
20 radio frequency (RF) network, and a combination thereof.
[026] . As will be described in greater detail in conjunction with FIGS. 2 –
10, in order to transform the legacy code (for example, COBOL) to a modernized code
(for example, Java, or Python), initially, the communication network 103 may facilitate
data exchange between the user device 101 and the computing device 102. Specifically,
25 the computing device 102 receives data (for example, legacy code data and one or more
natural language documents) from the user device 101 via the communication network
103.
[027] As will be appreciated by those skilled in the art, the techniques
described herein are not limited to the transformation of COBOL codebases alone, but
30 encompass a broader scope of legacy codebases. These techniques are designed to
facilitate the transition from various legacy languages, including but not limited to
COBOL, to modernized code languages such as Java or Python. It is understood that
legacy systems may be written in diverse programming languages, and the techniques
described herein are adaptable and extensible to address the challenges posed by
35 different legacy codebases.
[028] The user device 101 may include a legacy codebase for storing the
legacy code data, internal data sources for storing internal natural language documents,
and external data sources for storing external natural language documents. Examples
of the user device 101 may include a smartphone, a tablet, a laptop, a desktop, a
40 notebook, a mobile phone, an application server, or the like.
[029] The legacy code data encompasses various elements associated with
legacy codebase, including but not limited to, online and batch programs, copybooks,
job control language (JCL), control cards, scripts, stored procedures, and schedules.
The internal natural language document includes unstructured data such as lifecycle
45 documents, standard operating procedures (SOPs), use cases, configuration
management database (CMDB), design documents, incident management systems,
human-generated emails, blogs, knowledge repositories, and knowledge transfer
6
5 sessions. The external natural language document includes industry references,
standard documents, and reference frameworks.
[030] The computing device 102 may further utilize at least two distinct
large language models (LLMs) (for example, a first LLM and a second LLM). The first
LLM may generate a first natural language output based on the legacy code data,
10 providing domain context or code explanations relevant to the legacy code. These
expanded code provide a detailed and human-readable representation of the code's
functionality and structure.
[031] Alternatively, the first LLM may generate summaries of code
comments in natural language, based on the code comments present in the legacy code
15 data. These summaries are then incorporated into the first natural language output,
offering concise and informative descriptions of the code's purpose and behaviour.
[032] Meanwhile, the second LLM may generate a second natural language
output based on the at least one natural language document, extracting knowledge and
information contained within the documents.
20 [033] In some embodiments, the second natural language output may be
derived from a combination of both the internal and external natural language
documents. In such embodiments, the computing device 102 may employ two distinct
LLMs (such as, an internal document processing LLM and an external document
processing LLM), one for generating the internal natural language output and another
25 for generating the external natural language output. Alternatively, the computing
device 102 may employ a single LLM (e.g., the second LLM) to generate both the
internal and external natural language outputs.
[034] Once the first natural language output and the second language output
are generated, the computing device 102 may further fine-tune at least one of the first
30 LLM or the second LLM based on the first natural language output and the second
natural language output.
[035] To fine-tune the at least one of the first LLM or the second LLM, the
computing device 102 may employ a third LLM. The third LLM may be an Artificial
Intelligence (AI) based convergent LLM that may be dedicated to analyzing and
35 identifying gaps between the first natural language output and the second language
output.
[036] Once the gaps are identified, they are fed back to the LLMs of the
respective data sources. Human experts, such as developers or domain specialists, may
review the natural language outputs to provide feedback on their accuracy,
40 completeness, and context relevance. This human-assisted feedback may serve as
valuable information to fine-tune the LLM models. The feedback may help to identify
areas where the natural language outputs may be improved, and the LLM models may
be adjusted accordingly to enhance their language understanding and generation
capabilities.
45 [037] To improve performance of the at least one of the first LLM or the
second LLM based on the feedback received from the gap analysis, the computing
device 102 may modify one or more parameters of the at least one of the first LLM or
7
5 the second LLM. This modification may include adjusting internal settings, weights,
and configurations of the LLMs to better suit the specific task of legacy code
transformation and natural language document understanding.
[038] The process of modification may be iterative, and for this the
computing device 102 may fine-tune the at least one of the first LLM or the second
10 LLM multiple times using the feedback to achieve continuous improvement. By
adjusting the parameters, each of the first LLM or the second LLM may become more
contextually aware, capturing domain-specific knowledge, and generating more
accurate and relevant natural language outputs.
[039] Further, the computing device 102 may generate a natural language
15 specification document corresponding to the legacy code data based on the first natural
language output and the second natural language output through the third LLM. The
natural language specification document may provide essential guidelines, rules, and
requirements necessary for the transformation process. It may capture a domain
context, code explanations, and extracted knowledge, ensuring that the modernization
20 effort aligns with the intended objectives and requirements. It should be noted that each
of the first LLM, the second LLM, and the third LLM may be an encoder-decoder
transformer architecture-based generative AI model.
[040] The encoder-decoder transformer architecture is a powerful and
widely used framework for natural language processing tasks. In this architecture, the
25 encoder component processes input data and converts it into a fixed-size
representation, capturing contextual information and domain-specific knowledge. The
decoder component takes the fixed-size representation as input and generates an output
sequence, such as a natural language output.
[041] By adopting the encoder-decoder transformer architecture, each of the
30 LLM models may effectively handle complexities of understanding and generating
natural language representations. The transformer architecture, with its self-attention
mechanism, allows the LLM models to focus on relevant parts of the input and
efficiently capture long-range dependencies within the data.
[042] As generative AI models, the LLMs may generate human-like text that
35 is contextually relevant and coherent. They may understand the nuances of both the
legacy code data and the natural language documents, facilitating accurate conversion
and specification generation.
[043] Additionally, the computing device 102 may utilize a code-generating
generative AI model to generate modern code data corresponding to the legacy code
40 data based on the natural language specification document. The modern code data may
be a transformation of the legacy code data in a modernized code language.
[044] In a more elaborative way, the code-generating generative AI model
interprets the natural language specification document and translates it into a desired
modern code language, such as Java. The modern code data generated by this process
45 represents a modernized version of the original legacy code.
[045] The modern code data aligns with the guidelines and requirements
specified in the natural language specification document, ensuring that the transformed
8
5 code complies with the intended modernization goals. This modernized code may
further be deployed and integrate within an organization's updated software
infrastructure, contributing to the overall enhancement and efficiency of the legacy
system.
[046] FIG. 2 is a block diagram 200 of a computing device 102 for
10 facilitating legacy code transformation, in accordance with an exemplary embodiment
of the present disclosure. FIG. 2 is explained in conjunction with elements from FIG.
1. The computing device 102 may include a processing circuitry 201 and a memory
202 communicatively coupled to the processing circuitry 201 via a communication bus
203. The memory 202 may store processor instructions. The processor instructions,
15 when executed by the processing circuitry 201, may cause the processing circuitry 201
to implement one or more embodiments of the present disclosure. The memory 202
may include a processing module 204, a large language model (LLM) module 205, and
a database 206.
[047] The database 206 may store legacy code data, and natural language
20 documents (for example, internal natural language documents, and external natural
language documents). Once the computing device 102 receives the legacy code data
and at least one natural language document from the database 206, the processing
module 204 may pre-process the legacy code data and the at least one natural language
document.
25 [048] To further elaborate, the pre-processing may include segregating a
COBOL code within the legacy code data into distinct elements, including variables,
file operations, SQL/DB operations, function blocks, user interactions, and comments.
The pre-processing is aimed at organizing and categorizing the elements of the COBOL
code, providing a structured representation of its different functionalities. By
30 segregating the code into specific elements, such as variables, file operations, etc., the
processing module 204 prepares the data for further analysis and transformation. The
pre-processed data may act as a foundation for the generation of the first natural
language output and the extraction of knowledge for the subsequent stages of the legacy
code transformation.
35 [049] The LLM module 205 may include a first LLM, a second LLM, and a
third LLM, and a code-generating generative AI model. The first LLM may be
configured to generate a first natural language output based on the legacy code data.
This output may include domain context or code explanations that correspond to the
legacy code data. In other words, the first LLM may convert the legacy codebase into
40 a human-readable format, providing valuable information related to a functionality and
structure of the legacy code.
[050] The second LLM may generate a second natural language output
based on the at least one natural language document. This output may include extracted
knowledge and information from the natural language document. By analyzing the
45 natural language document, the second LLM may capture relevant data and context to
be used in the transformation process.
9
5 [051] In a more elaborative way, the second LLM processes one or more
internal natural language documents, which may include various types of unstructured
data-life cycle documents, SOPs, use cases, CMDB, design documents, incident
management systems, human-generated emails, blogs, knowledge repositories, and
knowledge transfer sessions. The second LLM analyzes these internal documents and
10 extracts pertinent information relevant to the legacy codebase.
[052] Additionally, the second LLM also processes one or more external
natural language documents, which may include industry references, standard
documents, and reference frameworks. These external documents often provide
valuable industry-specific standards, best practices, and guidelines that are crucial for
15 the modernization process.
[053] Through the analysis of both internal and external natural language
documents, the second LLM generates the second natural language output, capturing
important knowledge and domain-specific details relevant to the legacy code
transformation. It should be noted that the second natural language output may be
20 generated either by single second LLM or by combination of internal document
processing LLM and external document processing LLM.
[054] In other words, in one embodiment, the second LLM may be designed
to handle the analysis of diverse document types and extract relevant information from
both internal and external sources. This single second LLM may generate the second
25 natural language output, combining information from both types of documents.
[055] In other embodiment, the second natural language output may be
generated by a combination of LLMs, each specialized in processing a specific type of
document. This means that there may be one LLM focused on analyzing and extracting
information from internal natural language documents (such as SOPs, design
30 documents, etc.), and another LLM specialized in processing external natural language
documents (such as industry references and standards). The combination of the outputs
from these specialized LLMs results in the complete second natural language output.
[056] Both approaches have their advantages and may be selected based on
factors such as the complexity and variety of the natural language documents, the scale
35 of the legacy codebase, and the specific requirements of the legacy language
transformation project.
[057] Further, the third LLM may fine-tune at least one of the first LLM or
the second LLM based on the first natural language output and the second natural
language output. The fine-tuning may include updating parameters and weights of the
40 first LLM and the second LLM based on the feedback derived from the first and second
natural language outputs.
[058] The fine-tuning process may be essential for ensuring that the LLMs
align more accurately with the specific requirements and context of the legacy code
transformation task. By learning from the generated natural language outputs, the
45 LLMs become better equipped to produce contextually relevant and accurate outputs
in subsequent iterations of the transformation process. The process of fine-tuning is
explained in detail in conjunction with FIG. 6.
10
5 [059] Further, the third LLM may generate a natural language specification
document corresponding to the legacy code data based on the first natural language
output and the second natural language output. The natural language specification
document may act as a detailed guide that encapsulates essential information from both
the legacy code data and the extracted knowledge from the natural language
10 documents. By combining the first and second natural language outputs, the third LLM
generates a detailed specification that outlines the necessary steps and guidelines for
the modernization of the legacy code.
[060] The content of the natural language specification document may
include, but is not limited to, transformation requirements, modernization strategies,
15 design patterns, architectural considerations, and other critical elements required for a
successful transformation of the legacy code into a modernized code language.
[061] Further, the code-generating generative AI model may generate
modern code data corresponding to the legacy code data based on the natural language
specification document. The modern code data may be a transformation of the legacy
20 code data in a modernized code language.
[062] By interpreting the detailed specifications and guidelines outlined in
the natural language specification document, the code-generating generative AI model
may generate modern code that aligns with the requirements and objectives of the
modernization process. This code generation includes various aspects, such as
25 refactoring, optimization, code restructuring, and incorporating best practices in the
modernized code.
[063] In some embodiments, the LLM module 205 may be responsible to
train the first LLM using a training dataset. The training may be performed using a selfsupervised learning technique. The training dataset may include a source code dataset
30 and natural language specification corresponding to the source code dataset.
[064] In some embodiments, the LLM module 205 may train the first LLM
to configure it for generating code explanation corresponding to the legacy code data
in the first natural language output. To accomplish this, the training dataset may include
legacy code language information along with their respective explanations.
35 [065] Additionally, in some embodiments, the LLM module 205 may train
the first LLM to configure it for generating domain context in the first natural language
output. The training dataset utilized for this purpose may include textual data relevant
to the domain context.
[066] It should be noted that all such aforementioned modules 204 – 205
40 may be represented as a single module or a combination of different modules. Further,
as will be appreciated by those skilled in the art, each of the modules 204 – 205 may
reside, in whole or in parts, on one device or multiple devices in communication with
each other. In some embodiments, each of the modules 204 – 205 may be implemented
as dedicated hardware circuit comprising custom application-specific integrated circuit
45 (ASIC) or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or
other discrete components. Each of the modules 204 – 205 may also be implemented
in a programmable hardware device such as a field programmable gate array (FPGA),
11
5 programmable array logic, programmable logic device, and so forth. Alternatively,
each of the modules 204 – 205 may be implemented in software for execution by
various types of processors (e.g., the processing circuitry 201). An identified module
of executable code may, for instance, include one or more physical or logical blocks of
computer instructions, which may, for instance, be organized as an object, procedure,
10 function, or other construct. Nevertheless, the executables of an identified module or
component need not be physically located together, but may include disparate
instructions stored in different locations which, when joined logically together, include
the module and achieve the stated purpose of the module. Indeed, a module of
executable code could be a single instruction, or many instructions, and may even be
15 distributed over several different code segments, among different applications, and
across several memory devices.
[067] As will be appreciated by one skilled in the art, a variety of processes
may be employed for facilitating legacy code transformation. For example, the
exemplary computing device 102 may facilitate transformation of the legacy code to a
20 modernized code by the processes discussed herein. In particular, as will be appreciated
by those of ordinary skill in the art, control logic and/or automated routines for
performing the techniques and steps described herein may be implemented by the
computing device 102 either by hardware, software, or combinations of hardware and
software. For example, suitable code may be accessed and executed by the one or more
25 processors on the computing device 102 to perform some or all of the techniques
described herein. Similarly, application specific integrated circuits (ASICs) configured
to perform some or all of the processes described herein may be included in the one or
more processors on the computing device 102.
[068] FIG. 3 is a diagram that illustrates an exemplary process 300 for
30 facilitating legacy code transformation is depicted via a flowchart, in accordance with
an exemplary embodiment of the present disclosure. FIG. 3 is explained in conjunction
with elements from FIGS. 1 and 2. In an embodiment, the process 300 may be
implemented by the computing device 102. The process 300 may include receiving
legacy code data and at least one natural language document from one or more data
35 sources, at step 301.
[069] Further, the process 300 may include pre-processing the legacy code
data and the at least one natural language document, at step 302. Further, the process
300 may include generating a first natural language output based on the legacy code
data through a first LLM, and a second natural language output based on the at least
40 one natural language document through a second LLM, at step 303. The first natural
language output may include domain context or code explanation corresponding to the
legacy code data, and the second natural language output may include extracted
knowledge from the at least one natural language document.
[070] Further, the process 300 may include fine-tuning at least one of the
45 first LLM or the second LLM based on the first natural language output and the second
natural language output, through a third LLM, at step 304. A process of fine-tuning the
12
5 at least one of the first LLM or the second LLM is explained in detail in conjunction
with FIG. 6.
[071] Further, the process 300 may include generating a natural language
specification document corresponding to the legacy code data based on the first natural
language output and the second natural language output through the third LLM, at step
10 305. In some embodiments, an alternative approach may be employed, where the
process 300 may generate a Domain-Specific Language (DSL) specification document.
Similar to the natural language specification document, the DSL specification
document may also be based on the first natural language output and the second natural
language output, achieved through the third LLM.
15 [072] Both types of specification documents, whether in natural language or
Domain-Specific Language, may serve as detailed and comprehensive guidelines for
modernizing the legacy codebase. It may include critical information, requirements,
and recommendations derived from the legacy code data and the information obtained
from the analysis of natural language documents. These specification documents
20 become invaluable references for the subsequent stages of the legacy language
transformation process, guiding developers throughout the modernization journey.
[073] Further, the process 300 may include generating modern code data
corresponding to the legacy code data based on the natural language specification
document through a code-generating generative AI model, at step 306.
25 [074] FIG. 4 is a diagram that illustrates an exemplary process 400 for
training a first LLM is depicted via a flowchart, in accordance with an exemplary
embodiment of the present disclosure. FIG. 4 is explained in conjunction with elements
from FIGS. 1, 2, and 3. In an embodiment, the process 400 may be implemented by the
computing device 102. The process 400 may include training the first LLM using a
30 training dataset through a self-supervised learning technique, at step 401.
[075] The self-supervised learning technique may enable the first LLM to
learn from the training dataset without the need for explicit labels. The training dataset
may include a set of source code data (representing the legacy codebase) and their
corresponding natural language specifications. During the training process, the first
35 LLM may utilize a relationship between the source code and its associated natural
language specifications to learn and map the two, effectively acquiring the ability to
generate natural language outputs that may correspond to the legacy codebase.
[076] In an embodiment, to train the first LLM, the process 400 may include
configuring the first LLM to generate the code explanation corresponding to the legacy
40 code data in the first natural language output, at step 402. The training dataset may
include legacy code language information and corresponding explanation.
[077] Alternatively, to train the first LLM, the process 400 may include
configuring the first LLM to generate the domain context in the first natural language
output, at step 403. The training dataset may include textual data. Through this training
45 process the first LLM may produce contextually relevant and accurate language
outputs, laying a foundation for subsequent stages of the legacy code transformation.
13
5 [078] FIG. 5 is a diagram that illustrates an exemplary process 500 for
generating a first natural language output is depicted via a flowchart, in accordance
with an exemplary embodiment of the present disclosure. FIG. 5 is explained in
conjunction with elements from FIGS. 1, 2, 3, and 4. In an embodiment, the process
500 may be implemented by the computing device 102. As previously explained in
10 reference to FIG. 3, the first natural language output may be generated based on the
legacy code data, at step 303.
[079] In an embodiment, to generate the first natural language output, the
process 500 may include generating expanded code functions corresponding to the
legacy code data in natural language, at step 501. The first natural language output may
15 include the expanded code functions.
[080] These expanded code functions provide a comprehensive and detailed
representation of the functionality and operations present within the legacy codebase.
The first natural language output may be enhanced with these expanded code functions,
ensuring that it includes an exhaustive and detailed account of the legacy codebase's
20 operations and capabilities. By incorporating the expanded code functions into the first
natural language output, the output becomes more informative and expressive,
capturing the complexities of the legacy code.
[081] Alternatively, in some embodiments, to generate the first natural
language output, the process 500 may include generating summaries of code comments
25 in the legacy code data in natural language, at step 502. The first natural language
output may include the summaries. These summaries may capture the essence of the
comments present within the legacy codebase and present them in a concise and
understandable format.
[082] When integrated into the first natural language output, these
30 summaries may enhance its clarity and readability by providing valuable information
from the code comments. The inclusion of code comment summaries may allow
developers to gain a quick understanding of the intentions, explanations, and context
embedded in the legacy code. This concise representation of code comments in the first
natural language output may help in understanding and analysing the legacy codebase,
35 facilitating a smoother modernization process.
[083] FIG. 6 is a diagram that illustrates an exemplary process 600 for finetuning at least one of a first LLM and a second LLM is depicted via a flowchart, in
accordance with an exemplary embodiment of the present disclosure. FIG. 6 is
explained in conjunction with elements from FIGS. 1, 2, 3, 4, and 5. In an embodiment,
40 the process 600 may be implemented by the computing device 102. As explained
earlier in reference to FIG. 3, the at least one of the first LLM and the second LLM
may be fine-tuned, at step 304.
[084] Therefore, to fine-tune the at least one of the first LLM and the second
LLM, the process 600 may include performing, via the third LLM, a gap analysis based
45 on the first natural language output and the second natural language output, at step 601.
[085] Further, the process 600 may include identifying, via the third LLM,
one or more gaps in at least one of the first natural language output or the second natural
14
5 language output based on the gap analysis, at step 602. During the gap analysis, the
third LLM may identify any discrepancies or gaps that may exist between the first
natural language output and the second natural language output, as compared to the
information derived from the legacy codebase.
[086] Further, the process 600 may include providing, via the third LLM, a
10 feedback to at least one of the first LLM or the second LLM based on the identified
one or more gaps, at step 603. The fine-tuning may be based on human-assisted
feedback corresponding to the first natural language output and the second natural
language output. In particular, the identified gaps may be manually addressed through
a feedback loop that involve human experts.
15 [087] Further, the process 600 may include modifying one or more
parameters of the at least one of the first LLM or the second LLM based on the
feedback, at step 604. The modification may include updating weights, adjusting
internal configurations, and learned representations within the LLMs, guided by the
feedback received from the gap analysis and the human-assisted review of the natural
20 language outputs.
[088] When the feedback loop identifies discrepancies or inadequacies in the
first and second natural language outputs generated by the first and second LLMs, it
indicates areas where the LLM models may be enhanced. To address these areas of
improvement, the fine-tuning process may adjust the internal parameters of the LLMs.
25 By doing so, the LLMs’ natural language generation capabilities may be refined,
leading to more accurate, contextually relevant, and high-quality representations of the
legacy code and associated natural language documents.
[089] FIG. 7 is a diagram that illustrates an exemplary process 700 for
facilitating legacy code transformation is depicted via a flowchart, in accordance with
30 an exemplary embodiment of the present disclosure. FIG. 7 is explained in conjunction
with elements from FIGS. 1, 2, 3, 4, 5, and 6. In an embodiment, the process 700 may
be implemented by the computing device 102. As mentioned earlier in reference to
FIG. 3, the process 300 may include receiving legacy code data and at least one natural
language document from one or more data sources, at step 301.
35 [090] Since, the one or more data sources may include an external data
source or an internal data source, therefore the present process 700 is explained in
relation to the received natural language documents from either the external data source
or the internal data source. At step 701, the process 700 may include receiving the
legacy code data, at least one internal natural language document from one or more
40 internal data sources, and at least one external natural language document from one or
more external data sources.
[091] Further, process 700 may include generating the first natural language
output based on the legacy code data through the first LLM, an internal natural
language output based on the at least one internal natural language document through
45 an internal document processing LLM, and an external natural language output based
on the at least one external natural language document through an external document
processing LLM, at step 702.
15
5 [092] Further, process 700 may include fine-tuning at least one of the first
LLM, the internal document processing LLM, or the external document processing
LLM based on the first natural language output, the internal natural language output,
and the external natural language output, through the third LLM, at step 703.
[093] Further, the process 700 may include generating the natural language
10 specification document corresponding to the legacy code data based on the first natural
language output, the internal natural language output, and the external natural language
output through the third LLM, at step 704.
[094] In some embodiments, once the natural language specification
document is generated, the process 700 may further include generating modern code
15 data corresponding to the legacy code data based on the natural language specification
document through a code-generating generative AI model. The modern code data may
be a transformation of the legacy code data in a modernized code language.
[095] FIG. 8 is a diagram that illustrates transformation of a legacy code
data to a modernized code, in accordance with an exemplary embodiment of the present
20 disclosure. FIG. 8 is explained in conjunction with elements from FIGS. 1, 2, 3, 4, 5,
6, and 7. In order to transform the legacy code data to the modernized code, initially,
the legacy code data 801, an internal natural language document 802, and an external
natural language document 803 may be provided as inputs to three distinct Large
Language Model (LLM) models i.e., the first LLM model 804, the internal document
25 processing LLM 805, and the external document processing LLM 806.
[096] These LLMs may process the input data and generate corresponding
natural language outputs. Specifically, the first LLM model 804 may generate a first
natural language output 807 based on the legacy code data 801, while the internal
document processing LLM 805 may generate the internal natural language output 808
30 based on the internal natural language document 802. Similarly, the external document
processing LLM 806 may generate an external natural language output 809 based on
the external natural language document 803.
[097] Subsequently, a third LLM 810 (for example, an artificial intelligence
(AI) based convergent LLM model), may receive the first natural language output 807,
35 the internal natural language output 808, and the external natural language output 809
as inputs. The third LLM 810 may then perform a fine-tuning process on at least one
of the first LLM 804, the internal document processing LLM 805, or the external
document processing LLM 806. This fine-tuning may be based on human-assisted
feedback 812 corresponding to the first natural language output 807, the internal natural
40 language output 808, and the external natural language output 809, ensuring that the
LLM models improve and align their language generation capabilities.
[098] Further, the third LLM 810 may generate a natural language
specification document 811 corresponding to the legacy code data 801 based on the
first natural language output 807, the internal natural language output 808 and the
45 external natural language output 809. The generated natural language specification
document 811 (preferably, in English language) outlines detailed guidelines, rules, and
requirements for modernizing the legacy code.
16
5 [099] Once the natural language specification document 811 is generated,
further a code-generating generative AI model 813 may generate modern code data
corresponding to the legacy code data 801. The modern code data represents a
transformation of the legacy code into a modernized code language 814 (such as, Java),
driven by the guidelines and requirements specified in the natural language
10 specification document 811.
[0100] During the code generation process, the code-generating generative AI
model 813 (such as Alphacode, Codex, etc.,) use natural language specification
document 811 to produce a modern code. However, as AI models are not perfect and
may have limitations or biases, human assistance play a crucial role in verifying the
15 quality of the output. Human assistance for validation 815 may include checking if the
generated modern code meets the desired transformation goals, adheres to coding best
practices, and complies with any specific guidelines or requirements. If any issues,
errors, or improvements are identified during the review, human experts provide
feedback and corrections to refine the code generation process.
20 [0101] In some embodiments, the natural language specification document
811 may serve additional purposes beyond guiding the modern code generation
process. In particular, it may be utilized to generate and recommend test cases that may
be aligned with the specified functionality and requirements in the legacy code. The
natural language specification document 811 may help to ensure that the modernized
25 code meets the desired performance and functionality standards.
[0102] Additionally, the natural language specification document 811 may be
utilized to generate a dependency graph. This graph may represent the relationships
and interdependencies between various components and functions in the modernized
code. By visualizing these dependencies, developers may better understand the code
30 structure and identify potential bottlenecks or areas for optimization.
[0103] FIG. 9 is a diagram that illustrates generation of a natural language
specification document corresponding to legacy code data, in accordance with an
exemplary embodiment of the present disclosure. FIG. 9 is explained in conjunction
with elements from FIGS. 1, 2, 3, 4, 5, 6, 7, and 8. The natural language specification
35 document may be obtained by extracting valuable information from various data
sources that may include unstructured documents, human-generated documents, and
industry reference documents. This information may then be compared with the
knowledge extracted from the legacy codebase. Further a gap analysis may be
performed to identify areas where the extracted knowledge may be lacking or
40 incomplete. This gap analysis may be carried out by an AI model (such as, the third
LLM), which may carefully examine the outputs and identifies the discrepancies.
Based on gap analysis, the third LLM model may generate the natural language
specification document.
[0104] In a more elaborative way, the process of generating the natural
45 language specification document is explained via the present FIG. 9. The generation
process may be divided into several key steps:
17
5 [0105] At initial step, a legacy code data 901, including complete code 907
such as programs copybooks, scripts, JCLs, etc., may undergo a multi-step
transformation process using transformer-based generative AI models, (such as Code
T5, GPT, or HuggingFace). These AI models is specifically trained on COBOL code
to comprehend its structure and semantics. In this step, the entire legacy code data 901
10 may be fed into the transformer-based generative AI model (for example, a first LLM).
[0106] In addition to the legacy code data 901, the first LLM 910 may also
receive input from a current data model 902, which includes schema and metadata 908.
By incorporating the current data model 902, the first LLM 910 may gain a deeper
understanding of existing data system, enhancing its transformation capabilities.
15 [0107] [042] Besides the legacy code data 901 and current data model 902,
the transformer-based generative AI models (such as, an internal document processing
LLM 911 and an external natural language processing LLM 912) may also receive data
from internal natural language documents 903 and external natural language
documents 904. The internal natural language documents 903 may include various
20 sources, such as standard operating procedures of the bank, life-cycle documents,
internal knowledge repository, project-related documents, and incident reports. On the
other hand, the external natural language documents 904 may include industry
references such as the Banking Industry Architecture Network (BIAN) framework and
regulatory or legal documents from external bodies.
25 [0108] Before inputting the legacy code data 901 into the first LLM 910, a
pre-processing step may be performed to organize the code effectively. During this preprocessing, the legacy code data 901 may be segregated into schedule/trigger 905,
interfaces, and file 906, and the complete code 907. The complete code 907 may further
be classified into variables, file operations, SQL/DB operations, function blocks, user
30 actions, and comments. This pre-processing step may prepare the legacy code for indepth analysis and transformation, streamlining the subsequent stages of the process.
[0109] In an exemplary embodiment, the internal and external documents
may also undergo pre-processing 909 before they are input to the internal and external
document processing LLMs. The pre-processing of the internal and external documents
35 may include several pre-processing steps, including indexing, crawling, and sentence
vectorization.
[0110] The indexing step may include creating a structured representation of
the documents, wherein unique identifiers may be assigned to each document, and
essential information such as document titles, authors, dates, and keywords are
40 extracted and stored in an index or database. This may facilitate efficient retrieval and
access to specific documents based on their attributes.
[0111] The crawling step may include employing a software program,
referred to as a web crawler or spider, to systematically navigate through websites or
online sources. The web crawler may visit web pages, extracts relevant content, follows
45 hyperlinks to other pages, and stores the acquired data for further processing. The
crawling process may gather pertinent textual information from different repositories,
databases, or websites.
18
5 [0112] The sentence vectorization step relates to an application of natural
language processing (NLP) techniques. It may convert individual sentences from the
internal and external documents into numerical representations (such as, vectors)
suitable for ML models. Each word in a sentence may be converted into a numerical
vector, and these word vectors may be combined to form a single vector representing
10 the entire sentence. This representation may allow for effective processing of textual
data by AI models (such as the internal document processing LLM 911 and the external
document processing LLM 912).
[0113] Further, the first LLM 910 may generate a first natural language output
913 based on the legacy code data 901. The internal document processing LLM 911
15 may generate an internal natural language output 914 based on the internal natural
language documents 903. The external document processing LLM 912 may generate
an external natural language output 915, based on the external natural language
documents 904.
[0114] Further, the first natural language output 913, the internal natural
20 language output 914, and the external natural language output 915 may be fed as input
to a third LLM 916. The third LLM 916 may fine-tune at least one of the first LLM
910, the internal document processing LLM 911, or the external document processing
LLM 912. In fine-tuning process, the output (such as the first natural language output
913) of domain context from legacy code and knowledge extraction from documents
25 (such as the internal natural language output 914, and the external natural language
output 915) may be compared for gap analysis.
[0115] The gap analysis may be conducted to identify differences and
variations between the information derived from the internal and external documents
and the information gathered from the legacy codebase. The gap analysis outcome may
30 be fed back into the generative AI models (such as the first LLM 910, the internal
document processing LLM 911, or the external document processing LLM 912) as a
feedback loop. This feedback loop may be an essential part of the iterative process that
drives continuous improvement and refinement of the generative AI models. It should
be noted that the feedback may be a human-assisted feedback 918. The feedback loop
35 facilitates the modification of one or more parameters in the first LLM 910, the internal
document processing LLM 911, or the external document processing LLM 912 based
on the gaps identified during the analysis.
[0116] By incorporating this feedback into the LLMs, the generative AI
model may learn from its errors and iteratively improve its performance. This
40 continuous learning process may allow the LLM models to become more accurate and
contextually relevant in generating the natural language outputs, ultimately resulting in
a more comprehensive and precise natural language specification document.
[0117] Upon completing the fine-tuning process, the third LLM 916 may
generate a natural language specification document 917 corresponding to the legacy
45 code data 901 based on the first natural language output 913, the internal natural
language output 914, or the external natural language output 915.
19
5 [0118] The natural language specification document 917 plays a vital role in
modernizing the legacy codebase. This specification document may include various
essential elements crucial for understanding and transforming the legacy code into a
modernized code language. It includes detailed descriptions of entities, representing
objects or concepts relevant to the domain being modeled, along with their associated
10 attributes. Additionally, the natural language specification document 917 may outline
various functions performed by the legacy code, the rules governing its behavior, and
the events that trigger specific actions. Process flows may be laid out to describe the
sequence of steps executed to achieve specific outcomes, and functional clusters may
be employed to group related functions together based on similarity or purpose. Lastly,
15 a bounded context may define the scope and context within which the legacy code
operates. Collectively, these elements form a structured representation of the legacy
codebase, facilitating its transformation and ensuring the modern code accurately
reflects its functionalities and behaviors.
[0119] FIG. 10 is a diagram that illustrates training of first LLM (1002), in
20 accordance with an exemplary embodiment of the present disclosure. FIG. 10 is
explained in conjunction with elements from FIGS. 1, 2, 3, 4, 5, 6, 7, 8, and 9. The first
LLM 1002 may be trained using a training dataset through a self-supervised learning
technique. The training dataset may include a source code dataset (e.g., COBOL
programs, copybooks, scripts, JCLs, etc.,) and natural language specification (e.g.,
25 unstructured documents, human-generated documentations, industry references, SOPs,
internal knowledge repository, project documents, application-specific documents,
incident reports, etc.,) corresponding to the source code dataset.
[0120] The training of the first LLM 1002 may include two approaches. The
first approach may be to configure the first LLM 1002 to generate a code explanation
30 1003 in the first natural language output. For this purpose, the training dataset includes
legacy code language information along with corresponding explanations. The second
approach may be to configure the first LLM 1002 to generate a domain context 1004
in the first natural language output. In this case, the training dataset includes textual
data.
35 [0121] To elaborate on the training process of the first LLM 1002, the
complete pre-processed code 1001, including COBOL programs, COBOL copybooks,
and related components, may be fed as input to a pre-trained sequence-to-sequence
transformer-based generative model with an encoder-decoder architecture (i.e., the first
LLM 1002). The first LLM 1002 may utilize a self-supervised learning techniques to
40 further refine its understanding of the legacy code. During this training process, the
first LLM 1002 may focus on expanding code functions and summarizing code
comments, leading to a creation of a domain context and entities representation.
[0122] In the training of the first LLM 1002, various datasets such as ‘The
Pile’, ‘CodeSearchNet’, ‘CodeXGLUE’, ‘Concode’, etc., which may include extensive
45 source code and corresponding natural language descriptions may be utilized. One
example of a pre-trained generative model may be a ‘CodeT5’ model, that may be
accessible on ‘HuggingFace’ playground.
20
5 [0123] The approach may utilize the pre-trained generative model (e.g.,
CodeT5) with already possesses knowledge of programming languages. By employing
unsupervised learning on a relatively larger dataset that includes over 1000 legacy code
components, a foundational model may be created. This foundational model ( i.e., the
first LLM 1002) may be adapted in various manners. In this process, two adaptations
10 may be created (one may be the code explanation 1003, and other may be the domain
context 1004), resulting in three distinct outputs. Out of the three distinct outputs, two
outputs may be obtained from the code explanation 1003 which may include a text
document explaining code, and functionality and process flows across COBOL files
using call dependency. One output from summarization of the domain context 1004
15 which may include text context. Based on these outputs, a Domain-Specific Language
(DSL) specification document may be constructed.
[0124] Once the legacy code is understood and represented as the DSL
specification document, the existing LLM models trained on the source code may be
utilized to transform it into modernized programming languages such as Java or
20 Python. This approach enables a seamless transformation of legacy code data into the
modernized code language.
[0125] As will be also appreciated, the above described techniques may take
the form of computer or controller implemented processes and apparatuses for
practicing those processes. The disclosure can also be embodied in the form of
25 computer program code containing instructions embodied in tangible media, such as
floppy diskettes, solid state drives, CD-ROMs, hard drives, or any other computerreadable storage medium, wherein, when the computer program code is loaded into and
executed by a computer or controller, the computer becomes an apparatus for practicing
the invention. The disclosure may also be embodied in the form of computer program
30 code or signal, for example, whether stored in a storage medium, loaded into and/or
executed by a computer or controller, or transmitted over some transmission medium,
such as over electrical wiring or cabling, through fiber optics, or via electromagnetic
radiation, wherein, when the computer program code is loaded into and executed by a
computer, the computer becomes an apparatus for practicing the invention. When
35 implemented on a general-purpose microprocessor, the computer program code
segments configure the microprocessor to create specific logic circuits.
[0126] Thus, the disclosed method and system try to overcome the technical
problem of understanding and extracting information from complex legacy codebases,
which have been a challenge for larger banks and financial institutions undergoing
40 large-scale transformation programs. By employing generative AI models, such as the
LLMs and code-generating AI models, this approach offers several significant
advantages to financial services organizations. One of the key advantages is a
substantial reduction in effort, timeline, and cost associated with reverse engineering
legacy systems. With the ability to generate natural language outputs and domain
45 context from the legacy code data and various natural language documents, the manual
effort required for deciphering complex legacy code and documentation is greatly
21
5 minimized. This leads to a streamlined and efficient transformation process, enabling
organizations to accelerate their modernization initiatives.
[0127] Another notable advantage is the reduced dependency on legacy skills
and the avoidance of vendor lock-in with legacy platforms. Traditional methods of
understanding legacy code often rely heavily on specific skills and expertise in outdated
10 programming languages like COBOL. By utilizing generative AI models, this solution
allows organizations to shift away from legacy skill dependencies, providing a more
flexible and future-proof approach to legacy modernization.
[0128] Moreover, the disclosed techniques assist in planning business
capabilities to be delivered incrementally within large-scale transformation programs.
15 By generating natural language specification documents and domain models, financial
services organizations may gain a comprehensive understanding of their legacy
codebase, enabling better planning and prioritization of modernization efforts. This
incremental delivery approach helps in avoiding disruptions and facilitates a smooth
and systematic transition to modernized systems.
20 [0129] Additionally, the expedited time-to-market provided by the disclosed
techniques is a significant benefit. The use of generative AI models allows for faster
and more accurate comprehension of legacy code and documents, leading to quicker
decision-making and code modernization. As a result, financial institutions may speed
up their digital transformation initiatives, enhancing their competitive edge and
25 responsiveness to rapidly changing market demands.
[0130] In light of the above mentioned advantages and the technical
advancements provided by the disclosed method and system, the claimed steps as
discussed above are not routine, conventional, or well understood in the art, as the
claimed steps enable the following solutions to the existing problems in conventional
30 technologies. Further, the claimed steps clearly bring an improvement in the
functioning of the device itself as the claimed steps provide a technical solution to a
technical problem.
[0131] The specification has described method and system for facilitating
legacy code transformation. The illustrated steps are set out to explain the exemplary
35 embodiments shown, and it should be anticipated that ongoing technological
development will change the manner in which particular functions are performed.
These examples are presented herein for purposes of illustration, and not limitation.
Further, the boundaries of the functional building blocks have been arbitrarily defined
herein for the convenience of the description. Alternative boundaries can be defined so
40 long as the specified functions and relationships thereof are appropriately performed.
Alternatives (including equivalents, extensions, variations, deviations, etc., of those
described herein) will be apparent to persons skilled in the relevant art(s) based on the
teachings contained herein. Such alternatives fall within the scope and spirit of the
disclosed embodiments.
45 [0132] Furthermore, one or more computer-readable storage media may be
utilized in implementing embodiments consistent with the present disclosure. A
computer-readable storage medium refers to any type of physical memory on which
22
5 information or data readable by a processor may be stored. Thus, a computer-readable
storage medium may store instructions for execution by one or more processors,
including instructions for causing the processor(s) to perform steps or stages consistent
with the embodiments described herein. The term “computer-readable medium” should
be understood to include tangible items and exclude carrier waves and transient signals,
10 i.e., be non-transitory. Examples include random access memory (RAM), read-only
memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs,
DVDs, flash drives, disks, and any other known physical storage media.
[0133] It is intended that the disclosure and examples be considered as
exemplary only, with a true scope and spirit of disclosed embodiments being indicated
15 by the following claims.
We Claim:
1. A method for facilitating legacy code transformation, the method comprising:
receiving (301), by a computing device (102), legacy code data (801) and at
least one natural language document from one or more data sources;
generating (302), by the computing device (102), a first natural language
10 output (807) based on the legacy code data (801) through a first Large Language Model
(LLM) (804), and a second natural language output based on the at least one natural
language document through a second LLM, wherein the first natural language output
(807) comprises domain context or code explanation corresponding to the legacy code
data (801), and wherein the second natural language output comprises extracted
15 knowledge from the at least one natural language document;
fine-tuning (303), by the computing device (102), at least one of the first LLM
(804) or the second LLM based on the first natural language output (807) and the
second natural language output, through a third LLM (810); and
generating (304), by the computing device (102), a natural language
20 specification document (811) corresponding to the legacy code data (801) based on the
first natural language output (807) and the second natural language output through the
third LLM (810).
2. The method of claim 1, wherein each of the first LLM (804), the second LLM,
25 and the third LLM (810) is an encoder-decoder transformer architecture-based
generative Artificial Intelligence (AI) model, and wherein each of the one or more data
sources is one of an external data source or an internal data source.
3. The method of claim 2, further comprising:
30 receiving, by the computing device (102), the legacy code data (801), at least
one internal natural language document (802) from one or more internal data sources,
and at least one external natural language document (803) from one or more external
data sources;
generating, by the computing device (102), the first natural language output
35 (807) based on the legacy code data (801) through the first LLM (804), an internal
natural language output (808) based on the at least one internal natural language
document (802) through an internal document processing LLM (805), and an external
natural language output (809) based on the at least one external natural language
document (803) through an external document processing LLM (806);
40 fine-tuning, by the computing device (102), at least one of the first LLM (804),
the internal document processing LLM (805), or the external document processing
LLM (806) based on the first natural language output, the internal natural language
output (808), and the external natural language output (809), through the third LLM
(810); and
45 generating, by the computing device (102), the natural language specification
document (811) corresponding to the legacy code data (801) based on the first natural
24
5 language output, the internal natural language output (808), and the external natural
language output (809) through the third LLM (810).
4. The method of claim 1, further comprising pre-processing, by the computing
device (102), the legacy code data (801) and the at least one natural language document.
10
5. The method of claim 1, further comprising training, by the computing device
(102), the first LLM (804) using a training dataset through a self-supervised learning
technique, wherein the training dataset comprises a source code dataset and natural
15 language specification corresponding to the source code dataset, and wherein training
the first LLM (804) comprises, at least one of:
configuring, by the computing device (102), the first LLM (804) to generate the
code explanation corresponding to the legacy code data (801) in the first natural
language output (807), wherein the training dataset comprises legacy code language
20 information and corresponding explanation; or
configuring, by the computing device (102), the first LLM (804) to generate the
domain context in the first natural language output (807), wherein the training dataset
comprises textual data.
25 6. The method of claim 1, wherein generating the first natural language output
(807) based on the legacy code data (801) further comprises, at least one of:
generating, by the computing device (102), expanded code functions
corresponding to the legacy code data (801) in natural language, wherein the first
natural language output (807) comprises the expanded code functions; or
30 generating, by the computing device (102), summaries of code comments in the
legacy code data (801) in natural language, wherein the first natural language output
(807) comprises the summaries.
7. The method of claim 1, wherein the fine-tuning is based on human-assisted
35 feedback corresponding to the first natural language output (807) and the second
natural language output, and wherein fine-tuning at least one of the first LLM (804)
and the second LLM further comprises:
performing, by the computing device (102) and via the third LLM (810), a gap
analysis based on the first natural language output (807) and the second natural
40 language output;
identifying, by the computing device (102) and via the third LLM (810), one or
more gaps in at least one of the first natural language output (807) or the second natural
language output based on the gap analysis;
providing, by the computing device (102) and via the third LLM (810), a
45 feedback to at least one of the first LLM (804) or the second LLM based on the
identified one or more gaps; and
25
5 modifying, by the computing device (102), one or more parameters of the at
least one of the first LLM (804) or the second LLM based on the feedback.
8. The method of claim 1, further comprising:
generating, by the computing device (102), a Domain-Specific Language
10 (DSL) specification document based on the first natural language output (807) and the
second natural language output through the third LLM (810); and
generating modern code data corresponding to the legacy code data (801)
based on the natural language specification document (811) through a code-generating
generative AI model (813), wherein the modern code data is a transformation of the
15 legacy code data (801) in a modernized code language (814).
9. A system for facilitating legacy code transformation, the system comprising:
a processing circuitry (201); and
20 a memory (202) communicatively coupled to the processing circuitry (201),
wherein the memory (202) stores processor instructions, which when executed by the
processing circuitry (201), cause the processing circuitry (201) to:
receive legacy code data (801) and at least one natural language
document from one or more data sources;
25 generate a first natural language output (807) based on the legacy code
data (801) through a first Large Language Model (LLM) (804), and a second
natural language output based on the at least one natural language document
through a second LLM, wherein the first natural language output (807)
comprises domain context or code explanation corresponding to the legacy
30 code data (801), and wherein the second natural language output comprises
extracted knowledge from the at least one natural language document;
fine-tune at least one of the first LLM (804) or the second LLM based
on the first natural language output (807) and the second natural language
output, through a third LLM (810); and
35 generate a natural language specification document (811)
corresponding to the legacy code data (801) based on the first natural language
output (807) and the second natural language output through the third LLM
(810).
40 10. The system of claim 9, wherein each of the first LLM (804), the second LLM,
and the third LLM (810) is an encoder-decoder transformer architecture-based
generative Artificial Intelligence (AI) model, and wherein each of the one or more data
sources is one of an external data source or an internal data source.
45 11. The system of claim 8, wherein the processor instructions, on execution, further
cause the processing circuitry (201) to:
26
5 generate a Domain-Specific Language (DSL) specification document based
on the first natural language output (807) and the second natural language output
through the third LLM (810); and
generate modern code data corresponding to the legacy code data (801) based
on the natural language specification document (811) through a code-generating
10 generative AI model (813), wherein the modern code data is a transformation of the
legacy code data (801) in a modernized code language (814).

Documents

Application Documents

#	Name	Date
1	202341053446-STATEMENT OF UNDERTAKING (FORM 3) [09-08-2023(online)].pdf	2023-08-09
2	202341053446-REQUEST FOR EXAMINATION (FORM-18) [09-08-2023(online)].pdf	2023-08-09
3	202341053446-PROOF OF RIGHT [09-08-2023(online)].pdf	2023-08-09
4	202341053446-POWER OF AUTHORITY [09-08-2023(online)].pdf	2023-08-09
5	202341053446-FORM 18 [09-08-2023(online)].pdf	2023-08-09
6	202341053446-FORM 1 [09-08-2023(online)].pdf	2023-08-09
7	202341053446-DRAWINGS [09-08-2023(online)].pdf	2023-08-09
8	202341053446-DECLARATION OF INVENTORSHIP (FORM 5) [09-08-2023(online)].pdf	2023-08-09
9	202341053446-COMPLETE SPECIFICATION [09-08-2023(online)].pdf	2023-08-09