Abstract: Automated translation of legacy code to modern programming languages is a need of the hour for modernizing enterprise systems. Traditional rule-based tools perform statement-wise translation, overlooking possible modularization and refactoring of source code to translate to readable target code. However, these encounter difficulties with regard to logical correctness and readability during translation. Present disclosure provides a large language model-based system and method that leverage temperature sampling and refinement-based strategies, to ensure logical correctness of transpilation and maximize readability of the target code. More specifically, sequential refinement of the transpiled high-resource source code with execution-guided logic feedback is performed followed by application of LLM-based readability feedback/criteria to yield better performance in terms of logical correctness and readability.
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
TRANSPILING SOURCE CODE TO FUNCTIONALLY CORRECT AND READABLE IN TARGET LANGUAGE USING LARGE LANGUAGE MODELS
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to source code, and, more particularly, to transpiling source code to functionally correct and readable in target language using large language models.
BACKGROUND
Automated translation of legacy code to modern programming languages is the need of the hour for modernizing enterprise systems. Traditional rule-based tools for this perform statement-wise translation, overlooking possible modularization and refactoring of a source code to translate to human-readable target code. State-of-the-art models in the domain of code encounter difficulties with regard to logical correctness and readability when directly translating low-resource source code to target code.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
For example, in one aspect, there is provided a processor implemented method for transpiling source code to functionally correct and readable in target language using large language models. The method comprises receiving, via one or more hardware processors, a first source code in a first language; transpiling, by using a temperature sampling via the one or more hardware processors, the first source code to a set of second source codes based on at least one selected prompt configuration and one or more sample test cases, wherein each second source code in the set of the second source codes is translated to a second language; and performing a first level of functional correctness and readability refinement of the set of second source codes to obtain a functionally and readability refined source code, wherein performing the first level of functional correctness and refinement comprises: determining a current functional correctness of the set of second source codes and refining the set of second source codes based on the determined functional correctness to obtain a set of refined second source codes; ranking the set of refined second source codes to obtain a set of ranked sourced codes; applying one or more large language model (LLM)-based task-specific readability criteria on a first ranked sourced code amongst the set of ranked sourced codes to obtain one or more specific portions of the first ranked sourced code, wherein the one or more specific portions indicate one or more problematic aspects of readability; and performing the readability refinement of the one or more specific portions using the one or more sample test cases to obtain the functional corrected and refined source code.
In an embodiment, the one or more large language model (LLM)-based task-specific readability criteria comprise at least one of an overall readability, an abstraction, a modularization, a dead code, a logical segmentation, a complexity, comments, a naming convention, a testability, and an alignment.
In an embodiment, the step of determining the functional correctness of the set of second source codes and refining the set of second source codes is performed to obtain an output that is indicative of (i) an absence of error(s) in each second source code, or (ii) one or more errors present in each second source code.
In an embodiment, the one or more errors comprise at least one of one or more Compilation Errors (CE), one or more Runtime Errors (RE), one or more Function Errors (FE), and one or more Non-Termination Errors (NTE).
In an embodiment, the step of ranking the set of refined second source codes is based on an associated level of functional correctness followed by a type of the one or more errors.
In an embodiment, the method further comprises performing a second level of functional correctness and functional refinement of the obtained functionally corrected and refined code by performing one of: (i) identifying the obtained functionally corrected and refined code as the final source code based on an associated output of the obtained functionally corrected and refined code and the sample test cases; and (ii) determining the functional correctness of the obtained functionally corrected and refined code to (a) identify the first ranked sourced code as the final source code, or (b) perform the functional refinement of the obtain the functionally corrected and refined code based on the sample test cases to obtain the final source code.
In an embodiment, the first source code is a low-resource procedural source code, and each second source code in the set of second source code is a high resource object-oriented programming language code.
In another aspect, there is provided a processor implemented system for transpiling source code to functionally correct and readable in target language using large language models. The system comprises: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a first source code in a first language; transpile, by using a temperature sampling, the first source code to a set of second source codes based on at least one selected prompt configuration and one or more sample test cases, wherein each second source code in the set of the second source codes is translated to a second language; and performing a first level of functional correctness and readability refinement of the set of second source codes to obtain a functionally and readability refined source code, wherein the first level of functional correctness and refinement comprises: determining a functional correctness of the set of second source codes and refining the set of second source codes based on the determined functional correctness to obtain a set of refined second source codes; ranking the set of refined second source codes to obtain a set of ranked sourced codes; applying one or more large language model (LLM)-based task-specific readability criteria on a first ranked sourced code amongst the set of ranked sourced codes to obtain one or more specific portions of the first ranked sourced code, wherein the one or more specific portions indicate one or more problematic aspects of readability; and performing the readability refinement of the one or more specific portions using the one or more sample test cases to obtain the functional corrected and refined source code.
In an embodiment, the one or more large language model (LLM)-based task-specific readability criteria comprise at least one of an overall readability, an abstraction, a modularization, a dead code, a logical segmentation, a complexity, comments, a naming convention, a testability, and an alignment.
In an embodiment, the step of determining the functional correctness of the set of second source codes and refining the set of second source codes is performed to obtain an output that is indicative of (i) an absence of error(s) in each second source code, or (ii) one or more errors present in each second source code.
In an embodiment, the one or more errors comprise at least one of one or more Compilation Errors (CE), one or more Runtime Errors (RE), one or more Function Errors (FE), and one or more Non-Termination Errors (NTE).
In an embodiment, the step of ranking the set of refined second source codes is based on an associated level of functional correctness followed by a type of the one or more errors.
In an embodiment, the one or more hardware processors are further configured by the instructions to perform a second level of functional correctness and functional refinement of the obtained functionally corrected and refined code by performing one of: (i) identifying the obtained functionally corrected and refined code as the final source code based on an associated output of the obtained functionally corrected and refined code and the sample test cases; and (ii) determining the functional correctness of the obtained functionally corrected and refined code to (a) identify the first ranked sourced code as the final source code, or (b) perform a functional refinement of the obtain the functionally corrected and refined code based on the sample test cases to obtain the final source code.
In an embodiment, the first source code is a low-resource procedural source code, and each second source code in the set of second source code is a high resource object-oriented programming language code.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause transpiling source code to functionally correct and readable in target language using large language models by receiving a first source code in a first language; transpiling, by using a temperature sampling, the first source code to a set of second source codes based on at least one selected prompt configuration and one or more sample test cases, wherein each second source code in the set of second source codes is translated to a second language; and performing a first level of functional correctness and readability refinement of the set of second source codes to obtain a functionally and readability refined source code, wherein performing the first level of functional correctness and refinement comprises: determining a current functional correctness of the set of second source codes and refining the set of second source codes based on the determined functional correctness to obtain a set of refined second source codes; ranking the set of refined second source codes to obtain a set of ranked sourced codes; applying one or more large language model (LLM)-based task-specific readability criteria on a first ranked sourced code amongst the set of ranked sourced codes to obtain one or more specific portions of the first ranked sourced code, wherein the one or more specific portions indicate one or more problematic aspects of readability; and performing the readability refinement of the one or more specific portions using the one or more sample test cases to obtain the functional corrected and refined source code.
In an embodiment, the one or more large language model (LLM)-based task-specific readability criteria comprise at least one of an overall readability, an abstraction, a modularization, a dead code, a logical segmentation, a complexity, comments, a naming convention, a testability, and an alignment.
In an embodiment, wherein the step of determining the functional correctness of the set of second source codes and refining the set of second source codes is performed to obtain an output that is indicative of (i) an absence of error(s) in each second source code, or (ii) one or more errors present in each second source code.
In an embodiment, the one or more errors comprise at least one of one or more Compilation Errors (CE), one or more Runtime Errors (RE), one or more Function Errors (FE), and one or more Non-Termination Errors (NTE).
In an embodiment, the step of ranking the set of refined second source codes is based on an associated level of functional correctness followed by a type of the one or more errors.
In an embodiment, the one or more instructions which when executed by the one or more hardware processors further cause performing a second level of functional correctness and functional refinement of the obtained functionally corrected and refined code by performing one of: (i) identifying the obtained functionally corrected and refined code as the final source code based on an associated output of the obtained functionally corrected and refined code and the sample test cases; and (ii) determining the functional correctness of the obtained functionally corrected and refined code to (a) identify the first ranked sourced code as the final source code, or (b) perform the functional refinement of the obtain the functionally corrected and refined code based on the sample test cases to obtain the final source code.
In an embodiment, the first source code is a low-resource procedural source code, and each second source code in the set of second source code is a high resource object-oriented programming language code.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 depicts an exemplary system for transpiling source code to functionally correct and readable in target language using large language models, in accordance with an embodiment of the present disclosure.
FIG. 2 depicts an exemplary high level block diagram of the system for transpiling source code to functionally correct and readable in target language using large language models, in accordance with an embodiment of the present disclosure.
FIG. 3 depicts an exemplary flow chart illustrating a method for transpiling source code to functionally correct and readable in target language using large language models, using the systems of FIGS. 1-2, in accordance with an embodiment of the present disclosure.
FIGS. 4A through 4Cdepict contribution of various components of the system 100 such as a transpilation module, a first logic refinement module, a readability refinement module, and a second logic refinement module towards correction of logical errors, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Converting codebases from legacy to modern languages is time consuming and costly (e.g., refer “Baptiste Roziere et al. 2020. Unsupervised translation of programming languages. Advances in Neural Information Processing Systems 33 (2020), 20601–20611.”). Most of the traditional rule-based approaches require knowledge of both source and target languages to design hand-crafted rules (e.g., refer “Aniketh Malyala et al. 2023. On ML-Based Program Translation: Perils and Promises. arXiv preprint arXiv:2302.10812 (2023).”). To overcome this, machine learning (ML) models were proposed to learn the statistical alignments between programming languages (e.g., refer “Karan Aggarwal et al. 2015. Using machine translation for converting python 2 to python 3 code. Technical Report. PeerJ PrePrints.”, and “Guillaume Lample et al. 2018. Phrase-Based & Neural Unsupervised Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 5039–5049.”). Approaches of Neural Machine Translation (NMT) are limited by the availability of parallel corpora (e.g., refer “Wasi Uddin Ahmad et al. 2023. Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages. arXiv:2205.11116 [cs.CL]”, and “Xinyun Chen et al. 2018. Tree-to-tree neural networks for program translation. Advances in neural information processing systems 31 (2018).”).
Transcoder (e.g., refer “Baptiste Roziere et al. 2020. Unsupervised translation of programming languages. Advances in Neural Information Processing Systems 33 (2020), 20601–20611.”) apply unsupervised NMT for languages with a small parallel corpora. Lachaux et al. (e.g., “Marie-Anne Lachaux et al. 2021. DOBF: A deobfuscation pre-training objective for programming languages. Advances in Neural Information Processing Systems 34 (2021), 14967–14979.”) proposed a better pre-training mechanism specific to programming languages. Roziere et al. (e.g., refer “Baptiste Roziere et al. 2021. Leveraging automated unit tests for unsupervised code translation. arXiv preprint arXiv:2110.06773 (2021).”) use unit test cases to improve code translation performance. Recently, LLMs pre-trained with code such as Codex (e.g., refer “Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. arXiv e-prints (2021), arXiv–2107.”), and PaLM (e.g., refer “Aakanksha Chowdhery et al. 2022. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]”) have been used for code-translation. Athiwaratkun et al. (e.g., refer “Ben Athiwaratkun et al. 2022. Multi-lingual Evaluation of Code Generation Models. In The Eleventh International Conference on Learning Representations.”) (MBXP) train their own LLMs and test their zero-shot and few-shot, monolingual and multilingual translation capabilities. Pan et al. compare recent LLMs and code-LLMs for high-resource languages: starcoder (e.g., refer “Raymond Li et al. 2023. StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]”), llama2 (e.g., refer “Hugo Touvron et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]”), codegen (e.g., refer “Erik Nijkamp et al. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs.LG]”), codegeex (e.g., refer “Qinkai Zheng et al. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. arXiv:2303.17568 [cs.LG]”), and GPT-4 (e.g., refer “OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]”) to show that GPT-4 performs the best. The system and method emulated their taxonomy for error and bug analysis. Though there exist some rule-based works on COBOL to Java translation (e.g., refer “Harry M Sneed and Katalin Erdoes. 2013. Migrating AS400-COBOL to Java: a report from the field. In 2013 17th European Conference on Software Maintenance and Reengineering. IEEE, 231–240.”, “Aditya Trivedi and Ugrasen Suman. 2013. Design of a Reverse Engineering Model (A Case Study of COBOL to Java Migration). International Journal of Computer Applications 79, 5 (2013).”) most of the latest neural and LLM-based approaches are not tested for translation of low-resource legacy to modern languages, which is challenging as mentioned above.
Code readability is defined as the ease of understanding and maintaining it (e.g., refer “Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 371–383.”), they evaluate readability based on meaningfulness of variable and function names, quality of comments, and the readability of algorithmic implementation. Initial works on readability extract features from code blocks and train a classifier with different types of features like structural (average number of parenthesis, identifier(s), etc.), entropy, and Halstread’s volume (e.g., refer “Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability. IEEE Transactions on software engineering 36, 4 (2009), 546–558.”, and “Daryl Posnett et al. 2011. A simpler model of software readability. In Proceedings of the 8th working conference on mining software repositories. 73–82.”). Dorn (e.g., refer “Jonathan Dorn. 2012. A general software readability model. (2012).”) defines a more extensive set of features based on four aspects: visual, spatial, alignment and linguistic. Scalabrino et al. (e.g., refer “Simone Scalabrino et al. 2018. A comprehensive model for code readability. Journal of Software: Evolution and Process 30, 6 (2018), e1958.”) introduce additional textual features based on lexicon analysis. The above defined features might not be sufficient to estimate the ease of understanding. In addition, it needs high-level abstractions (e.g., refer “Simone Scalabrino et al. 2019. Automatically assessing code understandability. IEEE Transactions on Software Engineering 47, 3 (2019), 595–613.”). In the method of the present disclosure, the system 100 includes abstraction and modularization aspects, which are very specific to and essential for translation from legacy to modern languages. Madaan et al. (e.g., refer “Aman Madaan et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).”) propose a self-refinement framework with readability-specific feedback generated using the same LLM to improve the performance for code generation. As opposed to this, in the method of the present disclosure, the system explicitly specifies transpilation specific readability criteria for more comprehensive feedback and refinement.
As digital landscapes evolve, efficiently modernizing legacy codes becomes necessary. Manual efforts demand expertise in both legacy and modern languages. Hence, there is a need for automated legacy to modern language translation solutions for correct, efficient, and scalable navigation of legacy systems. In the present disclosure, systems and methods focus on translating COBOL code to Java. There has been extensive work done on code translation with LLMs for high-resource languages such as Java, Python, etc. Low resource languages such as COBOL do not get much exposure during pretraining, as it is subsequently observed that direct COBOL to Java translation (no additional guidance) results, in terms of execution accuracy (????) by state-of-the-art code LLMs such as WizardCoder (e.g., refer “Ziyang Luo et al. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs.CL]”), are quite poor (19.57% and 8.70%, respectively), yielding logically incorrect Java codes as showcased in Table 1. In addition to the COBOL’s low-resource setting, the lower performance is due to differences in paradigms and syntax. COBOL’s verbose English-like syntax, contrasting with Java’s concise C++-like syntax, presents a challenge in mapping COBOL statements to abstract high-level Java constructs. Adapting COBOL’s procedural and imperative paradigm to Java’s object-oriented one requires intricate restructuring for more modular and reusable Java code. A COBOL code snippet can be mapped to a Java library or function, requiring refactoring. As illustrated in Table 1, the direct LLM-based translated Java code has a flatter (non-modularized) structure with no function calls (readinput, calculateAnswer, calculateSum) and refactored (abstracted) code (input.split()).
Although there exist syntax-directed tools for this task, these are rule-based translations. A program is represented as an AST (Abstract Syntax Tree) and translated to target language using handwritten rules (e.g., refer “Aniketh Malyala et al. 2023. On ML-Based Program Translation: Perils and Promises. arXiv preprint arXiv:2302.10812 (2023).”). One such tool for COBOL to Java translation is opencobol2java4. Unlike LLMs, though, these tools produce logically correct Java programs (no compilation or execution errors), they tend to produce statement-wise translations as illustrated in Table 1 (complete code is on Github page). This does not produce optimally modularized Java code with semantically meaningful code segments, which are required for maintainability and comprehension. Moreover, such tools require tedious maintenance of translation rules and translators for each of the several dialects of COBOL leading to project delays. To address these issues, for COBOL to Java translation, systems and methods of the present disclosure focus on both aspects logical correctness and human readability of the target Java code. The present disclosure implements an LLM-based method that leverages the following: (i) temperature sampling and code refinement, which have proven to be beneficial for other code related tasks, viz. Code Generation, Repair, Optimization, Software Testing, etc., (ii) the high resource nature of the target language, Java, for refinement and (iii) availability of Input and Output (I/O) of Test Cases.
Through empirical analysis with two best-performing code LLMs (WizardCoder), the method’s best-performing workflow achieves 86.34% execution accuracy and 0.646 readability score on CodeNet, validating the following claims: (i) Temperature sampling aids COBOL to Java translation to get better Java code in terms of logical correctness and readability. (ii) Aspect (logical correctness and readability) specific refinements of translated Java code yield better code than trying to ensure logical correctness and readability as a part of COBOL to Java translation task. (iii) Existing LLM-based translation approach by Pan et al. (e.g., refer “Rangeet Pan et al. 2023. Understanding the Effectiveness of Large Language Models in Code Translation. arXiv e-prints (2023), arXiv–2308.”), targeting only high-resource programming languages, refines the target code in the context of the source code. As opposed to this, Java being high-resource, the system and method of the present disclosure claim that refinement of translated Java code without having low-resource COBOL code in the context (refinement of translated Java code) yields better performance, as compared to refinement of target Java code with the source COBOL code in the context (refinement of translation task), where low-resource COBOL acts as a distractor for refinement. (iv) Aspect (logical correctness and readability) specific sequential refinements lead to better Java code than combined refinement for both aspects.
Automated translation of legacy code to modern programming languages is the need of the hour for modernizing enterprise systems. This work specifically addresses automated COBOL to Java translation. Traditional rule-based tools for this perform statement-wise translation, overlooking possible modularization and refactoring of the source COBOL code to translate to human-readable target Java code. Present disclosure investigation reveals that state-of-the-art Large Language Models (LLMs) in the domain of code encounter difficulties with regard to logical correctness and readability when directly translating low-resource COBOL code to Java. To address these challenges, present disclosure provides an LLM-based workflow/method, leveraging temperature sampling and refinement-based strategies, to not only ensure logical correctness of the translation but also maximize the readability of the target Java code. The present disclosure exploits the fact that, due to their extensive exposure to human-written Java codes during pre-training, the LLMs are more equipped with profound comprehension and capability for refining translated Java codes than COBOL to Java translation. With a dataset sourced from CodeNet, the system and method of the present disclosure perform sequential refinement of the translated high-resource Java code with execution-guided logic feedback followed by LLM-based readability feedback. The present disclosure demonstrates that this approach yields better performance in terms of logical correctness (81.99% execution accuracy) and readability (0.610 score), than LLM based translation with test cases and readability guidance (60.25% and 0.539) or refinement of the translation task itself (77.95% and 0.572).
Referring now to the drawings, and more particularly to FIGS. 1 through 4C, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments, and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 depicts an exemplary system 100 for transpiling source code to functionally correct and readable in target language using large language models, in accordance with an embodiment of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is/are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices (e.g., smartphones, tablet phones, mobile communication devices, and the like), workstations, mainframe computers, servers, a network cloud, and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic-random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 comprises information pertaining to input source code (e.g., COBOL - common business-oriented language – also referred to as first source code), a corresponding second source code (also referred to as target code,), one or more sample test cases, functional correctness and readability refinement techniques, and the like The database 108 further comprises one or more large language model (LLM)-based task-specific readability criteria, error types associated with the second source code and the like. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
FIG. 2, with reference to FIG. 1, depicts an exemplary high level block diagram of the system 100 for transpiling source code to functionally correct and readable in target language using large language models, in accordance with an embodiment of the present disclosure, in accordance with an embodiment of the present disclosure.
FIG. 2 illustrates the method described herein using one or more modules in the order: Translation/Transpilation Module (TM), a first Logic Refinement Module (??????1), RRM, a second Logic Refinement Module (??????2), where ???????? is the ????h LRM instance. Thus, it performs transpilation followed by sequential aspect specific refinements. For the method, the system 100 uses the best-performing TM configuration. The system 100 feeds a COBOL code to TM and generate K samples of Java code (??????????) using temperature sampling. Out of these K samples, the erroneous ones are passed to ??????1 to rectify the detected errors discussed above. The logically refined K samples (??????????) are again compiled, executed, and ranked based on their logical correctness. The logically correct ones get the highest rank, followed by the ones with ????, followed by ???? or ?????? followed by ????. The readability scores (????) are used as tiebreakers. For example, consider that there are N<=K samples with no errors (logically correct), and if ?? > 1, the sample from ?? with greatest ???? gets the highest rank. The Java code with the highest rank (??????????*) is forwarded to RRM. As logical correctness gets priority over readability, the Java Code generated as an output of RRM (????????????) is again checked for logical correctness. If it is logically incorrect and ??????????* is logically correct, the system 100 stick to ??????????* as the final output. Otherwise, if both ??????????* and ???????????? are logically incorrect, we feed ???????????? to ??????2 to ensure logical correctness of the resulting Java code (??????????????).
FIG. 3, with reference to FIGS. 1-2, depicts an exemplary flow chart illustrating a method for transpiling source code to functionally correct and readable in target language using large language models, using the systems of FIGS. 1-2, in accordance with an embodiment of the present disclosure. In an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to components of the system 100 of FIG. 1, the block diagram of the system 100 depicted in FIG. 2, and the flow diagram as depicted in FIG. 3. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
At step 202 of the method of the present disclosure, the one or more hardware processors 104 receive a first source code in a first language. For instance, at step 202 a COBOL source code is received by the system 100. The expressions ‘first source code’ and ‘COBOL source code’/’COBOL code may be interchangeably used herein. The first source code is sourced the dataset from CodeNet (e.g., refer “Ruchir Puri et al. 2021. CodeNet: A Large-Scale AI for Code Dataset for Learning a diversity of coding tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).”), which is derived from code submission platforms AIZU5 and AtCoder6. It has 4053 problem statements with multiple submissions in 55 programming languages. This is the only public dataset having COBOL samples with test cases. The system 100 and the method of the present disclosure investigate closed transformer-based pre-trained large language model exposure to CodeNet as a part of pre-training data. It is found that although the closed transformer-based pre-trained large language model can generate some statistics for CodeNet, it cannot complete partially provided problem statements or COBOL/Java codes in the dataset. This suggests that it might have seen some web description of CodeNet, but not the actual data in terms of problem statements and codes. The system 100 and the method of the present disclosure select the problem statements that have at the least one accepted COBOL submission and randomly select a submission for each problem, along with provided I/O test cases, resulting in 322 COBOL code samples. This set is consistent across all settings in Table 2. The system and method of the present disclosure do not treat the Java submissions for that problem as ground truth, as the metric is independent of the same. As the system 100 and method of the present disclosure work in a zero-shot setting, all the COBOL codes are used as test samples. The number of COBOL code lines (ranges from 11-358) serve as criterion of code difficulty. The following has been the distribution of codes over difficulty: (i) Very Easy: 73 (ii) Easy: 58 (iii) Medium: 69 (iv) Hard: 57 (v) Extra Hard: 65. Below illustrates an exemplary first source code (e.g., COBOL source code) serving as an input to the system 100 at step 202:
First source code (COBOL source code)
IDENTIFICATION DIVISION.
PROGRAM-ID. test1.
ENVIRONMENT DIVISION.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 INP PIC X(100).
01 A PIC 9999999999.
01 B PIC 9999999999.
01 C PIC X(100).
01 REN PIC 9999999.
01 RE PIC ZZZZZZZZ9.
01 temp1 PIC 9999.
01 temp2 PIC 9999.
01 N PIC 9999.
PROCEDURE DIVISION.
MAIN.
ACCEPT INP.
PERFORM VARYING temp1 FROM 1 BY 1 UNTIL INP(temp1:1)=SPACE
END-PERFORM.
SUBTRACT 1 FROM temp1.
MOVE INP(1:temp1) TO A.
ADD 2 TO temp1.
PERFORM VARYING temp2 FROM temp1 BY 1 UNTIL INP(temp2:1)=SPACE
END-PERFORM.
COMPUTE N=temp2 - temp1.
MOVE INP(temp1:N) TO B.
IF A >= B THEN
SUBTRACT B FROM A
ELSE
MOVE 0 TO A
END-IF.
MOVE A TO RE.
DISPLAY RE.
STOP RUN.
Referring to steps of FIG. 3, at step 204 of the method of the present disclosure, the one or more hardware processors 104 transpile, by using a temperature sampling, the first source code to a set of second source codes based on at least one selected prompt configuration and one or more sample test cases. In one embodiment, each second source code in (or amongst) the set of the second source code is translated to a second language (e.g., Java language). It is to be understood by a person having ordinary skill in the art or person skilled in the art that above examples of the first source code and the second source code shall not be construed as limiting the scope of the present disclosure. The expressions ‘second source code’ and ‘Java source code’ (or also referred to as Java code, target code, destination code, target source code, destination source code, and the like) may be interchangeably used herein. Depending on the requirement and application of the system 100 and method, the first source code can be a low-resource procedural source code (e.g., COBOL, FORTRAN, Report Program Generator (RPG), and the like) that requires transpiling to another language such as a high resource object-oriented programming language source codes (e.g., Java, Python, C#, C++, and the like). In other words, the first source code is a low-resource procedural source code, and each second source code in the set of second source code is a high resource object-oriented programming language code.
The temperature sampling parameter (e.g., a hyperparameter) is often tuned for large language models (LLMs). Lowering the temperature sampling parameter results in a sharper probability distribution, leading to more conservative and predictable samples, where high-probability tokens are more likely to be chosen. Conversely, increasing the temperature sampling parameter leads to a flatter probability distribution, encouraging the LLMs to produce more diverse and creative outputs by giving lower probability tokens a higher chance of being sampled. In the present disclosure, temperature sampling refers to the process of generating K outputs with the temperature set to a relatively high value (0.7 here). This increases the probability of generating functionally correct Java code after functional refinement using execution-guided feedback. It is to be understood by a person having ordinary skill in the art or person skilled in the art that the above temperature setting of value 0.7 shall not be construed as limiting the scope of the present disclosure and may vary depending on the implementation and/or application. The system 100 feeds a COBOL code to transpiling module (TM) which generate K samples of Java code (??????????) using temperature sampling. The above step 204 is better understood by way of following description:
There are four components which create distinct configurations of prompts for TM, viz. (i) BASE (ii) I/O (iii) INS and (iv) ???NS?_def. The primary objective of BASE is to evaluate the LLM’s inherent ability to perform direct translation. For this, the system 100 and the method of the present disclosure keep the prompt minimal, devoid of additional aids. With I/O, the system 100 enhances TM by incorporating sample I/O pairs of test cases in the prompt, which can aid TM, resulting in Java code that may accurately replicate the I/O behavior of the original COBOL code (giving expected output for given input as well as adhering to I/Oformat). INS provides basic instructions to produce logically correct and readable Java code, whereas ?INS?_def consists of carefully crafted guidelines to preserve the logical correctness and readability of the translated code. Since readability can often be ambiguous, certain definitions related to readability criteria applicable for code translation are included in this component. These criteria are: Overall Readability, Abstraction, Modularization, Dead (Unnecessary) code, Logical segmentation, Complexity, Comments, Naming convention, Testability and Alignment. Some of these criteria, such as abstraction, modularization, logical segmentation, and unnecessary code, are specifically designed to address readability concerns of COBOL-to-Java translation. Attaching distinct combinations of the above-explained components to BASE leads us to the following six prompt configurations for TM: (i) ?TM?_BASE (e.g., refer to Github page prompt 1 – not shown in FIGS.) (ii) ?TM?_(BASE+I/O) (refer to Github page prompt 2 – not shown in FIGS.) (iii) ?TM?_(BASE+INS) (refer to Github page prompt 3 – not shown in FIGS.) (iv) ?TM?_(BASE+?INS?_def ) (refer to Github page prompt 4 – not shown in FIGS.) (v) ?TM?_(BASE+I/O+INS) (refer to Github page prompt 5 – not shown in FIGS.) and (vi) ?TM?_(BASE+I/O+?INS?_def ). The input to TM is COBOL code, and output is the translated Java code with a selected prompt configuration. As an example, the system 100 illustrates the prompt for ?TM?_(BASE+I/O+?INS?_def ) configuration in Prompt 1. For further details on prompts, prompts for the rest of the configurations are illustrated on the Github Page. Below illustrates the on at least one selected prompt configuration of the system 100.
Prompt 1: ?TM?_(BASE+I/O+?INS?_def ) Transpiling prompt:
Transpile the following COBOL code to Java to satisfy the following sample test cases without hardcoding inputs while taking the following guidelines into consideration. {Output_format}
SAMPLE TEST CASES: {test_cases}
GUIDELINES:
The logic of the code should be preserved when transpiling from COBOL to Java.
The resulting Java code should be readable in nature.
{READABILITY_DEFINITIONS}
COBOL CODE: {cobol}
Below illustrate exemplary sample test cases:
Example 1
Problem Statement
Say, in a village (XYZ), there are 999 towers that are 1, (1+2), (1+2+3), ..., (1+2+3+...+999) meters high from west to east, at intervals of 1 meter. It had been snowing for a while before it finally stopped. For some two adjacent towers located 1 meter apart, the lengths of the parts of those towers that are not covered with snow were measured, and the results are a) meters for the west tower, and b) meters for the east tower. Assuming that the depth of snow cover and the altitude are the same everywhere in the village, find the amount of the snow cover. Assume also that the depth of the snow cover is always at least 1 meter.
Constraints
1 <= a < b < 499500(= 1 + 2 + 3 + … + 999). All values in input are integers. There is no input that contradicts the assumption.
Input
Input is given from Standard Input in the following format: a b
Output
If the depth of the snow cover is x meters, print x as an integer.
Sample Input
8 13
Sample Output
2
Example 2
Problem Statement
A sandglass runs for X seconds. The sand drops from the upper bulb at a rate of 1 gram per second. That is, the upper bulb initially contains X grams of sand. How many grams of sand will the upper bulb contain after t seconds?
Constraints
1 <= X <= 109 ; 1 <= t <= 109 X and t are integers.
Input
The input is given from Standard Input in the following format: X t
Output
Print the number of sand in the upper bulb after t second.
Sample Input
100 17
Sample Output
83
Example 3
Problem Statement
The window of Takahashi’s room has a width of A. There are two curtains hung over the window, each of which has a horizontal length of B. (Vertically, the curtains are long enough to cover the whole window.) We will close the window so as to minimize the total horizontal length of the uncovered part of the window. Find the total horizontal length of the uncovered parts of the window then.
Constraints
1 < A < 100 ; 1 < B < 100; A and B are integers.
Input
Input is given from Standard Input in the following format: A B
Output
Print the total horizontal length of the uncovered parts of the window.
Sample Input
12 4
Sample Output
4
Below illustrates one of the sample second source code amongst the set of second source codes generated by the system 100:
import java.util.Scanner;
public class Test1 {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
String inp = scanner.nextLine();
int temp1 = 0;
int temp2 = 0;
for (temp1 = 0; inp.charAt(temp1) != ' '; temp1++);
temp1--;
String a = inp.substring(0, temp1);
temp1 += 2;
for (temp2 = temp1; inp.charAt(temp2) != ' '; temp2++);
int n = temp2 - temp1;
String b = inp.substring(temp1, temp1 + n);
int A = Integer.parseInt(a);
int B = Integer.parseInt(b);
if (A >= B) {
A -= B;
} else {
A = 0;
}
String re = String.format("%07d", A);
System.out.println(re);
}
}
Referring to steps of FIG. 3, at step 206 of the method of the present disclosure, the one or more hardware processors 104 perform a first level of functional correctness and readability refinement of the set of second source codes to obtain a functionally and readability refined source code. More specifically, the first level of functional correctness and readability refinement includes determining a current functional correctness of the set of second source codes and refining the set of second source codes based on the determined functional correctness to obtain a set of refined second source codes. This is performed by a logic refinement module (LRM) comprised in the memory 102 and invoked for execution. The LRM aims to ensure logical correctness of the resultant translated Java code. On the similar lines of Huang et al. (e.g., refer “Jie Huang et al. 2023. Large Language Models Cannot Self-Correct Reasoning Yet. (2023). arXiv:2310.01798 [cs.CL]”), the system 100 and the method do not rely on the intrinsic capabilities of LLMs for the refinement, but instead use external feedback for the same. The system 100 compiles and executes the Java code generated by the prior module (TM). If the resultant Java code compiles and generates the expected output for all the test cases, the LRM process is bypassed for that sample. The step of determining the functional correctness of the set of second source codes and refining the set of second source codes is performed to obtain an output that is indicative of (i) a no error found in each second source code, or (ii) one or more errors found in each second source code. In case of failure, the system 100 gathers the error messages (from compiler or runtime system) and categorize one or more errors into four types: (i) Compilation Errors (????), (ii) Runtime Errors (????), (iii) Functional errors (????) (code executes but does not provide expected output) and (iv) Non-Termination Errors (??????), to generate logic feedback. The system 100 illustrates the above-mentioned errors by way of examples:
Compilation Error:
import java.io.*;
import java.util.*;
public class ABC_098_B {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
int N = sc.nextInt();
String S = sc.next();
List x1List = new ArrayList<>();
List y1List = new ArrayList<>();
for (int i = 0; i < N; i++) {
char chr = S.charAt(i);
x1List.add(chr);
List str = new ArrayList<>();
for (int j = i + 1; j < N; j++) {
str.add(S.charAt(j));
}
y1List.addAll(str);
int cnt = 0;
int cntmax = 0;
for (char x : x1List) {
for (char y : y1List) {
if (x == y) {
cnt++;
}
}
cntmax = Math.max(cntmax, cnt);
cnt = 0;
}
}
int ans = cntmax;
System.out.println(ans);
}
}
The code has the following COMPILATION error:
ts_base_io_ins_new2Sample.java:41: error: cannot find symbol
int ans = cntmax;
^
symbol: variable cntmax
location: class ts_base_io_ins_new2Sample
1 error
Runtime Error:
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
int N = scanner.nextInt();
scanner.nextLine();
String INP = scanner.nextLine();
int[] A = new int[N+1];
int[] SA = new int[N+1];
int P1 = 1;
int P2;
for (int I = 1; I <= N; I++) {
P2 = P1;
while (INP.charAt(P2-1) != ' ') {
P2++;
}
A[I] = Integer.parseInt(INP.substring(P1-1, P2-1));
if (I > A[I]) {
SA[I - A[I]]++;
}
P1 = P2 + 1;
}
long CNT = 0;
for (int I = 1; I <= N - 2; I++) {
if (I + A[I] <= N) {
CNT += SA[I + A[I]];
}
}
System.out.println(CNT);
}
}
Sample Input 1
6
2 3 3 1 3 1
Sample Output 1
3
The code results in a RUNTIME error for the following sample test cases:
Sample test case 1 -
Input -
6
2 3 3 1 3 1
Error Message -
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 11
at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:47)
at java.base/java.lang.String.charAt(String.java:693)
at ts_base_io_ins_new4Sample.main(ts_base_io_ins_new4Sample.java:19)
Non-Termination Error:
import java.util.Scanner;
public class ABC_106_C {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
String[] S = new String[100];
int N;
long K;
for (int i = 0; i < S.length; i++) {
S[i] = Character.toString((char) scanner.nextInt());
}
N = scanner.nextInt();
K = scanner.nextLong();
for (int i = 0; i < K; i++) {
if (S[i].length() < N) {
break;
}
if (!S[i].equals("1")) {
System.out.println(S[i]);
return;
}
}
System.out.println("1");
}
}
Sample Input 1
1214
4
Sample Output 1
2
Functional Error:
Code -
import java.util.Scanner;
public class NomuraB {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
String t = scanner.nextLine();
scanner.close();
String out = t.replace("?", "D");
System.out.println(out.trim());
}
}
Sample Input -
PD?D??P
Sample Output -
PDPDPDP
The code is not providing expected outputs for the following sample test cases:
Sample test case 1 -
Input -
PD?D??P
Expected Output -
PDPDPDP
Code Execution Output -
PDDDDDP
The system 100 considers ???? to be more severe than ???? and ??????, which in turn would be more severe than ????. If the logic feedback indicates any of the above errors, the system 100 proceeds to LRM using the feedback as the guideline along with the test cases as part of the refinement prompt (Prompt 2). Note that this prompt does not have the source COBOL code in the context. The input to LRM is a logically incorrect Java code with logic feedback, and output is the Java code repaired by the LLM, currently in a non-iterative setting.
Once the set of refined second source codes is obtained, the system 100 ranks the set of refined second source codes to obtain a set of ranked source codes. Out of these K samples, the erroneous ones are passed to ??????1 to rectify the detected errors discussed above. The logically refined K samples (??????????) are again compiled, executed, and ranked based on their logical correctness. In an embodiment, the step of ranking the set of refined second source codes is based on an associated level of functional correctness followed by a type of the one or more errors. For instance, the logically correct ones get the highest rank, followed by the ones with ????, followed by ???? or ?????? followed by ????. The readability scores (????) are used as tiebreakers. For example, consider that there are N<=K samples with no errors (logically correct), and if N>1, the sample from N with greatest ???? gets the highest rank.
The top ranked source code (JavaL*) is illustrated below by way of example:
import java.util.Scanner;
public class Test1 {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
String inp = scanner.nextLine();
int temp1 = 0;
int temp2 = 0;
for (temp1 = 0; temp1 < inp.length(); temp1++) {
if (inp.charAt(temp1) == ' ') {
break;
}
}
temp1--;
String a = inp.substring(0, temp1);
temp1 += 2;
for (temp2 = temp1; temp2 < inp.length(); temp2++) {
if (inp.charAt(temp2) == ' ') {
break;
}
}
int n = temp2 - temp1;
String b = inp.substring(temp1, temp1 + n);
int A = Integer.parseInt(a);
int B = Integer.parseInt(b);
if (A >= B) {
A -= B;
} else {
A = 0;
}
String re = String.format("%07d", A);
System.out.println(re);
}
}
Referring to steps of FIG. 3, once the set of ranked sourced codes is obtained, the one or more hardware processors 104 apply one or more large language model (LLM)-based task-specific readability criteria on a first ranked sourced code amongst the set of ranked sourced codes to obtain one or more specific portions of the first ranked sourced code. The one or more specific portions indicate one or more problematic aspects of readability. In an embodiment, the one or more large language model (LLM)-based task-specific readability criteria comprise at least one of an overall readability, an abstraction, a modularization, a dead code, a logical segmentation, a complexity, comments, a naming convention, a testability, and an alignment. The system 100 and the method of the present disclosure illustrate exemplary definitions of the above-mentioned one or more large language model (LLM)-based task-specific readability criteria by way of examples:
Readability: How easy code is for the programmers to understand and maintain.
Abstraction: The code should contain optimal abstractions based on the target language. The code which can be represented in shorter form needs to be abstracted out.
Modularization: The code should be properly divided into functions to minimize repetitions.
Unnecessary code: Unnecessary code is code that is not required to implement the overall logic. Remove unnecessary code and variables to improve readability.
Logical segmentation: The code should be separated by blank lines into logical sub-units.
Complexity: The code should not be convoluted, and the logic should be implemented in a simpler way.
Comments: The code should have proper comments for different parts of the code explaining what is being done. It should be easily understandable by programmers.
Naming convention: Variable names and function names should be meaningful. The terms in the function comments should be similar to the terms used in the function body. Full-word identifier names are better for more readability.
Testability: Easy to test code is more readable.
Alignment: Proper code indentation is essential for readability.
It is to be understood by way person having ordinary skill in the art or person skilled in the art that the above definitions for the one or more large language model (LLM)-based task-specific readability criteria shall not be construed as limiting the scope of the present disclosure and may vary as per the implementation of the present disclosure. Since readability can often be ambiguous, certain definitions related to readability criteria applicable for code translation are included in this component. Some of these criteria, such as abstraction, modularization, logical segmentation, and unnecessary code, are specifically designed to address readability concerns of COBOL-to-Java translation, whereas others have been picked up from prior literature as indicated.
The system 100 then performs the readability refinement of the one or more specific portions using the one or more sample test cases to obtain the functional corrected and refined source code. The above step of applying the one or more large language model (LLM)-based task-specific readability criteria and performing the readability refinement of the one or more specific portions are better understood by way of following description: The system 100 employs a Readability Refinement Module (RRM) that is invoked from the memory 102 for execution. The RRM when executed generates LLM-based readability feedback/criteria for the transpiled Java code and rectifies the same in case of negative feedback. To provide feedback for the transpiled Java code, the system 100 and method of the present disclosure do not rely on the LLM’s inherent understanding of readability and use the aforementioned task-specific readability criteria. The system 100 includes the definitions of the criteria in the feedback prompt (e.g., also refer Github page as known in the literature) and ensure that the feedback is both detailed (mentions specific portions of the code along with the understanding of the corresponding problematic aspects of readability, if any) and comprehensive in nature. The LLM leverages the insights gained from the readability feedback, along with sample I/O, to enhance the readability of the generated Java code by making it more refactored, modularized, and aligned with Java coding standards (Prompt 3). Note that source COBOL code is not in the context. Below illustrates an exemplary Prompt 3 used by the system 100:
Prompt 3: Readability Refinement Prompt
{READABILITY_DEFINITIONS}. Can you give suggestions to improve readability of the following Java code based on abstraction, modularization, unnecessary code, logical segmentation, complexity, comments, naming convention, testability, and alignment of the following code. Give suggestions for each feature separately first. Do not fix the code, just provide suggestions.
JAVA CODE: {Java}
SAMPLE TEST CASES: {test_cases}
GUIDELINES: {readability_feedback/criteria}
Now fix the code based on the above guidelines. {OUTPUT_FORMAT}
Below illustrates the functional corrected and refined source code by way of example:
import java.util.Scanner;
public class Test1 {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
String input = scanner.nextLine();
String[] parts = extractParts(input);
int a = parseInteger(parts[0]);
int b = parseInteger(parts[1]);
int result = performSubtraction(a, b);
String formattedResult = formatResult(result);
System.out.println(formattedResult);
}
// Function to extract the two parts from the input string
private static String[] extractParts(String input) {
String[] parts = input.split(" ");
return parts;
}
// Function to parse an integer from a string
private static int parseInteger(String str) {
return Integer.parseInt(str);
}
// Function to perform the subtraction
private static int performSubtraction(int a, int b) {
return (a >= b) ? (a - b) : 0;
}
// Function to format the result with leading zeros
private static String formatResult(int result) {
return String.format("%07d", result);
}
}
The one or more portions and their readability feedback in terms of response by the system 100 are provided below by way of examples:
Example 1: JAVA CODE:
import java.util.Scanner;
public class ABC_099_B {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
int ablen = 2;
int cur = 1;
String INP = scanner.nextLine();
int i = 0;
int j;
int len;
int[] ab11 = new int[2];
long div;
long s;
long t;
long ans;
for (int k = 0; k < ablen; k++) {
j = cur;
while (j < INP.length() && INP.charAt(j-1) != ' ') {
j++;
}
len = j - cur;
ab11[i] = Integer.parseInt(INP.substring(cur-1, cur-1+len));
cur = j + 1;
i++;
}
div = ab11[1] - ab11[0];
s = 0;
i = 1;
for (int k = 0; k < div; k++) {
s += i;
i++;
}
t = s - ab11[1];
ans = t;
System.out.println(ans);
}
}
RESPONSE (also referred to as readability feedback):
Abstraction:
- Consider abstracting out the logic for parsing the input string and calculating the result into separate methods.
- Identify repetitive patterns and consider encapsulating them into functions or classes.
Modularization:
- Break down the code into smaller, reusable functions with specific responsibilities.
- Consider separating input processing, calculation, and output into separate methods.
Unnecessary code:
- Remove unused variables like `len`.
- Consider removing the `ablen` variable if it's always going to be `2`.
Logical segmentation:
- Add blank lines between logically separate sections of code such as input processing, calculation, and output.
Complexity:
- Simplify the calculation logic, if possible, to make it easier to understand.
- Consider breaking down complex calculations into smaller, more manageable parts.
Comments:
- Add comments to describe the purpose of each section of code, especially where the logic might not be immediately clear.
- Consider adding comments to explain the significance of certain variables or calculations.
Naming convention:
- Use meaningful variable names that reflect their purpose.
- Consider renaming variables like `ablen`, `ab11`, `div`, `s`, `t`, and `ans` to more descriptive names.
Testability:
- Consider refactoring the code to make it more testable by breaking it into smaller, isolated functions.
- Separate input/output from the core logic to facilitate easier testing.
Alignment:
- Ensure consistent indentation throughout the code for better readability.
- Align variable declarations and assignments to improve code organization.
Example 2: JAVA CODE:
import java.util.Scanner;
public class AtCoder {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
String input = scanner.nextLine();
int temp1 = 0;
int temp2 = 0;
int N = 0;
for (temp1 = 0; temp1 < input.length() && input.charAt(temp1) != ' '; temp1++);
temp1--;
int A = Integer.parseInt(input.substring(0, temp1));
temp1 += 2;
for (temp2 = temp1; temp2 < input.length() && input.charAt(temp2) != ' '; temp2++);
N = temp2 - temp1;
int B = Integer.parseInt(input.substring(temp1, temp1 + N));
int C = A - B - B;
if (A > B + B) {
System.out.println(C);
} else {
System.out.println("0");
}
}
}
RESPONSE (also referred to as readability feedback):
Suggestions for improving readability of the provided Java code:
1. **Abstraction**:
- Extract the logic for parsing input and performing calculations into separate methods.
- Use meaningful method names that convey the purpose of each abstraction.
2. **Modularization**:
- Break down the main logic into smaller, more focused functions.
- Avoid having a single large `main` method.
3. **Unnecessary code**:
- Remove unused variables such as `N`.
- Simplify unnecessary calculations or iterations if possible.
4. **Logical segmentation**:
- Add blank lines between different logical segments of the code to improve readability.
- Group-related statements together.
5. **Complexity**:
- Simplify the logic for readability.
- Avoid unnecessary complexity in calculations or conditions.
6. **Comments**:
- Add comments to explain the purpose of each section of the code, especially complex calculations or conditions.
- Use clear and concise comments to aid understanding.
7. **Naming convention**:
- Use meaningful variable names that reflect their purpose.
- Follow Java naming conventions (camelCase for variables, PascalCase for class names).
8. **Testability**:
- Ensure that the code is structured in a way that facilitates easy testing.
- Consider separating concerns to improve testability.
9. **Alignment**:
- Ensure consistent and proper indentation throughout the code.
- Use standard indentation conventions (usually 4 spaces in Java).
These suggestions aim to make the code easier to understand, maintain, and test, by improving its structure, clarity, and simplicity.
The system 100 or the RRM performs a second level of functional correctness and functional refinement of the obtained functionally corrected and refined code (JavaLR/JavaL*). Firstly, the obtained functionally corrected and refined code (JavaLR/JavaL*) is identified as the final source code (JavaF) based on an associated output of the obtained functionally corrected and refined code (JavaLR/JavaL*) and the sample test cases. Else, the functional correctness of the obtained functionally corrected and refined code (JavaLR/JavaL*) is determined to (a) identify the first ranked sourced code as the final source code (JavaF), or (b) perform the functional refinement of the obtain the functionally corrected and refined code (JavaLR) based on the sample test cases to obtain the final source code (JavaF). In other words, the Java code with the highest rank (JavaL*) is forwarded to RRM. As logical correctness gets priority over readability, the Java Code generated as an output of RRM (JavaLR) is again checked for logical correctness. If it is logically incorrect and JavaL* is logically correct, the system 100 sticks to JavaL* as the final output. Otherwise, if both JavaL* and JavaLR are logically incorrect, the system 100 feed JavaLR to ??????2 to ensure logical correctness of the resulting Java code (JavaLRL) (refer FIG. 2).
Below illustrates the final source code (JavaF) by way of example:
import java.util.Scanner;
public class Test1 {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
int a = scanner.nextInt();
int b = scanner.nextInt();
int result = performSubtraction(a, b);
System.out.println(result);
}
private static int performSubtraction(int a, int b) {
return (a >= b) ? (a - b) : 0;
}
}
Below Table 1 illustrates Java code generated by syntax-directed approach (Complete code in Github Page); LLM based direct translation; method of the present disclosure Incorrect vs. Correct Logic; Non-Modularized vs. Modularized vs. Over-modularized; Non-Abstracted vs. Abstracted code. Problem: There are 999 towers of height 1, (1+2), (1+2+..999), at intervals of 1 meter, a) meters for tower 1 and b) meters for tower 999 are not covered with snow. Find the total snow cover depth, assuming at least 1-meter-deep snow between two adjacent towers.
Table 1
First source code (COBOL) Target Java code with syntax-directed approach (prior art) Direct LLM-based translation (prior art) Transpiled source code (Java code) by method of the present disclosure
......
ACCEPT INP.
PERFORM ablen TIMES
PERFORM VARYING j FROM cur BY 1
UNTIL INP(j:1) = SPACE
END-PERFORM
COMPUTE len = j - cur
MOVE INP(cur:len) TO ab11(i)
COMPUTE cur = j + 1
ADD 1 TO i
END-PERFORM.
COMPUTE div = ab11(2) - ab11(1).
MOVE 0 TO s.
MOVE 1 TO i.
PERFORM div TIMES
ADD i TO s
ADD 1 TO i
END-PERFORM.
COMPUTE t = s - ab11(2).
MOVE t TO ans.
DISPLAY ans.
STOP RUN. .....
public class Abc_099_b extends Program {
public int `getAblen()` {
return super.getDisplayInt(0,2,false,false,false);
}
public String `getAblenAsString()` {
return super.toString(0,2);
}
public void `setAblen(int val)` {
super.setDisplayInt(0,2,val,false,false,false);
}
public void `setAblen(String val)` {
super.valueOf(0,2,val,true);
}
.....
public static void main(String[] args) {
__processCmdLineArgs(args);
Abc_099_b instance_ = new Abc_099_b();
instance_.execute(null);
System.exit(instance_.__getReturnCode());
}
public void execute(ProgramEnv __env) {
initialize(this);
doCobolGotoStart();
}
`Paragraph main=new Paragraph(this)` {
public CobolMethod run() {
.....
}};
public Abc_099_b() {
super(new CobolBytes(133));
}
public void `initialize(Program p)` {
.....
}
} import java.util.Scanner;
public class ABC_099_B {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
int ablen = 2;
||int cur = 1;||
``String INP = scanner.nextLine();``
||int i = 1;||
``...
for (int k = 0; k < ablen; k++) {
j = cur;``
@while (@||INP.charAt(j-1) != ' '||@) {
j++;
}
len = j - cur;
ab11[i] = Integer.parseInt(INP.substring
(cur-1, cur-1+len));@
``cur = j + 1;
i++;
}
div = ab11[1] - ab11[0];
...
for (int k = 0; k < div; k++) {
s += i;
i++;
}
t = s - ab11[1];
ans = t;``
System.out.println(ans);
}
} import java.util.Scanner;
public class ABC_099_B {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
int[] ab = |readInput|(scanner);
long ans = |calculateAnswer|(ab);
System.out.println(ans);
}
private static int[] |readInput|(Scanner scanner) {
int[] ab = new int[2];
String input = scanner.nextLine();
~String[] parts = ~@@input.split(" ")@@~;
ab[0] = Integer.parseInt(parts[0]);
ab[1] = Integer.parseInt(parts[1]);~
return ab;
}
private static long |calculateAnswer|(int[] ab) {
int a = ab[0];
int b = ab[1];
long div = b - a;
long s = |calculateSum|(div);
long t = s - b;
return t;
}
private static long |calculateSum|(long n) {
return (n * (n + 1)) / 2;
}
}
EXPERIMENTATION AND RESULTS
Dataset
As mentioned above, the system 100 sourced the dataset from CodeNet (e.g., refer “Ruchir Puri et al. 2021. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).”), which is derived from code submission platforms AIZU5 and AtCoder6. It has 4053 problem statements with multiple submissions in 55 programming languages. This is the only public dataset having COBOL samples with test cases. The system and method of the present disclosure investigate closed transformer-based pre-trained large language model’s exposure to CodeNet as a part of pre-training data. It is found that although closed transformer-based pre-trained large language model can generate some statistics for CodeNet, it cannot complete partially provided problem statements or COBOL/ Java codes in the dataset. This suggests that it might have seen some web description of CodeNet, but not the actual data in terms of problem statements and codes. The system 100 and the method of the present disclosure select the problem statements that have at the least one accepted COBOL submission and randomly select a submission for each problem, along with provided I/O test cases, resulting in 322 COBOL code samples. This set is consistent across all settings in Table 2. The system and method of the present disclosure do not treat the Java submissions for that problem as ground truth, as the metric is independent of the same. As the system 100 and method of the present disclosure work in a zero-shot setting, all the COBOL codes are used as test samples. The number of COBOL code lines (ranges from 11-358) serve as criterion of code difficulty. The following has been the distribution of codes over difficulty: (i) Very Easy: 73 (ii) Easy: 58 (iii) Medium: 69 (iv) Hard: 57 (v) Extra Hard: 65.
Metrics
As the system 100 and method of the present disclosure aim to produce logically correct and readable Java programs, the following metrics have been used for evaluation.
Execution accuracy (????): It represents the percentage of translated Java code samples that execute correctly, i.e., without any errors and yield expected output for all the test cases provided in the dataset. The system 100 and method of the present disclosure used OpenJDK v11.0.20.1 to execute the Java codes.
Readability Score (????): The system 100 and method of the present disclosure used the approach in Scalabrino et al. (e.g., refer “Simone Scalabrino et al. 2018. A comprehensive model for code readability. Journal of Software: Evolution and Process 30, 6 (2018), e1958.”) for ???? computation (Range 0-1). They trained a classifier using logistic regression on code features defined by Buse and Weimer (e.g., refer “Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability. IEEE Transactions on software engineering 36, 4 (2009), 546–558.”), Dorn (e.g., refer “Jonathan Dorn. 2012. A general software readability model. (2012).”), Posnett et al. (e.g., refer “Daryl Posnett et al. 2011. A simpler model of software readability. In Proceedings of the 8th working conference on mining software repositories. 73–82.”) including spatial and textual features (Section 4.2).
LLM-based readability Score: As ???? does not consider some of the criteria important for COBOL-to-Java translation viz. abstraction, modularization, logical segmentation, and redundancy; the system 100 and method of the present disclosure use closed transformer-based pre-trained large language model (prompt illustrated on the Github page) to evaluate the resultant Java code for these criteria.
Large Language Models (LLMs)
The system 100 and method of the present disclosure used two base LLMs pre-trained on natural language as well as code, viz. Open source pretrained language model (e.g., refer “Ziyang Luo et al. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs.CL]”) and a closed transformer-based large language model. Due to computational constraints, a 4-bit quantized version of WizardCoder with 15B parameters has been used by the system 100. For temperature sampling, the system 100 used temperature 0.7 for more diverse output; for refinement modules temperature 0.2 was used for more deterministic output. The system 100 used a MIG A100 GPU with 20 GiB memory for WizardCoder?? inference.
Baselines
All Prompts for the baselines are illustrated on the Github page (e.g., from literature).
Refinement of Translation: This baseline is designed following Pan et al., to demonstrate that the method of the present disclosure (refinement of translated Java code) of the present disclosure results in better Java code than performing refinement of translation task, as mentioned earlier. Same as the method of the present disclosure, for this baseline, the results have been computed with the best performing configuration ?TM?_(BASE+I/O+?INS?_def ) and K=5 temperature samples. The system 100 and method performed two variants of refinement modules, viz. one-step refinement for logical correctness and readability (?TM?_(BASE+I/O+?INS?_def )?(LRM+RRM)_trans) and aspect-specific sequential refinement (?TM?_(BASE+I/O+?INS?_def )??LRM1?_trans??RRM?_trans??LRM2?_trans). Here trans denotes refinement of the translation task with having COBOL code in the context.
One-Step Refinement of translated Java code: The method of the present disclosure (?TM?_(BASE+I/O+?INS?_def )??LRM1?_java??RRM?_java??LRM2?_java) claims to produce better Java code by using sequential aspect-specific refinements. To validate this, the system 100 and the method of the present disclosure used this baseline with only single-step refinement of translated Java code for both logical correctness and readability (?TM?_(BASE+I/O+?INS?_def )?(LRM+RRM)_java). Here java denotes refinement of the resultant java with no COBOL code in the context.
Variations in Temperature Sampling: To demonstrate effectiveness of temperature sampling the system 100 executed the method of the present disclosure with distinct values of K(1,5,13). The present disclosure also studied a variation of the method of the present disclosure where the highest-ranking sample is selected to be forwarded to LRM after TM itself (K=5?1) as opposed to passing all 5 samples (output of TM) to LRM1 and then ranking and selecting the best sample.
Direct translation of COBOL to logically correct and readable Java: The system 100 and method of the present disclosure treated one of their own ablations, ?TM?_(BASE+I/O+?INS?_def ) as a baseline to showcase that the method of the present disclosure outperforms direct COBOL to Java translation.
Results and Discussion
The system 100 and method of the present disclosure address key research questions (RQ) based on results in Table 2.
RQ1: Does Aspect-wise Sequential Refinement of translated Java code perform better than baselines? The present disclosure observed improvement in ???? with the method of the present disclosure over the following baselines: (i) both the variants of Refinement of Translation (ii) One-step refinement of translated Java code (iii) Direct translation of COBOL to logically correct and readable Java without refinement. For baselines (i) and (ii), the system 100 observed a lower ???? for samples with higher code lengths. For some samples, with addition of COBOL code in the context, the refinement prompt exceeded the token limit of LLMs (closed transformer-based pre-trained large language model=4096; WizardCoder??=2048). This demonstrates another advantage of the method of the present disclosure needing no contextual COBOL code for refinement. For closed transformer-based pre-trained large language model, the system 100 observed substantial improvements in ???? over these baselines (>5%), whereas for WizardCoder?? the system 100 observed marginal improvements (~2%). As opposed to ????, variations with one-step refinement baselines have higher ???? than the method of the present disclosure. This is because the workflow prioritizes logical correctness over readability.
RQ2: Do readability enhancements impact code logic? The system 100 observed that addressing readability with basic (?TM?_(BASE+I/O+INS)) and elaborate instructions ?TM?_(BASE+I/O+?INS?_def ) shows improvement in ???? over BASE+I/O for WizardCoder?? but not for the closed transformer-based pre-trained large language model. Hence, the system 100 executes the method of the present disclosure with both best-performing configurations of TM, ?TM?_(BASE+I/O+?INS?_def ) for WizardCoder?? and ?TM?_(BASE+I/O) for closed transformer-based pre-trained large language model). Moreover, RRM not only improves ????, but also ???? for both LLMs (?TM?_(BASE+I/O+?INS?_def )??LRM1?_java??RRM?_java) over (?TM?_(BASE+I/O+?INS?_def )??LRM1?_java showcasing positive effect of readability enhancements over code logic.
RQ3: How does incorporation of Chain of Thought (CoT) Prompting impact performance? For TM, the system 100 include CoT prompting (e.g., refer “Jia Li et al. 2023. Structured Chain-of-Thought Prompting for Code Generation. arXiv:2305.06599 [cs.SE]”, and “Jason Wei et al. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]”), which has been explored for code generation. The system 100 asks the LLMs to generate a high-level plan (CoT) of the COBOL code prior to generating the corresponding Java translation in the same prompt (Prompts and example CoT on the Github page). The system 100 also includes few-shot examples from a left-out set with manually crafted CoT. The present disclosure compared CoT-based scores with the best-performing configurations of TM to check the effect of inclusion of CoT prompting to these configurations ?(TM?_(BASE+I/O+?INS?_def+CoT) and ?TM?_(BASE+I/O+CoT)) with no temperature sampling (K=1). With the inclusion of CoT, an average decrease in ???? is seen, whereas only a marginal increase in ????, for both the LLMs. This can be due to the few shot samples causing the prompt to exceed context limit. Given the higher significance placed on ????, the system 100 excluded CoT variations of TM from the method of the present disclosure.
RQ4: Does temperature sampling improve EA and RS? For both LLMs, higher ???? and ???? of the method of the present disclosure as compared to the baseline with K=1, indicate the positive effect of temperature sampling. Same is indicated by increase in ???? and ????, with increase in temperature samples (K=13 from 5) for both the LLMs (Github page Figure 2). Temperature sampling helps WizardCoder?? more than the closed transformer-based pre-trained large language model. With the method of the present disclosure, for a sample, the system 100 performed a maximum 13 inference calls to the LLMs (5 for temperature sampling for TM + 5 for LRM1 + 2 for RRM using LLM-based feedback and 1 for LRM2). The system 100 performed temperature sampling with K=13, i.e. same number of inference calls to that of the method (indicated by * in Table 2). More specifically, Table 2 illustrates Results, Read, Score (Range 0-1),W-WizardCoder?? and C - closed transformer-based pre-trained large language model; K - no. of temperature samples; Bold and Underlined: Overall best for K = 5; Bold: Best in the block; †- Best sample selected after TM; * - Same number (13) of total inference calls.
Table 2
Workflow/Method K Execution
W Accuracy % C Read
W Score
C
Baseline ?TM?_(BASE+I/O+?INS?_def )?(LRM+RRM)_trans 5 50.31 76.71 0.643 0.643
?TM?_(BASE+I/O+?INS?_def )??LRM1?_trans??RRM?_trans??LRM2?_trans* 5 51.86 77.95 0.601 0.572
?TM?_(BASE+I/O+?INS?_def )?(LRM+RRM)_java 5 51.86 75.78 0.673 0.663
?TM?_(BASE+I/O+?INS?_def )??LRM1?_java??RRM?_java??LRM2?_java 1 28.88 70.50 0.569 0.582
5?1† 51.86 77.02 0.606 0.603
Transpilation module ablation ?TM?_BASE 5 19.88 36.96 0.605 0.587
?TM?_(BASE+INS) 5 21.43 45.34 0.408 0.412
?TM?_(BASE+?INS?_def ) 5 18.94 36.65 0.649 0.262
?TM?_(BASE+I/O) 1 22.05 47.83 0.303 0.459
5 41.93 66.77 0.43 0.519
?TM?_(BASE+I/O)* 13 64.29 72.67 0.585 0.624
?TM?_(BASE+I/O+INS) 5 45.03 63.66 0.6 0.536
?TM?_(BASE+I/O+?INS?_def ) 1 25.47 44.72 0.525 0.471
?TM?_(BASE+I/O+?INS?_def )* 5 49.07 60.25 0.614 0.539
?TM?_(BASE+I/O+CoT) 13 59.32 68.94 0.624 0.577
?TM?_(BASE+I/O+?INS?_def+CoT) 1 9.63 48.45 0.525 0.467
Refinement
Modules
Ablation ?TM?_(BASE+I/O)??LRM1?_java 5 50.93 75.47 0.559 0.495
?TM?_(BASE+I/O)??LRM1?_java??RRM?_java 5 51.24 77.95 0.603 0.547
?TM?_(BASE+I/O)??LRM1?_java??RRM?_java??LRM2?_java* 5 51.24 79.81 0.571 0.547
?TM?_(BASE+I/O+?INS?_def )??LRM1?_java 5 52.8 76.4 0.602 0.569
?TM?_(BASE+I/O+?INS?_def )??LRM1?_java??RRM?_java 5 53.42 78.88 0.645 0.609
?TM?_(BASE+I/O+?INS?_def )??LRM1?_java??RRM?_java??LRM2?_java (w/o I/O)* 5 51.24 79.5 0.652 0.594
Method of the present disclosure ?TM?_(BASE+I/O+?INS?_def )??LRM1?_java??RRM?_java??LRM2?_java* 5 53.42 81.99 0.617 0.61
For closed transformer-based pre-trained large language model, the workflow with ?(TM?_(BASE+I/O+?INS?_def )) configuration yielded better ???? and ???? than temperature sampling with the same number of inference calls. However, for WizardCoder??, the results are reversed.
RQ5: Does a combination of LLMs perform better? With higher performance of temperature sampling forWizardCoder?? and refinement for closed transformer-based pre-trained large language model, the system 100 executed the workflow with WizardCoder?? for TM and closed transformer-based pre-trained large language model for all three refinements. This led to cost savings in terms of lesser number of API calls to closed transformer-based pre-trained large language model. The system 100 found the value of K to be 25, where WizardCoder?? reaches stagnancy as far as ???? is concerned. With this workflow, the system 100 yielded the best performance of ???? = 86.34% and ???? = 0.646.
RQ6: Does inclusion of I/O improve EA and RS? Inclusion of I/O of test cases for TM, LRM and RRM consistently yielded better ???? for both LLMs. However, it does not consistently result in an increase in ????.
Qualitative Analysis
Contribution of Modules for Logical Error Correction.
FIGS. 4A through 4C show the contribution of each module towards correction of the four types of logical errors for closed transformer-based pre-trained large language model (similar illustration for WizardCoder?? on Github page as in literature). More specifically, FIGS. 4A through 4C, with reference to FIGS. 1 through 3, depict contribution of various components of the system 100 such as a transpilation module, a first logic refinement module, a readability refinement module, and a second logic refinement module towards correction of logical errors, in accordance with an embodiment of the present disclosure. With closed transformer-based pre-trained large language model, after ???? with the best configuration, the system 100 observed a higher proportion of ???? and ???? than ???? and ??????. For WizardCoder??, the number of ???? (22.05%) after ???? is much greater than that in the case of closed transformer-based pre-trained large language model. ??????1 is highly effective for closed transformer-based pre-trained large language model and corrects a significant chunk of erroneous codes (28.26%), even eliminating all ????s. The system 100 associates this fixing capability with the detailed error messages being passed as a part of the logic feedback, along with sample I/O pairs. ForWizardCoder??, some errors from all categories are fixed, but the increase in logically correct samples (5.60%) is significantly lesser than closed transformer-based pre-trained large language model. For closed transformer-based pre-trained large language model, RRM fixes some instances of ???? and ???? while also converting some of them to ????. For WizardCoder?? also, although a few errors are fixed, this module results in a net transformation of other errors to ????. The attempt of the LLMs to make the code more readable can lead to an overall increase in code size owing to the addition of comments and modularization. For some samples, this also leads to the context length limit being exceeded which ultimately can result in compilation errors for the resultant Java programs. Thus, if ???????????? results into higher severity errors than ??????????*, we plan to retain ??????????* as the output. This would prevent the propagation of errors from low to higher severity. ??????2 fixes some instances of ???? and ???? to boost the ???? for closed transformer-based pre-trained large language model further. However, it is not as effective for WizardCoder??, resulting in further transformation of other errors (R - 1.86%, N - 1.24%, F - 7.14%) to ????. At the output of each module, for both the LLMs, the system 100 observed that there is generally a lesser number of ?????? as compared to other error types. The closed transformer-based pre-trained large language model is better than WizardCoder?? at refinement, given that it fixes a significant amount and variety of errors through sequential refinements. The example demonstrated in Table 1, shows a sample where ???? gives Java code that causes ????. ??????1 further improves this to yield Java code that causes only ???? (Github page Table 2). Further, RRM completely fixes this code, which provides the expected outputs for given I/Os.
Bug Analysis: To analyze failure cases, the system 100 25 samples, uniformly distributed across difficulty. The system 100 categorized bugs in failed Java codes as: (i) S - Syntactic and Semantic differences between COBOL and Java, (ii) L - Dependency and Logic (missing imports or incorrect logic replication), (iii) M – Model generation capability and (iv) D - Data (incorrect data types, input parsing or output formatting issues). As a single program can have multiple bugs, there is no direct one-to-one mapping between the type of bugs and the error types discussed above. Thus, each bug type can lead to different error types. Table 3A shows the distribution of the 25 samples over the bug types (including no bugs) analyzed at the output of each module for closed transformer-based pre-trained large language model. (Table 3B shows effect of difficulty level: EA (%) and RS for closed transformer-based pre-trained large language model. One sample may belong to more than one bug type. closed transformer-based pre-trained large language model is able to fix D bugs after the refinements. but not WizardCoder??. One major cause of S bugs is that indices in COBOL start at 1 as opposed to 0 in Java, leading to buggy Java where indices are shifted by 1. The system 100 observed that ??????1 is effective in repairing such bugs for both LLMs. Major L bugs are missing import statements for libraries like Scanner. For both LLMs, the refinement modules fix considerable bugs of this type. The specific error location that is provided as part of logic feedback aids in pinpointing and resolving bugs of types S and L. The system 100 observed M bugs for both LLMs for the cases where the response exceeds the context limit or does not adhere to the required format (code enclosed within and tags). This is observed more in the case of WizardCoder??, where ??????2 causes a drastic increase in M bugs.
Table 3A
Bug TM LRM1 RRM LRM2
D 7 4 4 2
S 6 1 1 1
L 8 5 2 1
M 0 0 1 2
C 11 17 19 20
Table 3B
Bin EA RS
1 95.89 0.696
2 93.10 0.624
3 84.06 0.649
4 78.95 0.574
5 56.92 0.490
Code Difficulty. Table 3B shows that, for both LLMs, ???? and ???? decreases with increasing levels of difficulty.
Readability: In addition to the criteria captured in ????, a scoring (scale: 0-5) of the resultant Java codes is done of the method of the present disclosure for the above-discussed 25 samples for the missing criteria, viz. abstraction, modularization, presence of unnecessary code, and logical segmentation. The average abstraction scores indicate good performance for capturing abstract codes with both closed transformer-based pre-trained large language model and WizardCoder?? (Table 4).
Table 4: Average Scores and Correlations for readability criteria; C_M, W_M, C_C, W_C - Manual and closed transformer-based pre-trained large language model evaluation scores for closed transformer-based pre-trained large language model and WizardCoder?? translations. ?Co?_C, ?Co?_W – Correlation between closed transformer-based pre-trained large language model scores and manual annotation for closed transformer-based pre-trained large language model and WizardCoder?? translations.
Table 4
C_C C_M W_C W_M ?Co?_C ?Co?_W
Abstraction 3.68 4.8 3.96 4.3 0.36 0.21
Modularization 3.04 4.64 3.18 4.4 0.39 0.43
Logical segmentation 3.32 4.96 3.36 4.76 0.21 0.47
Redundant code 4.06 4.2 4.32 4.12 0.45 0.5
With closed transformer-based pre-trained large language model, for 24 out of 25 samples, the resultant Java codes used libraries like ?????????????? for taking input, ???????????? for sorting (functions like ????????????.???????? ()) wherever necessary or import ??????h functions. However, for one sample, it fails to import ??????????????, used hard-coded input and did not convert the array input and its processing (which has a longer, complex COBOL code) to abstract Java (example on Github page Table 8). For WizardCoder??, the Java codes do not import ?????????????? for 14 out of 25 samples and directly use ?????????????? or use hard-coded input. For 15 resultant Java codes with closed transformer-based pre-trained large language model, andWizardCoder?? create functions optimally, yielding high scores for modularization, but for remaining samples, there is some scope for modularization. 12 and 17 samples for closed transformer-based pre-trained large language model and WizardCoder?? do not have any unnecessary code (considered after eliminating the extra text).
For both LLMs, the system 100 observed that the resultant Java codes are well indented, follow good logical segmentation, however, do not follow meaningful naming conventions but use source code variable names.
The system 100 further used closed transformer-based pre-trained large language model based evaluator to get scores for the above criteria (Github page prompt 16) for the same set of samples. The correlation of these scores with manual annotations (Table 4) for abstraction criteria for Java codes generated with WizadCoder?? is lower than closed transformer-based pre-trained large language model. This is because, for few samples, WizadCoder?? generates extra text appended to the code (Github page Table 7) confusing closed transformer-based pre-trained large language model evaluator. The correlation for modularization is high for both the models. Lower correlation for logical segmentation is the result of closed transformer-based pre-trained large language model evaluator scores not reflecting good logical segmentation of resultant Java codes and suggesting room for improvement without mentioning the explicit reasons. Similarly, closed transformer-based pre-trained large language model evaluator tends to give a low score for unnecessary code even with an explanation mentioning that no redundant code exists. It also considers ????????????.???????? (0) as unnecessary code. Overall, the low correlation between manual annotation and closed transformer-based pre-trained large language model evaluator scores indicates a scope for improvement in LLM-based evaluation.
Embodiments of the present disclosure provide an LLM-based system and method for automation of COBOL to Java translation. The method not only ensures the logical correctness of translated Java programs but also ensures readability. The present disclosure exploits Java’s high-resource nature, and performs refinement over translated Java code with compiler, execution and LLM-based feedback for logic and readability, without having the low-resource COBOL code in the context. Using a dataset derived from CodeNet, the system 100 empirically demonstrated the positive effect of introducing I/O test pairs, temperature sampling, and sequential logic and readability refinements of the translated Java, consistently across two distinct LLMs, viz. closed transformer-based pre-trained large language model and WizardCoder??. The system 100 and method of the present disclosure further observed complementary benefits of these LLMs in terms of reducing errors and fixing bugs with temperature sampling and refinement, leading to define the method using both the LLMs, which yields the best execution accuracy of 86.34% for COBOL to Java translation with readability score of 0.646.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:
1. A processor implemented method, comprising:
receiving, via one or more hardware processors a first source code in a first language;
transpiling, by using a temperature sampling via the one or more hardware processors, the first source code to a set of second source codes based on at least one selected prompt configuration and one or more sample test cases, wherein each second source code in the set of second source codes is translated to a second language; and
performing a first level of functional correctness and readability refinement of the set of second source codes to obtain a functionally and readability refined source code, wherein performing the first level of functional correctness and refinement comprises:
determining a current functional correctness of the set of second source codes and refining the set of second source codes based on the determined functional correctness to obtain a set of refined second source codes;
ranking the set of refined second source codes to obtain a set of ranked sourced codes;
applying one or more large language model (LLM)-based task-specific readability criteria on a first ranked sourced code amongst the set of ranked sourced codes to obtain one or more specific portions of the first ranked sourced code, wherein the one or more specific portions indicate one or more problematic aspects of readability; and
performing the readability refinement of the one or more specific portions using the one or more sample test cases to obtain the functional corrected and refined source code.
2. The processor implemented method as claimed in claim 1, wherein the one or more large language model (LLM)-based task-specific readability criteria comprise at least one of an overall readability, an abstraction, a modularization, a dead code, a logical segmentation, a complexity, comments, a naming convention, a testability, and an alignment.
3. The processor implemented method as claimed in claim 1, wherein the step of determining the functional correctness of the set of second source codes and refining the set of second source codes is performed to obtain an output that is indicative of (i) an absence of error in each second source code, or (ii) one or more errors present in each second source code.
4. The processor implemented method as claimed in claim 3, wherein the one or more errors comprise at least one of one or more Compilation Errors (CE), one or more Runtime Errors (RE), one or more Function Errors (FE), and one or more Non-Termination Errors (NTE).
5. The processor implemented method as claimed in claim 3, wherein the step of ranking the set of refined second source codes is based on an associated level of functional correctness followed by a type of the one or more errors.
6. The processor implemented method as claimed in claim 1, comprising:
performing a second level of functional correctness and functional refinement of the obtained functionally corrected and refined code by performing one of:
(i) identifying the obtained functionally corrected and refined code as the final source code based on an associated output of the obtained functionally corrected and refined code and the sample test cases; and
(ii) determining the functional correctness of the obtained functionally corrected and refined code to (a) identify the first ranked sourced code as the final source code, or (b) perform the functional refinement of the obtain the functionally corrected and refined code based on the sample test cases to obtain the final source code.
7. The processor implemented method as claimed in claim 1, wherein the first source code is a low-resource procedural source code, and each second source code in the set of second source code is a high resource object-oriented programming language code.
8. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a first source code in a first language;
transpile, by using a temperature sampling, the first source code to a set of second source codes based on at least one selected prompt configuration and one or more sample test cases, wherein each second source code in the set of second source codes is translated to a second language; and
perform a first level of functional correctness and readability refinement of the set of second source codes to obtain a functionally and readability refined source code, wherein performing the first level of functional correctness and refinement comprises:
determining a current functional correctness of the set of second source codes and refining the set of second source codes based on the determined functional correctness to obtain a set of refined second source codes;
ranking the set of refined second source codes to obtain a set of ranked sourced codes;
applying one or more large language model (LLM)-based task-specific readability criteria on a first ranked sourced code amongst the set of ranked sourced codes to obtain one or more specific portions of the first ranked sourced code, wherein the one or more specific portions indicate one or more problematic aspects of readability; and
performing the readability refinement of the one or more specific portions using the one or more sample test cases to obtain the functional corrected and refined source code.
9. The system as claimed in claim 8, wherein the one or more large language model (LLM)-based task-specific readability criteria comprise at least one of an overall readability, an abstraction, a modularization, a dead code, a logical segmentation, a complexity, comments, a naming convention, a testability, and an alignment.
10. The system as claimed in claim 8, wherein the step of determining the functional correctness of the set of second source codes and refining the set of second source codes is performed to obtain an output that is indicative of (i) an absence of error in each second source code, or (ii) one or more errors present in each second source code.
11. The system as claimed in claim 10, wherein the one or more errors comprise at least one of one or more Compilation Errors (CE), one or more Runtime Errors (RE), one or more Function Errors (FE), and one or more Non-Termination Errors (NTE).
12. The system as claimed in claim 10, wherein the step of ranking the set of refined second source codes is based on an associated level of functional correctness followed by a type of the one or more errors.
13. The system as claimed in claim 8, wherein the one or more hardware processors are further configured by the instructions to:
perform a second level of functional correctness and functional refinement of the obtained functionally corrected and refined code by performing one of:
(i) identifying the obtained functionally corrected and refined code as the final source code based on an associated output of the obtained functionally corrected and refined code and the sample test cases; and
(ii) determining the functional correctness of the obtained functionally corrected and refined code to (a) identify the first ranked sourced code as the final source code, or (b) perform the functional refinement of the obtain the functionally corrected and refined code based on the sample test cases to obtain the final source code.
14. The system as claimed in claim 8, wherein the first source code is a low-resource procedural source code, and each second source code in the set of second source code is a high resource object-oriented programming language code.
| # | Name | Date |
|---|---|---|
| 1 | 202421015859-STATEMENT OF UNDERTAKING (FORM 3) [06-03-2024(online)].pdf | 2024-03-06 |
| 2 | 202421015859-REQUEST FOR EXAMINATION (FORM-18) [06-03-2024(online)].pdf | 2024-03-06 |
| 3 | 202421015859-FORM 18 [06-03-2024(online)].pdf | 2024-03-06 |
| 4 | 202421015859-FORM 1 [06-03-2024(online)].pdf | 2024-03-06 |
| 5 | 202421015859-FIGURE OF ABSTRACT [06-03-2024(online)].pdf | 2024-03-06 |
| 6 | 202421015859-DRAWINGS [06-03-2024(online)].pdf | 2024-03-06 |
| 7 | 202421015859-DECLARATION OF INVENTORSHIP (FORM 5) [06-03-2024(online)].pdf | 2024-03-06 |
| 8 | 202421015859-COMPLETE SPECIFICATION [06-03-2024(online)].pdf | 2024-03-06 |
| 9 | Abstract1.jpg | 2024-04-05 |
| 10 | 202421015859-FORM-26 [20-05-2024(online)].pdf | 2024-05-20 |
| 11 | 202421015859-Proof of Right [29-08-2024(online)].pdf | 2024-08-29 |