Abstract: A system and method for mining test case repositories to identify the relevant test cases associated with a code change is described. The system integrates semantic analysis of source code and test descriptions with historical execution data and deterministic artifacts to enable targeted, explainable, and efficient test execution tailored to the actual scope and impact of code changes. The system comprises of code repository extractor module , test case repository extractor module , code understanding engine module , test mapping engine module , deterministic artifact parser module , relationship correlation engine module , change impact analyzer module and selective test executor module. Using pre-trained machine learning models, large language models (LLMs), and deterministic systems, the method creates a dependency map between code and test artifacts. The system intelligently identifies a subset of test cases affected by any code change, eliminating redundant or unrelated test executions and optimizing software regression testing.
Description:FIELD OF THE INVENTION
The present invention relates to the field of software testing. More particularly, it pertains to systems and methods for mining test case repositories to identify relevant test cases based on code changes.
BACKGROUND OF THE INVENTION
Software development workflows commonly involve frequent code changes, even at a minor level. In traditional testing environments, such changes often result in the execution of large-scale test suites. However, a significant number of these tests may not be related to the actual code modifications. Executing such extensive test suites unnecessarily leads to increased consumption of computing resources, longer feedback cycles for developers, and reduced overall productivity. These issues are particularly pronounced in modern Continuous Integration (CI) and Continuous Delivery (CD) pipelines, where rapid iteration and immediate validation are critical.
In order to address this inefficiency, existing systems typically use static dependency analysis or rely on manually maintained code-test annotations to predict test case relevance. These approaches, however, fall short in several key areas. Static analysis cannot capture dynamic or runtime behaviors that emerge during actual execution, and manual annotations are prone to become outdated or inaccurate as the codebase evolves. Moreover, such methods often fail to identify implicit dependencies and cross-cutting relationships between code and test logic, which limits the accuracy of test case selection.
Prior Art:
For instance, US20210390036A1 discloses systems and methods for test impact analysis based on identifying changed code blocks using code fingerprinting techniques, such as CRC-based hashing, and correlating those changes with prior test coverage data. While this approach introduces automation in selecting relevant test cases, it is primarily dependent on historical test coverage and lacks semantic understanding of source code or test behavior. It does not utilize large language models (LLMs) or semantic graph representations, nor does it incorporate deterministic runtime artifacts such as logs or configuration data to validate impact predictions. In contrast, the present invention combines LLM-driven semantic inference with deterministic evidence to construct a version-controlled code-test dependency graph, enabling explainable and dynamic test case selection.
US10417119B2 describes a dynamic testing system that analyzes source code changes and identifies associated test cases by mapping test case history, test outcomes, and previous modifications. Although it implements a learning mechanism to improve mapping accuracy over time, the methodology primarily relies on tracking historical co-modification patterns and manually annotated test mappings. It does not support deep semantic analysis of source code, nor does it employ artificial intelligence models to infer implicit relationships between code and test cases. Furthermore, the system lacks integration with runtime deterministic artifacts such as execution logs or test coverage reports to validate or reinforce mapping predictions. The present invention overcomes these limitations by employing code-oriented LLMs, deterministic signal correlation, and graph-based reasoning to construct a robust, explainable, and adaptive test selection system.
WO2024103370A1 relates to a machine learning-based approach for selecting test cases affected by code changes, using data-driven classifiers and optimization algorithms to improve test selection accuracy. While it acknowledges the use of learning-based models, it does not describe the use of pre-trained language models specifically fine-tuned for code understanding, nor does it propose a hybrid reasoning framework that combines semantic, syntactic, and deterministic signals within a graph-based architecture. Additionally, the system does not describe constructing or maintaining a version-controlled graph of code-test relationships, nor does it incorporate runtime artifacts as empirical validation sources. In contrast, the present invention fuses deep semantic modeling with deterministic evidence to enable traceable and high-confidence test selection, and continuously evolves its internal graph through feedback from test outcomes.
DEFINITIONS
The expression “system” used hereinafter in this specification refers to an ecosystem comprising, but is not limited to a system with a user, input and output devices, processing unit, plurality of mobile devices, a mobile device-based application to identify dependencies and relationships between diverse businesses, a visualization platform, and output; and is extended to computing systems like mobile, laptops, computers, PCs, etc.
The expression “input unit” used hereinafter in this specification refers to, but is not limited to, mobile, laptops, computers, PCs, keyboards, mouse, pen drives or drives.
The expression “output unit” used hereinafter in this specification refers to, but is not limited to, an onboard output device, a user interface (UI), a display kit, a local display, a screen, a dashboard, or a visualization platform enabling the user to visualize, observe or analyse any data or scores provided by the system.
The expression “processing unit” refers to, but is not limited to, a processor of at least one computing device that optimizes the system.
The expression “large language model (LLM)” used hereinafter in this specification refers to a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
The expression “Code-Test Relationship Graph (CTRG)” used hereinafter in this specification refers to, but is not limited to, a dynamic and version-controlled graph-based data structure in which nodes represent source code modules, test cases, logs, and configurations, and edges represent both inferred (semantic/syntactic) and validated (deterministic) connections.
The expression “Minimal Impacted Test Set (MITS)” used hereinafter in this specification refers to, but is not limited to, a prioritized subset of test cases identified as being most likely affected by a given code change, based on graph traversal, confidence scores, and runtime evidence.
OBJECTS OF THE INVENTION
The primary object of the present invention is to provide a system and method for intelligently identifying test cases relevant to a given code change using large language models and deterministic runtime artifacts.
Another object of the invention is to construct a dynamic Code-Test Relationship Graph (CTRG) that captures semantic, syntactic, and deterministic dependencies between code, tests, and artifacts.
Yet another object of the invention is to validate inferred code-test relationships using execution logs, stack traces, test coverage reports, and configuration data.
A further object of the invention is to determine a Minimal Impacted Test Set (MITS) through graph traversal based on change impact, reachability, and confidence scores.
An additional object of the invention is to selectively execute impacted test cases through Continuous Integration (CI ) tool integration and update the CTRG with new runtime evidence.
A still further object of the invention is to continuously improve test recommendations through feedback-driven graph updates and learning.
SUMMARY
Before the present invention is described, it is to be understood that the present invention is not limited to specific methodologies and materials described, as these may vary as per the person skilled in the art. It is also to be understood that the terminology used in the description is for the purpose of describing the particular embodiments only and is not intended to limit the scope of the present invention.
The present invention describes a system and method for mining software test case repositories and source code control systems to identify the most relevant test cases associated with a code change. The system integrates semantic analysis of source code and test descriptions with historical execution data and deterministic artifacts, such as runtime logs, test coverage reports, and configuration metadata. The system uses a multi-modal analysis approach that fuses deep learning, graph-based reasoning, and deterministic signal correlation.
According to the aspect of the present invention, the system comprises of an input unit , a processing unit and output unit , wherein the processing unit comprises of code repository extractor module , test case repository extractor module , code understanding engine module , test mapping engine module , deterministic artifact parser module , relationship correlation engine module , change impact analyzer module and selective test executor module such that each module works together to build an intelligent, dynamic mapping of test case relevance. The method integrates data from version control, test case repositories, execution logs, and documentation. Using pre-trained machine learning models, large language models (LLMs), and deterministic systems, the method creates a dependency map between code and test artifacts. The system intelligently identifies a subset of test cases affected by any code change, eliminating redundant or unrelated test executions and significantly optimizing software regression testing and CI/CD pipelines.
BRIEF DESCRIPTION OF DRAWINGS
A complete understanding of the present invention may be made by reference to the following detailed description which is to be taken in conjugation with the accompanying drawing. The accompanying drawing, which is incorporated into and constitutes a part of the specification, illustrates one or more embodiments of the present invention and, together with the detailed description, it serves to explain the principles and implementations of the invention.
FIG. 1 illustrates a flowchart of the workflow of the present invention .
DETAILED DESCRIPTION OF INVENTION:
Before the present invention is described, it is to be understood that this invention is not limited to methodologies described, as these may vary as per the person skilled in the art. It is also to be understood that the terminology used in the description is for the purpose of describing the particular embodiments only and is not intended to limit the scope of the present invention. Throughout this specification, the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps. The use of the expression “at least” or “at least one” suggests the use of one or more elements or ingredients or quantities, as the use may be in the embodiment of the invention to achieve one or more of the desired objects or results. Various embodiments of the present invention are described below. It is, however, noted that the present invention is not limited to these embodiments, but rather the intention is that modifications that are apparent are also included.
The present invention describes a system and method for mining software test case repositories and source code control systems to identify the most relevant test cases associated with a code change. It addresses the existing challenges by providing an intelligent and adaptive system that dynamically identifies the most relevant test cases impacted by a given code change. The system integrates semantic analysis of source code and test descriptions with historical execution data and deterministic artifacts, such as runtime logs, test coverage reports, and configuration metadata. By leveraging these diverse inputs, it enables targeted, explainable, and efficient test execution tailored to the actual scope and impact of code changes within fast-paced Continuous Integration (CI) and Continuous Delivery (CD) environments. The system uses a multi-modal analysis approach that fuses deep learning, graph-based reasoning, and deterministic signal correlation. Unlike traditional test mapping tools that rely solely on static analysis or human-maintained annotations, this invention utilizes pre-trained code-oriented large language models (LLMs) to semantically understand both the source code and test case descriptions, extracting implicit and cross-layer dependencies between them. These relationships are further validated and strengthened through deterministic artifacts—such as logs, coverage data, and runtime stack traces—collected from real-world executions and CI pipelines.
According to the embodiment of the present invention, the system comprises of an input unit , a processing unit and output unit , wherein the processing unit further comprises of Code Repository Extractor module (CRE), Test Case Repository Extractor module (TCRE), Code Understanding Engine module (CUE), Test Mapping Engine module (TME), Deterministic Artifact Parser module (DAP), Relationship Correlation Engine module (RCE), Change Impact Analyzer module (CIA) and Selective Test Executor module (STE). Each module works together to build an intelligent, dynamic mapping of test case relevance.
According to the embodiment of the present invention, the code repository extractor module(CRE) connects to version control systems such as Git, SVN, or any SCM (Source control management) system and extracts current branch code snapshots, Commit history, File-level and function-level diffs, Change metadata (author, timestamp, message). A diff (short for difference) is a representation of the changes between two versions of a file or set of files. The module performs static parsing to tokenize and abstract the source code into its constituent modules, classes, methods, and components. The test case repository extractor module (TCRE) extracts test case definitions (unit, integration, E2E), associated metadata (tags, labels, author, priority), historical test run data and linked issues or user stories. The repository may be in different testing tools or frameworks, that serve different roles and work in different ecosystems like JUnit, PyTest, Postman Collections, Cucumber Gherkin specs, etc. The extractor normalizes them into a common representation format.
According to the embodiment of the present invention, the code understanding engine module (CUE) uses pre-trained transformer-based models (e.g., CodeBERT, GraphCodeBERT) to: identify code constructs and logical boundaries, build an abstract syntax tree (AST), annotate code with functional roles (controller, model, utility) and detect cross-cutting concerns (logging, security, etc.). It also identifies reusable components and shared modules. The test mapping engine module (TME) uses fine-tuned LLMs (code-oriented or specialized open-source models) to link specific test cases to code blocks or functions, understand natural language descriptions in test case titles, match expected behaviors to method implementations and handle implicit relationships (e.g., shared global states, fixtures). It builds a many-to-many relationship graph between code blocks and test cases.
According to the embodiment of the present invention, to add explainability and certainty to the correlation graph, the deterministic artifact parser module (DAP) reviews application logs from CI pipelines; stack traces and exception logs; test coverage reports (e.g., JaCoCo, Istanbul) and Configuration and orchestration metadata (Docker, Kubernetes, Terraform) Using deterministic trails, the parser module reinforces LLM predictions with hard evidence like: "This log line confirms method X was tested by case Y." The relationship correlation engine module (RCE) is the central integrator. It fuses all inputs to generate a code-test-case relationship graph (CTRG). In the code-test-case relationship graph the nodes are code modules, test cases, logs, configs, the edges are semantic, syntactic, statistical, deterministic relationships and the weighting is based on log evidence, LLM confidence, frequency of co-modification, test outcome correlations. The graph is stored in a graph database (e.g., Neo4j) with versioning support.
According to the embodiment of the present invention, the change impact analyzer module (CIA) monitors commits or change requests and performs fine-grained diff extraction, path tracing in the code-test-case relationship graph from changed nodes, propagation of changes via dependencies and scoping of impacted test cases using reachability and confidence scores. The module produces a “Minimal Impacted Test Set (MITS)” for the change. The selective test executor module (STE) integrates with CI tools (e.g., Jenkins, GitHub Actions) to pull the Minimal Impacted Test Set, execute only the identified test cases, track pass/fail and test coverage and update the code-test-case relationship graph based on new evidence.
According to the embodiment of the present invention, the system constructs a dynamic Code-Test Dependency Graph (CTDG) where nodes represent code entities, test cases, logs, and configuration items, while edges encode both inferred (semantic/syntactic) and proven (deterministic) connections. When a code change is introduced, the system uses a hybrid traversal algorithm that combines LLM-derived confidence scores with deterministic edge weights to identify only those test cases that are most likely impacted. This dual-layered validation mechanism, wherein predictions from generative models are cross-checked with empirical signals, makes the solution uniquely robust and explainable. The resulting Minimal Impacted Test Set (MITS) enables highly targeted test execution, reducing redundancy and dramatically improving efficiency in continuous integration (CI) workflows. Additionally, the system continuously evolves by incorporating feedback from every test run, making the dependency graph self-healing and increasingly accurate over time. The novel features of the invention are the fusion of LLM-based relationship inference with deterministic logs in a version-controlled graph, the hybrid reasoning engine that quantifies impact using both semantic and empirical signals, and the continuous learning loop that adapts mappings based on runtime behavior, enabling a scalable, intelligent, and adaptive regression testing system for modern software pipelines.
According to the embodiment of the present invention, the method for mining software test case repositories and source code control systems to identify the most relevant test cases associated with a code change as described in FIG. 1 comprises the steps of :
• ingesting test case repositories and source code systems;
• using pre-trained models to extract component-level understanding of the codebase;
• using LLMs to discover relationships and dependencies between modules;
• mapping test cases to specific blocks of code using a combination of static analysis and LLM-driven interpretation;
• using deterministic systems like execution logs and config artifacts to reinforce or cross-validate these relationships;
• generating an intelligent test case mapping graph;
• consulting this graph upon a code change to determine the minimal set of tests required.
Advantages:
• Efficiency: Avoids running thousands of unrelated test cases.
• Accuracy: Higher confidence in test relevance.
• Explainability: Uses logs and LLM reasoning for justification.
• Cost-saving: Reduces cloud usage and developer wait times.
• Adaptability: Learns and updates as the codebase evolves.
Example:
Deep test case correlation and selection:
Input:
• C = {c1, c2, ..., cn} Code components
• T = {t1, t2, ..., tm} Test cases
• L = {l1, ..., lp} Logs and artifacts
• ΔC = Change set (subset of C)
Output:
• T’ ⊆ T Relevant tests for ΔC
Steps:
1. Code Understanding:
o For each ci ∈ C, derive its AST, role, and signature using CUE.
2. Test Mapping:
o For each tj ∈ T, extract its described behavior and execution path.
o Run similarity match between tj and ci using TME.
3. Artifact Reinforcement:
o Parse L to identify actual runtime links between tj and ci.
4. Graph Formation:
o Build CTRG = (Nodes: C ∪ T ∪ L, Edges: relationships with weights).
5. Change Propagation:
o For each Δci ∈ ΔC, traverse CTRG to find reachable tj with edge weight ≥ θ.
6. Test Selection:
o Aggregate these test cases into T’.
o Rank them by impact likelihood, execution cost, and priority.
7. Feedback Loop:
o After test execution, update weights based on outcomes and logs.
, Claims:We claim,
1. A system and method for mining test case repositories to identify the relevant test cases associated with a code change
characterized in that
the system integrates semantic analysis of source code and test descriptions with historical execution data and deterministic artifacts to enable targeted, explainable, and efficient test execution tailored to the actual scope and impact of code changes within the Continuous Integration and Continuous Delivery environments;
the system comprises of an input unit , a processing unit and output unit , wherein the processing unit comprises of code repository extractor module , test case repository extractor module , code understanding engine module , test mapping engine module , deterministic artifact parser module , relationship correlation engine module , change impact analyzer module and selective test executor module such that each module works together to build an intelligent, dynamic mapping of test case relevance;
the method for mining test case repositories comprises the steps of :
• ingesting test case repositories and source code systems;
• using pre-trained models to extract component-level understanding of the codebase;
• using LLMs to discover relationships and dependencies between modules;
• mapping test cases to specific blocks of code using a combination of static analysis and LLM-driven interpretation;
• using deterministic systems like execution logs and config artifacts to reinforce or cross-validate these relationships;
• generating an intelligent test case mapping graph;
• consulting this graph upon a code change to determine the minimal set of tests required.
2. The system and method as claimed in claim 1, wherein the code repository extractor module connects to version control systems and extracts current branch code snapshots, Commit history, File-level and function-level diffs, change metadata and the module performs static parsing to tokenize and abstract the source code into its constituent modules, classes, methods, and components.
3. The system and method as claimed in claim 1, wherein the test case repository extractor module extracts test case definitions, associated metadata ,historical test run data and linked issues or user stories such that the repository may be in different testing tools or frameworks, and the extractor normalizes them into a common representation format.
4. The system and method as claimed in claim 1, wherein the code understanding engine module uses pre-trained transformer-based models to identify code constructs and logical boundaries, build an abstract syntax tree , annotate code with functional roles and detect cross-cutting concerns and it also identifies reusable components and shared modules.
5. The system and method as claimed in claim 1, wherein the test mapping engine module uses fine-tuned LLMs from code-oriented or specialized open-source models to link specific test cases to code blocks or functions, understand natural language descriptions in test case titles, match expected behaviors to method implementations and handle implicit relationships and it builds a many-to-many relationship graph between code blocks and test cases.
6. The system and method as claimed in claim 1, wherein to add explainability and certainty to the correlation graph, the deterministic artifact parser module reviews application logs from CI pipelines; stack traces and exception logs; test coverage reports and configuration and orchestration metadata and using deterministic trails, the parser module reinforces LLM predictions with hard evidence.
7. The system and method as claimed in claim 1, wherein the relationship correlation engine module is the central integrator that fuses all inputs to generate a code-test-case relationship graph in which the nodes are code modules, test cases, logs, configurations, the edges are semantic, syntactic, statistical, deterministic relationships and the weighting is based on log evidence, LLM confidence, frequency of co-modification, test outcome correlations and the graph is stored in a graph database with versioning support.
8. The system and method as claimed in claim 1, wherein the change impact analyzer module monitors commits or change requests and performs fine-grained diff extraction, path tracing in the code-test-case relationship graph from changed nodes, propagation of changes via dependencies and scoping of impacted test cases using reachability and confidence scores and the module produces a minimal impacted test set for the change.
9. The system and method as claimed in claim 1, wherein the selective test executor module integrates with CI tools to pull the minimal impacted test set, execute only the identified test cases, track pass/fail and test coverage and update the code-test-case relationship graph based on new evidence.
10. The system and method as claimed in claim 1, wherein the system constructs the dynamic Code-Test Dependency Graph where nodes represent code entities, test cases, logs, and configuration items, while edges encode both inferred and proven connections and when a code change is introduced, the system uses the hybrid traversal algorithm that combines LLM-derived confidence scores with deterministic edge weights to identify only those test cases that are most likely impacted and this is a dual-layered validation mechanism, wherein predictions from generative models are cross-checked with empirical signals, makes the solution uniquely robust and explainable.
| # | Name | Date |
|---|---|---|
| 1 | 202521066228-STATEMENT OF UNDERTAKING (FORM 3) [11-07-2025(online)].pdf | 2025-07-11 |
| 2 | 202521066228-POWER OF AUTHORITY [11-07-2025(online)].pdf | 2025-07-11 |
| 3 | 202521066228-FORM 1 [11-07-2025(online)].pdf | 2025-07-11 |
| 4 | 202521066228-FIGURE OF ABSTRACT [11-07-2025(online)].pdf | 2025-07-11 |
| 5 | 202521066228-DRAWINGS [11-07-2025(online)].pdf | 2025-07-11 |
| 6 | 202521066228-DECLARATION OF INVENTORSHIP (FORM 5) [11-07-2025(online)].pdf | 2025-07-11 |
| 7 | 202521066228-COMPLETE SPECIFICATION [11-07-2025(online)].pdf | 2025-07-11 |
| 8 | Abstract.jpg | 2025-07-30 |
| 9 | 202521066228-FORM-9 [26-09-2025(online)].pdf | 2025-09-26 |
| 10 | 202521066228-FORM 18 [01-10-2025(online)].pdf | 2025-10-01 |