System And Method For Llm Orchestrated Contextual Deduplication Of

< Back

System And Method For Llm Orchestrated Contextual Deduplication Of Tools Across Distributed Modular Control Plane (Mcp) Servers

Abstract: The present invention discloses a system (10) and method for LLM-orchestrated contextual deduplication of tools across distributed modular control plane (MCP) servers. The system comprises an input unit—tool ingestion and normalization module (1); a processing unit (2) featuring a behavioral analysis engine (21), output signature extractor (22), dual-pass deduplication engine (23) with contextual (23.1) and deterministic (23.2) paths, an orchestration DAG framework (24), a canonical tool registry (CTR) (25), and user interfaces (26); and an output unit with a fault tolerance and recovery module (3). The method involves ingesting and normalizing tools, extracting semantic behavior and output hashes using LLMs, and detecting duplicates through combined vector similarity and deterministic comparisons. Tools are stored in the CTR and made accessible via APIs and dashboards.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

17 July 2025

Publication Number

40/2025

Publication Type

INA

Invention Field

BIO-MEDICAL ENGINEERING

Status

Parent Application

Applicants

Persistent Systems

Bhageerath, 402, Senapati Bapat Rd, Shivaji Cooperative Housing Society, Gokhale Nagar, Pune - 411016, Maharashtra, India

Inventors

1. Mr. Nitish Shrivastava

10764 Farallone Dr, Cupertino, CA 95014-4453, United States

Specification

Description:FIELD OF INVENTION
The present invention relates to the fields of distributed computing, computer-based application orchestration, and artificial intelligence. More particularly, it pertains to a system and method for LLM-orchestrated contextual deduplication of tools across distributed modular control plane (MCP) servers thereby identifying, deduplicating, and managing tools across modular control plane (MCP).

BACKGROUND
Software infrastructures, especially Modular Control Plane (MCP) servers are widely used to manage and orchestrate the execution of automation tools, scripts, and services across distributed environments. These systems enable teams to independently develop and deploy tools for tasks such as data processing, system configuration, monitoring, and more. While some metadata such as descriptions, version tags, or help outputs may be available, most MCP systems do not extract behavioural characteristics or output patterns in a structured, machine-readable way. This leads to poor contextual understanding of tool behaviour, making deduplication and optimization difficult.
In the existing LLM-based systems used for workflow orchestration or intelligent automation. These systems are not designed to evaluate whether different tools perform the same job, especially when the tools differ in structure or output formatting. Existing approaches rely heavily on shallow comparisons; these techniques fail to capture deeper contextual meaning or behavioural similarities between tools. These limitations highlight the need for a structured, intelligent deduplication method that compare tools based on behaviour, inputs, and outputs regardless of naming or formatting differences.
Prior Art:
US20170344579A1 relates to data deduplication; wherein upon addition or modification of a data unit in a data storage device, a Context Triggered Piecewise Hash (CTPH) key may be generated for an added or modified data unit. CTPH key of the added or modified data unit may be compared with a group CTPH key for each of a plurality of groups of data units stored in the data storage device to identify a group whose group CTPH key is within a pre-defined edit distance from the CTPH key of the added or modified data unit. A duplicate of the added or modified data unit may be identified within the identified group.
While the prior art detects and eliminates near-duplicate data units in storage systems, it does not account for functional or behavioural equivalence between high-level software tools. None of the prior arts address the need to intelligently deduplicate functionally equivalent tools across distributed Modular Control Plane (MCP) environments using a hybrid of large language models, semantic embeddings, and deterministic rule-based logic.
To overcome these drawbacks, there is a need for a novel, deterministic, tool orchestration system that can intelligently analyse, compare, and deduplicate similar tools across distributed Modular Control Plane (MCP) servers by combining semantic understanding from large language models with rule-based output verification, ensuring accuracy and scalability in tool management.

DEFINITIONS
The expression “system” used hereinafter in this specification refers to an ecosystem comprising, but not limited to, system for LLM-orchestrated contextual deduplication of tools with input and output devices, processing unit, plurality of mobile devices, a mobile device-based application. It is extended to computing systems like mobile phones, laptops, computers, PCs, and other digital computing devices.
The expression “input unit” used hereinafter in this specification refers to, but is not limited to, mobile, laptops, computers, PCs, keyboards, mouse, pen drives or drives.
The expression “output unit” used hereinafter in this specification refers to, but is not limited to, an onboard output device, a user interface (UI), a display unit, a local display, a screen, a dashboard, or a visualization platform enabling the user to visualize the graphs provided as output by the system.
The expression “model-compute platform” (MCP) used hereinafter in this specification refers to a framework that provides the required infrastructure and tools to handle a wide range of workloads, from basic apps to complicated computational processes, and can be deployed on-premises, in the cloud, or at the edge. By underpinning software processes, compute platforms help businesses streamline and adapt.
The expression “Large Language Models” or “LLMs” used hereinafter in this specification refers to systems that use natural language understanding to interpret and generate text. In this system, they help extract features and suggest metrics.
The expression “deduplication” used hereinafter in this specification refers to, the process of removing identical files or blocks from databases and data storage.
The expression “orchestration” used hereinafter in this specification refers to, the automated coordination and management of multiple tasks, systems, or services to execute a larger workflow or process.
The expression “directed acyclic graph” or “DAG” used hereinafter in this specification refers to a type of graph data structure where nodes are connected by directed edges (meaning the connection has a direction) and there are no cycles so that the user cannot follow a path of directed edges and return to the starting node.
The expression “canonical tool registry” or “CTR” used hereinafter in this specification refers to, centralized, authoritative list or repository of tools, libraries, or packages that are officially recognized and maintained for a particular ecosystem, organization, or project.
The expression “APIs” used hereinafter in this specification refer to; it’s an API (Application Programming Interface) is a set of rules and specifications that allows different software systems to communicate and interact with each other.

OBJECTS OF THE INVENTION
The primary object of the invention is to provide a system and method for LLM-orchestration contextual deduplication of tools across distributed modular control plane (MCP) servers.
Another object of the invention is to use dual-pass deduplication driven by LLM and rule engines.
Yet another object of the invention is to provide a vector-based semantic comparison of tool behaviour and output.
Yet another object of the invention is to provide a middle-layer architecture that auto-updates and retriggers only affected clusters upon new ingestion.
Yet another object of the invention is to provide a feedback loop that allows manual intervention to refine model and rule decisions.

SUMMARY
Before the present invention is described, it is to be understood that the present invention is not limited to specific methodologies and materials described, as these may vary as per the person skilled in the art. It is also to be understood that the terminology used in the description is for the purpose of describing the particular embodiments only and is not intended to limit the scope of the present invention.
The invention discloses a system and method for LLM-orchestration contextual deduplication of tools across distributed modular control plane (MCP) that uniquely solves the problem by employing a dual-path strategy that blends contextual inference and deterministic logic; wherein the system (10) comprises multiple interlinked modules where the tool ingestion and normalization module (1) serves as the input unit that acquires and standardizes tool data. The processing unit (2) includes several key components such as the behavioral analysis engine (21) powered by transformer-based LLMs, the output signature extractor (22), and the dual-pass deduplication engine (23) comprising a contextual match path (23.1) and deterministic match path (23.2). The orchestration framework (24) is implemented as a directed acyclic graph (DAG), ensuring modular workflow execution. A canonical tool registry (CTR) (25) stores unique tool records and embeddings. Usability is enhanced through visual and API interfaces (26), and the output unit (3) includes a fault tolerance and recovery module that safeguards against incomplete or failed LLM outputs.
The deduplication process begins with secure tool ingestion (1) from MCP servers, followed by normalization using textual transformations and optional LLM-assisted fuzzy matching. The behavioral analysis engine (21) extracts semantic representations of each tool's core functions and encodes them into vectors. Simultaneously, the output signature extractor (22) identifies unique outputs like file names, logs, or return codes and converts them into hashes. The dual-pass deduplication engine (23) then conducts semantic clustering using cosine similarity and deterministic matching via output hashes and rule-based logic. These results are fused to identify duplicates, variants, or uncertain cases flagged for manual review. The orchestration DAG (24) manages these stages in a modular and checkpointed pipeline, ensuring efficiency and reactivity. Final tool entries are stored in the canonical tool registry (25), while users interact with the results via dashboards and APIs (26). Fault tolerance and recovery module (3) guarantees accuracy by rerouting failed processes for reanalysis.
The system offers a high degree of accuracy and scalability due to its novel dual-pass deduplication strategy that blends LLM-driven contextual similarity with deterministic output analysis. Its vector-based semantic comparison enables deep, behavior-aware matching even for superficially similar tools. The orchestration DAG architecture facilitates efficient execution by retriggering only the affected clusters upon new ingestion, preserving computational resources. The human-in-the-loop feedback loop further enhances reliability by allowing manual intervention in ambiguous cases. These features make the system particularly valuable for environments like enterprise DevOps, AI toolchains, container orchestration, and security audit platforms, where tool sprawl and redundancy are common. Overall, it enables streamlined operations, better resource management, and faster resolution of tool inconsistencies.

BRIEF DESCRIPTION OF DRAWINGS
A complete understanding of the present invention may be made by reference to the following detailed description which is to be taken in conjugation with the accompanying drawing. The accompanying drawing, which is incorporated into and constitutes a part of the specification, illustrates one or more embodiments of the present invention and, together with the detailed description, it serves to explain the principles and implementations of the invention.
FIG.1. illustrates a schematic representation of the structural and functional components of the system.
FIG.2. illustrates a high-level stepwise method/ workflow.
FIG.3. illustrates dual deduplication paths.
FIG.4. illustrates orchestration DAG (Workflow Engine).

DETAILED DESCRIPTION OF INVENTION
Before the present invention is described, it is to be understood that this invention is not limited to methodologies described, as these may vary as per the person skilled in the art. It is also to be understood that the terminology used in the description is for the purpose of describing the particular embodiments only and is not intended to limit the scope of the present invention. Throughout this specification, the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps. The use of the expression “at least” or “at least one” suggests the use of one or more elements or ingredients or quantities, as the use may be in the embodiment of the invention to achieve one or more of the desired objects or results. Various embodiments of the present invention are described below. It is, however, noted that the present invention is not limited to these embodiments, but rather the intention is that modifications that are apparent are also included.
The present invention describes a system and method for LLM-orchestration contextual deduplication of tool across distributed modular control plane (MCP) servers. The system (10) as illustrated in fig. 1. comprises of an input unit referred as a tool ingestion and normalization module (1), a processing unit (2) further comprising behavioural analysis engine (21), output signature extractor (22), dual-pass duplication engine (23), orchestration framework implemented as a directed acyclic graph (DAG) (24), canonical tool registry (CTR) (25), visual or API interfaces (26) and an output unit with fault tolerance and recovery module (3); where all the structural and functional components work in co-ordination to employ a method for workflow orchestration.
In an embodiment of the invention, an input unit also referred to as the tool ingestion and normalization module (1) wherein the tools are fetched from MCP server inventories through secure API calls, further each tool is normalized by applying transformation including lowercasing, punctuation removal, alias matching, and prefix-stripping. The normalization process is supplemented by a lightweight synonym registry and an optional fuzzy-matching LLM assistance.
In yet another embodiment of the invention, the behaviour analysis engine (21) possessing a transformer-based LLM model acting as a core of this module, which analyses tool description, help documentation and usage examples. A custom prompt is formulated to extract behavioral aspects such as core purpose which relates to what the system does; inputs and expected parameters; and side-effects such as system changes, file writes, API calls. Further, the output is structured as a JSON schema and embedded into a vector using a semantic embedding model (e.g., text-embedding-3-large).
In the next embodiment of the invention, an output signature extractor module (22) operates by leveraging another LLM prompt specifically designed to identify various forms of output, such as files generated, console output patterns, and return codes or logs. These outputs are then processed and converted into deterministic hashes, which serve as unique signatures which are subsequently used for effective matching and conflict resolution during downstream processing.
In yet a next embodiment, the dual-pass deduplication engine (23) forms the core components of the system which integrates two complementary analytical perspectives such as the contextual path (23.1) that evaluates tools based on cosine similarity applied to their behavior and output vectors, followed by clustering techniques such as HDBSCAN or KMeans to group semantically similar tools. In parallel, the independent path (23.2) performs a hash-based comparison of output signatures and applies rule-based identity checks, such as matching flag structures and output patterns. If both paths concur, the tools are conclusively marked as duplicates. When only one path indicates similarity, the tools are flagged as variants. In cases where both paths disagree, the tools are escalated for manual review to ensure accuracy.
In yet a next embodiment of the invention, the orchestration framework that is (24) is implemented as a directed acyclic graph (DAG) using workflow tools like LangGraph or Prefect; where each stage of the pipeline such as tool ingestion, normalization, behavior analysis, output hashing, deduplication and registration is represented as an individual node with checkpointed state within the graph; such that each step is a node that can be independently re-executed based on dependency impact. To optimize efficiency, smart checkpointing mechanisms are employed to avoid redundant processing. Additionally, when a new tool is onboarded, upon ingestion of a new tool, only the impacted clusters in the semantic or deterministic space are retriggered, minimizing redundant computation and enabling reactive deduplication at scale; thereby enabling recursive deduplication and maintaining the accuracy and integrity of the system. The DAG supports partial re-execution, ensuring only affected downstream nodes are recomputed upon updates.
In yet a next embodiment of the invention, the canonical tool registry (CTR) (25) a persistent store that maintains records of all unique tools; wherein each tool entry contains a canonical ID, a normalized name, a behavioral fingerprint, an output hash, and a comprehensive list of all matching or variant tools along with their corresponding MCP sources. These entries are indexed using embeddings, enabling real-time semantic search and efficient retrieval. Further, the visual and API interfaces (26) are configured to enhance usability and data accessibility; such that the users can access deduplicated lists, comparison summaries, and detailed audit logs through APIs (26). Additionally, the dashboard provides visual representations, including graphs that illustrate duplicate clusters and variant timelines, facilitating easier analysis and tracking.
In yet a next embodiment, an output includes a fault tolerance and recovery module (3), such that if a large language model (LLM) fails or produces incomplete results, an independent engine steps in to process the tool. All such outcomes are logged along with a confidence score and are flagged for reprocessing at a later stage to ensure accuracy and completeness.
In a preferred embodiment of the invention, a high-level stepwise method for LLM-orchestration contextual deduplication of tool as illustrated in fig. 2. comprises the steps as follows:
- ingesting tools from MCP servers (1) thereby collects tool data from master control program (MCP) servers, and
- standardizing the format and metadata of ingested tools by the tool ingestion and normalization module (1);
- using a large language model (LLM) to understand tool behavior patterns by the behavioral analyzer (LLM) (21),
- converting tool behavior into structured vector representations by behavior embedding using the behavior analyzer;
- detecting duplicates based on behavioral similarity by the contextual deduplication engine; and
- using a large language model (LLM) parallelly to identify unique output traits of each tool using the output signature extractor;
- generating unique hashes for each tool’s output for comparison referred as output hashing;
- detects duplicates using exact output signatures by the deterministic deduplication engine (23);
- merging tool variants and resolving any inconsistencies using the fusion and conflict resolver (24);
- storing the final deduplicated and verified version of each tool using the canonical tool registry (CTR) (25);
- providing access to deduplicated tools via a user interface and APIs using the dashboard / API output (26).
In a next embodiment of the invention, an exemplified dual deduplication paths (23) as illustrated in the fig. 3. comprises the steps as follows:
- analyzing a tool (say Tool A) to extract behavior embedding and output hash,
- comparing behavior embedding by the contextual match engine with the existing tools.
- comparing output hash by the deterministic match engine with the known signatures.
- combining results from both engines by the fusion resolver to resolve duplicates or variants,
- classifying the tool as deduplicated or variant or flagging for manual inspection.
According to yet another preferred embodiment, the orchestration DAG (24) enabled by the workflow engine as illustrated in fig. 4. comprises the steps as follows:
- initializing the tool processing workflow,
- standardizes tool metadata and structure thereby normalizing tool,
- analyzing behavior via LLM to understand tool behavior,
- extracting output signature via LLM thereby identifying unique output characteristics using a language model,
- generate embedding that transforms behavior into vector format for comparison,
- generating output hash thereby creating a unique fingerprint of the tool's output,
- running contextual deduplication that compares tools based on behavior and output to find duplicates,
- updating canonical tool registry that saves the verified unique version of the tool,
- ending the workflow thereby marking completion of the deduplication and registration process.
According to an embodiment of the invention, the system offers several unique features such as the dual-pass deduplication configured for combining LLM-driven contextual analysis with deterministic rule-based filtering, that ensures high-precision identification of redundant or similar tool behaviors, enhancing clarity and reducing noise. The vector-based semantic comparison offered by the system enables nuanced differentiation of tools and outputs, even when superficial characteristics overlap. The intelligent middle-layer architecture allows for efficient, targeted updates by only retriggering affected clusters upon new data ingestion, ensuring minimal disruption and optimal performance. Additionally, the integrated feedback loop supports human-in-the-loop corrections, enabling ongoing refinement of both model logic and rule sets.
Furthermore, the aforementioned abilities make the system highly applicable across complex environments, where it can streamline enterprise DevOps infrastructures filled with overlapping scripts, optimize AI toolchains that use similar agent wrappers, manage container orchestration systems with redundant toolsets, and support security audit platforms in rationalizing tool inventories. In all cases, it improves operational efficiency, accuracy, and adaptability in dynamic and tool-rich ecosystems.
While considerable emphasis has been placed herein on the specific elements of the preferred embodiment, it will be appreciated that many alterations can be made and that many modifications can be made in preferred embodiment without departing from the principles of the invention. These and other changes in the preferred embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.

, Claims:We claim,
1. A system and method for LLM-orchestration contextual deduplication of tool across distributed modular control plane (MCP) servers;
wherein the system (10) comprises of an ingestion and normalization module (1), a processing unit (2) further comprising behavioural analysis engine (21), output signature extractor (22), dual-pass duplication engine (23), orchestration framework implemented as a directed acyclic graph (DAG) (24), canonical tool registry (CTR) (25), visual or API interfaces (26) and an output unit with fault tolerance and recovery module (3); where all the structural and functional components work in co-ordination to employ a method for workflow orchestration;

characterized in that:
the method for LLM-orchestration contextual deduplication of tool comprises the steps of;
- ingesting tools from MCP servers (1) thereby collects tool data from master control program (MCP) servers, and
- standardizing the format and metadata of ingested tools by the tool ingestion and normalization module (1);
- using a large language model (LLM) to understand tool behavior patterns by the behavioral analyzer (LLM) (21),
- converting tool behavior into structured vector representations by behavior embedding using the behavior analyzer;
- detecting duplicates based on behavioral similarity by the contextual deduplication engine; and
- using a large language model (LLM) parallelly to identify unique output traits of each tool using the output signature extractor;
- generating unique hashes for each tool’s output for comparison referred as output hashing;
- detects duplicates using exact output signatures by the deterministic deduplication engine (23);
- merging tool variants and resolving any inconsistencies using the fusion and conflict resolver (24);
- storing the final deduplicated and verified version of each tool using the canonical tool registry (CTR) (25);
- providing access to deduplicated tools via a user interface and APIs using the dashboard/API output (26).

2. The system as claimed in claim 1, wherein the tool ingestion and normalization module (1) is configured to fetch tools from MCP server inventories through secure API calls, normalize each tool by applying transformation including lowercasing, punctuation removal, alias matching, and prefix-stripping.

3. The system as claimed in claim 1, wherein the behaviour analysis engine (21) possesses a transformer-based LLM model acting as a core components that analyses tool description, help documentation and usage examples, wherein a custom prompt is formulated to extract behavioral aspects including a core purpose; inputs and expected parameters; and side-effects such as system changes, file writes, API calls, thereby structuring the output as a JSON schema and embedded into a vector using a semantic embedding model.

4. The system as claimed in claim 1, wherein the output signature extractor module (22) operates by leveraging an LLM prompt specifically designed to identify various forms of output, such as files generated, console output patterns, and return codes or logs, further processing and converting into deterministic hashes, which serve as unique signatures used for effective matching and conflict resolution.

5. The system and method as claimed in claim 1, wherein the dual-pass deduplication engine (23) forms the core components integrating into two complementary analytical perspectives such as the contextual path (23.1) that evaluates tools based on cosine similarity applied to their behavior and output vectors, followed by clustering techniques to group semantically similar tools; and the independent path (23.2) performs a hash-based comparison of output signatures and applies rule-based identity checks, such as matching flag structures and output patterns;
and wherein the dual deduplication paths (23) follow the steps of;
- analyzing a tool (say Tool A) to extract behavior embedding and output hash,
- comparing behavior embedding by the contextual match engine with the existing tools.
- comparing output hash by the deterministic match engine with the known signatures.
- combining results from both engines by the fusion resolver to resolve duplicates or variants,
- classifying the tool as deduplicated or variant or flagging for manual inspection.

6. The system and method as claimed in claim 1, wherein the orchestration framework (24) is implemented as a directed acyclic graph (DAG) using various workflow tools enabled by a workflow engine comprising the steps of;
- initializing the tool processing workflow,
- standardizes tool metadata and structure thereby normalizing tool,
- analyzing behavior via LLM to understand tool behavior,
- extracting output signature via LLM thereby identifying unique output characteristics using a language model,
- generate embedding that transforms behavior into vector format for comparison,
- generating output hash thereby creating a unique fingerprint of the tool's output,
- running contextual deduplication that compares tools based on behavior and output to find duplicates,
- updating canonical tool registry that saves the verified unique version of the tool,
- ending the workflow thereby marking completion of the deduplication and registration process.

7. In yet a next embodiment of the invention, the canonical tool registry (CTR) (25) refers to a persistent store that maintains records of all unique tools; wherein each tool entry contains a canonical ID, a normalized name, a behavioral fingerprint, an output hash, and a comprehensive list of all matching or variant tools along with their corresponding MCP sources; such that the entries are indexed using embeddings, enabling real-time semantic search and efficient retrieval.

8. The system as claimed in claim 1, wherein the visual and API interfaces (26) are configured to enhance usability and data accessibility; such that the users can access deduplicated lists, comparison summaries, and detailed audit logs through APIs (26); and the dashboard provides visual representations, including graphs that illustrate duplicate clusters and variant timelines, facilitating easier analysis and tracking.

9. The system as claimed in claim 1, wherein the fault tolerance and recovery module (3) checks if a large language model (LLM) fails or produces incomplete results whereby the independent engine steps in to process the tool, followed by logging all such outcomes along with a confidence score and are flagged for reprocessing at a later stage to ensure accuracy and completeness.

Documents

Application Documents

#	Name	Date
1	202521068256-STATEMENT OF UNDERTAKING (FORM 3) [17-07-2025(online)].pdf	2025-07-17
2	202521068256-POWER OF AUTHORITY [17-07-2025(online)].pdf	2025-07-17
3	202521068256-FORM 1 [17-07-2025(online)].pdf	2025-07-17
4	202521068256-FIGURE OF ABSTRACT [17-07-2025(online)].pdf	2025-07-17
5	202521068256-DRAWINGS [17-07-2025(online)].pdf	2025-07-17
6	202521068256-DECLARATION OF INVENTORSHIP (FORM 5) [17-07-2025(online)].pdf	2025-07-17
7	202521068256-COMPLETE SPECIFICATION [17-07-2025(online)].pdf	2025-07-17
8	Abstract.jpg	2025-08-02
9	202521068256-FORM-9 [26-09-2025(online)].pdf	2025-09-26
10	202521068256-FORM 18 [01-10-2025(online)].pdf	2025-10-01