Abstract: A SYSTEM AND METHOD FOR PROCESSING UNSTRUCTURED DATA A method for processing unstructured data comprises selecting (502) unstructured documents from connected repositories using document selection criteria. The method executes (504) parallel multi-modal processing on selected documents, simultaneously performing optical character recognition, visual element extraction, and semantic analysis using large language models to generate processed documents. Key-value pairs are created (508) from extracted entities and relationships. A knowledge graph is generated (510) by automatically transforming key-value pairs into structured interconnected representations without manual schema definition. Metadata is applied (512) through automatic tag inheritance from repositories and contextual tag generation. A decentralized approval workflow is initiated (514) that automatically identifies domain owners based on document content classification. The approved knowledge graph is published (516) as an adaptive data package to a searchable marketplace, becoming discoverable and accessible to authorized users. Figure 1
Description:FIELD OF THE INVENTION
[0001] The present invention relates generally to the field of information management systems. More specifically, it pertains to a system and method for automated transformation of unstructured documents into structured data.
BACKGROUND FOR THE INVENTION:
[0002] Modern organizations are increasingly reliant on complex IT infrastructures composed of distributed systems, cloud environments, hybrid networks, and diverse endpoints. As these infrastructures grow in scale and complexity, so does the volume, velocity, and variety of operational data generated across monitoring tools, log files, alerts, and system notifications. A significant portion of this data exists in unstructured formats ranging from log messages and alert descriptions to incident reports and audit trails stored across disparate systems, repositories, and formats.
[0003] Enterprise data is mostly unstructured, often residing in PDFs, DOCX files, textual logs, and other free-form documents. This data contains critical insights necessary for maintaining uptime, diagnosing issues, and ensuring compliance. However, extracting actionable intelligence from this data is often manual, fragmented, and time-consuming. Traditional approaches require a series of disconnected tools including OCR systems, natural language processing engines, and rule-based alert correlation platforms each demanding extensive configuration and domain expertise.
[0004] Current IT alert management systems typically rely on static rules, keyword matching, or simple pattern recognition that lacks the ability to understand contextual relationships across events. This leads to high false positive rates, redundant tickets, and delayed responses. Furthermore, existing platforms seldom incorporate automated reasoning or entity relationship mapping from unstructured sources, preventing a holistic view of infrastructure events and their cascading impacts.
[0005] While some progress has been made with automated document transformation and analytics tools, these solutions often depend on pre-defined schemas or templates and are unable to dynamically adapt to evolving IT environments. This rigidity hinders their ability to process diverse alerts and incident descriptions that vary widely in language, structure, and origin.
[0006] The integration of human oversight remains another bottleneck. In critical environments, automation must be balanced with governance, auditability, and review capabilities. However, current systems lack seamless mechanisms to support human-in-the-loop interactions for reviewing, approving, or fine-tuning automated decisions in real time.
[0007] Knowledge graphs have emerged as a valuable method for modelling relationships between infrastructure components, incidents, and response actions. Yet existing graph generation tools typically require structured input data and manual mapping by data engineers. This creates a significant barrier to leveraging graph-based analytics for real-time alert correlation and decision-making in IT operations.
[0008] Therefore, there is a need for a document transformation and adaptive data package creation system which overcomes the challenges of existing technology. Further, there exists a need for advanced technology capable of processing large volumes of heterogeneous unstructured documents, extracting entities with contextual awareness, and providing adaptive knowledge graph generation, all within a unified platform that supports seamless human-in-the-loop interaction. Such technology would enable business teams to respond more efficiently to information needs, ultimately improving organizational knowledge management and minimizing operational inefficiencies.
OBJECTS OF THE INVENTION:
[0009] Object of the present invention is to provide a system for processing unstructured documents and assisting business teams in efficiently managing and transforming such documents into reusable adaptive data packages.
[0010] Another object of the present invention is to extract entities, relationships, and contextual information from unstructured documents using artificial intelligence techniques, including large language models and multi-modal processing.
[0011] Another object of the present invention is to automatically generate knowledge graphs by dynamically identifying entity relationships based on document context, extracted metadata, and semantic analysis without requiring predefined schemas.
[0012] Another object of the present invention is to present a unified interface that enables business users to review extracted information and selectively approve adaptive data packages through integrated governance workflows.
[0013] Another object of the invention is to reduce the manual effort and response time associated with document processing and adaptive data package creation by integrating extraction, transformation, governance, and marketplace publishing into a cohesive platform.
SUMMARY OF THE INVENTION:
[0014] The present invention provides a system and method for transforming unstructured documents into knowledge graph-based adaptive data packages through an integrated processing and publishing platform. The method includes selecting unstructured documents from one or more connected repositories using document selection criteria. The method further includes executing parallel multi-modal processing on the selected unstructured documents to generate processed documents. The parallel multi-modal processing simultaneously performs optical character recognition, visual element extraction, and semantic analysis by employing one or more large language models. Entities and relationships are extracted from the processed documents using the large language models to identify people, locations, dates, keywords, and domain-specific terms. The extracted entities and relationships between the entities are then used to create key-value pairs. The key-value pairs are automatically transformed into structured interconnected representations to generate a knowledge graph without the need for manual schema definition. The knowledge graph represents the relationships between the entities. Metadata is applied to the knowledge graph by automatically inheriting tags from the connected repositories and generating contextual tags based on content analysis. A decentralized approval workflow is initiated to automatically identify domain owners based on document content classification. The knowledge graph is published, after approval, as a adaptive data package to a searchable marketplace, wherein the adaptive data package becomes discoverable and accessible to authorized users.
[0015] In one implementation, the parallel multi-modal processing includes an optical character recognition pipeline that extracts textual content while preserving document structure, formatting, and table layouts. The processing also includes a visual processing pipeline that identifies and interprets charts, graphs, diagrams, and images. Additionally, a semantic analysis pipeline is employed to identify entities and extract the relationships between the entities using natural language processing.
[0016] In one implementation, the method includes validating the selected documents for format compatibility, file integrity, and processing feasibility before executing the parallel multi-modal processing.
[0017] In one implementation, the generation of the knowledge graph involves performing entity resolution to identify and merge duplicate entities across document sections to create resolved entities. The method transforms the resolved entities and the relationships between the entities into the key-value pairs. The method further establishes weighted connections between the resolved entities with confidence scoring, creates dynamic schemas that automatically adapt to document content patterns, and maintains entity provenance to track a source and an extraction method for each graph element.
[0018] In one implementation, the decentralized approval workflow routes approval requests through parallel or sequential workflows based on governance requirements, and maintain immutable audit trails of all approval decisions.
[0019] In one implementation, publishing to the marketplace includes providing version control with complete change history, generating integration endpoints for business intelligence platforms, and tracking usage analytics. In one implementation, the method incorporates user feedback during processing to improve extraction accuracy through machine learning model refinement.
[0020] In one implementation, the entities and relationships between the entities are extracted using transformer-based large language models trained on domain-specific documents. In one implementation, the knowledge graph generated from the key-value pairs is stored in a graph database configured to support query operations and relationship traversal.
BRIEF DESCRIPTION OF DRAWINGS:
[0021] Fig. 1 illustrates system architecture for processing unstructured data, in accordance with an implementation of the present invention.
[0022] Fig. 2 illustrates a block diagram showing different components of a system for processing unstructured data, in accordance with an implementation of the present invention.
[0023] Fig. 3 illustrates a three-step workflow for processing unstructured data, in accordance with an implementation of the present invention.
[0024] Fig. 4 illustrates parallel processing architecture, in accordance with an implementation of the present invention.
[0025] Fig. 5 illustrates a flowchart showing a method of creating adaptive data packages from unstructured documents, in accordance with an implementation of the present invention.
DETAILED DESCRIPTION OF DRAWINGS:
[0026] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as not to unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0027] Some embodiments of this disclosure, illustrating all its features, will now be discussed in detail. The words "enabling”, "establishing", “attaching” and other forms thereof, are intended to be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. The terms “comprises,” “comprising,” “has,” “having,” “includes” and/or “including” as used herein, specify the presence of stated features, elements, and/or components and the like, but do not preclude the presence or addition of one or more other features, elements, components and/or combinations thereof. The term “an embodiment” is to be read as “at least one embodiment.” The term “another embodiment” is to be read as “at least one other embodiment.” Although any system and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the exemplary system and methods are now described.
[0028] The disclosed embodiments are merely examples of the disclosure, which may be embodied in various forms. Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure is not intended to be limited to the embodiments described but is to be accorded the widest scope consistent with the principles and features described herein.
[0029] Fig. 1 illustrates a system (102) for processing unstructured data comprises multiple interconnected components comprising a memory (104), processor (106), and communication module (108) which operate collectively to transform raw documents into analytical-ready adaptive data packages. The adaptive data package is a data package which is reusable, self-contained data asset that integrates real-time data with query and access control mechanisms, enabling authorized users to independently discover, understand, and utilize data effectively, while maintaining compliance with organizational policies and standards, The adaptive data package can be used for data extraction or any such data related activities for any data specific operations. The adaptive data packages constitute a specialized implementation of data product.
[0030] The system (102) includes a processing gateway (114) configured to coordinate data flow and processing operations. The communication module (108) connects through a network (112) to one or more repositories (110-1), (110-2), through (110-n). The repositories include, but not limited to AWS S3, Azure Blob Storage, Google Cloud Storage, or on-premises file systems for accessing unstructured documents. The system (102) also may interface with a non-transitory computer-readable storage medium (116) that stores program instructions for executing the data processing operations. The processing gateway (114) implements secure connection protocols through authentication and encryption protocols.
[0031] Fig. 2 illustrates the system (102) architecture showing an interface (200), the processor (106), and the memory (104) containing the program instructions. The memory (104) stores program instructions for document selection (202), program instructions for parallel multi-modal processing (204), program instructions for entity and relationship extraction (206), program instructions for key-value pair creation (208), program instructions for knowledge graph generation (210), program instructions for metadata management (212), program instructions for approval workflow (214) and program instructions for marketplace publishing (216).
[0032] The program instructions for document selection (202) may cause the processor (106) to select unstructured documents from one or more connected repositories based on document selection criteria. The document selection criteria may include file type compatibility, date ranges, content categories, and user-defined parameters to identify relevant unstructured documents across multiple repository sources.
[0033] The program instructions for parallel multi-modal processing (204) may cause the processor (106) to simultaneously execute OCR, visual, and semantic analysis pipelines on the unstructured documents selected using the document selection criteria to generate processed documents. The parallel execution enables concurrent processing where optical character recognition extracts textual content, visual element extraction interprets graphical components, and semantic analysis using one or more large language models identifies contextual meanings within the documents.
[0034] The program instructions for entity and relationship extraction (206) may cause the processor (106) to process outputs from all pipelines to identify people, locations, dates, keywords, and domain-specific terms along with relationships between the entities from the processed documents. The extraction utilizes the one or more large language models to analyze the processed documents and identify both explicit and implicit relationships between the detected entities.
[0035] The program instructions for key-value pair creation (208) may cause the processor (106) to create key-value pairs from the extracted entities and the relationships between the entities, transforming them into structured formats suitable for graph construction. The transformation process converts unstructured entity-relationship data into normalized key-value pairs that serve as building blocks for subsequent knowledge graph generation.
[0036] The program instructions for knowledge graph generation (210) may cause the processor (106) to automatically transform the key-value pairs into structured interconnected representations without requiring manual schema definition, wherein the knowledge graph represents interconnected relationships between the entities. The automatic transformation adapts dynamically to the content patterns within the key-value pairs, creating a flexible graph structure that accurately represents the relationships identified in the source documents.
[0037] The program instructions for metadata management (212) may cause the processor (106) to apply comprehensive metadata to the knowledge graph including automatically inheriting tags from the connected repositories and generating contextual tags based on content analysis. The metadata application enriches the knowledge graph with both source-derived attributes and system-generated classifications that enhance discoverability and usability.
[0038] The program instructions for approval workflow (214) may cause the processor (106) to initiate a decentralized approval workflow that automatically identifies domain owners based on document content classification. The workflow analyzes the knowledge graph content to determine appropriate approval authorities and routes approval requests according to organizational governance policies.
[0039] The program instructions for marketplace publishing (216) may cause the processor (106) to publish the knowledge graph, which is approved, as an adaptive data package to a searchable marketplace wherein the adaptive data package becomes discoverable and accessible to authorized users. The publishing process transforms the approved knowledge graph into a reusable adaptive data package with defined access controls, versioning, and integration endpoints for consumption by business intelligence and analytics platforms.
[0040] Fig. 3 illustrates an exemplary embodiment of the three-step workflow for unstructured adaptive data package creation, in accordance with an implementation of the present invention. At step 1 - Source and File Selection (302) represents the initial phase where users interact with the system (102) to identify and select documents for processing. Connected repositories (304) interface provides access to multiple data sources including repository 1, repository 2, cloud storage, and local files, which are connected to the system through the network (112) as illustrated in Fig. 1. A file explorer (308) retrieves and displays available documents from the connected repositories (304), presenting the available documents in a navigable interface that allows users to browse document hierarchies and preview file contents.
[0041] Selection controls (310) enable users to manipulate users’ document selection through Select All functionality for bulk processing, while filters (312) narrow the selection based on document type (which may include PDF, DOCX) and date range criteria. Further, search (314) capability queries document metadata and content to locate specific files. After the user has selected the documents, which the user wants to process from the available files in the repositories, validation module (316) automatically examines each selected document, performing three critical checks: format verification to ensure compatibility with the processing pipelines, size validation to confirm processing feasibility, and integrity checking to detect corrupted or incomplete files. Subsequently, results (318) aggregate the validation outcomes, displaying "3 files ready" to indicate successful validation, after which the validated file list is passed to step 2 for further processing.
[0042] At step 2, detailing (320) receives the validated file list from step 1 and presents input forms (322) for users to enrich the adaptive data package with descriptive information. Further, name field (324) captures a user-defined identifier for the adaptive data package, while description field (326) allows detailed explanation of the content and intended use. Concurrently, the system analyzes the selected documents' content to power auto-suggestions module (328), which generates contextual recommendations such as "finance," "contracts," and "Q4-2024" based on preliminary entity extraction and keyword analysis performed on the validated documents.
[0043] Business Context module (330) captures structured metadata including target audience (finance team), update frequency (monthly), and related adaptive data packages (including sales data), establishing relationships with existing adaptive data packages in the marketplace. Category selector (332) enables classification into predefined taxonomies such as financial documents, legal contracts, or technical specifications, each of which will later determine the appropriate domain owners in the decentralized approval workflow. Enriched metadata is combined with the document processing instructions and forwarded to step 3 after capturing all detailing information.
[0044] At step 3, Preview and Approval (334) receives the enriched document package from step 2 along with the results of parallel processing executed by the pipelines. Preview panel (336) presents multiple synchronized views of the processed output: graph visualization (338) displays the knowledge graph structure generated by the knowledge graph generation module (210), while the entities panel (340) lists extracted entities including Company ABC, individual names, monetary values ($1.5M), and temporal references (Q4 2024) derived from the entity extraction performed by semantic pipeline.
[0045] Metadata tags module (342) displays both inherited tags from the connected repositories and newly generated contextual tags ("financial," "contract," "2024") produced by the metadata management module (212). Further, quality metrics (344) calculate, and display confidence scores (92%) based on the agreement between the three processing pipelines, and completeness indicators (100%) showing the percentage of document content successfully processed. The quality metrics (344) are derived from synchronization module's conflict resolution and cross-referencing operations.
[0046] Submission for approval request through approval button (346) after user review and confirmation, triggers the decentralized approval workflow module (214), which analyzes the document classification from step 2 to automatically identify appropriate domain owners. Further, status panel (348) dynamically updates to show "Pending" and lists the identified approvers (Domain Owner 1, Domain Owner 2) along with their approval status and compliance requirements. The approval request, along with the preview data and metadata, is then routed through the decentralized governance workflow, leading to marketplace publication upon successful approval.
[0047] Fig. 4 illustrates a parallel processing architecture, in accordance with an implementation of the present invention. A document input module (402) receives documents for processing and provides the documents to a document splitter (404). The document splitter (404) with context markers (406) segments the documents while maintaining context and creates three synchronized copies of each document segment, simultaneously routing them to three parallel pipelines for concurrent processing.
[0048] OCR Pipeline (408) receives the first copy and processes it through a Text extraction engine (410), Structure Analyzer (412), Table Extractor (414), and Formatting Preserver (416). The OCR pipeline (408) produces Structured text output (418) with pages and tables preserved, which is then routed to the synchronization module (444).
[0049] Visual pipeline (420) receives the second copy from the document splitter (404) and processes it simultaneously through an Image Detector (422), Chart Interpreter (424), Diagram Analyzer (426), and Caption Generator (428). The visual pipeline generates Visual data output (430) containing images and graphs, which is also routed to the synchronization module (444).
[0050] Semantic pipeline (432) receives the third copy from the document splitter (404) and employs an entity recognizer (434), relationship extractor (436), context analyzer (438), and theme identifier (440) to process the document content concurrently with the other pipelines. The semantic pipeline (432) produces semantic annotations output (442) with entities and relationships between the entities, which is likewise provided to the synchronization module (444).
[0051] Synchronization module (444) receives and coordinates all three pipeline outputs including, the Structured text output (418) from the OCR pipeline, the Visual data output (430) from the visual pipeline, and Semantic annotations output (442) from the semantic pipeline. The synchronization module (444) processes these through a Temporal Aligner (446) to ensure temporal consistency across all three pipeline outputs. A Conflict Resolver (448) reconciles any contradictions between pipeline outputs, and a Cross-reference Creator (450) establishes connections between textual, visual, and semantic elements. Coordinated processing results in a Unified output (452), which combines all processed information into a single, coherent representation of the document, which is then forwarded to subsequent processing modules for key-value pair creation and knowledge graph generation.
[0052] Fig. 5 illustrates a method (500) for creating adaptive data packages from unstructured documents comprising multiple coordinated steps. At step 502, the system (102) selects unstructured documents from connected repositories based on user-defined criteria. The unstructured documents can include but not limited to portable document format files and word processing document files. At step 504, the system (102) executes parallel multi-modal processing including but not limited to OCR, Visual and Semantic simultaneously on selected documents to generate processed documents. At step 506, the system (102) extracts entities and relationships between the entities using Large Language Models to identify people, locations, dates, keywords, and domain-specific terms. At step 508, the system (102) creates key-value pairs from the extracted entities and the relationships between the entities for structured representation.
[0053] At step 510, the system (102) generates knowledge graph by transforming the key-value pairs into structured interconnected representations without manual schema definition. The knowledge graph represents the relationships between the entities. At step 512, the system (102) applies metadata to the knowledge graph, including automatically inheriting tags from the connected repositories and generating contextual tags based on content analysis. At step 514, the system (102) initiates a decentralized approval workflow that automatically identifies domain owners based on document content classification. At step 516, the system (102) publishes the knowledge graph, after approval, as a adaptive data package to a searchable marketplace where the adaptive data package becomes discoverable and accessible to authorized users.
[0054] The knowledge graph generation process automatically transforms the key-value pairs into interconnected graph structures without requiring manual intervention. The system (102) performs entity resolution to identify and merge duplicate entities across document sections to create resolved entities, maintaining complete provenance tracking.
[0055] The entity resolution employs multiple techniques including fuzzy string matching for name variations, contextual similarity analysis using embedding vectors, and temporal proximity for date-related entities. The resolved entities maintain references to all original occurrences, preserving provenance while eliminating redundancy.
[0056] The relationship extraction identifies connections between the entities using multiple evidence sources. The system (102) extracts direct relationships from explicit statements in documents. Indirect relationships are inferred through co-occurrence patterns and shared attributes. The system (102) assigns confidence scores to each relationship based on evidence strength and consistency across document sections.
[0057] The graph construction builds the knowledge graph using the resolved entities as nodes and the extracted relationships between the entities as edges. The system (102) implements dynamic schema generation that adapts to document content patterns rather than requiring predefined structures. Node properties are populated from entity attributes, while edge properties capture relationship types, strengths, and directional information.
[0058] The graph optimization process enhances the generated graph for analytical use and integration with business intelligence platforms. The graph optimization process includes community detection to identify entity clusters, centrality analysis to determine important entities, and path optimization for efficient traversal. The system (102) also performs graph embedding to enable machine learning applications on the knowledge graph generated from the key-value pairs and stored in a graph database configured to support query operations and relationship traversal.
[0059] The quality assurance validates the generated knowledge graph against quality metrics, supporting the validation capabilities. The quality metrics include completeness that represents percentage of document content captured, consistency, which verifies absence of logical contradictions, and connectivity that measures the degree of entity interconnection. The knowledge graphs failing quality thresholds are flagged for review with specific remediation suggestions.
[0060] The decentralized approval workflow implements governance controls while maintaining workflow efficiency. The system (102) automatically determines approval requirements based on document content and organizational policies. Content classifier analyses document content to determine data sensitivity, regulatory requirements, and business criticality. Content classification uses multiple signals including keyword detection for sensitive terms, pattern matching for regulated data types (personally identifiable information, financial data, health information), and contextual analysis for business impact assessment.
[0061] Domain owner identification maps classified content to relevant approval authorities. The system (102) maintains a registry of domain owners organized by business function, data category, and geographic region. The system (102) determines whether parallel approval across all domains simultaneously or sequential approval ordered by priority is required for documents spanning multiple domains, based on governance requirements.
[0062] Workflow orchestration engine manages the decentralized approval process from initiation to completion. The workflow orchestration engine sends notifications through multiple channels (email, in-app, mobile), tracks approval status with configurable escalation rules for delayed approvals, and enforces service level agreements for approval response times. The workflow orchestration engine also handles approval delegation when primary approvers are unavailable, implementing the role-based access controls.
[0063] Audit trail creates immutable records of all decentralized approval activities. Each record includes timestamp, approver identity, decision rationale, and any attached comments or conditions. The system (102) implements cryptographic signing to ensure non-repudiation and maintains compliance with regulatory retention requirements, supporting the compliance reporting capabilities.
[0064] Marketplace publishing enables approved adaptive data packages to become discoverable and accessible across the organization. Published adaptive data packages retain their full knowledge graph structure while providing multiple access methods for different user needs. The publishing process generates multiple representations of each adaptive data package. The native graph format preserves all relationships between the entities and properties for advanced analytics. Tabular extracts provide simplified views for traditional business intelligence tools. API endpoints enable programmatic access for application integration supporting real-time data synchronization with external platforms. Documentation automatically generated from metadata helps users understand and utilize the adaptive data package effectively.
[0065] Marketplace discovery interface implements search capabilities including full-text search, faceted filtering, and recommendation based on user activity patterns. Full-text search across all adaptive data package content and metadata enables quick location of relevant information. The marketplace discovery interface provides usage analytics showing which adaptive data packages are most valuable to the organization, with version control and complete change history maintained for all published adaptive data packages.
[0066] The system (102) is implemented using a microservices architecture that ensures scalability and maintainability. Each major component including ingestion, processing, key-value pair creation, graph generation, approval, publishing operates as an independent service communicating through message queues. The microservices architecture enables horizontal scaling of processor-intensive operations while maintaining system coherence and supporting the three-step workflow that maintains state throughout the process.
[0067] Parallel processing pipelines utilize containerized execution environments that can be dynamically provisioned based on workload. GPU acceleration is employed for computer vision and large language model operations, including transformer-based large language models trained on domain-specific documents for extracting the entities and the relationships between the entities. The system (102) implements intelligent caching to avoid reprocessing common document patterns and incorporates continuous learning by incorporating user corrections and feedback to improve future processing accuracy.
[0068] Security is implemented through multiple layers including encryption at rest and in transit as specified in the authentication and encryption protocols, role-based access controls with fine-grained permissions, audit logging of all data access and modifications, and data loss prevention scanning before marketplace publication. The entire process from document selection to marketplace publishing is accessible to users without requiring technical expertise or programming knowledge, implementing the non-transitory computer-readable storage medium (116) with machine readable instructions that, when executed by the processor (106), cause the processor to perform all specified operations.
[0069] An advantage of the present invention is that the parallel multi-modal processing architecture disclosed in Fig. 4 enables simultaneous execution of OCR pipeline (408), visual pipeline (420), and semantic pipeline (432) to generate processed documents, significantly reducing processing time compared to sequential processing. The synchronization module (444) preserves contextual relationships between textual, visual, and semantic elements that would be lost in traditional sequential processing approaches, while the cross-reference creator (450) ensures accuracy enhancement when extracting entities and relationships between the entities.
[0070] Another advantage of the present invention is that the automated key-value pair creation and knowledge graph generation, as implemented through program instructions for key-value pair creation (208) and knowledge generation (210) stored in memory (104), eliminates the need for manual schema definition and entity mapping. The system’s transformation process occurs in two stages. The first stage creates key-value pairs from extracted entities and the relationships between those entities. The second stage transforms these key-value pairs into structured, interconnected representations. The two stages transformation process approach enables dynamic schema adaptation, allowing the knowledge graph to accurately represent diverse document content without forcing data into predefined structures. The dynamic schema adaptation proves particularly valuable for processing unstructured documents, including portable document format files and word processing document files with varying formats.
[0071] Yet another advantage of the present invention is that the three-step workflow illustrated in Fig. 3 democratizes adaptive data package creation by enabling users without technical expertise or programming knowledge to process complex documents. The intuitive interface (200) with features like auto-suggestions (328), validation (316), and preview panel (336) reduces training requirements while the workflow's state maintenance capability ensures users can navigate between steps without losing progress.
[0072] A further advantage of the present invention is that the decentralized governance model implemented through program instructions for decentralized approval workflow (214) ensures compliance without introducing friction. The automatic domain owner identification based on document content classification eliminates manual routing decisions, while the immutable audit trails and role-based access controls with escalation mechanisms provide complete transparency for regulatory requirements.
[0073] Another advantage of the present invention is that the marketplace publishing capability, executed through program instructions for marketplace publishing (216), transforms isolated document processing into an organizational knowledge platform. The marketplace published adaptive data packages comprising the knowledge graph generated from the key-value pairs become discoverable through full-text search, faceted filtering, and recommendations based on user activity patterns, while integration endpoints support real-time data synchronization with business intelligence platforms, eliminating redundant processing across the organization.
[0074] An additional advantage is that the system (102) incorporating memory (104), processor (106), and non-transitory computer-readable storage medium (116) provides a complete end-to-end solution from document ingestion through connected repositories (110-1) to (110-n) to knowledge graph generation from key-value pairs to marketplace publishing, all accessible through a single unified platform rather than requiring multiple disparate tools, thereby reducing complexity and total cost of ownership.
[0075] An implementation of the disclosure may be an article of manufacture in which a machine-readable medium (such as microelectronic memory) has stored thereon instructions which program one or more data processing components (generically referred to here as a "processor") to perform the operations described above. In other implementations, some of these operations might be performed by specific hardware components that contain hardwired logic. Those operations might alternatively be performed by any combination of programmed data processing components and fixed hardwired circuit components.
[0076] A non-transitory computer-readable storage medium includes program instructions to implement various operations embodied by a computing device such as a server, desktop, or cloud-based processing system. The medium may also include, alone or in combination with the program instructions, data files, data structures, and the like. The medium and program instructions may be those specially designed and constructed for the purposes, or they may be of the kind well-known and available to those having skill in the computer software arts.
[0077] Examples of non-transitory computer-readable storage medium include magnetic media such as hard drives, optical media such as compact disc read-only memory disks, semiconductor memories such as flash memory and random access memory, and cloud-based storage systems. Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
[0078] The described hardware devices may be configured to act as one or more software modules in order to perform the operations of the above-described implementations. The term "software" as used herein is intended to encompass such instructions stored in storage medium such as random access memory, a hard disk, optical disk, or so forth, and is also intended to encompass so-called "firmware" that is software stored on a read-only memory or so forth. Such software may be organized in various ways, and may include software components organized as libraries, internet-based programs stored on a remote server or so forth, source code, interpretive code, object code, directly executable code, and so forth.
[0079] Any combination of the above features and functionalities may be used in accordance with one or more implementations. In the foregoing specification, implementations have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
[0080] An interface may be used to provide input or fetch output from the server. The interface may be implemented as a web-based graphical user interface providing a three-step workflow to create data products from unstructured documents, command line interface, or application programming interface. Further, representational state transfer interfaces may be used for programmatic interaction with the system.
[0081] A processor (106) may include one or more general purpose processors, graphics processing units for parallel computation, tensor processing units for machine learning operations, or specialized processors for specific tasks. Further, the processor may implement knowledge graph generation. The processor may be provisioned in on-premises data centres, cloud computing environments, or hybrid infrastructures combining both deployment models.
[0082] A memory (104) may include, but is not limited to, one or more non-transitory machine-readable storage devices such as solid-state drives, hard disk drives, optical storage media, or semiconductor memories including random access memory, read-only memory, and flash memory. In cloud deployments, memory may include object storage services, distributed file systems, or managed database services. The memory systems may implement redundancy, replication, and backup mechanisms to ensure data durability and availability of unstructured documents and processed data products.
[0083] The terms "or" and "and/or" as used herein are to be interpreted as inclusive or meaning any one or any combination. Therefore, "A, B or C" or "A, B and/or C" mean "any of the following: A; B; C; A and B; A and C; B and C; A, B and C." An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
[0084] The system (102) may be deployed in various architectural configurations including monolithic applications, microservices architectures, serverless computing models, or containerized deployments. Scalability may be achieved through horizontal scaling across multiple instances, vertical scaling of individual components, or elastic scaling based on processing demand. Load balancing mechanisms distribute processing across available resources to optimize performance and reliability.
[0085] Integration with external systems is facilitated through standardized protocols and interfaces. The system (102) supports batch processing for large-scale document ingestion, real-time processing for immediate adaptive data package creation, and hybrid modes combining both approaches. Event-driven architectures enable reactive processing based on document arrival or system events. Message queuing systems provide reliable communication between system components and external services.
[0086] Monitoring and observability features track system performance, processing metrics, and operational health. Logging frameworks capture detailed execution traces for debugging and audit purposes. Alerting mechanisms notify administrators of exceptional conditions or processing failures. Performance analytics provide insights into processing efficiency, resource utilization, and optimization opportunities.
[0087] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily configure and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein. , Claims:CLAIMS
I/We Claim:
1. A method for processing unstructured data, the method comprising:
selecting (502), using document selection criteria, unstructured documents from one or more connected repositories;
executing (504) parallel multi-modal processing on the unstructured documents selected using the document selection criteria for generating processed documents, wherein the multi-modal processing simultaneously performs optical character recognition, visual element extraction, and semantic analysis using one or more large language models;
extracting (506) entities including people, locations, dates, keywords, and domain-specific terms along with relationships between the entities from the processed documents using the one or more large language models;
creating (508) key-value pairs from the extracted entities and relationships between the entities;
generating (510) a knowledge graph by automatically transforming the key-value pairs into structured interconnected representations without requiring manual schema definition, wherein the knowledge graph represents interconnected relationships between the entities;
applying (512) metadata to the knowledge graph, including automatically inheriting tags from the connected repositories and generating contextual tags based on content analysis;
initiating (514) a decentralized approval workflow that automatically identifies domain owners based on document content classification; and
publishing (516) the knowledge graph, which is approved as an adaptive data package to a searchable marketplace, wherein the adaptive data package becomes discoverable and accessible to authorized users.
2. The method as claimed in claim 1, wherein the parallel multi-modal processing comprises:
an optical character recognition pipeline that extracts textual content while preserving document structure, formatting, and table layouts;
a visual processing pipeline that identifies and interprets charts, graphs, diagrams, and images; and
a semantic analysis pipeline that identifies entities and extracts contextual relationships using natural language processing.
3. The method as claimed in claim 1, wherein the method comprising validating the selected documents for format compatibility, file integrity, and processing feasibility before executing the parallel multi-modal processing.
4. The method as claimed in claim 1, wherein generating the knowledge graph comprises:
performing entity resolution to create resolved entities for identifying and merging duplicate entities across document sections;
transforming the resolved entities and the relationships between the entities into key-value pairs;
establishing weighted connections between the resolved entities with confidence scoring;
creating dynamic schemas that automatically adapt to document content patterns; and
maintaining entity provenance to track the source and an extraction method for each graph element.
5. The method as claimed in claim 1, wherein the decentralized approval workflow comprises routing approval requests through parallel or sequential workflows based on governance requirements, and maintaining immutable audit trails of all approval decisions.
6. The method as claimed in claim 1, wherein publishing to the marketplace comprises providing version control with complete change history, generating integration endpoints for business intelligence platforms, and tracking usage analytics.
7. The method as claimed in claim 1, wherein the method comprising: incorporating a user feedback during processing to improve extraction accuracy through machine learning model refinement.
8. The method as claimed in claim 1, wherein the entities and relationships between the entities are extracted using transformer-based large language models trained on domain-specific documents.
9. The method as claimed in claim 1, wherein the knowledge graph generated from the key-value pairs is stored in a graph database supporting query operations and relationship traversal.
10. A system (102) for processing unstructured data, the system (102) comprising:
a processor (106);
a memory (104) storing program instructions which, when executed by the processor (106), cause the processor (106) to:
select, using document selection criteria, unstructured documents from one or more connected repositories;
execute parallel multi-modal processing on the unstructured documents selected using the document selection criteria to generate processed documents, wherein the multi-modal processing simultaneously performs optical character recognition, visual element extraction, and semantic analysis using one or more large language models;
extract entities including people, locations, dates, keywords, and domain-specific terms along with relationships between the entities from the processed documents using the one or more large language models;
create key-value pairs from the extracted entities and the relationships between the entities;
generate a knowledge graph by automatically transforming the key-value pairs into structured interconnected representations without requiring manual schema definition, wherein the knowledge graph represents the relationships between the entities;
apply metadata to the knowledge graph, including automatically inheriting tags from the connected repositories and generating contextual tags based on content analysis;
initiate a decentralized approval workflow that automatically identifies domain owners based on document content classification; and
publish the knowledge graph, after approval, as an adaptive data package to a searchable marketplace, wherein the adaptive data package becomes discoverable and accessible to authorized users.
11. The system (102) as claimed in claim 10, wherein the instructions for executing the parallel multi-modal processing cause the system (102) to implement:
an optical character recognition pipeline configured to extract textual content while preserving document structure, formatting, and table layouts;
a visual processing pipeline configured to identify and interpret charts, graphs, diagrams, and images; and
a semantic analysis pipeline configured to identify entities and extract the relationships between the entities using natural language processing.
12. The system (102) as claimed in claim 10, wherein the instructions further cause the system to validate the selected documents for format compatibility, file integrity, and processing feasibility before executing the parallel multi-modal processing.
13. The system (102) as claimed in claim 10, wherein the instructions for generating the knowledge graph cause the system (102) to:
perform entity resolution to identify and merge duplicate entities across document sections to create resolved entities;
transform the resolved entities and the relationships between the entities into the key-value pairs;
establish weighted connections between the resolved entities with confidence scoring;
create dynamic schemas that automatically adapt to document content patterns; and
maintain entity provenance to track a source and an extraction method for each graph element.
14. The system (102) as claimed in claim 10, wherein the instructions for initiating the decentralized approval workflow cause the system to route approval requests through parallel or sequential workflows based on governance requirements, and maintain immutable audit trails of all approval decisions.
15. The system (102) as claimed in claim 10, wherein the instructions for publishing to the marketplace cause the system to provide version control with complete change history, generate integration endpoints for business intelligence platforms, and track usage analytics.
16. The system (102) as claimed in claim 10, wherein the instructions further cause the system to incorporate user feedback during processing to improve extraction accuracy through machine learning model refinement.
17. The system (102) as claimed in claim 10, wherein the one or more large language models comprise transformer-based large language models trained on domain-specific documents for extracting the entities and the relationships between the entities.
18. The system (102) as claimed in claim 10, wherein the system (102) comprising a graph database configured to store the knowledge graph generated from the key-value pairs and support query operations and relationship traversal.
19. A non-transitory computer-readable storage medium comprising machine readable instructions that, when executed, cause a processor (106) to:
select, using document selection criteria, unstructured documents from one or more connected repositories;
execute parallel multi-modal processing on the unstructured documents selected using the document selection criteria to generate processed documents, wherein the multi-modal processing simultaneously performs optical character recognition, visual element extraction, and semantic analysis using one or more large language models;
extract entities including people, locations, dates, keywords, and domain-specific terms along with relationships between the entities from the processed documents using the one or more large language models;
create key-value pairs from the extracted entities and the relationships between the entities;
generate a knowledge graph by automatically transforming the key-value pairs into structured interconnected representations without requiring manual schema definition, wherein the knowledge graph represents the relationships between the entities;
apply metadata to the knowledge graph, including automatically inheriting tags from the connected repositories and generating contextual tags based on content analysis;
initiate a decentralized approval workflow that automatically identifies domain owners based on document content classification; and
publish the knowledge graph, after approval, as an adaptive data package to a searchable marketplace, wherein the adaptive data package becomes discoverable and accessible to authorized users.
BALIP AMIT ABASAHEB [IN/PA-5184]
| # | Name | Date |
|---|---|---|
| 1 | 202541092868-STATEMENT OF UNDERTAKING (FORM 3) [27-09-2025(online)].pdf | 2025-09-27 |
| 2 | 202541092868-REQUEST FOR EXAMINATION (FORM-18) [27-09-2025(online)].pdf | 2025-09-27 |
| 3 | 202541092868-REQUEST FOR EARLY PUBLICATION(FORM-9) [27-09-2025(online)].pdf | 2025-09-27 |
| 4 | 202541092868-POWER OF AUTHORITY [27-09-2025(online)].pdf | 2025-09-27 |
| 5 | 202541092868-FORM-9 [27-09-2025(online)].pdf | 2025-09-27 |
| 6 | 202541092868-FORM 18 [27-09-2025(online)].pdf | 2025-09-27 |
| 7 | 202541092868-FORM 1 [27-09-2025(online)].pdf | 2025-09-27 |
| 8 | 202541092868-DRAWINGS [27-09-2025(online)].pdf | 2025-09-27 |
| 9 | 202541092868-DECLARATION OF INVENTORSHIP (FORM 5) [27-09-2025(online)].pdf | 2025-09-27 |
| 10 | 202541092868-COMPLETE SPECIFICATION [27-09-2025(online)].pdf | 2025-09-27 |