Methods And Systems For Image Sensing, Extracting Information, And

< Back

Methods And Systems For Image Sensing, Extracting Information, And Real Time Analysis Of An Object

Abstract: Disclosed herein is a method (600) for image sensing, extracting information, and real-time analysis of an object. The method comprises capturing (602), by an image capturing device, one or more images of at least one object. The method comprises capturing (604) an input stream of the one or more images based on adjusting one or more operating parameters of one or more sensors of the image capturing device in response to capturing the one or more images and based on a pre-defined threshold value. The method comprises detecting (606) one or more text by processing the input stream. Further, the method comprises recognising (608) one or more text by processing the detected one or more text. Furthermore, generating (610) one or more insights related to the one or more images of the object by analysing the one or more recognised text.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

18 February 2025

Publication Number

14/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

NEOPHYTE AMBIENT INTELLIGENCE PRIVATE LIMITED

Shop No.6, Plot No.4, Vishwakarma Tower, Sector-21, Kharghar, Navi Mumbai 410210, India

Inventors

1. SAHOO, Anurag

Gauri priya CHS flat no 802 8th floor, Plot no 41 sector 19, Kharghar 410210, Navi Mumbai, India

2. Dr ANAND, Abhinav

Gauri priya CHS flat no 802 8th floor, Plot no 41 sector 19, Kharghar 410210, Navi Mumbai, India

3. KUMAR, Ritesh

Gauri priya CHS flat no 802 8th floor, Plot no 41 sector 19, Kharghar 410210, Navi Mumbai, India

4. ATRE, Gourav

401, Vishwakarma towers, Plot no 4, Sector 21, Kharghar 410210, Navi Mumbai, India

Specification

Description:

FIELD OF THE INVENTION

[001] The present disclosure generally relates to high-resolution image processing, and in particular, to a method and a system for image sensing, extracting information, and real-time analysis of an object.

BACKGROUND

[002] Visual information extraction and analysis systems are technologies that automatically process, analyze, and interpret visual data such as images, videos, or visual signals to extract meaningful insights or information. Such systems typically combine various fields like computer vision, machine learning, and image processing to perform tasks that range from identifying objects to understanding complex visual patterns.

[003] These conventional systems face significant limitations when handling variations in image quality, object orientation, lighting conditions, and complex contextual information. These systems often rely heavily on high-resolution image processing, which is computationally expensive and impractical for real-time applications, especially in resource-constrained environments. Furthermore, these systems typically use fixed model architectures, lacking the flexibility to adapt to new or evolving object classes and diverse environmental conditions, which hampers their accuracy and applicability in dynamic business contexts.

[004] The primary drawbacks of the conventional system include:

[005] Inability to Handle Multi-Resolution and Multi-Depth Variations: Existing image capture systems struggle to maintain performance across varying object sizes, depths, and resolutions. They often depend on fixed-focus cameras or static resolution settings, leading to inconsistent image quality and suboptimal text extraction, particularly when objects are positioned at different depths or under low-light conditions.

[006] Computational Inefficiencies and Resource Constraints: Traditional image processing architectures employ single-resolution pipelines, consuming excessive computational resources and power when processing high-resolution images. This results in slower processing times and impractical use in real-time applications, especially in edge-computing scenarios.

[007] Limited Accuracy in Complex Text Extraction: Standard OCR models often struggle with intricate text patterns, such as dotted or overlapping text, high color contrasts, and non-standard orientations—common in retail packaging. These models are not fine-tuned to handle diverse textual representations of key attributes like manufacturing dates, expiry dates, and product codes, leading to frequent misclassifications and low extraction accuracy.

[008] Lack of Adaptability and Real-Time Learning: Existing solutions typically lack adaptive learning mechanisms, making them unable to refine their performance based on real-time feedback. This results in a decreased ability to adapt to new or evolving data scenarios, requiring extensive manual re-training.

[009] Limited Interactivity and Analytical Capabilities for Operational Intelligence: Current state-of-the-art systems lack integration with advanced language models to support natural language queries and provide contextually relevant responses based on visual and textual data. This restricts their ability to support business analytics and interactive data exploration.

[010] In view of the foregoing disadvantages, there is a compelling need for an innovative solution that can address the shortcomings of traditional systems and provide an intelligent character recognition and visual information extraction system and method, offering a versatile, adaptable, and efficient solution for real-world applications.

SUMMARY

[001] This summary is provided to introduce a selection of concepts in a simplified format that are further described in the detailed description of the invention. This summary is not intended to identify essential inventive concepts of the invention, nor is it intended to determine the scope of the invention.

[002] According to an embodiment of the present disclosure, a method for image sensing, extracting information, and real-time analysis of an object is disclosed. The method includes capturing, by an image capturing device, one or more images of at least one object. The method includes capturing an input stream of the one or more images based on adjusting one or more operating parameters of one or more sensors of the image capturing device in response to capturing the one or more images and based on a pre-defined threshold value. The input stream includes of a low-resolution stream with a high frame rate, a medium-resolution stream with a moderate frame rate and a high-resolution stream with a low frame rate. The method includes detecting one or more text by processing the input stream. Further, the method includes recognising one or more text by processing the detected one or more text. Furthermore, generating one or more insights related to the one or more images of the object by analysing the one or more recognised text.

[003] According to another embodiment, a system for image sensing, extracting information, and real-time analysis of an object. The system comprises one or more processors and a memory coupled with the one or more processors. The one or more processors are configured to capture, by an image capturing device, one or more images of at least one object. The one or more processors are configured to capture an input stream of the one or more images based on adjusting one or more operating parameters of one or more sensors of the image capturing device in response to capturing the one or more images and based on a pre-defined threshold value. The input stream comprises of a low-resolution stream with a high frame rate, a medium-resolution stream with a moderate frame rate and a high-resolution stream with a low frame rate. The one or more processors are configured to detect one or more text by processing the input stream. Further, the one or more processors are configured to recognise one or more text by processing the detected one or more text. Furthermore, the one or more processors are configured to generate one or more insights related to the one or more images of the object by analysing the one or more recognised text.

[004] To further clarify the advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawing. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting its scope. The invention will be described and explained with additional specificity and detail in the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[005] These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

[006] Figure 1 illustrates an exemplary environment for implementing a system for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention;

[007] Figure 2 illustrates system architecture for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention;

[008] Figure 3 illustrates a schematic block diagram of the operational flow of the modules of the system for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention;

[009] Figure 4 illustrates a schematic architecture of the system for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention;

[010] Figure 5 illustrates a hardware architecture of the system for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention;

[011] Figure 6 illustrates a method for image sensing, extracting information, and real-time analysis of an object, in accordance with an exemplary embodiment of the present disclosure; and

[012] Figures 7A-7D illustrate exemplary representations of the system implementations for image sensing, extracting information, and real-time analysis of an object, in accordance with an exemplary embodiment of the present disclosure.

[013] Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present invention. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.

DETAILED DESCRIPTION

[014] It should be understood at the outset that although illustrative implementations of the embodiments of the present disclosure are illustrated below, the present invention may be implemented using any number of techniques, whether currently known or in existence. The present disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary design and implementation illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.

[015] The term “some” as used herein is defined as “none, or one, or more than one, or all.” Accordingly, the terms “none,” “one,” “more than one,” “more than one, but not all” or “all” would all fall under the definition of “some.” The term “some embodiments” may refer to no embodiments, to one embodiment or to several embodiments or to all embodiments. Accordingly, the term “some embodiments” is defined as meaning “no embodiment, or one embodiment, or more than one embodiment, or all embodiments.”

[016] The terminology and structure employed herein is for describing, teaching, and illuminating some embodiments and their specific features and elements and does not limit, restrict, or reduce the spirit and scope of the claims or their equivalents.

[017] More specifically, any terms used herein such as but not limited to “includes,” “comprises,” “has,” “consists,” and grammatical variants thereof do NOT specify an exact limitation or restriction and certainly do NOT exclude the possible addition of one or more features or elements, unless otherwise stated, and furthermore must NOT be taken to exclude the possible removal of one or more of the listed features and elements, unless otherwise stated with the limiting language “MUST comprise” or “NEEDS TO include.”

[018] Whether or not a certain feature or element was limited to being used only once, either way, it may still be referred to as “one or more features” or “one or more elements” or “at least one feature” or “at least one element.” Furthermore, the use of the terms “one or more” or “at least one” feature or element does NOT preclude there being none of that feature or element, unless otherwise specified by limiting language such as “there NEEDS to be one or more . . .” or “one or more element is REQUIRED.”

[019] Hereinafter, it is understood that terms including “unit” or “module” at the end may refer to the unit for processing at least one function or operation and may be implemented in hardware, software, or a combination of hardware and software.

[020] Unless otherwise defined, all terms, and especially any technical and/or scientific terms, used herein may be taken to have the same meaning as commonly understood by one having ordinary skill in the art.

[021] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted to not unnecessarily obscure the embodiments herein. Also, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments. The term “or” as used herein, refers to a non-exclusive or unless otherwise indicated. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein can be practiced and to further enable those skilled in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.

[022] As is traditional in the field, embodiments may be described and illustrated in terms of blocks that carry out a described function or functions. These blocks, which may be referred to herein as units or modules or the like, are physically implemented by analog or digital circuits such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits, or the like, and may optionally be driven by firmware and software. The circuits may, for example, be embodied in one or more semiconductor chips, or on substrate supports such as printed circuit boards and the like. The circuits constituting a block may be implemented by dedicated hardware, by a processor (e.g., one or more programmed microprocessors and associated circuitry), or by a combination of dedicated hardware to perform some functions of the block and a processor to perform other functions of the block. Each block of the embodiments may be physically separated into two or more interacting and discrete blocks without departing from the scope of the invention. Likewise, the blocks of the embodiments may be physically combined into more complex blocks without departing from the scope of the invention.

[023] The accompanying drawings are used to help easily understand various technical features and it should be understood that the embodiments presented herein are not limited by the accompanying drawings. As such, the present disclosure should be construed to extend to any alterations, equivalents, and substitutes in addition to those which are particularly set out in the accompanying drawings. Although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are generally only used to distinguish one element from another.

[024] An object of the present disclosure is to provide a method and a system for image sensing, extracting information related to visual and textual information, and real-time analysis of an object.

[025] Another object of the present disclosure is to provide a hierarchical multi-resolution image capture framework with dynamic focus adjustment based on real-time depth estimation and region selection, ensuring consistent image quality and precise text recognition under varied conditions.

[026] Yet another object of the present disclosure is to provide an advanced Vision-and-Language Transformers (ViLT) that handles complex text patterns and multi-lingual content, achieving higher accuracy in text extraction even in visually challenging contexts.

[027] Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.

[028] The method and system of the present disclosure provide a hierarchical multi-resolution pipeline that dynamically allocates computational resources and adjusts resolution levels based on the complexity of visual content, ensuring optimal performance and efficient resource usage even in low-power environments.

[029] The method and system of the present disclosure provide an incremental learning framework with a teacher-student model and Human-in-the-Loop capabilities, allowing for continuous model refinement and optimization, ensuring sustained high accuracy and adaptability over time.

[030] The method and system of the present disclosure integrates Large Language Models (LLMs) and Small Language Models (SLMs) to interpret user queries, deliver actionable insights, and facilitate seamless interaction between the system and users, thereby enhancing operational intelligence and decision-making capabilities.

[031] Figure 1 illustrates an exemplary environment for implementing a system for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention. Referring to Figure 1, an electronic device 102 with an in-built camera (interchangeably referred as image capturing device) 106 communicating with a system 104 for image sensing, extracting information related to visual and textual information, and real-time analysis of an object 108. The system 104 may include a software, a hardware, a combination of software or hardware, an in-built application on an electronic device 102 or an application to be installed and operated on the electronic device 102 in communication with a network interface. The system 104 may also be available via cloud-based server and available remotely from the electronic device 102.

[032] In the embodiment when the system 104 is located outside the electronic device 102, a network interface (not shown) may be configured to provide network connectivity and enable communication between the system 104 and the electronic device 102. The network connectivity may be provided via a wireless connection or a wired connection. For example, the network connectivity may be provided via cellular technology, such as 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G), pre-5G, 6th Generation (6G), Bluetooth, Local Area Network (LAN), Wi-Fi, cable, or any other wired/wireless communication technology.

[033] Figure 2 illustrates system architecture 200 for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention. The system 104 may be configured for sensing an image, extracting information related to visual and textual information and providing real-time analysis of an object. The system 104 may include at least one processor 202 (referred to as processor for sake of brevity) which is communicatively coupled to a memory 204, one or more modules 206, and a data unit 208.

[034] In an example, the processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 202 may be configured to fetch and execute computer-readable instructions and data stored in the memory 204. At this time, the processor 202 may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, and an AI-dedicated processor such as a neural processing unit (NPU). The processor 202 may control the processing of input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory, i.e., the memory 204. The predefined operating rule or artificial intelligence model is provided through training or learning. Further, the processor 202 may be operatively coupled to each of the memory, the I/O Interface. The processor 202 may be configured to process, execute, or perform a plurality of operations described herein.

[035] The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).

[036] The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning.

[037] Here, being provided through learning means that, by applying a learning technique to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.

[038] The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values and performs a layer operation through the calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.

[039] The learning technique is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning techniques include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

[040] In an example, the memory 204 may include any non-transitory computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read-only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 204 is communicatively coupled with the processor 202 to store processing instructions for completing the process. Further, the memory 204 may include an operating system for performing one or more tasks of the system, as performed by a generic operating system in a computing domain. The memory 204 is operable to store instructions executable by the processor 202.

[041] In some embodiments, the one or more modules 206 may include a set of instructions that can be executed to cause the system 104 to perform any one or more of the methods disclosed. The system 104 may operate as a standalone device or may be connected, e.g., using a network, to other computer systems or peripheral devices. Further, while a single system 104 is illustrated, the term “system” shall also be taken to include any collection of systems or sub-systems that individually or jointly execute a set, or multiple sets, of instructions to perform one or more computer functions.

[042] In an embodiment, the module(s) 206 may be implemented using one or more artificial intelligence (AI) modules that may include a plurality of neural network layers. Examples of neural networks include but are not limited to, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Restricted Boltzmann Machine (RBM). Further, ‘learning’ may be referred to in the disclosure as a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning techniques include, but are not limited to supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning. At least one of a plurality of CNN, DNN, RNN, RMB models and the like may be implemented to thereby achieve execution of the present subject matter’s mechanism through an AI model. A function associated with an AI module may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor, such as a neural processing unit (NPU). One or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning.

[043] In some embodiments, the data unit 208 serves, amongst other things, as a repository for storing data processed, received, and generated by one or more of the modules 206. The modules 206, amongst other things, include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement data types. The modules 206 may also be implemented as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the modules 206 may be implemented in hardware, instructions executed by a processing unit, or by a combination thereof. The module(s) 206 enables the system 104 to perform the features/functions of the present disclosure, as discussed and explained in detail in conjunction with Figures 3-7 in the forthcoming paragraphs.

[044] Figure 3 illustrates a schematic block diagram of the operational flow of the modules of the system for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention.

[045] At operation 301, the system 104 is configured to capture one or more images of at least one object using an image capturing device. The image capturing device is a hierarchical multi-resolution image capturing device with dynamic focus adjustment to process the one or more image at varying resolutions and frame rates as per the pre-defined threshold value related to latency, accuracy, and quality of the one or more image.

[046] At operation 302, the system 104 is configured to capture an input stream of the one or more images based on adjusting one or more operating parameters of one or more sensors of the image capturing device in response to capturing the one or more images and based on a pre-defined threshold value. The one or more operating parameters of one or more sensors of the image capturing device include changing environmental conditions and one or more specific operational requirements of a user. Further, the input stream comprises a low-resolution stream with a high frame rate, a medium-resolution stream with a moderate frame rate and a high-resolution stream with a low frame rate.

[047] In an embodiment, the low-resolution stream is 720p with a high frame rate of 60-120 frames per second (FPS). The medium-resolution stream is of 1280p with a moderate frame rate of 20-30 FPS and the high-resolution stream is of 4K with a low frame rate of 1-2 FPS.

[048] At operation 303, the system 104 is configured to detect one or more text by processing the input stream.

[049] At operation 304, the system 104 is configured to recognise one or more text by processing the detected one or more text. At operation 305, the system is configured to use an Artificial Intelligence (AI) model to identify and localize one or more text regions in each frame of the input stream. The input stream is initially the low-resolution stream.

[050] At operation 306, to recognise one or more text, the system 104 may be configured to generate a confidence score based on the identification and the localization of one or more text regions.

[051] At operation 307, the system 104 is configured to determine whether the confidence score is above a threshold value or a specific value. For instance, the threshold value is in the range of 0.6-0.99 and a specific value is in the range of 0.7-0.99, which may be dynamically selected based on the environment and calibration data.

[052] Based on the determination, at operation 308, the system 104 is configured to dynamically switch, by the streaming handler, the input stream between one of the medium-resolution streams, and the high-resolution stream if the confidence score is above one of the threshold values and the specific value respectively.

[053] At operation 309, the system 104 is configured to determine one of: the dynamic switching of the input stream and the confidence score of the input stream being below the threshold value. Based on the determination, the system is configured to detect the one or more text, using the AI based module, by processing the input stream.

[054] At operation 310, the system 104 is configured to recognise the one or more text by processing the detected one or more text.

[055] At operation 311, the system 104 is configured to generate one or more insights related to the one or more images of the object by analysing the one or more recognised text.

[056] At operation 312, to analyse the one or more recognised text, the system 104 may be configured to extract meta-information from the one or more recognised text using one or more AI model, Optical Character Recognition (OCR) and Vision-and-Language Transformers (ViLT) module. In an alternate embodiment, the meta-information may also be extracted using Quick Character Recognition (QCR) which is able to perform faster both printed and handwritten visual and text extraction, making it suitable for digitizing complex handwritten records into legible, structured text for further analysis.

[057] At operation 313, the system 104 is configured to determine a semantic relationship between the one or more image of the object and the one or more recognised text based on the extracted meta-information.

[058] At operation 314, the system 104 is configured to generate the one or more insights related to the one or more image of the object based on the determined semantic relationship.

[059] For instance, the low-resolution stream is processed by a miniaturized edge AI model for text detection (TD), which identifies and localizes text regions in each frame, while also generating a confidence score for the detected text regions. Based on this confidence score, the streaming handler dynamically switches to the medium-resolution stream for subsequent processing. The miniaturized edge AI model for text recognition (TR) operates on the medium-resolution stream. It utilizes the localized text region output from the TD model on the low-resolution stream if the confidence score is above a predefined threshold. Otherwise, the TD is executed independently on the medium-resolution stream before invoking the TR. Thereafter, the output from the TR model is processed and the recognized text is converted into domain-specific metadata, formatted as key-value pairs. If the confidence score of the extracted key-value pairs meets or exceeds a specified threshold, the streaming handler switches to the high-resolution stream to capture high-quality images, thereby enabling further processing by higher-order AI models.

[060] In an embodiment, the system 104 is configured to display the generating one or more insights related to the one or more images of the object and prompting a user for one of: validation and correction, cross-reference the extracted meta-data from the AI model and the ViLT model, on receipt of the validation by the user and update ViLT model based on the captured correction from the user, on receipt of the correction from the user.

[061] In an embodiment, the ViLT model is configured to be equipped with a specialized vision encoder fine-tuned preferably for complex retail scenarios making it highly effective at detecting and localizing text within images. The vision encoder is fine-tuned on large-scale synthetic datasets that include a wide variety of fonts, dotted texts, orientations, colour contrasts, and background complexities. The robust training ensures accurate identification of text regions and distinguishing between different visual patterns, making it particularly adept at handling difficult OCR scenarios, such as curved or stylized text even in visually cluttered settings. Further, the method and system of the present disclosure enable an understanding of the semantic relationships between visual objects and text components. This facilitates more accurate extraction of structured meta-data, such as product names, prices, batch no, manufacturing dates, expiry dates, batch codes, etc. Additionally, the method and system employ a multi-agent architecture comprising Small Language Models (SLMs) that collaborate to interpret extracted data and generate comprehensive business insights. This allows the method and system to respond to user queries in real-time, providing contextually relevant information such as product categorization, inventory analysis, and compliance verification.

[062] In addition to printed text extraction, the method and system of the present disclosure are capable of interpreting handwritten content, expanding its utility to domains such as historical record digitization, legal documentation, and administrative processes. The method and system of the present disclosure further leverages neural networks trained specifically to handle variations in handwriting styles and languages, allowing the method and system of the present disclosure to convert handwritten records into readable and interpretable text. For example, handwritten land records in Tamil may be extracted, transcribed, and presented as legible printed text in Tamil or translated into English or any other language for broader accessibility and accurate interpretation. The feature ensures that complex handwritten information regardless of the script or language may be effectively digitized, translated, and structured for downstream applications, enabling seamless integration of manual records into modern digital workflows.

[063] In an embodiment, the method and system of the present disclosure employ a combination of ViLT and OCR to process and encode complex visual and textual elements found in food, pharma, cosmetics, retail packaging and similar domains. The OCR methodology is specifically designed to handle diverse textual and visual data patterns, including multi-lingual representations, high-contrast regions, and overlapping text fields, ensuring comprehensive and accurate data extraction. In an alternate embodiment, QCR may also be used to perform faster both printed and handwritten visual and text extraction, making it suitable for digitizing complex handwritten records into legible, structured text for further analysis.After processing, the extracted information is encoded into a standardized QR code format, which can be easily printed and affixed to product packaging. The QR code serves as a unified repository of critical product and compliance information, enabling seamless accessibility across the entire value chain. It is designed to support multiple use cases such as:

[064] Warehouse and Supply Chain Operations: The QR code facilitates inbound logistics, inventory management, batch tracking, and order fulfilment by providing real-time access to product details such as batch numbers, expiry dates, and regulatory compliance data.

[065] Retail Applications: On the retail floor, the QR code enables efficient on-shelf placement management, pricing consistency by resolving multiple Maximum Retail Price (MRP) issues, and the timely promotion of near-expiry products to reduce wastage.

[066] Consumer Engagement: Customers can scan the QR code to retrieve comprehensive product details, including nutritional information, allergen warnings, FSSAI and CDSCO compliance data, and environmental or ethical claims like recyclability or vegetarian indicators.

[067] Thus, the OCR methodology has the following advantages:

[068] Improved Traceability: Seamlessly tracks products throughout their lifecycle, source to store, aiding in expiry management, inventory planning, recalls, inventory management, and operational efficiency.

[069] Enhanced Compliance: Simplifies adherence to GS1, FSSAI, and CDSCO standards by centralizing required data and making it easy to use by simple QR code.

[070] Data-Driven Decisions: Enables better product development, marketing, and sales strategies through readily accessible data.

[071] Increased Efficiency: Reduces manual effort and errors by automating data capture and dissemination.

[072] Better Consumer Experience: Provides accurate and up-to-date product information, enhancing transparency and trust.

[073] Figure 4 illustrates a schematic architecture 400 of the system 104 for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention. In an embodiment, the information extraction may be related to visual and textual information. As shown, Figure 4 showcases underlying operational framework of the system architecture 400 integrating various components and their seamless interaction. The framework is designed to handle multi-resolution image streams from the camera 106 payload, dynamically switching between low, medium, and high-resolution frames 402 based on the complexity and confidence of the visual content being analysed. For instance, the system captures high-quality visual data through its RGB (red, blue, green) and depth sensors. In addition, Figure 4 illustrates a hybrid compute architecture, wherein an edge compute module 404 performs the initial stages of text detection, recognition, and key-value extraction using low and medium-resolution frames. The initial processing may be conducted by a series of lightweight AI model 406 optimized for real-time performance, enabling the system 104 to maintain low latency while delivering accurate preliminary results. Further, the on-premises or cloud module 408 may facilitate use of Vision-Language Transformer (ViLT) and Region Selector for higher-level reasoning and detailed text recognition. The hierarchical flow between the edge computing module 404 and the cloud module 406 ensures efficient utilization of computational resources by only escalating to higher-resolution frames and more complex models when necessary. For instance, initial processing is carried out using text detection and recognition modules, which extract key attributes like Maximum Retail Price (MRP), Manufacturing Date (MFD), Expiry Date, and Batch Number. This information is structured and stored locally for fast access and immediate responses.

[074] Further, Figure 4 showcases integration of the teacher-student model framework within the online incremental learning module 408. The framework allows the student model to be continuously supervised by the teacher model, with agent modules mediating and refining the student model’s outputs based on feedback loops. The agent module 410 actively monitors and resolves any inconsistencies between the predictions, triggering fine-tuning or updates when necessary. Through the iterative learning cycle, the system 104 achieves robust performance across a wide array of use cases allowing real-time information extraction related to visual and textual information and analysis.

[075] In an embodiment, the system 104 comprises an interactive interface 416 enables user feedback to be incorporated into the system 104 in real-time. The human-in-the-loop feedback is captured through user validation or correction of the system’s 104 predictions. The feedback is looped back into the training database 418 and used to refine both the Vision-Language Transformer (ViLT) and edge-based models, 410. By adopting an active learning mechanism, discrepancies between the edge and ViLT models are identified and addressed through incremental learning techniques. The updates ensure that the system 104 adapts dynamically to evolving data patterns and maintains high recognition accuracy over time.

[076] In addition, the interface 416, combined with advanced dashboards, serves as a comprehensive communication and control module, enabling the system 104 to not only capture, extract, and analyze visual and textual data but also to facilitate user interaction and system oversight. The interface 416 powered by Small Language Models (SLMs) for natural language understanding, offers intuitive dashboards that present real-time insights, operational feedback, and actionable instructions to users based on ongoing data extraction and analysis. The interface 416 dashboard extends beyond conventional data visualization by providing an interactive platform where users can validate or correct the outputs of the system 104. Through the interaction, the system 104 captures nuanced human inputs, which are subsequently incorporated into the active learning loop. The feedback loop not only improves the performance of text recognition models but also tailors the system's 104 behavior to better align with specific user expectations and evolving business requirements. The use of SLMs enables the system 104 to interpret user commands, respond contextually, and offer guidance on optimizing image capture and extraction settings, thus empowering users with a higher degree of control and adaptability. Additionally, the interface dashboard 416 employs predictive analytics to recommend system 104 adjustments or highlight potential anomalies in real-time. For instance, if a particular text extraction scenario consistently triggers low-confidence scores, the interface 416 may prompt the user with suggestions to modify parameters, adjust camera 106 positioning, or utilize different resolution bands for improved results. Thus, the interface 416 may transcend traditional user interfaces into an intelligent assistant that actively contributes to the overall accuracy, efficiency, and effectiveness of the system.

[077] In an embodiment, the interactive interface 416, coupled with natural language-powered instructions, provides a seamless bridge between the sophisticated AI model 408 and the end-user, making the system 104 not just a tool for visual and textual data extraction but a comprehensive, interactive solution that guides, instructs, and adapts based on user feedback and operational demands.

[078] Figure 5 illustrates a hardware architecture 500 of the system 104 for image sensing, extracting information, and real-time analysis of an object, according to an embodiment of the present invention.

[079] In an example embodiment, the hardware architecture 500 of the system 104 comprises a Depth Camera 106, a Processor 202, a High-Resolution Image Sensor 402-B, and a Liquid Lens Driver with an Image Signal Processor (ISP) 402-A. The Depth Camera 106 is connected to the Processor 202 via a USB-C interface, transmitting a combined Depth and RGB video stream for further processing. The High-Resolution Image Sensor 402-B, situated on the right side communicates with the Processor 202 through another USB-C interface, delivering high-resolution image frames. The Image Sensor 402-B is equipped with a Liquid Lens, controlled by a Liquid Lens Driver and ISP module 402-A through an Inter-Integrated Circuit (I2C) connection.

[080] In an example embodiment, the liquid lens 402-A is connected to the processor 202 via a USB interface, and employs the depth estimation module to autonomously calibrate the focus. This is achieved through the transmission of a 10-bit value (ranging from 0 to 1023) to the liquid lens over a serial connection, enabling real-time focus variation. The non-linear relationship between the stepper values of the lens and the distance of objects from the camera is modelled and learned using a lightweight neural network. The region selection module is implemented as a pre-trained deep neural network, which processes the video streams to accurately predict regions of interest for focus adjustments. These regions of interest may include text regions, tables, barcode/QR code areas, logos, brand names, and other relevant visual elements.

[081] Further, the system generates high-voltage control signals to dynamically adjust the liquid lens’s focal length, thereby enabling real-time image enhancement and focus adjustments. The overall system architecture 500 allows for synchronized acquisition and processing of depth data and high-resolution images, making it suitable for advanced imaging applications. The architecture 500 enables the system to dynamically adapt focus and resolution settings based on the visual context, optimizing the accuracy and efficiency of visual and textual data extraction and processing.

[082] Figure 6 illustrates a method for image sensing, extracting information, and real-time analysis of an object, in accordance with an exemplary embodiment of the present disclosure.

[083] At step 602, the method 600 comprises capturing, by an image capturing device, one or more images of at least one object. The image capturing device is a hierarchical multi-resolution image capturing device with dynamic focus adjustment to process the one or more image at varying resolution and frame rate as per the pre-defined threshold value related to latency, accuracy, and quality of the one or more image.

[084] At step 604, the method 600 comprises capturing an input stream of the one or more images based on adjusting one or more operating parameters of one or more sensors of the image capturing device in response to capturing the one or more images and based on a pre-defined threshold value. The one or more operating parameters of one or more sensors of the image capturing device includes changing environmental condition and one or more specific operational requirements of a user. Further, the input stream comprises of a low-resolution stream with a high frame rate, a medium-resolution stream with a moderate frame rate and a high-resolution stream with a low frame rate.

[085] At step 606, the method 600 comprises detecting one or more text by processing the input stream.

[086] At step 608, the method 600 comprises recognising one or more text by processing the detected one or more text. In an embodiment, to recognise the one or more text, the method 600 comprises using, an Artificial Intelligence (AI) model, to identify and localize one or more text regions in each frame of the input stream. The input steam is initially the low-resolution stream.

[087] The method 600 comprises generating a confidence score based on the identification and the localization of one or more text regions. The method 600 comprises determining, whether the confidence score is above a threshold value or a specific value. Based on the determination, the method 600 comprises dynamically switching, by the streaming handler, the input stream between one of the medium-resolution stream and the high-resolution stream if the confidence score is above one of the threshold value and the specific value respectively. For instance, the threshold value is in the range of 0.6-0.99 and a specific value is in the range of 0.7-0.99, which may be dynamically selected based on the environment and calibration data. The method 600 comprises determining one of: the dynamic switching of the input stream and the confidence score of the input stream being below the threshold value. Based on the determination, the method 600 comprises detecting the one or more text, using the AI based module, by processing the input stream. The method 600 comprises recognising the one or more text by processing the detected one or more text.

[088] At step 610, the method 600 comprises generating one or more insights related to the one or more images of the object by analysing the one or more recognised text. To analyse, the method 600 comprises extracting meta-information from the one or more recognised text using one or more AI model, OCR and ViLT module. In an alternate embodiment, the meta-information may also be extracted using QCR which is able to perform faster both printed and handwritten visual and text extraction, making it suitable for digitizing complex handwritten records into legible, structured text for further analysis. The method 600 comprises determining a semantic relationship between the one or more image of the object and the one or more recognised text based on the extracted meta-information. The method 600 comprises generating the one or more insights related to the one or more image of the object based on the determined semantic relationship.

[089] Figures 7A-7D illustrate an exemplary representation of the system for image sensing, extracting information, and real-time analysis of an object, in accordance with an exemplary embodiment of the present disclosure. Figures 7A-7D illustrates implementation of the system tailored for different use cases enabling capturing, extracting, and analyzing textual and visual information from images.

[090] Figure 7A illustrates a use case scenario of the system 104 implemented on mobile devices (Android or iOS) and handheld devices. The system 104 integrates with built-in smartphone cameras, depth sensors, accelerometer sensors, gyro-sensors, and computing capabilities to allow real-time, high-resolution image streaming suitable for visual context extraction and recognition tasks. The system 104 of the present disclosure enables various industrial applications such as inventory management and retail automation.

[091] Figure 7B illustrates a use case scenario of the system 104 implemented on a handheld (HHD) device equipped with multi-resolution, dynamic focus cameras, time of flight (ToF) sensors, accelerometer sensors, gyro-sensors, and custom-tailored Graphic Processing Unit (GPU) compute for running miniaturized edge AI models, enabling it to perform extraction and analysis tasks in real-time. The device captures essential product information such as barcodes, expiry dates, and manufacturing dates, making it ideal for quick and efficient data capture in retail and warehousing environments.

[092] Figure 7C illustrates a use case scenario of the system 104 implemented as a Vision Cave model which is tabletop height model integrating with the system into a mechanical housing. The implementation of the system provides efficient product scanning by capturing comprehensive visual data in a controlled environment. Such an implementation may be suitable for use cases where a static scanning solution is required, such as quality assurance and inspection stations. Further, to ensure that the vision cave captures exceptionally high-quality images without any glare or glossiness, it may incorporate a dynamic lighting arrangement using cutting-edge components and design principles.

[093] For instance, the system 104 may use dynamic lighting with addressable RGB LEDs (e.g., WS2812B) controlled by an ESP32 microcontroller, which communicates with a Raspberry Pi for real-time adjustments. The system 104 may employ the RGB LED strip, which consists of individually controlled RGB LEDs that allow precise control over light color and brightness to dynamically adjust lighting conditions. An ESP32 microcontroller may manage the LED strip based on commands from the Raspberry Pi and create a dynamic lighting effect called the "train effect," where a light sequence moves across the LEDs to simulate changing light direction. The Raspberry Pi may send commands to the ESP32, such as "CAPTURE," to trigger the lighting effect and also process images, and remove glare by combining multiple frames captured under varying lighting conditions. Further, a power supply may ensure stable operation by providing sufficient power to both the LED strip and the ESP32. For example, the Raspberry Pi may initiate the process by sending a "CAPTURE" command to the ESP32 microcontroller. The ESP32 may activate the "train effect," lighting up LEDs in a sequence that may create a dynamic shift in light direction across the subject being scanned. As the lighting shifts, the position of glare in the image also changes. The camera may capture multiple frames, each with different lighting angles and glare positions. The Raspberry Pi processes these captured frames, combining the best-lit portions of each captured image while minimizing glare. This results in a glare-free final stitched image with enhanced details and accuracy. This approach to dynamic lighting ensures enhanced image quality by dynamically adjusting lighting and processing multiple frames. The system’s ability to optimize lighting conditions in real time ensures consistent performance across a variety of product surfaces and textures. Thus making the system a versatile and advanced solution for scenarios where precise visual data is critical, such as quality assurance, inspection stations, and detailed product documentation.

[094] Figure 7D illustrates a use case scenario of the system 104 implemented for conveyor-based systems. For instance, the system may be positioned on all four sides of a conveyor belt. The configuration for such system dynamically captures product details, such as MRP, manufacturing dates, expiration dates, and barcodes, as items move through the tunnel. Such solution is ideal for logistics and distribution centres where high-speed data capture and accuracy are crucial.

[095] The present invention provides the following technical advantages:

[096] Hierarchical Multi-Resolution Image Capture and Adaptive Focus Mechanism: The method and system of the present disclosure provides a hierarchical multi-resolution image capture system utilizing dynamic focus adjustments and optimized resolution levels based on real-time depth estimation and region selection. The method and system of the present facilitates seamless integration with devices for effective visual information extraction. The invention’s dynamic focus mechanism, utilizing real-time content analysis, enables automatic transitions between low, medium, and high-resolution imaging to optimize visual information extraction and resource utilization under varying environmental conditions and object orientations.

[097] Integration of Vision-and-Language Transformers (ViLT) for Complex Contextual Analysis: The method and system of the present disclosure incorporates a Vision-and-Language Transformer (ViLT) model optimized for complex text patterns, multi-lingual content, and challenging visual contexts. The ViLT model is integrated with AI/ML and OCR modules to facilitate structured metadata extraction and real-time analysis of intricate visual data, including handwritten records and non-standard text formats. This integration ensures high-precision text recognition across diverse and visually cluttered scenarios.

[098] Real-Time Multi-Sensor Data Fusion for Enhanced Contextual Understanding: The method and system of the present disclosure utilizes a multi-sensor data fusion framework to integrate outputs from heterogeneous sensor modalities, including multi-resolution cameras, Time-of-Flight (ToF) sensors, and accelerometers. This configuration provides a unified approach to interpreting visual and environmental data, enabling accurate correlation and contextual understanding of multi-modal information for advanced operational intelligence.

[099] Incremental Learning Framework with Adaptive Model Refinement: The method and system of the present disclosure introduces an incremental learning framework based on a teacher-student model architecture, wherein a high-accuracy teacher model provides corrective feedback and real-time guidance to a streamlined student model deployed at the edge. This framework enables continuous learning and adaptation of recognition models in response to evolving data patterns, minimizing the need for manual re-training.

[0100] Enhanced Text Recognition and Visual Data Extraction Capabilities: The method and system of the present disclosure is configured to extract textual and visual information from challenging contexts, including dotted text, high-contrast regions, and overlapping text fields. The system’s architecture is optimized to handle complex visual patterns and non-standard text attributes, ensuring accurate extraction and classification of visual data for diverse real-world applications.

[0101] Human-in-the-Loop (HITL) Module for Real-Time Model Optimization: The method and system of the present disclosure integrates a Human-in-the-Loop module that enables active user interaction for validation and correction of output. User feedback is incorporated into the method’s and system’s learning pipeline for real-time optimization, thereby enhancing model accuracy and ensuring that the method’s and system’s performance is aligned with operational requirements and user expectations.

[0102] Integration of Advanced Natural Language Processing (NLP) Models for Interactive Engagement: The method and system of the present disclosure includes natural language understanding capabilities through Large Language Models (LLMs) and Small Language Models (SLMs) to support interactive user queries and provide contextually relevant responses. This feature enables users to engage with the system via natural language commands, facilitating interactive data exploration and decision support.

[0103] Interactive Data Visualization and Control Interface: The method and system of the present disclosure features a customizable data visualization interface, enabling real-time monitoring, data exploration, and parameter adjustments. The interface supports interactive dashboards that provide actionable insights and system feedback, allowing stakeholders to effectively manage and interpret multi-modal data streams.

[0104] Seamless Integration with Enterprise Systems for Operational Interoperability: The method and system of the present disclosure is designed to integrate seamlessly with existing enterprise systems, such as Warehouse Management Systems (WMS) and Enterprise Resource Planning (ERP) platforms, enabling automated data synchronization and comprehensive information flow for enhanced operational efficiency and strategic decision-making.

[0105] Broad Applicability Across Various Industry/ Vertical Domains: The method and system of the present disclosure is configured to address a wide range of applications across various sectors, such as retail analytics, logistics, quality assurance, and document digitization, including but not limited to the following use cases:

[0106] Healthcare and Medical Records Digitization: Automating the reading, extraction, and organization of handwritten medical notes, prescriptions, and medication labels. It captures complex and often illegible handwritten records from doctors, nurses, and pharmacists, ensuring accurate data entry into electronic health records (EHR) systems. This eliminates manual transcription errors, speeds up the digitization process, and enhances overall patient care by providing healthcare professionals with reliable, easily accessible medical information. Additionally, the method and system of the present disclosure may handle multilingual medical notes and labels, making it useful in diverse healthcare settings and improving compliance with regulatory standards for accurate record-keeping.

[0107] Financial Services and Banking: Streamlining the processing of financial documents, such as cheques and contracts, to ensure accurate data capture and reduce manual intervention.

[0108] Government (Census, Public Records, Tax Forms): Facilitating the digitization and analysis of large volumes of handwritten and printed data, improving data management for government entities.

[0109] Transportation and Fleet Management: Capturing data from shipping labels and maintenance records in real-time to enhance operational efficiency in logistics and fleet management.

[0110] Legal Services (Contracts, Forms, Case Records): Automating the processing of legal documents to improve workflow efficiency and accuracy, addressing significant pain points in the legal field.

[0111] Insurance (Claims Processing, Handwritten Reports): Automating the capture and processing of handwritten reports and claims forms to improve speed and accuracy in claims handling.

[0112] Education (Exam Papers, Handwritten Assignments, Forms): Digitizing and analyzing handwritten exam papers and assignments for efficient grading and data management in educational institutions.

[0113] Intelligent Character Recognition (ICR) for Handwritten Content Extraction: Incorporating advanced ICR capabilities that enable the capture, extraction, and analysis of content from handwritten records and conversion into properly readable and interpretable text. This functionality allows handwritten records, such as land documents written in any language, to be transcribed and presented in a legible format. For instance, land records written in Tamil can be extracted and tabulated meaningfully in Tamil or translated into English or other languages, enhancing usability and interpretation.

[0114] Broad Applicability Across Various Horizontal Business Processes Across Industries: The method and system of the present disclosure have broad applicability across various business processes, offering automation, speed, and accuracy. These applications span industries, transforming workflows and reducing manual errors.

[0115] Product Inwarding and Inventory Management: The method and system of the present disclosure quickly captures product details like manufacturing dates, lot numbers, and FSSAI licenses for food products and equivalent license numbers for products regulated by other regulating agencies, preventing errors and ensuring compliance in inventory management.

[0116] Vendor Contract Administration and Compliance: Automates contract summarization, generates compliance checklists, and cross-verifies contract terms with product inwarding and payment claims, streamlining vendor management.

[0117] Invoice Processing and Payment Reconciliation: Captures and compares invoice data with purchase orders and payment claims, while automating bank statement reconciliation, reducing financial errors.

[0118] Purchase Order and Supply Chain Management: Verifies purchase orders and incoming goods, ensuring accuracy in supply chain operations, and reducing delays and errors.

[0119] E-Invoicing and Digital Tax Compliance: Automates e-invoice processing and tax compliance, ensuring accurate data capture and reducing risks in tax reporting.

[0120] Document Archiving and Retrieval: Digitizes and organizes handwritten and printed documents for easy storage and retrieval, enhancing record management.

[0121] Expense and Travel Reimbursement Automation: Processes handwritten receipts and expense forms, improving accuracy and reducing delays in employee reimbursements.

[0122] Maintenance and Service Logs Automation: Automates the capture of handwritten service logs, ensuring real-time updates and improving equipment maintenance records for maintenance of industrial equipment in manufacturing and consumer appliances under service warranty.

[0123] Claims Processing and Insurance Documentation: Speeds up the processing of handwritten claims forms and documents, enhancing efficiency in insurance claim handling.

[0124] The method and system of the present disclosure may be applicable to domains requiring high-precision image analysis, complex text recognition, and contextual data understanding, such as retail analytics, inventory management, logistics, and quality assurance. Method and system of the present disclosure may find specific use in sectors such as Fast-Moving Consumer Goods (FMCG), Home & Personal Care (HPC), Beauty & Wellness, and other industries where multi-modal data extraction and real-time decision-making are critical. By integrating Vision-and-Language Transformers (ViLT), incremental learning frameworks, and adaptive focus mechanisms, the method and system may enable sophisticated multi-resolution image processing and contextual interpretation, providing comprehensive visual information extraction and actionable insights for a wide range of industrial and commercial applications.

[0125] While specific language has been used to describe the disclosure, any limitations arising on account of the same are not intended. As would be apparent to a person in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein.

[0126] The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein.

[0127] Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

[0128] Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.
, Claims:We Claim:

1. A method (600) for image sensing, extracting information, and real-time analysis of an object, the method comprising:
capturing (602), by an image capturing device, one or more images of at least one object;
capturing (604) an input stream of the one or more images based on adjusting one or more operating parameters of one or more sensors of the image capturing device in response to capturing the one or more images and based on a pre-defined threshold value,
wherein the input stream comprises of a low-resolution stream with a high frame rate, a medium-resolution stream with a moderate frame rate and a high-resolution stream with a low frame rate;
detecting (606) one or more text by processing the input stream;
recognising (608) one or more text by processing the detected one or more text; and
generating (610) one or more insights related to the one or more images of the object by analysing the one or more recognised text.

2. The method (600) as claimed in claim 1, wherein the image capturing device is a hierarchical multi-resolution image capturing device with dynamic focus adjustment to process the one or more image at varying resolution and frame rate as per the pre-defined threshold value related to latency, accuracy, and quality of the one or more image.

3. The method (600) as claimed in claim 1, wherein the one or more operating parameters of one or more sensors of the image capturing device includes changing environmental condition and one or more specific operational requirements of a user.

4. The method (600) as claimed in claim 1, wherein recognising the one or more text comprises:
using, an Artificial Intelligence (AI) model, to identify and localize one or more text regions in each frame of the input stream, wherein the input steam is initially the low-resolution stream;
generating a confidence score based on the identification and the localization of one or more text regions;
determining, whether the confidence score is above a threshold value or a specific value;
based on the determination, dynamically switching, by the streaming handler, the input stream between one of the medium-resolution stream and the high-resolution stream if the confidence score is above one of the threshold value and the specific value respectively;
determining one of: the dynamic switching of the input stream and the confidence score of the input stream being below the threshold value;
based on the determination, detecting the one or more text, using the AI based module, by processing the input stream; and
recognising the one or more text by processing the detected one or more text.

5. The method (600) as claimed in claim 1, wherein analysing the one or more recognised text comprises:
extracting meta-information from the one or more recognised text using one or more AI model, Optical Character Recognition (OCR) and Vision-and-Language Transformers (ViLT) module;
determining a semantic relationship between the one or more image of the object and the one or more recognised text based on the extracted meta-information; and
generating the one or more insights related to the one or more image of the object based on the determined semantic relationship.

6. The method (600) as claimed in claim 1, comprises:
displaying the generating one or more insights related to the one or more images of the object and prompting a user for one of:
validation and correction;
cross-referencing the extracted meta-information from the AI model and the ViLT model, on receipt of the validation by the user; and
updating ViLT model based on the captured correction from the user, on receipt of the correction from the user.

7. A system (104) for image sensing, extracting information, and real-time analysis of an object, the system comprising:
one or more processors (202);
a memory (204) coupled with the one or more processors (202), wherein the one or more processors (202) are configured to:
capture, by an image capturing device (102), one or more images of at least one object;
capture an input stream of the one or more images based on adjusting one or more operating parameters of one or more sensors of the image capturing device (102) in response to capturing the one or more images and based on a pre-defined threshold value,
wherein the input stream comprises of a low-resolution stream with a high frame rate, a medium-resolution stream with a moderate frame rate and a high-resolution stream with a low frame rate;
detect one or more text by processing the input stream;
recognise one or more text by processing the detected one or more text; and
generate one or more insights related to the one or more images of the object by analysing the one or more recognised text.

8. The system (104) as claimed in claim 7, wherein the image capturing device (102) is a hierarchical multi-resolution image capturing device (102) with dynamic focus adjustment to process the one or more image at varying resolution and frame rate as per the pre-defined threshold value related to latency, accuracy, and quality of the one or more image.

9. The system (104) as claimed in claim 7, wherein the one or more operating parameters of one or more sensors of the image capturing device (102) include changing environmental condition and one or more specific operational requirements of a user.

10. The system (104) as claimed in claim 7, wherein to recognise the one or more text, the one or more processors (202) are configured to:
use, an Artificial Intelligence (AI) model, to identify and localize one or more text regions in each frame of the input stream, wherein the input steam is initially the low-resolution stream;
generate a confidence score based on the identification and the localization of one or more text regions;
determine, whether the confidence score is above a threshold value or a specific value;
based on the determination, dynamically switch, by the streaming handler, the input stream between one of the medium-resolution stream and the high-resolution stream if the confidence score is above one of the threshold value and the specific value respectively;
determine one of: the dynamic switching of the input stream and the confidence score of the input stream being below the threshold value;
based on the determination, detect the one or more text, using the AI based module, by processing the input stream; and
recognise the one or more text by processing the detected one or more text.

11. The system (104) as claimed in claim 7, wherein to analyse the one or more recognised text, the one or more processors (202) are configured to:
extracting meta-information from the one or more recognised text using one or more AI model, Optical Character Recognition (OCR) and Vision-and-Language Transformers (ViLT) module;
determining a semantic relationship between the one or more image of the object and the one or more recognised text based on the extracted meta-information; and
generating the one or more insights related to the one or more image of the object based on the determined semantic relationship.

12. The system (104) as claimed in claim 7, the one or more processors (202) are configured to:
display the generated one or more insights related to the one or more images of the object and prompt a user for one of:
validation and correction;
cross-reference the extracted meta-information from the AI model and the ViLT model, on receipt of the validation by the user; and
update ViLT model based on the captured correction from the user, on receipt of the correction from the user.

Documents

Application Documents

#	Name	Date
1	202521013978-TRANSLATIOIN OF PRIOIRTY DOCUMENTS ETC. [18-02-2025(online)].pdf	2025-02-18
2	202521013978-STATEMENT OF UNDERTAKING (FORM 3) [18-02-2025(online)].pdf	2025-02-18
3	202521013978-FORM FOR SMALL ENTITY(FORM-28) [18-02-2025(online)].pdf	2025-02-18
4	202521013978-FORM FOR SMALL ENTITY [18-02-2025(online)].pdf	2025-02-18
5	202521013978-FORM 1 [18-02-2025(online)].pdf	2025-02-18
6	202521013978-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [18-02-2025(online)].pdf	2025-02-18
7	202521013978-EVIDENCE FOR REGISTRATION UNDER SSI [18-02-2025(online)].pdf	2025-02-18
8	202521013978-DRAWINGS [18-02-2025(online)].pdf	2025-02-18
9	202521013978-DECLARATION OF INVENTORSHIP (FORM 5) [18-02-2025(online)].pdf	2025-02-18
10	202521013978-COMPLETE SPECIFICATION [18-02-2025(online)].pdf	2025-02-18
11	202521013978-Proof of Right [13-03-2025(online)].pdf	2025-03-13
12	202521013978-FORM-8 [13-03-2025(online)].pdf	2025-03-13
13	202521013978-FORM-26 [13-03-2025(online)].pdf	2025-03-13
14	202521013978-MSME CERTIFICATE [22-03-2025(online)].pdf	2025-03-22
15	202521013978-FORM28 [22-03-2025(online)].pdf	2025-03-22
16	202521013978-FORM-9 [22-03-2025(online)].pdf	2025-03-22
17	202521013978-FORM FOR SMALL ENTITY [22-03-2025(online)].pdf	2025-03-22
18	202521013978-FORM 18A [22-03-2025(online)].pdf	2025-03-22
19	202521013978-EVIDENCE FOR REGISTRATION UNDER SSI [22-03-2025(online)].pdf	2025-03-22
20	Abstract.jpg	2025-03-28