System And Method For Virtual Interaction With Outdoor Imagery

< Back

System And Method For Virtual Interaction With Outdoor Imagery

Abstract: [0001] Disclosed is a method (200) for transforming a visual. The method (200) has steps of integrating, by way of processing circuitry (120) of an information processing apparatus (106), product data within a context of one or more scenarios to form a modified visual, where the integration is based on human feedback-loop interactions. FIG. 2 is selected.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

28 September 2023

Publication Number

14/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Sterlite Technologies Limited

Sterlite Technologies Limited, Capital Cyberscape,15th & 16th Floor Sector 59 Gurugram Haryana India 122102

Inventors

1. Spandan Mahapatra

1292 Octavia Ct., Marietta, GA, 30062

2. Sridhara Ramaiah

1106, Commons Dr, Milford, OH, 45150

3. Satyam Barnwal

15th & 16th Floor, Capital Cyberscape, Sector – 59, Gurugram, Haryana 122102, India

4. Damegunta Lakshmi Susreetha

15th & 16th Floor, Capital Cyberscape, Sector – 59, Gurugram, Haryana 122102, India

5. Sainath Navuluri

15th & 16th Floor, Capital Cyberscape, Sector – 59, Gurugram, Haryana 122102, India

Specification

DESC:TECHNICAL FIELD
[0001] The present disclosure relates to computer vision and AI field, and more specifically to a system and method for virtual interaction with outdoor imagery.
BACKGROUND
[0002] Creating an accurate virtual model with iterative incorporation of continuous feedback of requirements along with certain existing constraints, preserving context of outdoor spaces for use is the goal. The innovation involves continuously, precisely, fine tuning and developing an outdoor scenario, on an iterative basis, until the user is completely satisfied. The machine learning based fine tuning and iterative approach is based on multi modal input and is not physically done by a human artist.
[0003] Customers shopping for outdoor products, whether in physical stores or online, often face challenges envisioning these items in their own spaces. Determining whether a product complements their garden, fits appropriately, and matches the existing décor can be tricky. We continuously fine tune and iteratively build the virtual outdoor scene, while incorporating products and materials from a product catalogue.
[0004] US Patent 11170569 outlines a system and method for creating virtual 3D models of indoor scenes from 2D images, facilitating modeling of indoor spaces, furniture, and objects. In another example, US patent 11367250 outlines a method for creating and interacting with virtual 3D indoor room models. It involves capturing indoor imagery, generating a virtual 3D model, providing a user interface, and enabling user interaction with the model through that interface. Some of the existing industry solutions offer reimagined scenarios based on textual prompts. Some solutions offer live augmented reality (AR) on devices such as AR-enabled phones.
[0005] Above solutions are mainly focused on 3D modelling for experience generation, lacking ability to provide a continuous iteration of the design, interaction with design attributes or real product information. 3D modelling requires a 3D model of each object in the product catalogue. It takes a lot of time and effort to create a 3D model of each product in the product catalogue. So, we are proposing the use of 2D photographs, which already exist in the product catalogue. If they don’t exist, 2D photographs can easily be taken, using the ubiquitous cameras, in devices such as mobile phones.
[0006] Existing solutions do not allow for distinct objects, like a lawn and a chair from different images, to be presented as separate elements within a single unified view or scene. Current solutions lack the ability to continuously refine experiences while evolving designs without discarding previous versions. They do not support iterative fine-tuning based on human or autonomous feedback. Additionally, these solutions often demand substantial technical expertise and can be time-consuming. Mixed reality-based solutions lack automatic image regeneration and attribute-based interaction. They mainly involve changing object placements in a target image.
[0007] Our goal is to reimagine outdoor experiences by facilitating interactions with attributes and design-related information for a more dynamic approach.
[0008] Hence, there is a compelling need for a solution that enables the creation of new design experiences through human and machine feedback, autonomously updating and fine-tuning the original design, while allowing product catalogue-based searches and mapping based on human feedback or intent, incorporate in-painting and replacement of design elements into previous versions of the design experience.
SUMMARY
[0009] In an aspect of the present disclosure, a method for transforming a visual is disclosed. The method has a step of integrating, by way of processing circuitry of an information processing apparatus, product data within a context of one or more scenarios to form a modified visual, where the integration is based on human feedback-loop interactions.
BRIEF DESCRIPTION OF DRAWINGS
[0010] Having thus described the disclosure in general terms, reference will now be made to the accompanying figures, where:
[0011] FIG. 1 illustrates a block diagram of a system for iterative fine tuning of designing an experience with multi modal input interactions.
[0012] FIG. 2A is a block diagram that illustrates an information processing apparatus of FIG. 1.
[0013] FIGs. 2B and 2C illustrates a flowchart of a method of iterative fine tuning for designing an experience with multi modal input interaction.
[0014] FIG. 3 illustrates a flowchart of a method that utilizes a combination of Generative Adversarial Network (GAN) and Reinforcement learning AI technique to continuously make the output design image realistic.
[0015] FIG. 4 illustrates a flowchart of a method for data preparation and vector embedding generation.
[0016] FIG. 5 illustrates a flowchart of a method for creating a vector index with a matching engine.
[0017] FIG. 6 illustrates a flowchart of a method for recommending products based on text input.
[0018] It should be noted that the accompanying figures are intended to present illustrations of exemplary aspects of the present disclosure. These figures are not intended to limit the scope of the present disclosure. It should also be noted that accompanying figures are not necessarily drawn to scale.
DEFINITIONS
[0019] The term “The Vector Database” serves as a repository storing high-dimensional vector embeddings that encapsulate the semantic features of products in the catalogue. These embeddings enable efficient cross-modal comparisons and retrieval, forming the basis for accurate product recommendations. Vector database is to provide a superior solution for handling vector embeddings by addressing the limitations of standalone vector indices, such as scalability challenges, cumbersome integration processes, and the absence of real-time updates and built-in security measures, ensuring a more effective and streamlined data management experience. In our use case, the Vector Database encapsulates the essence of each product, combining text and image information. This unified representation enables effective retrieval of semantically similar products, supporting the recommendation engine's accuracy and relevance.
[0020] The term “Vector embeddings” are multi-dimensional numerical representations that capture semantic relationships and the content of data points. In our context, these embeddings encode both text and image information, enabling seamless comparisons and retrieval across modalities. Vector embeddings are a way of representing data in a multi-dimensional space. They are used in machine learning and data science to represent data in a way that is easier to analyze. Vector embeddings are used in recommendation systems because they can quickly find similar objects for a given query.
[0021] The term “Matching Engine” is an efficient search mechanism that exploits Approximate Nearest Neighbors (ANN) algorithms. It indexes vector embeddings in the Vector Database, allowing rapid identification of products with high similarity to a given query. A matching engine is a software component that matches incoming data against a set of predefined rules or criteria. It is used in many applications such as fraud detection, recommendation systems, and search engines.
[0022] The term “Approximate Nearest Neighbors (ANN)” is a search algorithm designed for efficient retrieval of data points with similar characteristics from a large dataset. It employs techniques like space partitioning (e.g., trees) and hashing to organize data in a way that minimizes the search space. By sacrificing absolute accuracy for speed, ANN quickly identifies a subset of potential neighbors that are likely to be close in distance to the query point. This makes ANN well-suited for high-dimensional spaces where exhaustive search becomes impractical. The algorithm is commonly used in recommendation systems, image retrieval, and other applications that demand fast similarity-based searches.
[0023] The term “Multimodal Embedding” is a machine-learning technique that projects diverse data modalities, such as text and images, into a shared semantic space. This enables efficient comparison and retrieval across modalities, creating a unified representation where similar inputs are close and dissimilar ones are distant. These embeddings allow for seamless interchangeability between modalities, enabling tasks like image search, content recommendation, and more. Utilizing APIs, users input text or images, receiving high-dimensional embeddings that capture semantic content and relationships. This technique enhances the performance of various applications that involve heterogeneous data analysis.
[0024] The term “Model Functionality” is a multimodal embedding model combines text (name) and image (image) inputs, generating a specific-dimensional embedding vector representing the product's semantic content. This vector is used for indexing and retrieval in the matching engine, allowing for efficient recommendations based on text inputs.
DETAILED DESCRIPTION
[0025] The detailed description of the appended drawings is intended as a description of the currently preferred aspects of the present disclosure and is not intended to represent the only form in which the present disclosure may be practiced. It is to be understood that the same or equivalent functions may be accomplished by different aspects that are intended to be encompassed within the spirit and scope of the present disclosure.
[0026] Moreover, although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to said details are within the scope of the present technology. Similarly, although many of the features of the present technology are described in terms of each other, or in conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the present technology is set forth without any loss of generality to, and without imposing limitations upon, the present technology.
[0027] The System and Method for this invention solves the problem of machine driven autonomous iterative fine tuning of experience design, with user provided constraints, by seamlessly mashing up (adding noise and reducing noise) and integrating synthetic (generative AI designed new scenarios) along with actual product data catalogue multimodal data (text, audio, video & related formats).
[0028] Users can unleash their creative prowess as they effortlessly blend the real and virtual worlds, designing mesmerizing scenes that captivate and inspire. The System and Method’s intelligent recommendation engine utilizes machine learning to suggest design themes and products that align perfectly with their vision. Users can edit, compare, and save experiences to optimize results and drive informed decisions.
[0029] Here are some of the current features in the System and Method of the present disclosure provides:
CX (Consumer Experience) App on mobile Devises
Generative Models
Provide users with immersive and interactive experiences, Personalized recommendations.
Personalized shopping experience, User-friendly interface, Create wish lists.
1. Create experiences, save experiences, browse experiences and browse related products, choose products, and see the impact on budget in Real-time.
2. Enable users to visualize outdoor products in their actual spaces.
3. This allows users to assess how the items would fit and match their existing environment, helping with purchasing decisions. Allows users to interact with virtual objects in their surroundings as captured by the camera in Real-time.
The system and method of the present disclosure generates modified images based on textual prompts using a custom implementation of the “text to image generation models” such as InstructPix2Pix, StableDiffusion. The model pipeline involves these key steps:
Essential libraries and packages are imported to enable the functionality required for image manipulation and processing. The pipeline itself is designed to take advantage of the Test to image generation model, which excels at generating images based on textual instructions.
[0030] The custom pipeline is initialized with crucial components such as a Variational Autoencoder (VAE), tokenizer, text encoder, UNet, scheduler, and an image processor. These components work together harmoniously to facilitate the image modification process.
[0031] Textual prompts are processed to obtain embeddings through tokenization and encoding. These text embeddings act as guides for generating modified images in line with the provided instructions.
Image latent representations are extracted from the input image. These latent representations serve as references, aiding in the manipulation of the image to align with the desired modifications.
[0032] The core of the pipeline involves a denoising process. Latent representations undergo iterative denoising over multiple time steps. Noise is gradually introduced and then removed, effectively modifying the image while adhering to the input instructions and reference latent representations. Once the denoising process is complete and the latent representations are refined, they are utilized to generate the final modified image. By employing the decoding process of the VAE, latent representations are transformed back into the image space. Further, the system and methods of the present disclosure demonstrates how a custom pipeline, composed of carefully coordinated components, can seamlessly integrate text processing, latent manipulation, and image generation to yield modified images that adhere to textual prompts. This process showcases the capacity of machine learning models to be adapted and tailored to specific tasks, offering a coherent framework for creating meaningful image modifications based on provided instructions.
[0033] FIG. 1 illustrates a block diagram of a system 100 for iterative fine tuning of designing an experience with multi modal input interactions for virtual interaction with outdoor imagery. The system 100 may be configured to facilitate in reimagining and continuously fine-tuning outdoor experiences. The system 100 may be further configured to integrate new scenarios with product data through autonomous processes and human feedback-loop interactions, making virtual interactions with outdoor imagery possible for a user. Specifically, the system 100 may be configured to rely on an input mechanism, incorporating elements such as an outdoor image, a product data store, and user-generated input such that the inputs undergo computational processes, culminating in intricate spatial arrangement of the products within the context of the outdoor image.
[0034] The system 100 may include a user device 102, a plurality of sensors 104, and an information processing apparatus 106. The user device 102, the plurality of sensors 104, and the information processing apparatus 106 may be configured to communicate with each other and other entities within the system 100 by way of a communication network 108 and/or through separate communication networks established therebetween.
[0035] The communication network 108 may include suitable logic, circuitry, and interfaces that may be configured to provide a plurality of network ports and a plurality of communication channels for transmission and reception of data related to operations of various entities in the system 100. Each network port may correspond to a virtual address (or a physical machine address) for transmission and reception of the communication data. For example, the virtual address may be an Internet Protocol Version 4 (IPV4) (or an IPV6 address) and the physical address may be a Media Access Control (MAC) address. The communication network 108 may be associated with an application layer for implementation of communication protocols based on one or more communication requests from the user device 102, the plurality of sensors 104, and the information processing apparatus 106. The communication data may be transmitted or received, via the communication protocols. Examples of the communication protocols may include, but are not limited to, Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), Simple Mail Transfer Protocol (SMTP), Domain Network System (DNS) protocol, Common Management Interface Protocol (CMIP), Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Long Term Evolution (LTE) communication protocols, or any combination thereof. In one embodiment, the communication data may be transmitted or received via at least one communication channel of a plurality of communication channels in the communication network 108. The communication channels may include, but are not limited to, a wireless channel, a wired channel, a combination of wireless and wired channel thereof. The wireless or wired channel may be associated with a data standard which may be defined by one of a Local Area Network (LAN), a Personal Area Network (PAN), a Wireless Local Area Network (WLAN), a Wireless Sensor Network (WSN), Wireless Area Network (WAN), Wireless Wide Area Network (WWAN), a Metropolitan Area Network (MAN), a Satellite Network, the Internet, a Fiber Optic Network, a Coaxial Cable Network, an Infrared (IR) network, a Radio Frequency (RF) network, and a combination thereof. Embodiments of the present invention are intended to include or otherwise cover any type of communication channel, including known, related art, and/or later developed technologies.
[0036] The user device 102 may be adapted to facilitate a user to input data, receive data, and/or transmit data within the system 100. In some embodiments of the present disclosure, the user device 102 may be, but is not limited to, a desktop, a notebook, a laptop, a handheld computer, a touch sensitive device, a computing device, a smart phone, a smart watch, and the like. It will be apparent to a person of ordinary skill in the art that the user device 102 may be any device/apparatus that is capable of manipulation by the user, without deviating from the scope of the present disclosure. Although FIG. 1 illustrates that the system 100 includes a single user device (i.e., the user device 102), it will be apparent to a person skilled in the art that the scope of the present disclosure is not limited to it. In various other aspects, the system 100 may include multiple user devices without deviating from the scope of the present disclosure. In such a scenario, each user device is configured to perform one or more operations in a manner similar to the operations of the user device 102 as described herein. As illustrated, the user device 102 may have an interface 110, a processing unit 112, and a memory 114.
[0037] The interface 110 may include an input interface for receiving inputs from the first user. Examples of the input interface may include, but are not limited to, a touch interface, a mouse, a keyboard, a motion recognition unit, a gesture recognition unit, a voice recognition unit, or the like. Aspects of the present disclosure are intended to include or otherwise cover any type of the input interface including known, related art, and/or later developed technologies. The interface 110 may further include an output interface for displaying (or presenting) an output to the customer. Examples of the output interface may include, but are not limited to, a display device, a printer, a projection device, and/or a speaker. Examples of the interface 110 may include, but are not limited to, a digital display, an analog display, a touch screen display, a graphical customer interface, a website, a webpage, a keyboard, a mouse, a light pen, an appearance of a desktop, and/or illuminated characters.
[0038] The processing unit 112 may include suitable logic, instructions, circuitry, and/or interfaces for executing various operations, such as one or more operations associated with the user device 102. In some aspects of the present disclosure, the processing unit 112 may be configured to control the one or more operations executed by the user device 102 in response to an input received at the user device 102 from the user. Examples of the processing unit 112 may include, but are not limited to, an Application-Specific Integrated Circuit (ASIC) processor, a Reduced Instruction Set Computing (RISC) processor, a Complex Instruction Set Computing (CISC) processor, a Field-Programmable Gate Array (FPGA), a Programmable Logic Control unit (PLC), and the like. Aspects of the present disclosure are intended to include or otherwise cover any type of the processing unit 112 including known, related art, and/or later developed technologies.
[0039] The memory 114 may be configured to store logic, instructions, circuitry, interfaces, and/or codes of the processing unit 112, data associated with the user device 102, and data associated with the system 100. Examples of the memory 114 may include, but are not limited to, a Read Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (FM), a Removable Storage Drive (RSD), a Hard Disk Drive (HDD), a Solid-State Memory (SSM), a Magnetic Storage Drive (MSD), a Programmable Read Only Memory (PROM), an Erasable PROM (EPROM), and/or an Electrically EPROM (EEPROM). Aspects of the present disclosure are intended to include or otherwise cover any type of the memory 114 including known, related art, and/or later developed technologies.
[0040] In some aspects of the present disclosure, the user device 102 may further have one or more computer executable applications configured to be executed by the processing unit 112. The one or more computer executable applications may have suitable logic, instructions, and/or codes for executing various operations associated with the system 100. The one or more computer executable applications may be stored in the memory 114. Examples of the one or more computer executable applications may be, but are not limited to, an audio application, a video application, a social media application, a navigation application, and the like. Preferably, the one or more computer executable applications may include an application 116. Specifically, one or more operations associated with the application 116 may be controlled by the information processing apparatus 106 that will be explained in detail in FIG. 2.
[0041] The user device 102 may further have a communication interface 118. The communication interface 118 may be configured to enable the user device 102 to communicate with the information processing apparatus 106 and other components of the system 100 over the communication network 108. Examples of the communication interface 118 may be, but are not limited to, a modem, a network interface such as an Ethernet Card, a communication port, and/or a Personal Computer Memory Card International Association (PCMCIA) slot and card, an antenna, a Radio Frequency (RF) transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a Coder Decoder (CODEC) Chipset, a Subscriber Identity Module (SIM) card, and a local buffer circuit. It will be apparent to a person of ordinary skill in the art that the first communication interface 118 may include any device and/or apparatus capable of providing wireless and/or wired communications between the user device 102 and other components of the system 100 over the communication network 108.
[0042] The plurality of sensors 104 may be installed in a vicinity where the user is positioned to experience a scenario environment. In some aspects of the preset disclosure, the plurality of sensors 104 may be configured to sense signals that represents one or more parameters associated with an environment of the user. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the parameters associated with the user that may facilitate to efficiently monitor the environment of user, without deviating from the scope of the present disclosure. In some aspects of the present disclosure, the plurality of sensors 104 may be, but is not limited to, a proximity sensor, an Ultrasonic sensor, an imaging sensor, and the like. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the plurality of sensors 104, including known, related, and later developed sensors, without deviating from the scope of the present disclosure.
[0043] The information processing apparatus 106 may be a network of computers, a framework, or a combination thereof, that may provide a generalized approach to create a server implementation. The information processing apparatus 106 may have, but is not limited to, personal computers, laptops, mini-computers, mainframe computers, any non-transient and tangible machine that can execute a machine-readable code, cloud-based servers, distributed server networks, or a network of computer systems. The information processing apparatus 106 may be realized through various web-based technologies such as, but not limited to, a Java web-framework, a .NET framework, a personal home page (PHP) framework, or any other web-application framework. The information processing apparatus 106 may include one or more processing circuitries of which processing circuitry 120 is shown and a data store 122.
[0044] The processing circuitry 120 may be configured to execute various operations associated with the system 100. The processing circuitry 120 may be configured to host and enable the application 116 running on (and/or installed on) the user devices 102 to execute one or more operations associated with the system 100 by communicating one or more commands and/or instructions over the communication network 108. Examples of the processing circuitry 120 may be, but are not limited to, an ASIC processor, a RISC processor, a CISC processor, a FPGA, and the like. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the processing circuitry 120 including known, related art, and/or later developed technologies.
[0045] The data store 122 may be configured to store the logic, instructions, circuitry, interfaces, and/or codes of the processing circuitry 120 for executing various operations. The data store 122 may be further configured to store therein, data associated with the users registered with the system 100, and the like. In some aspects of the present disclosure, the data store 122 may be configured to receive and/or access unstructured data from internet (such as shopping websites, social media websites, and the like). It will be apparent to a person having ordinary skill in the art that the data store 122 may be configured to store various types of data associated with the system 100, without deviating from the scope of the present disclosure. Examples of the data store 122 may include but are not limited to, a Relational database, a NoSQL database, a Cloud database, an Object-oriented database, and the like. Further, the data store 122 may include associated memories that may be, but is not limited to, a ROM, a RAM, a flash memory, a removable storage drive, a HDD, a solid-state memory, a magnetic storage drive, a PROM, an EPROM, and/or an EEPROM. Aspects of the present disclosure are intended to include or otherwise cover any type of the data store 122 including known, related art, and/or later developed technologies. In some aspects of the present disclosure, a set of centralized or distributed network of peripheral memory devices may be interfaced with the information processing apparatus 106, as an example, on a cloud server.
[0046] FIG. 2A is a block diagram that illustrates the information processing apparatus 106 of FIG. 1. As discussed, the information processing apparatus 106 has the processing circuitry 120 and the data store 122. Further, the information processing apparatus 106 may have a network interface 200 and an input/output (I/O) interface 202. The processing circuitry 120, the data store 122, the network interface 200, and the I/O interface 202 may communicate with each other by way of a first communication bus 204. In some aspects of the present disclosure, the processing circuitry 120 may have a registration engine 206, a data collection engine 208, a visual generation engine 210, a embeddings generation engine 212, a recommendation engine 214, and a search engine 216. The registration engine 206, the data collection engine 208, the visual generation engine 210, the embeddings generation engine 212, the recommendation engine 214, and the search engine 216 may communicate with each other by way of a second communication bus 218. It will be apparent to a person having ordinary skill in the art that the information processing apparatus 106 is for illustrative purposes and not limited to any specific combination of hardware circuitry and/or software.
[0047] The network interface 200 may have suitable logic, circuitry, and interfaces that may be configured to establish and enable a communication between the information processing apparatus 106 and different components of the system 100, via the communication network 108. The network interface 200 may be implemented by use of various known technologies to support wired or wireless communication of the information processing apparatus 106 with the communication network 108. The network interface 200 may be, but is not limited to, an antenna, a RF transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a CODEC chipset, a SIM card, and a local buffer circuit.
[0048] The I/O interface 202 may have suitable logic, circuitry, interfaces, and/or code that may be configured to receive inputs and transmit server outputs (i.e., one or more outputs generated by the information processing apparatus 106) via a plurality of data ports in the information processing apparatus 106. The I/O interface 202 may have various input and output data ports for different I/O devices. Examples of such I/O devices may be, but are not limited to, a touch screen, a keyboard, a mouse, a joystick, a projector audio output, a microphone, an image-capture device, a liquid crystal display (LCD) screen and/or a speaker.
[0049] The processing circuitry 120 may be configured to execute various operations associated with the system 100. Specifically, the processing circuitry 120 may be configured to execute the one or more operations associated with the system 100 by communicating one or more commands and/or instructions over the communication network 108 to the user device 102, the plurality of sensors 104, and other entities in the system 100.
[0050] The processing circuitry 120 may be configured to perform one or more operations associated with the system 100 by way of the registration engine 206, the data collection engine 208, the visual generation engine 210, the embeddings generation engine 212, the recommendation engine 214, and the search engine 216. In some aspects of the present disclosure, the registration engine 206 may be configured to enable a user to register into the system 100 by providing registration data through a registration menu (not shown) of the application 116 displayed through the user device 102, respectively. The registration data may be, but is not limited to, a name, a demographics, a contact number, an address, and the like. Aspects of the present disclosure are intended to include or otherwise cover any type of the registration data. In some aspects of the present disclosure, the registration engine 206 may be further configured to enable the user to create a login identifier and a password that may enable the user to subsequently login into the system 100. The registration engine 206 may be configured to store the registration data associated with the user, the login and the password associated with the user in a Look Up Table (LUT) (not shown) provided in the data store 122.
[0051] The data collection engine 208 may be configured to receive an input query from the user by way of the application 116 installed on the user device 102. In some aspects of the present disclosure, the input query may have, but not limited to, one or more scenarios, an image, product data (from one or more product catalogue or unstructured social media and web content), a text (e.g., “decorate my backyard with chairs for relaxing”), and the like. Specifically, the text may be provided in a free flow language. In some aspects of the present disclosure, the data collection engine 208 may be further configured to receive the sensed signals from the plurality of sensors 104.
[0052] The visual generation engine 210 may be configured to transform a visual by integrating the product data within a context of one or more scenarios to form an output visual and/or image (having modified visuals). In some aspects of the present disclosure, the visual generation engine 210 may be configured to generate one or more scenarios by using a generative Artificial Intelligence (AI) technique. In some aspects of the present disclosure, the visual generation engine 210 may be configured to generate one or more synthetic scenarios (i.e., the one or more scenarios) by using the generative Artificial Intelligence (AI) technique. In some aspects of the present disclosure, the visual generation engine 210 may be configured to generate the one or more synthetic scenarios based on the sensed signals received from the plurality of sensors 104. In some other aspects of the present disclosure, the visual generation engine 210 may be configured to import the one or more scenarios from a list of scenarios stored in the data store 122 based on the input query. In some aspects of the present disclosure, the visual generation engine 210 may be configured to process the input query using one or more neural network architectures.
[0053] The visual generation engine 210 may be configured to generate a visual based on the input query provided by the user (i.e., the one or more scenarios, the image, the product data, the text). In some aspects of the present disclosure, the visual generation engine 210 may be configured to transform the inputs received via the input query using Variational Autoencoders (VAEs) to enhance a spatial arrangement of one or more products within the one or more scenarios. Specifically, to enhance a spatial arrangement of one or more products within the one or more scenarios using the Variational Autoencoders (VAEs), the visual generation engine 210 generates the visual. To generate the visual based on the text and the image, the visual generation engine 210 modifies the image received via the input query in accordance with the text input to generate the visual with a recommended product (provided by way of the recommendation engine 214) such that the recommended product is added to the image based on the text to generate the modified visual using generative neural networks while considering the one or more scenarios. In some aspects of the present disclosure, the generative neural networks may be, but is not limited to GAN (Generative Adversarial Networks). In some aspects of the present disclosure, the recommended image may be provided by AI automatically in terms of colour, size and numbers. For example, when the text input is “input chair in garden”, the visual generation engine 210 in communication with the recommendation engine 214 may recommend images with varying number of products (i.e., chair) depending on a size of the garden using the generative AI technique.
[0054] The visual generation engine 210 may be further configured to detect one or more products in the generated visual. In some aspects of the present disclosure, to detect one or more products in the generated visual, the visual generation engine 210 may be configured to utilize an object detection model such as, but not limited to, YOLO, SSD, RetinaNet, and the like. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the object detection model that may be capable of detecting one or more products from the generated visual, known to a person having ordinary sill in the art, without deviating from the scope of the present disclosure. Further, the visual generation engine 210 may be configured to extract and save one or more coordinates of the one or more products from the visual. In some aspects of the present disclosure, the visual generation engine 210 may be configured to generate masked inpainted images using the pinpointed coordinates integrating the text, the one or more scenarios, and the product data of the input query. In other words, the visual generation engine 210 may be configured to generate a masked visual based on the extracted coordinates of the one or more products. The visual generation engine 210 may be further configured to remove any generic product from the generated visual by utilizing inpainting to generate an inpainted visual. Further, the visual generation engine 210 may be configured to crop required areas using the one or more extracted coordinates. In some aspects of the present disclosure, the visual generation engine 210 may be configured to pass one or more API calls to one or more data sources to retrieve data (specifically one or more images of one or more products that matches with the product data provided by the user). In some aspects of the present disclosure, the one or more data sources may be, but not limited to, a product catalogue, product searches, social media searches, and the like. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the one or more data sources, without deviating from the scope of the present disclosure.
[0055] The visual generation engine 210 may be further configured to generate a product image with background removed. Specifically, the visual generation engine 210 may be further configured to remove background from the data (specifically one or more images of one or more products that matches with the product data provided by the user) by way of one or more background removal tools. In some aspects of the present disclosure, the one or more background removal tools may be, but not limited to, REMBG, GIMP, removebg, ai-background-remove, and the like. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the one or more background removal tools, without deviating from the scope of the present disclosure. For example, the visual generation engine 210 may be configured to remove the background from the one or more images to extract the product image to make the one or more images amenable for being inpainted in the inpainted visual. The visual generation engine 210 may be configured to place the product image on the cropped visual and save as a composite visual.
[0056] The visual generation engine 210 may be configured to iteratively fine tuning at least a portion of the modified visual by altering at least a part of the one or more products or the one or more synthetic scenarios. For example, the visual generation engine 210 may be configured to fine tune the AI based generation of scenario data (synthetic scenario) and integration of the product data with the scenario data by placing the product image on the cropped visual and save as a composite visual. Further, the visual generation engine 210 may be configured to place all the composite visuals in the inpainted visual according to the one or more coordinates to generate a combined visual (interchangeably referred to as the combined image). In some aspects of the present disclosure, the visual generation engine 210 may be configured to utilize a reinforcement learning agent such that the reinforcement learning agent interacts with the aforementioned processing steps until a visual score meets an expected fidelity of the modified visual during training of the reinforcement learning agent.
[0057] Specifically, the visual generation engine 210 utilizes a combination of Generative Adversarial Networks (GANs) and Reinforcement Learning (RL) techniques to continuously improve the realism of output design visual (i.e., the modified visual). In other words, the visual generation engine 210 may be configured to utilize the Generative Adversarial Reinforcement Learning (GARL) or Reinforcement Learning in Generative Adversarial Networks (RLGAN) that leverage the strengths of both GANs and RL to achieve more realistic and fine-tuned image generation. Specifically, GANs provide a strong baseline for generating images, ensuring a certain level of realism. On the other hand, RL may be used to improve the GAN's output by rewarding the GAN when the GAN produces more realistic images. In some aspects of the present disclosure, an RL agent may be trained to provide rewards and/or penalties to the generator based on the realism of its output. Further, the RL agent receives feedback from human evaluators and also uses other criteria to assess visual quality. Through RL, the generator learns to produce visual that not only fool the discriminator but also satisfy additional criteria for realism. The image generation engine 208 may be further configured to continuously train the generator using a combination of GAN training and RL fine-tuning such that over time, the generator became better at producing highly realistic visual. In some aspects of the present disclosure, the visual generation engine 210 may be configured to adjust the feedback mechanisms and reward structures to adapt to changing requirements and user preferences. The visual generation engine 210 may be further configured to monitor the quality of generated visual over time and assess the realism and quality of the output by generating evaluation metrics, user feedback, or other indicators. In other words, combining GANs and RL facilitates for the continuous improvement of visual generation, with the generator learning to produce increasingly realistic designs over time in applications like image generation, where realism and quality are critical.
[0058] Specifically, the visual generation engine 210 may be configured to generate visuals with one or more products using available Image Generation models and fine tuning the visual generation model with training dataset. The visual generation engine 210 may be further configured to compare the image provided in the input query with the generated modified visual and fine tune the generated modified visual when a score associated with the generated modified image is less than the minimum score (N). Specifically, the score may be defined as an visual difference and the score may be determined by using below equation:
bitwise_xor_result = cv2.bitwise_xor (the input image, the generated modified visual)
N = np.count_nonzero(bitwise_xor_result)
The visual generation engine 210 may be configured to fine tune the generated modified visual until the minimum score (N) is less than or equal to a sum of pixels objects identified by the Object Detection Model (such as YOLO, SSD, RetinaNet).
[0059] In some aspects of the present disclosure, the visual generation engine 210 may be further configured to dynamically integrate the modified visual with real-time product data using generative models. Specifically, the visual generation engine 210 may be further configured to dynamically fetch the recommended product data from the data store 122 and integrate the modified visual with real-time product data using generative models. In some aspects of the present disclosure, the visual generation engine 210 may be further configured to enable the user to engage in virtual interactions with the modified visual. Specifically, the visual generation engine 210 may be configured to enable the user to edit and/or change parameters of the modified visual. In some aspects of the present disclosure, the parameters may be, but are not limited to, a colour of the modified visual, a size modified visual, an orientation of the product in the modified visual, zoom-in, zoom-out, and the like. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the parameters associated with the modified visuals, without deviating from the scope of the present disclosure. In some aspects of the present disclosure, the visual generation engine 210 may be configured to enable the user to delete, modify, and/or replace the products in the modified visual. Specifically, the products imported on the background and/or the products in the background may be deleted, modified, and/or replaced such that it will not erase the entire previous image that may facilitate the user to not only buy new products, but to replace existing products as well. In some aspects of the present disclosure, the parameters may be, but are not limited to, a colour of the modified visual, a size modified visual, an orientation of the product in the modified visual, zoom-in, zoom-out, and the like.
[0060] The embeddings generation engine 212 may be configured to implement a multimodal embedding model that may utilize multimodal fusion layers to fuse the image and text vectors through additional dense layers to create the multimodal embedding representation. In some aspects of the present disclosure, the multimodal fusion layers may have input layers for the image and text embeddings, hidden layers to learn cross-view interactions and correlations, output layer merging inputs into a joint M-dimensional multimodal embedding. In some aspects of the present disclosure, the multimodal fusion layers may be trained end-to-end with the encoders using the paired image-text triplets and loss.
[0061] The embeddings generation engine 212 may be configured to generate the multimodal embeddings using a multi-input neural network architecture. In some aspects of the present disclosure, the multi-input neural network architecture may have an image convolutional neural network (CNN) encoder based on the VGG-architecture with 16 convolutional layers and 3 fully connected layers. Batch normalization and dropout regularization with a rate of 0.5 may be applied between layers. Further, the CNN maps images to specific-dimensional embeddings that capture semantic visual features. A text recurrent neural network (RNN) S430 encoder using a bidirectional LSTM architecture with 64 hidden units in each direction. The RNN encodes text sequences into specific-dimensional embeddings capturing semantic meaning. The multimodal fusion layers may be made up of 3 fully connected layers mapping the image and text embeddings to a joint specific-dimensional multimodal embedding space. In some aspects of the present disclosure, the network may be optimized using Adam or RMSprop with learning rate scheduling and early stopping. Batch normalization improves stability for fast convergence.
[0062] In some aspects of the present disclosure, the training of the image encoder and the text encoder may be performed on paired image-text data using loss functions like triplet loss to ensure proper similarity mappings. In some aspects of the present disclosure, the image encoder and the text encoder may be configured to learn individual embedding functions and shared aligned attention layers during training using positive and negative pairs such that all possible modality pairs are formed for loss computation.
[0063] The recommendation engine 214 may be configured to implement a multimodal embedding model that may be further configured to generate specific-dimension vectors based on the input (which includes a combination of image data and/or text data). Specifically, the multimodal embedding module may utilize convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to encode the images and the text into dense vector representations.
[0064] In some aspects of the present disclosure, the embedding vectors can be used for subsequent tasks like image classification and/or content moderation. Specifically, the recommendation engine 214 may be configured to extract one or more features from raw input (which includes a combination of image data and/or text data) using respective encoders. For example, an image encoder (e.g., a CNN encoder) may be used for the image input and a text encoder (e.g., a sent2vec encoder) may be used. In some aspects of the present disclosure, the image encoder may have a convolutional base for feature extraction followed by pooling and fully connected layers to reduce the activations into a fixed-length vector. The image encoder CNN architecture may have convolutional layers with 3x3 kernels, ReLU activations, and He normal initialization for feature extraction, Max pooling layers for spatial dimensionality reduction, flattening layer to reshape 3D feature maps to 1D, fully connected layers with dropout regularization to reduce features and project into embedding space, Output layer with linear activation producing fixed length K-dimensional image embedding vector. In some aspects of the present disclosure, the image encoder (CNN) may be trained via backpropagation using an image triplet loss function to learn an embedding space where distance correlates to image similarity. Online triplet mining is used to generate hard triplets for effective training. Specifically, the image encoder (e.g., a CNN encoder) may be trained on the images of one or more products to learn visual features and mappings to the embedding space.
[0065] In some aspects of the present disclosure, the text encoder may be configured to process sequences of word embeddings with an RNN such as an LSTM or GRU. The text encoder RNN may have embedding layer to map word tokens to dense vectors, LSTM/GRU recurrent layers to model semantic sequences, attention mechanism to weigh important hidden states, output dense layer to aggregate state outputs into an L-dimensional text embedding vector. In some aspects of the present disclosure, the text encoder may be trained using a text triplet loss function for the embedding space mapping. Teacher forcing is used for stability during training. Specifically, the hidden state outputs are aggregated to produce a text vector summarizing the input sequence. The text encoder may be trained on a description of the products to learn textual features and mappings.
[0066] Further, the recommendation engine 214 may be configured to convert all input branches to JSON files. Specifically, to convert all input branches to JSON files, the first step is to convert the input modality mx into a modality-specific vector representation hx, where x ? {i; s; a; v; 3d} for image, and text respectively. In some aspects of the present disclosure, the recommendation engine 214 may be configured to resize the images to the desired dimensions while maintaining the aspect ratios of the images prior to converting the input modality mx into a modality-specific vector representation hx. Specifically, the resizing of the images may facilitate that the images seamlessly fit into an outdoor imagery, providing a consistent and visually appealing user experience.
[0067] For example, hi and hs for the encoding of the image and sentence modalities. The recommendation engine 214 may be configured to utilize a pre-trained convolutional neural network (CNN) as an encoder for the images. The encoder may be followed by modality-independent embedding functions, Fx giving modality-specific embeddings gx. Fx may have a series of three fully connected layers with “tan h” activations of size (1024 × 512 × 1408). The individual embedding functions gx may be followed by an aligned attention layer that has shared weights for each input modality pair hx. The output of the attention later is the JSON representation for all input modalities hx. Further, the output representation may be a specific-dimensional vector for each input, i.e., fi and fs for image and sentence inputs, respectively.
[0068] In some aspects of the present disclosure, the data store 122 may be configured to store the multimodal product embeddings indexed using Multi-Index Hashing (MIH) for efficient approximate nearest neighbour search. Specifically, hash tables may be constructed using locality-sensitive hashing functions including p-stable LSH and cosine LSH. In some aspects of the present disclosure, the multimodal product embeddings may be stored in a quantized and compressed format to reduce storage overhead. Further, the product data store uses locality-sensitive hashing (LSH) to index the multimodal embeddings for fast approximate nearest neighbour search. Specifically, LSH involves using multiple hash functions that map similar vectors to the same buckets with high probability. Further, the multimodal product embeddings are indexed into hash tables using different hash functions like signed random projections, cosine hashes, and the like. In some aspects of the present disclosure, at search time, the query embedding may be hashed using the same functions to find candidate buckets. The contents may be scanned to retrieve approximate neighbors and ranked by a true distance metric. Specifically, multiple hash tables with different hash functions improve recall at the expense of more memory overhead. In some aspects of the present disclosure, the entries are stored and optimized for vector similarity computations using inverted indexes, column vectors, or trees. Updates involve re-hashing and rebuilding impacted indices. The data store 122 may be a vector database that may serve as a repository storing high-dimensional vector embeddings that encapsulate the semantic features of products in the catalogue. Such embeddings enable efficient cross-modal comparisons and retrieval, forming the basis for accurate product recommendations vector database is to provide a superior solution for handling vector embeddings by addressing the limitations of standalone vector indices, such as scalability challenges, cumbersome integration processes, and the absence of real-time updates and built-in security measures, ensuring a more effective and streamlined data management experience.
[0069] The search engine 216 may be configured to enable retrieval of the recommendations. Specifically, to retrieve the recommendations, the search engine 216 may be configured to encode the input query into a multimodal embedding vector using the trained neural network technique. Further, the search engine 216 may be configured to perform an approximate cosine similarity search on the product data store using Multi-Index Hashing to identify the nearest neighbors within a radius e=0.1.
[0070] The search engine 216 may be configured to apply the trained image and text encoders to the input query to generate a multimodal embedding vector. The multimodal embedding vector may be hashed using the same LSH functions as the index to find approximate matching buckets in the product data store 122. In some aspects of the present disclosure, an exploration algorithm may expand the candidate buckets in best-first order using a priority queue based on the hash distance from the query. Further, a direct vector similarity may be computed for each candidate and the top k most similar are returned ranked by distance metric (e.g., cosine similarity). Specifically, a retrieval time may be reduced from linear scan by only computing direct similarity on a small fraction of the database vectors.
[0071] The recommendation engine 214 may be configured to refine the raw search rankings using additional techniques to produce the final personalized product recommendations. In some aspects of the present disclosure, the techniques may be, but not limited to, collaborative filtering, category spread constraints, and re-ranking based on user profile and history. The recommendation engine 214 may be configured to filter and refine the ranked search results before recommendation. In some aspects of the present disclosure, the recommendation engine 214 may be configured to (i) apply business rules such as, removing unavailable items, applying diversity filters, etc. based on configured policies, (ii) combine the search rank with collaborative filtering scores for hybrid results by way of statistical rank fusion, (iii) personalized re-ranking incorporates user profile data, preferences, and purchase history, (iv) categorical spread and coverage filters maintain variety across different product types, (v) final top-N products are recommended to the user 608, sorted by adjusted relevance score (vi) log the user interactions with the modified visuals to train improved multimodal embeddings and relevance models.
[0072] FIGs. 2B and 2C illustrate a flowchart of a method 220 of iterative fine tuning for designing an experience with multi modal input interaction.
[0073] At step 222, the processing circuitry 120 may receive a user prompt.
[0074] At step 224, the processing circuitry 120 may receive an input image and the dimensions of some of the objects in the image such as the wall, patio etc.
[0075] At step 226, the processing circuitry 120 may check a size of the input image. In an aspect of the present disclosure, when the processing circuitry 120 determines that the size of the input image is less than Model Image Size Requirements, method 220 may proceed to a step 230. On the other hand, when the processing circuitry 120 determines that the size of the input image greater than Model Image Size Requirements, the method 220 may proceed to a step 228.
[0076] At step 228, the processing circuitry 120 may resize the input image such that the size of the input image is less than Model Image Size Requirements. Further, the method 220 proceeds to the step 230.
[0077] At step 230, the processing circuitry 120 may implement the visual generation engine 210 (i.e., an image generation model).
[0078] At step 232, the processing circuitry 120, by way of the visual generation engine 210, may generate a modified Image, as per the user’s requirements given in the text prompt (as discussed in FIG. 3).
[0079] At step 234, the processing circuitry 120 may refine the image by way of a reinforcement learning agent.
[0080] At step 236, the processing circuitry 120 may determine and compare a score associated with the generated image. In an aspect of the present disclosure, when the processing circuitry 120 determines that the image score meets an expected fidelity of the modified visual, the method 220 may proceed to a step 238. On the other hand, when the processing circuitry 120 determines that the score does not meet the expected fidelity of the modified visual, the method 220 may return to the step 230.
[0081] At step 238, the processing circuitry 120 may detect objects in the generated image and extract coordinates of the objects.
[0082] At step 240, the processing circuitry 120 may generate a masked image based on the detected co-ordinates of the objects.
[0083] At step 241, the processing circuitry 120 may generate background removed product images. In some aspects of the present disclosure, the processing circuitry 120, by way of the visual generation engine 210, may be configured to generate the background removed product images from the catalogue. Specifically, the visual generation engine 210 may be configured to remove background from the product images (specifically one or more images of one or more products that matches with the input text provided by the user which describes what are the desired changes on the outdoor image that has been provided as input) by way of one or more background removal tools. Further, the processing circuitry 120, by way of the search engine 216 may be configured to perform an approximate cosine similarity search on the product data store using Multi-Index Hashing to identify the nearest neighbors within a radius e=0.1. This process is described further in 300 and beyond.
[0084] After the background is removed from the product image, the API may check the ‘dimensionality’ to see if the product will fit within the outdoor area dimensions specified as input parameter at step 245. If it does not fit within the specified outdoor area, a new product may be recommended and picked at step 243 and the flow comes back to the step 241 again with a different product. If the product fits within the outdoor area dimensions specified as input parameter, then the API may check the ‘orientation’, i.e. if the product image is oriented at the same angle as the object it is replacing in the generated image at step 250, if that is true, the flow proceeds to 242, else the ‘Image Reorientation model’ is called to change the angular orientation of the product image in 249.
[0085] At step 242, the processing circuitry 120 may be configured inpaint the masked image to generate an inpainted image, and further crop the inpainted image using the coordinates from step 238. Further, the processing circuitry 120 may combine the cropped, inpainted, images with ‘background removed’ product images from the catalogue.
[0086] At step 244, the processing circuitry 120 may overlay the inpainted masked image with ‘background removed’ product images from the catalogue, while maintaining image proportionality and starting coordinates . Further, the processing circuitry 120 by way of the visual generation engine 210, refine the combined image using the reinforcement learning in step 251.
[0087] At step 246, the processing circuitry 120 may check the score of the combined fine-tuned image. When the processing circuitry 120 determines that the score meets the requirement, the method 220 proceeds to a step 248. On the other hand, the method 220 returns to the step 244.
[0088] At step 248, the processing circuitry 120 generates and displays the output image.
[0089] FIG. 3 illustrates a flowchart of a method 300 that utilizes a combination of Generative Adversarial Network (GAN) and Reinforcement learning AI techniques to continuously make the output image (i.e., the modified visuals) realistic. The output design image (i.e., the modified visuals) may be made realistic by stitching together the combination of inpainted image and the target original constrained image (which has the user’s required background and the current view).
[0090] At step 302, the processing circuitry 120, may resize the User Input Image and utilizing the GAN, it may generate a modified Image based on Text input from User. Objects are overlaid on the input image based on the text prompt. Placement location of the objects in the image, their perspective, dimensionality, orientation, proportionality, quantity, and quality of the objects placed in the generated image are dependent on the user’s input text prompt. Example: The text prompt instructing that 5 chairs should be 'placed in a circle' should do exactly that. The output may not have a random number of chairs, placed in random locations within the image.
[0091] At step 304, the processing circuitry 120 may refine the image by way of a reinforcement learning agent.
[0092] At step 306, the processing circuitry 120 may detect objects in the generated Image in 302 and generate masked image for them using their co-ordinates.
[0093] At step 308, the processing circuitry 120 may inpaint the image, and crop the inpainted image using co-ordinates from step 306.
[0094] At step 310, the processing circuitry 120 may combine the cropped, inpainted image with ‘background removed’ product images from the catalogue, while making sure proportionality is maintained.
[0095] At step 312, the processing circuitry 120 may refine the image using the reinforcement learning.
[0096] FIG. 4 illustrates a flowchart of a method 400 for data preparation and vector embedding generation.
[0097] At step 402, product data may be loaded from the catalogue (which describes the product) which includes columns such as, but not limited to, a product name, an image URL, and a product description, and the like. In some aspects of the present disclosure, the data store 122 may have the product information in a form such that each row includes fields like product name, image, price of product, and product description.
[0098] At step 404, the vector embeddings of the product data may be generated by using a sentence transformer model. Aspects of the present disclosure are intended to include and/or otherwise cover any type of the sentence transformer model, without deviating from the scope of the present disclosure.
[0099] At step 406, the vector index may be deployed in vector search service to store and retrieve vector embeddings for similarity search.
[0100] FIG. 5 illustrates a flowchart of a method 500 for creating a vector index with a matching engine.
[0101] At step 502, an embedding vector of each product (of specific Dimensions) may be received.
[0102] At step 504, an embedding index may be generated using an indexing technique, like Approximate Nearest Neighbors (ANN), creating an index of the generated embedding vectors.
[0103] At step 506, the processing circuitry 120 may deploy the generated index and request the payload to this deployed index with query vector.
[0104] At step 508, the model may perform nearest neighbor search based on query vector in vector space. Vertex AI may return the search results, which includes information about the nearest neighbors to the query vector, such as the data point ID of the nearest neighbors. Based on this data point ID, product information will be retrieved.
[0105] FIG. 6 illustrates a flowchart of a method 600 for recommending products based on text input.
[0106] At step 602, the processing circuitry 120 may receive a text via the user prompt input (e.g., search query, description, related text).
[0107] At step 604, the processing circuitry 120 may generate vector embeddings for the user prompt using a sentence transformer model.
[0108] At step 606, the processing circuitry 120 may request the payload to the deployed index in vector search, with user prompt vector embeddings.
[0109] At step 608, the processing circuitry 120 may fetch a list of data point IDs from nearest neighbors’ similarity search. Further, the Product data may be retrieved with the data point IDs.
[0110] Thus, the system 100 and the method 200, 300, 400, 500, 600 provide an integrated technique which leads to a seamless design experience while meeting the requirements of the constraints posed by the user.
[0111] The foregoing descriptions of specific aspects of the present technology have been presented for the purpose of illustration and description. They are not intended to be exhaustive or to limit the present technology to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The aspects were chosen and described in order to best explain the principles of the present technology and its practical application, to thereby enable others skilled in the art to best utilize the present technology and various aspects with various modifications as are suited to the particular use contemplated.
[0112] It is understood that various omissions and substitutions of equivalents are contemplated as circumstance may suggest or render expedient, but such are intended to cover the application or implementation without departing from the spirit or scope of the claims of the present technology.
[0113] While several possible aspects of the invention have been described above and illustrated in some cases, it should be interpreted and understood as to have been presented only by way of illustration and example, but not by limitation. Thus, the breadth and scope of a preferred aspect should not be limited by any of the above-described exemplary aspects.
,CLAIMS:1. A method (200) for transforming a visual comprising:
integrating, by way of processing circuitry (120) of an information processing apparatus (106), product data within a context of one or more scenarios to form a modified visual, where the integration is based on human feedback-loop interactions.

2. The method (200) of claim 1, where the integrating of the product data within the context of the one or more scenarios further comprising:
utilizing generative Artificial Intelligence (AI) techniques to generate, by way of processing circuitry (120), the one or more scenarios, where the one or more scenarios are one or more synthetic scenarios;
fusing, by way of processing circuitry (120), the product data within the generated one or more synthetic scenarios to form the modified visual;
facilitating, by way of processing circuitry (120), a virtual interaction with the modified visual; and
fine-tuning, by way of processing circuitry (120), the modified visual using generative AI techniques.

3. The method (200) of claim 1, where the integrating of the product data within the context of the one or more scenarios further comprising:
importing, by way of processing circuitry (120), the one or more scenarios from a data store (122);
fusing, by way of processing circuitry (120), the product data within the one or more scenarios to form the modified visual;
facilitating, by way of processing circuitry (120), a virtual interaction with the modified visual; and
fine-tuning, by way of processing circuitry (120), the modified visual using generative AI techniques.

4. The method (200) of claim 1, where the integrating of the product data within the context of the one or more scenarios further comprising:
processing, by way of processing circuitry (120), an input query, where the input query comprising one of, a text, the one or more scenarios, product data from one or more product catalogues, unstructured social media and web content, user-generated input using neural network architectures, or a combination thereof;
transforming, by way of processing circuitry (120), data of the input query using Variational Autoencoders (VAEs) to enhance a spatial arrangement of the products within the one or more scenarios;
fusing, by way of processing circuitry (120), the product data into the context of the one or more scenarios using generative AI techniques to form the modified visual; and
dynamically integrating, by way of processing circuitry (120), the modified visual with real-time product data using generative AI techniques.

5. The method (200) of claim 1, further comprising iteratively fine-tuning, by way of processing circuitry (120), at least a portion of the visual by altering at least a part of one of, the one or more products, the one or more synthetic scenarios, or a combination thereof.

6. The method (200) of claim 1, further comprising enabling, by way of processing circuitry (120), a user to engage in a virtual interaction with the modified visual.

7. The method (200) of claim 1, further comprising dynamically integrating, by way of processing circuitry (120), the visual with real-time product data.

8. The method (200) of claim 1, further comprising automated and iteratively refining, by way of processing circuitry (120), the modified visual based on generative AI techniques.

9. The method (200) of claim 1, further comprising collecting the one or more scenarios using one or more sensors (104).

10. The method (200) of claim 1, further comprising:
generating masked inpainted images using one or more pinpointed coordinates integrating the text of the input query, the one or more scenarios, and the product data; and
generating a product image with background removed.

11. The method (200) of claim 10, further comprising generating a combined image using the masked inpainted image with the product image with background removed.

12. The method (200) of claim 11, further comprising generating an output image (i.e., the modified visual) by combining the scenario data and the combined image using the pinpointed coordinates used to generate the masked inpainted image.

Documents

Application Documents

#	Name	Date
1	202311065195-STATEMENT OF UNDERTAKING (FORM 3) [28-09-2023(online)].pdf	2023-09-28
2	202311065195-PROVISIONAL SPECIFICATION [28-09-2023(online)].pdf	2023-09-28
3	202311065195-FORM 1 [28-09-2023(online)].pdf	2023-09-28
4	202311065195-DRAWINGS [28-09-2023(online)].pdf	2023-09-28
5	202311065195-DECLARATION OF INVENTORSHIP (FORM 5) [28-09-2023(online)].pdf	2023-09-28
6	202311065195-POA [27-11-2023(online)].pdf	2023-11-27
7	202311065195-FORM-26 [27-11-2023(online)].pdf	2023-11-27
8	202311065195-FORM 13 [27-11-2023(online)].pdf	2023-11-27
9	202311065195-DRAWING [27-11-2023(online)].pdf	2023-11-27
10	202311065195-COMPLETE SPECIFICATION [27-11-2023(online)].pdf	2023-11-27
11	202311065195-AMENDED DOCUMENTS [27-11-2023(online)].pdf	2023-11-27
12	202311065195-RELEVANT DOCUMENTS [06-12-2023(online)].pdf	2023-12-06
13	202311065195-POA [06-12-2023(online)].pdf	2023-12-06
14	202311065195-FORM 13 [06-12-2023(online)].pdf	2023-12-06
15	202311065195-AMMENDED DOCUMENTS [06-12-2023(online)].pdf	2023-12-06
16	202311065195-ENDORSEMENT BY INVENTORS [07-12-2023(online)].pdf	2023-12-07
17	202311065195-Proof of Right [26-12-2023(online)].pdf	2023-12-26
18	202311065195-Power of Attorney [13-02-2024(online)].pdf	2024-02-13
19	202311065195-Form 1 (Submitted on date of filing) [13-02-2024(online)].pdf	2024-02-13
20	202311065195-Covering Letter [13-02-2024(online)].pdf	2024-02-13