Method And System For Performing Zero Shot Localized Multi Object

< Back

Method And System For Performing Zero Shot Localized Multi Object Editing Using Multi Diffusion

Abstract: ABSTRACT METHOD AND SYSTEM FOR PERFORMING ZERO-SHOT LOCALIZED MULTI-OBJECT EDITING USING MULTI-DIFFUSION Diffusion models have exhibited an outstanding ability to generate highly realistic images based on text prompts. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. The present disclosure provides a framework for zero-shot localized multi object editing through a multi-diffusion process in a single pass. This framework empowers to perform various operations on objects within an image. The method of the present disclosure leverages foreground masks and corresponding text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within a latent space ensures that characteristics of an object being edited is preserved while simultaneously achieving a high-quality, seamless reconstruction of background with fewer artifacts. [To be published with FIG. 3]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

12 February 2024

Publication Number

33/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India

Inventors

1. CHAKRABARTY, Goirik

Tata Consultancy Services Limited, 4 & 5th floor, PTI Building, No 4, Sansad Marg - 110012, New Delhi, India

2. HEBBALAGUPPE, Ramya Sugnana Murthy

Tata Consultancy Services Limited, 4 & 5th floor, PTI Building, No 4, Sansad Marg - 110012, New Delhi, India

3. CHANDRASEKAR, Aditya

Tata Consultancy Services Limited, Gopalan Enterprises Pvt Ltd (global Axis) SEZ, "H" Block, no. 152 (SY No. 147,157 & 158), Hoody Village, EPIP Zone, (II Stage), Whitefield, K.R. Puram Hobli, Bangalore - 560066, Karnataka, India

4. PRASAD, Prathosh Aragola

Indian Institute of Science, SPW 101, Signal Processing building West Wing, Department of Electrical Communication Engineering, Bengaluru - 560012, Karnataka, India

Specification

DESC:CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY [001] The present application claims priority from Indian provisional patent application no. 202421009378, filed on February 12, 2024. The entire contents of the aforementioned application are incorporated herein by reference. TECHNICAL FIELD The disclosure herein generally relates to the field of image processing, and, more particularly, to method and system for performing zero-shot localized multi-object editing using multi-diffusion. BACKGROUND Recent developments in the field of diffusion models have demonstrated an exceptional capacity to generate high quality prompt-conditioned image edits. Conventional approaches have primarily relied on textual prompts for image editing. However, text-based editing of multiple fine-grained objects precisely at given locations within an image is a challenging task. This challenge primarily stems from the inherent complexity of controlling diffusion models to specify the accurate spatial attributes of an image, such as the scale and occlusion during synthesis. Existing methods for textual image editing use a global prompt for editing images, making it difficult to edit in a specific region while leaving other regions unaffected. Thus, this is an important problem to tackle, as real-life images often have multiple subjects, and it is required to edit each subject independent of other subjects and the background while still retaining coherence in the composition of the image. SUMMARY Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method is provided. The processor implemented method, comprising receiving, via one or more hardware processors, (i) at least one image from a plurality of images, (ii) a plurality of masks, and (iii) a set of foreground prompts corresponding to the at least one image from the plurality of images as input, wherein each of the plurality of images comprises one or more objects; performing, via the one or more hardware processors, an inversion step on the at least one image from the plurality of images using a pretrained diffusion model, wherein the inversion step comprises: (i) obtaining (a) a latent image based on a latent code corresponding to the at least one image from the plurality of images generated by a Vector Quantized Variational Autoencoder (VQ-VAE) comprised in the pretrained diffusion model and (b) a background prompt corresponding to the at least one image from the plurality of images generated using a text-embedding framework comprised in the pretrained diffusion model; (ii) performing noise regularization on the obtained latent image using an auto correlation loss and a KL divergence loss to obtain a subsequent latent image; and (iii) repeating steps (i) and (ii) for a predefined number of steps to obtain a list of latent images and a final latent inversion image; applying, via the one or more hardware processors, a multi-diffusion process on the final latent inversion image for performing zero-shot localized multi-object editing using (i) the plurality of masks, (ii) the list of latent images, and (iii) the background prompt and the set of foreground prompts corresponding to the at least one image from the plurality of images to obtain an edited image, wherein the zero-shot localized multi-object editing is performed on a plurality of mask-specific regions; and optimizing, via the one or more hardware processors, a cross-attention loss and a background preservation loss for preservation of one or more attributes and background of the edited image. In another aspect, a system is provided. The system comprising a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive (i) at least one image from a plurality of images, (ii) a plurality of masks, and (iii) a set of foreground prompts corresponding to the at least one image from the plurality of images as input, wherein each of the plurality of images comprises one or more objects; perform an inversion step on the at least one image from the plurality of images using a pretrained diffusion model, wherein the inversion step comprises: (i) obtaining (a) a latent image based on a latent code corresponding to the at least one image from the plurality of images generated by a Vector Quantized Variational Autoencoder (VQ-VAE) comprised in the pretrained diffusion model and (b) a background prompt corresponding to the at least one image from the plurality of images generated using a text-embedding framework comprised in the pretrained diffusion model; (ii) performing noise regularization on the obtained latent image using an auto correlation loss and a KL divergence loss to obtain a subsequent latent image; and (iii) repeating steps (i) and (ii) for a predefined number of steps to obtain a list of latent images and a final latent inversion image; apply a multi-diffusion process on the final latent inversion image for performing zero-shot localized multi-object editing using (i) the plurality of masks, (ii) the list of latent images, and (iii) the background prompt and the set of foreground prompts corresponding to the at least one image from the plurality of images to obtain an edited image, wherein the zero-shot localized multi-object editing is performed on a plurality of mask-specific regions; and optimize a cross-attention loss and a background preservation loss for preservation of one or more attributes and background of the edited image. In yet another aspect, a non-transitory computer readable medium is provided. The non-transitory computer readable medium are configured by instructions for receiving (i) at least one image from a plurality of images, (ii) a plurality of masks, and (iii) a set of foreground prompts corresponding to the at least one image from the plurality of images as input, wherein each of the plurality of images comprises one or more objects; performing an inversion step on the at least one image from the plurality of images using a pretrained diffusion model, wherein the inversion step comprises: (i) obtaining (a) a latent image based on a latent code corresponding to the at least one image from the plurality of images generated by a Vector Quantized Variational Autoencoder (VQ-VAE) comprised in the pretrained diffusion model and (b) a background prompt corresponding to the at least one image from the plurality of images generated using a text-embedding framework comprised in the pretrained diffusion model; (ii) performing noise regularization on the obtained latent image using an auto correlation loss and a KL divergence loss to obtain a subsequent latent image; and (iii) repeating steps (i) and (ii) for a predefined number of steps to obtain a list of latent images and a final latent inversion image; applying a multi-diffusion process on the final latent inversion image for performing zero-shot localized multi-object editing using (i) the plurality of masks, (ii) the list of latent images, and (iii) the background prompt and the set of foreground prompts corresponding to the at least one image from the plurality of images to obtain an edited image, wherein the zero-shot localized multi-object editing is performed on a plurality of mask-specific regions; and optimizing a cross-attention loss and a background preservation loss for preservation of one or more attributes and background of the edited image. In accordance with an embodiment of the present disclosure, the cross-attention loss is optimized by updating a first set of cross-attention maps associated with the zero-shot localized multi-object editing to match with a second set of cross-attention maps associated with a reconstruction process of the at least one image from the plurality of images. In accordance with an embodiment of the present disclosure, wherein the background preservation loss is optimized by updating a first set of latent images associated with the zero-shot localized multi-object editing to match with a second set of latent images associated with a reconstruction process of the at least one image from the plurality of images. In accordance with an embodiment of the present disclosure, wherein the reconstruction process includes performing denoising on the final latent inversion image to obtain a reconstructed image of the at least one image from the plurality of images. In accordance with an embodiment of the present disclosure, wherein the one or more attributes and the background of the edited image are preserved to retain a structural consistency with the at least one image from the plurality of images. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles: FIG. 1 illustrates an exemplary system for performing zero-shot localized multi-object editing using multi-diffusion, according to some embodiments of the present disclosure. FIG. 2 illustrates an architectural overview of the system of FIG. 1 for performing zero-shot localized multi-object editing using multi-diffusion, in accordance with some embodiments of the present disclosure. FIG. 3 illustrates an exemplary flow diagram illustrating a method for performing zero-shot localized multi-object editing using multi-diffusion, in accordance with some embodiments of the present disclosure. FIG. 4 provides sample images for each edit type in the single object benchmark, according to some embodiments of the present disclosure. FIG. 5 provides sample images for each edit type in the multi object benchmark, according to some embodiments of the present disclosure. FIG. 6 depict a few examples of all the compared state of the art method with the method of present disclosure producing visually faithful edits, according to some embodiments of the present disclosure. FIG. 7 provide image representations illustrating qualitative results on all the compared state of the art methods for multi object edits on a few sample images with the method of the present disclosure, according to some embodiments of the present disclosure. FIG. 8 provides visual representation of results of ablation on temperature scaling, according to some embodiment of the present disclosure. FIG. 9 depicts image representations illustrating an impact of editing with a random latent compared to initiating the editing process via inversion, according to some embodiments of the present disclosure. FIG. 10 depicts graphical representations illustrating results of user study for performing zero-shot localized multi-object editing using multi-diffusion, according to some embodiments of the present disclosure. DETAILED DESCRIPTION OF EMBODIMENTS Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following embodiments described herein. Diffusion models have exhibited an outstanding ability to generate highly realistic images based on text prompts. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. The method of the present disclosure provides a framework for zero-shot localized multi object editing through a multi-diffusion process to overcome this challenge. This framework empowers users and computer systems to perform various operations on objects within an image, such as adding, replacing, or editing many objects in a complex scene in one pass. The method of the present disclosure leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the conventional methods. Embodiments of the present disclosure provide a system and method for performing zero-shot localized multi-object editing using multi-diffusion. The method of the present disclosure is based on compositional generative models and inherits generality without requiring training, making it a zero-shot solution. In the present disclosure, a pre-trained Stable Diffusion 2.0 is used as base generative model. The method of the present disclosure involves manipulation of diffusion trajectory within specific regions of an image earmarked for editing. Prompts that exert a localized influence on these regions are employed while simultaneously incorporating a global prompt to guide overall image reconstruction process that ensures a coherent composition of foreground and background with minimal/imperceptible artifacts. To initiate editing procedure, inversion of an original image is used as a starting point. For achieving high-fidelity, human-like edits in images, (a) cross-attention matching, and (b) background preservation are employed. These preserve the integrity of the edited image by guaranteeing that the edits are realistic and aligned with the original image. This, in turn, enhances the overall quality and perceptual authenticity of the final output. Additionally, a novel benchmark dataset is curated for multi-object editing. The experiments against existing state-of-the-art methods demonstrate improved effectiveness of the present disclosure in terms of both an image editing quality and an inference speed. More specifically, the present disclosure describes the following: A framework for zero-shot text-based localized multi-object editing based on multi-diffusion. Framework facilitating multiple edits in a single iteration via enforcement of cross-attention and background preservation, resulting in high fidelity and coherent image generation. A new benchmark dataset for evaluating the multi-object editing performance of existing frameworks, termed localized multi-object editing (LoMOE)-Bench. Referring now to the drawings, and more particularly to FIGS. 1 through 8, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments, and these embodiments are described in the context of the following exemplary system and/or method. FIG. 1 illustrates an exemplary system for performing zero-shot localized multi-object editing using multi-diffusion, according to some embodiments of the present disclosure. In an embodiment, the system 100 includes or is otherwise in communication with one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more hardware processors 104, the memory 102, and the I/O interface(s) 106 may be coupled to a system bus 108 or a similar mechanism. The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a plurality of sensor devices, a printer and the like. Further, the I/O interface(s) 106 may enable the system 100 to communicate with other devices, such as web servers and external databases. The I/O interface(s) 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface(s) 106 may include one or more ports for connecting a number of computing systems with one another or to another server computer. Further, the I/O interface(s) 106 may include one or more ports for connecting a number of devices to one another or to another server. The one or more hardware processors 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, portable computer, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like. The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102 includes a plurality of modules 102a and a repository 102b for storing data processed, received, and generated by one or more of the plurality of modules 102a. The plurality of modules 102a may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types. The plurality of modules 102a may include programs or computer-readable instructions or coded instructions that supplement applications or functions performed by the system 100. The plurality of modules 102a may also be used as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 102a can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. Further, the memory 102 may include information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. The repository 102b may include a database or a data engine. Further, the repository 102b amongst other things, may serve as a database or includes a plurality of databases for storing the data that is processed, received, or generated as a result of the execution of the plurality of modules 102a. Although the repository 102b is shown internal to the system 100, it will be noted that, in alternate embodiments, the repository 102b can also be implemented external to the system 100, where the repository 102b may be stored within an external database (not shown in FIG. 1) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, new data may be added into the external database and/or existing data may be modified and/or non-useful data may be deleted from the external database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). In another embodiment, the data stored in the repository 102b may be distributed between the system 100 and the external database. FIG. 2 illustrates an architectural overview of the system of FIG. 1 for performing zero-shot localized multi-object editing using multi-diffusion, in accordance with some embodiments of the present disclosure. As shown in FIG. 2, the method of the present disclosure comprises of three key steps including (a) inversion of the original image x_0 to obtain a latent code x_inv, which initiates editing procedure and ensures a coherent and controlled edit (b) applying the multi-diffusion process for localized multi-object editing to limit edits to mask-specific regions, and (c) attribute and background preservation via cross attention and latent background preservation to retain structural consistency with the original image. FIG. 3, with reference to FIGS. 1 and 2, illustrates an exemplary flow diagram illustrating a method for performing zero-shot localized multi-object editing using multi-diffusion, using the system of FIG. 1, in accordance with some embodiments of the present disclosure. Referring to FIG. 3, in an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to components of the system 100 of FIG. 1, the functional block diagram of FIG. 2, the flow diagram as depicted in FIG. 3, and one or more examples. Although steps of the method 200 including process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any practical order. Further, some steps may be performed simultaneously, or some steps may be performed alone or independently. With reference to the architectural overview of the system 100 depicted in FIG. 2 and referring to the steps of the method 200, at step 202 of the present disclosure, one or more hardware processors 104 are configured to receive (i) at least one image from a plurality of images, (ii) a plurality of masks, and (iii) a set of foreground prompts corresponding to the at least one image from the plurality of images as input, wherein each of the plurality of images comprises one or more objects. The present disclosure addresses a multi-object editing scenario where the objective is to simultaneously make local edits to several objects within one image. Formally, an input image x_0??, and N binary masks {M_1,……,M_N } along with a corresponding plurality of foreground prompts {c_1,……,c_N } is provided, where c_i?C is a space of encoded text prompts. These are used to obtain an edited image x^* such that the editing process precisely manifests at the locations dictated by the masks, in accordance with a guidance provided by the plurality of foreground prompts. Further, at step 204 of the present disclosure, the one or more hardware processors 104 are configured to perform an inversion step on the at least one image from the plurality of images using a pretrained diffusion model. In the present disclosure, a pretrained stable diffusion model denoted as F. The inversion step comprises obtaining (a) a latent image based on a latent code corresponding to the at least one image from the plurality of images generated by a vector quantized variational autoencoder (VQ-VAE) comprised in the pretrained diffusion model and (b) a background prompt corresponding to the at least one image from the plurality of images generated using a text-embedding framework comprised in the pretrained diffusion model. The VQ-VAE comprised in the pretrained diffusion model encodes the input image x_0??= R^(512×512×3) into a latent code x_0? R^(64×64×4). Further, noise regularization is performed on the obtained latent image using an auto correlation loss and a KL divergence loss to obtain a subsequent latent image. The above steps are repeated for a predefined number of steps to obtain a list of latent images and a final latent inversion image. The step 204 is better understood by way of the following description provided as exemplary explanation. Given an image x_0 and its corresponding latent code x_0, inversion entails finding a latent x_inv which reconstructs x_0 upon sampling. In a denoising diffusion probabilistic models (DDPM), the inversion step is defined by a forward diffusion process which involves Gaussian noise perturbation (?_t~N(0,I)) for a fixed number of timesteps t?[T] governed by equations (1) and (2). x_t=v(a_t ) x_0+v(1-a_t ) ?_t (1) x_inv=x_T (2) where a_t represents a prefixed noise schedule. During training, a neural network ?_? (x_t,t) learns to predict the noise ?_t added to a sample x_t. Additionally, this network can also be conditioned on text, images, or embeddings denoted by ?_? (x_t,t,c,?), where c is the encoded text condition and ? is null condition. In the present disclosure, x_inv is obtained by providing c_0 corresponding to x_0 that is generated using a text-embedding framework such as Bootstrapping language-image pre-training (BLIP). In a conventional method it is observed that the inverted noise maps are generated by denoising diffusion implicit model (DDIM) inversion ?_? (x_t,t,c,?)? R^(64×64×4) do not follow the statistical properties of uncorrelated, white gaussian noise in most cases, causing poor editability. Thus, gaussianity is softly enforced using a pairwise regularization loss L_pair and a divergence loss L_(KL )weighted by ?. These losses ensure that there is (i) no correlation between any pair of random locations, and (ii) zero mean, unit variance at each spatial location, respectively. Mathematically, the pairwise regularization loss is given by equation (3): L_(pair )=?_p¦1/(S_p^2 ) ?_(d=1)^(S_p-1)¦?_(x,y,c)¦??_(x,y,c)^p (?_(x-d,y,c)^p+?_(x,y-d,c)^p ) ? (3) where {?_0,?_1,· · · ,?_p} denotes the noise maps with size S_p at the p^th pyramid level, d denotes an offset which helps propagate long-range information, and {x,y,c} denotes a spatial location. Here, p = 4 and ?^0= ?_??R^(64×64×4) are set, where the subsequent noise maps are obtained via max-pooling. The divergence loss is given by equation (4): L_(KL )=s_(?_?)^2+µ_(?_?)^2-1-log?(s_(?_?)^2+e) (4) where µ_(?_? ) and s_(?_?)^2 denotes the mean and variance of ?_? and e is a stabilization constant. At step 206 of the present disclosure, the one or more hardware processors 104 are configured to apply a multi-diffusion process on the final latent inversion image for performing zero-shot localized multi-object editing using (i) the plurality of masks, (ii) the list of latent images, and (iii) the background prompt and the set of foreground prompts corresponding to the at least one image from the plurality of images to obtain an edited image. The zero-shot localized multi-object editing is performed on a plurality of mask-specific regions and restricted to them. In an embodiment, inversion provides a good starting point for the editing process, compared to starting from a random latent code. However, if the standard diffusion process is used for editing process, then there will be no control over local regions in the image using simple prompts. To tackle this, the multi-diffusion process for zero-shot localized multi-object editing is used. The step 206 is better understood by way of the following description provided as exemplary explanation. A diffusion model F, typically operates as follows: Given a latent code x_T and an encoded prompt c, it generates a sequence of latents {x_i }_(i=T-1)^0 during the backward diffusion process s.t, x_(t-1)= F(x_t/c), gradually denoising x_T over time. To obtain an edited image, start is from x_T=x_inv which is further guided based on a target prompt. This approach applies prompt guidance on a complete image, making output prone to unintentional edits. Thus, a localized prompting solution is provided restricting the edits to a masked region. To concurrently edit N regions corresponding to N masks, one approach is to use N+1 different diffusion processes {F((x_t^j)/c_j )}_(j=0)^N, where x_t^j and c_j are the latent code and encoded prompts, respectively for mask j. However, in the present disclosure, a single multi-diffusion process denoted by ? is used for zero-shot conditional editing of regions within all the given N masks. Given masks {M_1……M_N } and M_0= ?_(i=1)^N¦M_i , with a corresponding set of encoded text prompts z= (c_1…c_N ), the goal is to come up with a mapping function ?: ? × C^(N+1)?? , solving the following optimization problem as shown in equation (5): ?(y_t,z)=argmin-(y_(t-1) )??L_md (y_(t-1)¦y_t,z)? (5) The multi-diffusion process ? starts with y_T and generates a sequence of latents {y_i }_(i=T-1)^0 given by y_(t-1)=?(y_t¦z). The objective in equation (5) is designed to follow the denoising steps of F as closely as possible, enforced using the constraint L_md defined as shown in equation (6): L_md (y_(t-1)¦y_t,z)=?_(i=0)^N¦?M_i?[y_(t-1)-?(x_t^i¦c_i )]?^2 (6) where ? is the Hadamard product. The optimization problem in equation (5) has a closed-form solution given by equation (7): ?(y_t,z)=?_(i=0)^N¦M_i/(?_(j=0)^N¦M_j )? ?(x_t^i¦c_i ) (7) Editing in the present disclosure is accomplished by running a backward process (termed edit), using ? via a deterministic DDIM reverse process with y_T=x_inv and the target prompt (c). The DDIM reverse process as shown in equation (8) is deterministic when s_t=0 ? t, where the family Q of inference distributions is parameterized by s?R_+^T. y_(t-1)=v(a_(t-1) ) ((y_t-v(1-a_t ) ?_? (y_t,t,c,?))/a_t )+v(1-a_(t-1) ) ?_? (y_t,t,c,?) (8) where a_t is the noise scaling factor. Further, at step 208 of the present disclosure, the one or more hardware processors 104 are configured to optimize a cross-attention loss and a background preservation loss for preservation of one or more attributes and background of the edited image. In an embodiment, the one or more attributes and the background of the edited image are preserved to retain a structural consistency with the at least one image from the plurality of images. The cross-attention loss is optimized by updating a first set of cross-attention maps associated with the zero-shot localized multi-object editing to match with a second set of cross-attention maps associated with a reconstruction process of the at least one image from the plurality of images. The background preservation loss is optimized by updating a first set of latent images associated with the zero-shot localized multi-object editing to match with a second set of latent images associated with a reconstruction process of the at least one image from the plurality of images. The reconstruction process includes performing denoising on the final latent inversion image to obtain a reconstructed image of the at least one image from the plurality of images. In other words, in addition to the edit process, a backward process (termed reconstruction) is also used using F with x_T=x_inv and a source prompt (c_0 ). This provides a reconstruction x_0^' of the original latent code x_0. The deviation of x_0^' from x_0 is rectified by storing noise latents during the inversion process. During reconstruction, the list of latent images x_t^' (interchangeably referred as latents) and a plurality of cross-attention maps A ¯_t are saved for all timesteps t. These stored latents and cross-attention maps are used to define losses that guide the editing process. Bootstrapping: In the present disclosure, a bootstrap parameter (T_b) is used, allowing ?(y_t¦c_i ) to focus on region M_i early on in the process (until timestep T_b) and considering full context in the image later on. This improves the fidelity of the generated images when there are tight masks. It is introduced via time dependency in y_t, given by equation (9) as: y_t={¦(M_i.y_t+(1-M_i ).b_t, if t

Documents

Application Documents

#	Name	Date
1	202421009378-STATEMENT OF UNDERTAKING (FORM 3) [12-02-2024(online)].pdf	2024-02-12
2	202421009378-PROVISIONAL SPECIFICATION [12-02-2024(online)].pdf	2024-02-12
3	202421009378-FORM 1 [12-02-2024(online)].pdf	2024-02-12
4	202421009378-DRAWINGS [12-02-2024(online)].pdf	2024-02-12
5	202421009378-DECLARATION OF INVENTORSHIP (FORM 5) [12-02-2024(online)].pdf	2024-02-12
6	202421009378-FORM-26 [15-03-2024(online)].pdf	2024-03-15
7	202421009378-Proof of Right [14-06-2024(online)].pdf	2024-06-14
8	202421009378-FORM-5 [12-02-2025(online)].pdf	2025-02-12
9	202421009378-FORM 18 [12-02-2025(online)].pdf	2025-02-12
10	202421009378-FORM 13 [12-02-2025(online)].pdf	2025-02-12
11	202421009378-DRAWING [12-02-2025(online)].pdf	2025-02-12
12	202421009378-CORRESPONDENCE-OTHERS [12-02-2025(online)].pdf	2025-02-12
13	202421009378-COMPLETE SPECIFICATION [12-02-2025(online)].pdf	2025-02-12
14	202421009378-AMENDED DOCUMENTS [12-02-2025(online)].pdf	2025-02-12
15	202421009378-FORM-26 [14-02-2025(online)].pdf	2025-02-14
16	Abstract.jpg	2025-03-18
17	202421009378-FORM-26 [16-04-2025(online)].pdf	2025-04-16