Sign In to Follow Application
View All Documents & Correspondence

Text Guided Image Editor

Abstract: A computer-implemented method comprising obtaining a base prompt and an edit prompt; converting the base and edit prompts to base and edit embeddings; repeating, for a plurality of iterations, the steps of: determining new edit embeddings based on: the base and edit embeddings, a time step relating to the iteration, and a weight dependent on the time step wherein the weight controls mixing of the base and edit embeddings; inputting the base embeddings into a diffusion model in a base reverse process arranged to update a base latent relating to the base image; and inputting the new edit embeddings into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process. Finally, converting the edit latent to the edited image; and outputting the edited image.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
27 March 2024
Publication Number
44/2025
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

Fujitsu Limited
1-1 Kamikodanaka 4-chome, Nakahara-ku, Kawasaki-shi Kanagawa 211-8588, Japan.

Inventors

1. MALIK, Sameer
FUJITSU RESEARCH OF INDIA PRIVATE LIMITED, 6th Floor, Building No. 4, 77 Town Center, No. 36/2 Yamalur Village, Varthur Hobli, Old Airport Road Bangalore Bangalore KA 560037, India.

Specification

Description:FIELD OF THE INVENTION
Embodiments of the present invention described herein relate to text guided image editing, and in particular to a computer-implemented method, a computer program, and an information programming apparatus.

BACKGROUND OF THE INVENTION
Text guided image editing refers to methods based on image generation models that make semantic changes to a given image based on textual instructions. Text guided image editing involves using textual descriptions to guide the modification or manipulation of images. This can be achieved through techniques like conditional image generation, image captioning, semantic image editing, interactive interfaces, and content-aware editing. It enables intuitive editing workflows, allows for complex instructions using natural language, and finds applications in graphic design, photo editing, content creation, and computer-aided design.

It is desirable to be able to control the image editing and preserve the subject of the image’s identity.

SUMMARY OF THE INVENTION
It is an aim of the present disclosure to at least partially address one or more of the challenges mentioned above. The invention is defined in the independent claims, to which reference should now be made. Further features are set out in the dependent claims.

According to an embodiment of a first aspect there is disclosed herein a computer-implemented method comprising obtaining a base prompt indicating a base image and an edit prompt indicating an edit to be made to the base image. The method further comprising converting the base and edit prompts to base and edit embeddings, respectively. The method further comprising repeating, for a plurality of iterations, the steps of: (i) determining new edit embeddings based on: the base and edit embeddings, a time step relating to the iteration, and a weight dependent on the time step wherein the weight controls mixing of the base and edit embeddings; (ii) inputting the base embeddings into a diffusion model in a base reverse process arranged to update a base latent relating to the base image; and
(iii) inputting the new edit embeddings into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process. The method further comprising converting the edit latent to the edited image; and finally, outputting the edited image.

In some embodiments, the weight is dependent on the time step such that: at earlier time steps, the base and edit embeddings mix less than at later time steps.

In some embodiments, the weight is further dependent on a time invariant parameter which controls how much the base and edit embeddings mix.

In some embodiments, the converting of the base and edit prompts to the base and edit embeddings involves converting the base and edit prompts to base and edit tokens and converting the base and edit tokens to base and edit embeddings.

In some embodiments, the new edit embeddings are further based on a mask vector and an index vector computed based on the base and edit tokens.

In some embodiments, the inputting of the base embeddings into the diffusion model in the base reverse process and the inputting of the new edit embeddings into the diffusion model in the edit reverse process overlap in time.

In some embodiments, the base prompt is derived from an image.

In some embodiments, the base prompt and/or the edit prompt are obtained from a user input.

In some embodiments, the inputting of the base embeddings into the diffusion model in the base reverse process and the inputting of the new edit embeddings into the diffusion model in the edit reverse process may occur simultaneously.

In some embodiments, the base prompt may comprise a textual description relating to the base image.

In some embodiments, the edit prompt may comprise a textual description relating to the edited image.

In some embodiments, the base image may comprise an image of a human face.

In some embodiments, the edit may comprise a change in facial expression of the human face.

In some embodiments, the base and edit embeddings may comprise vector representations of the base and edit prompts, respectively.

In some embodiments, the steps of obtaining the base and edit prompts and converting them to embeddings may only be performed once per edit.

In some embodiments, the weight may be dependent on the time step such that the base embeddings have a greater weight at earlier time steps than at later time steps.

In some embodiments, the converting of the base and edit prompts to the base and edit embeddings may involve converting the base and edit prompts to base and edit tokens via a tokenizer unit and converting the base and edit tokens to base and edit embeddings via a text encoder unit.

In some embodiments, the text encoder unit may be a contrastive language image pre-training (CLIP) text encoder.

In some embodiments, the edit latent may be converted to the edited image using a latent to image decoder.

In some embodiments, the mask vector and the index vector may be computed by a mask and alignment vector unit.

In some embodiments, the new edit embeddings may be determined by an embedding mixer unit which takes the base and edit embeddings, the time step and the weight as inputs and outputs the new edit embeddings.

In some embodiments, the cross-attention maps are generated by a cross-attention processor.

According to an embodiment of a second aspect there is disclosed herein a computer program which, when run on a computer, causes the computer to carry out a method comprising obtaining a base prompt indicating a base image and an edit prompt indicating an edit to be made to the base image. The method further comprising converting the base and edit prompts to base and edit embeddings, respectively. The method further comprising repeating, for a plurality of iterations, the steps of: (i) determining new edit embeddings based on: the base and edit embeddings, a time step relating to the iteration, and a weight dependent on the time step wherein the weight controls mixing of the base and edit embeddings; (ii) inputting the base embeddings into a diffusion model in a base reverse process arranged to update a base latent relating to the base image; and (iii) inputting the new edit embeddings into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process. The method further comprising converting the edit latent to the edited image; and finally, outputting the edited image.

According to an embodiment of a third aspect there is disclosed herein an information processing apparatus comprising a memory and a processor connected to the memory, wherein the processor is configured to perform a method, the method comprising: obtaining a base prompt indicating a base image and an edit prompt indicating an edit to be made to the base image. The method further comprising converting the base and edit prompts to base and edit embeddings, respectively. The method further comprising repeating, for a plurality of iterations, the steps of: (i) determining new edit embeddings based on: the base and edit embeddings, a time step relating to the iteration, and a weight dependent on the time step wherein the weight controls mixing of the base and edit embeddings; (ii) inputting the base embeddings into a diffusion model in a base reverse process arranged to update a base latent relating to the base image; and (iii) inputting the new edit embeddings into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process. The method further comprising converting the edit latent to the edited image; and finally, outputting the edited image.

In some embodiments, the memory and the processor are collectively configured to provide an embedding mixer arranged to perform the step of determining the new edit embeddings.

In some embodiments, the memory and the processor are collectively configured to provide the diffusion model.

In some embodiments, the memory and the processor are collectively configured to provide a cross-attention processor arranged to generate the cross-attention maps.

In some embodiments, the converting of the base and edit prompts to the base and edit embeddings involves converting the base and edit prompts to base and edit tokens and converting the base and edit tokens to base and edit embeddings; and wherein the memory and the processor are collectively configured to provide:
a tokenizer arranged to perform the step of converting the base and edit prompts to base and edit tokens; and
a text encoder arranged to perform the step of converting the base and edit tokens to base and edit embeddings.

In some embodiments, the new edit embeddings are further based on a mask vector and an index vector computed based on the base and edit tokens; and wherein the memory and the processor are collectively configured to provide a mask and alignment index vector module arranged to perform the step of computing a mask vector and an index vector based on the base and edit tokens.

Features relating to any aspect/embodiment may be applied to any other aspect/embodiment.

Brief Description of the Drawings
Embodiments of the invention will now be further described by way of example only and with reference to the accompanying drawings, wherein like reference numerals refer to like parts, and wherein:

Figure 1 is a diagram illustrating a stable diffusion model;

Figure 2 is a diagram illustrating a cross-attention mechanism;

Figure 3 is a diagram illustrating a comparative method;

Figure 4 is a diagram illustrating a comparative method;

Figure 5 is a diagram illustrating a comparative method;

Figure 6 is a diagram illustrating an embodiment of the invention;

Figure 7 is a diagram illustrating an embodiment of the invention;

Figure 8 shows examples illustrating a comparison of a comparative method (top row) and embodiments of the invention (bottom row);

Figure 9 shows examples illustrating a comparison of a comparative method (top row) and embodiments of the invention (bottom row);

Figure 10 is a diagram illustrating an embodiment of the invention;

Figure 11 is a diagram illustrating an embodiment of the invention;

Figure 12 is a flow chart illustrating an embodiment of the invention;

Figure 13 is a diagram illustrating an apparatus;

Figure 14 is a flow chart illustrating an embodiment of the invention; and

Figure 15 shows examples of base and edit images in accordance with embodiments of the invention.

DETAILED DESCRIPTION
Text guided image editing refers to methods based on image generation models that make semantic changes to a given image based on textual instructions. This disclosure aims to provide a text guided image editor with improved control on the edit process (e.g. if editing a facial expression to a smile, controlling the amount of smile) and better preservation of the target identity (e.g. if editing a facial expression, preserving the subject’s facial characteristics). This disclosure also aims to provide a text guided image editor which does not require training for each new edit, unlike existing methods which require training on a per edit basis. This disclosure provides these advantages by providing an embedding mixer which controls the entanglement of a base prompt (the prompt for the original image, e.g. “photo of a man”) and an edit prompt (the prompt for the edited image, e.g. “photo of a smiling man”).

Figure 1 is a diagram illustrating a stable diffusion model 100. Diffusion models are state-of-the-art generative models that synthesize images from white Gaussian noise through progressive denoising with a T step reverse diffusion process. In addition to high quality image generation, diffusion models have proven to be useful for text guided semantic image editing, e.g. changing facial expressions in an image. Several methods perform text guided image editing by finetuning the pre-trained unconditional diffusion models using contrastive language image pre-training (CLIP) based loss. However, these methods are computationally expensive due to the finetuning required. Stable diffusion is an open-source conditional text-to-image diffusion model that conditions the reverse diffusion process with CLIP embeddings E of the text prompt using the cross-attention mechanism. Referring to the Figure, a text prompt 102 (e.g. “photo of a man”) is input into the CLIP Text Encoder 104 (via a tokenizer not shown) which outputs text embeddings E 106. The text embedding 106 is fed to the diffusion model 108 at every time step of the reverse process through the cross-attention mechanism 112. Specifically, the diffusion model in stable diffusion consists of 16 cross-attention layers. The i^th cross-attention layer at t^th diffusion step takes the input features F_t^i and modifies the features to F ̂_t^i by conditioning on the text embeddings E. The latent code L_T~N(0,I) is progressively denoised to obtain the latent code L_0 which is then decoded by a latent to image decoder 114 to get the image 116.

Figure 2 is a diagram illustrating a cross-attention mechanism 200. A cross-attention mechanism enables neural network models to selectively focus on relevant parts of one sequence based on information provided by another sequence. A cross-attention mechanism typically operates with three sets of input sequences: a "query" sequence and a "key” and a “value" sequence. These sequences can represent various types of data, such as words in a sentence, tokens in a document, or pixels in an image.

Referring to the Figure, a base prompt 202 is input into the CLIP Text Encoder 204 (via a tokenizer not shown) which outputs text embeddings E 206. Text embeddings 206 are used to obtain the key-value input sequence, while input features F_i^t 208 are used to obtain the query input sequence. Both inputs 208, 206 are input into the cross-attention mechanism 210. Each element in the query sequence is associated with a "query vector," 212 while each element in the key-value sequence is associated with both a "key vector" 214 and a "value vector" 216. These vectors are used to represent the semantic information of the input sequences. Specifically, the cross-attention mechanism 210 includes linear projection of the text embedding with W_K and W_V to get the key and value sequences respectively. The input features 206 are also linearly projected with W_Q to get the query sequence. Then the attention weights are computed using a similarity measure, such as dot product or scaled dot product, between the query vectors and key vectors.

The attention weights can be visualized as cross-attention maps 218. These maps typically depict a grid where each row corresponds to elements in the query sequence, and each column corresponds to elements in the key-value sequence. The intensity or colour of each cell in the grid represents the magnitude of attention assigned to the corresponding pair of elements. The attention weights are used to compute a weighted sum of the value vectors in the key-value sequence to produce the final output features F ̂_i^t 220. This weighted sum represents the "attended" information from the value sequence that is relevant to each element in the query sequence. During training, the parameters of the cross-attention mechanism, including the W_Q, W_K and W_V weights, are learned through backpropagation using labelled data or other suitable training objectives. Cross-attention mechanisms allow models to effectively leverage contextual information from one sequence to enhance the processing of another sequence.

Figure 3 is a diagram illustrating a first comparative method 300 (comparative method 1). Comparative method 1 is a method for editing images using text prompts 302. The method finetunes an unconditional diffusion model for a desired edit by minimizing the CLIP based loss computed as the dissimilarity between CLIP text embedding of the edit prompt 302 and CLIP image embedding of the edited images. First, the base image 304 is inverted to the corresponding latent x_T 306 using the forward denoising diffusion implicit models (DDIM) process 308. Then the diffusion model 310 is finetuned so that the reverse DDIM process 312 applied on the latent x_T 306 gives the desired result (edited image 314). A problem with comparative method 1 is that it requires separate diffusion models to be trained for different types of edits which is time consuming and requires a lot of memory. The first comparative method may be referred to as DiffusionCLIP (Kim et al. 2022, “DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation”).

Figures 4 and 5 illustrate a second comparative method 400, 500 (comparative method 2). Comparative method 2 is a method for editing images using text prompts using cross-attention maps to control the editing. Unlike comparative method 1, comparative method 2 does not require training for individual edits, i.e. the model does not need to be finetuned for different edits.

As shown in Figure 4, in comparative method 2, two reverse processes, the base reverse process 402 for the base prompt and the edit reverse process 404 for the edit prompt, are conducted simultaneously. During the process, for the overlapping part of the text prompt, at every step t, the edit reverse process 404 uses the cross-attention maps generated by cross-attention processor 406 from the base reverse process 402. E.g., Base prompt: Photo of man, Edit prompt: Photo of smiling man. In the edit reverse process, the cross-attention maps for the words “Photo”, “of”, “man” are taken from the base prompt, while map for the word “smiling” is computed in the edit reverse process itself. In more detail, latent 408 is input into both the base reverse process 402 and the edit reverse process 404. The base reverse process 402 produces denoised base latent 410 using the base embeddings E_base¬. Simultaneously, the edit reverse process 404 produces denoised edit latent 412 using the edit embeddings E_edit. Cross-attention maps generated from the base reverse process 402 are input into the edit reverse process 404. The denoised base latent 410 is input into the latent to image decoder 414 to produce base image 416. The denoised edit latent 412 is input into the latent to image decoder 414 to produce edit image 418.

Figure 5 illustrates the cross-attention mechanism for comparative method 2. The base cross-attention process 502 works in the same way as the cross-attention mechanism described in relation to Figure 2 above. The edit cross-attention process 504 works similarly to the base cross-attention process 502, except for the additional input of the base cross-attention maps 506 which are generated from the base reverse process and input into the edit reverse process.

A problem with comparative method 2 is that it has problems with changing a subject’s identity. Contextual encoding by the text encoder causes the edit instruction to affect other word embeddings from the edit prompt. For example, in the above case, the embedding of the word “man” is different in E_base and E_edit as the word “smiling” changes its embedding due to contextual embedding by the CLIP text encoder. This can negatively affect the subject’s identity. A further problem with comparative method 2 is that it is difficult to control the edit strength. Comparative method 2 proposes to control the strength of the edit by scaling the cross-attention map corresponding to the edit phrase (e.g. scaling cross-attention map of the word “smiling” in the above example). However, this is not very effective as now embeddings of other words in E_edit also encode the “smiling” attribute due to contextual encoding. The second comparative method may be referred to as Prompt-to-Prompt (Hertz et al. 2022, “Prompt-to-Prompt Image Editing with Cross-Attention Control”).

Figure 14 is a flow chart illustrating method steps in a computer-implemented method 1400 for generating an edited image in accordance with some embodiments of the invention. At step 1402, a base prompt indicating a base image is obtained. The base prompt may be obtained from an existing image and/or a user input. The base prompt may comprise a textual description relating to the base image. At step 1404, an edit prompt indicating an edit to be made to the base image is obtained. The edit prompt may be obtained from a user input. The edit prompt may comprise a textual description relating to the edited image. At step 1406, the base and edit prompts are converted to base E_base and edit E_edit embeddings, respectively. At step 1408, new edit embeddings E ̂_edit (t) are determined based on the base and edit embeddings, a time step t, and a weight dependent on the time step w_e (t), wherein the weight controls mixing of the base and edit embeddings. For example, the weight may be dependent on the time step such that at earlier time steps, the base and edit embeddings mix less than at later time steps. The weight may be further dependent on a time invariant parameter which controls how much the base and edit embeddings mix. This time invariant parameter is referred to herein as hyperparameter σ_e. At step 1410, the base embeddings are input into a diffusion model in a base reverse process arranged to update a base latent relating to the base image. At step 1412, the new edit embeddings are input into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process. Steps 1410 and 1412 may overlap in time, for example, they may occur simultaneously. At step 1414, the edit latent is converted to the edited image. At step 1416, the edited image is output, e.g. to a user. This could be in the form of displaying the edited image to the user, e.g. via a Graphical User Interface (GUI).

Embodiments of the invention may work on an image of any subject. For example, the base image may comprise an image of a human face. In such an example, the edit may comprise a change in facial expression of the human face. In other examples, the edit may comprise a change in the facial characteristics and/or age of the human face. Examples of editing images of human faces are shown in Figures 8 and 9 and described below. Examples 1500 of editing images of animals are shown in Figure 15 which shows an image of a bird 1502 which is edited to show an image of a flying bird 1504. Figure 15 also shows an image of a horse 1506 which is edited to show an image of a running horse 1508. Figure 15 also shows an image of a dog 1510 which is edited to show an image of a running dog 1512. As such, the edit may comprise a change in what the subject of the image is doing. An edit may comprise a change to the background of an image.

Figure 6 illustrates embodiments of the invention 600, in particular, the tth diffusion step of the method of the disclosure. The method is similar to the Prompt-to-Prompt method discussed in relation to Figures 4 and 5, but with the introduction of an embedding mixer module. The embedding mixer addresses the problem of the edit words (e.g. “smiling”) affecting embedding of other words in the edit prompt (e.g. “man”). Referring to the Figure, method 600 comprises inputting a base prompt 602 (e.g. “photo of a man”) and an edit prompt 604 (e.g. “photo of a smiling man”) into the CLIP text encoder 606 (via a tokenizer not shown) which outputs base embeddings E_base and the edit embeddings E_edit. Base embeddings E_base and edit embeddings E_edit are input into the embedding mixer 608, along with timestep t as the embedding mixer output is timestep dependent. The embedding mixer 608 outputs new, timestep t dependent, edit embeddings E ̂_edit (t). Base embeddings E_base are also input into a diffusion model in a base reverse process 610 which acts on the base latent L_(t+1)^base to produce L_t^base (i.e. the original unedited image in latent space). Meanwhile, new edit embeddings E ̂_edit (t) are input into the diffusion model in an edit reverse process 612 which acts on the edit latent L_(t+1)^edit to produce L_t^edit (i.e. the edited image in latent space). The cross-attention processor 614 acts between the two reverse processes 610, 612 as described above in relation to Figures 2 and 5, which allows the process of obtaining the original unedited image to inform the process of obtaining the edited image. The cross-attention processor 614 takes cross-attention maps {M_base^(n,t) }_(n=1)^N from the base reverse process 610 and cross-attention maps {M_edit^(n,t) }_(n=1)^N from the edit reverse process 612 and outputs modified cross-attention maps {M ̂_edit^(n,t) }_(n=1)^N as output which are then input back into the edit reverse process 612.

Figure 7 illustrates the embedding mixer 608 in more detail 700. Referring to the Figure, the base prompt, e.g. “photo of a man” 602 and the edit prompt, e.g. “photo of a smiling man” 604 are input into the tokenizer 702 which outputs base tokens t_b and edit tokens t_e, respectively. The tokens are then input into the text encoder 606 which outputs the base embeddings E_base 704 and the edit embeddings E_edit 706. The embedding of the edit word “smiling” is shown as 708. The embeddings 704 and 706 are input into the embedding mixer 608 as described above with reference to Figure 6. The mixing is done through convex combination of the embeddings of the overlapping parts of the base and the text prompt using the weights w_e (t). These weights are designed to emphasize the base embeddings for the initial steps of the reverse diffusion process and edit embeddings for the later part of the process. As initial time steps of the reverse diffusion process affect the higher-level features of the image (e.g. facial structure if the image is of a face), this ensures that only the base embeddings are used to affect these higher-level features to maintain the identity of the subject. Later time steps of the reverse diffusion process affect the lower-level features (i.e. finer details) of the image (e.g. facial expression if the image is of a face). By including the edit embeddings later in the process, this ensures that the edit embeddings only effect the finer details of the image and do not completely alter the subject of the image. Then, the embeddings specifying the desired edit is concatenated to the mixed embeddings. Since the base embeddings are not entangled with the edit embeddings, the problems associated with the entanglement are mitigated. Because the context in the edit prompt is also important, the mixing is done instead of replacing the embeddings.
The new edit embeddings E ̂_edit (t) may be expressed as:

E ̂_edit (t)=m*w_e (t)*E_base [i]+ m*(1-w_e (t))*E_edit+(1-m)*E_edit

Where i is an index vector to align same tokens in t_b with t_e. It is computed by comparing t_b and t_e; m is a binary mask that takes the value of 1 for the same tokens in t_b [i] and t_e; and w_e (t)=exp⁡(-〖(t-T)〗^2/(2〖σ_e〗^2 )) where T is the total number of timesteps in reverse process, and σ_e is the hyperparameter to control embedding mixing, where a lower hyperparameter value results in a stronger edit result.

Figure 8 shows an example comparison 800 of Prompt-to-Prompt results 804-808 (top row) and the method of the disclosure 810-814 (bottom row). For both methods, an original image 802 created with the base prompt “photo of a man” has been edited by the edit prompts “photo of a smiling man”, “photo of a crying man” and “photo of an old man”. The result of the Prompt-to-Prompt method for the “photo of a smiling man” edit can be seen in image 804. The result of the Prompt-to-Prompt method for the “photo of a crying man” edit can be seen in image 806. The result of the Prompt-to-Prompt method for the “photo of an old man” edit can be seen in image 808. The result of the method of the disclosure for the “photo of a smiling man” edit can be seen in image 810. The result of the method of the disclosure for the “photo of a crying man” edit can be seen in image 812. The result of the method of the disclosure for the “photo of an old man” edit can be seen in image 814. For the method of the disclosure, all three of these results were generated using hyperparameter value σ_e=500. It can be seen from these results that the result of the Prompt-to-Prompt method can change the identity of the subject. For example, the result for the “photo of a smiling man” 804 looks older than the base image 802. The embedding of “smiling” has had unintended effects on the age of the subject. Similarly, the result for the “photo of a crying man” 806 has a different facial structure to base image 802. The embedding of “crying” has had unintended effects on the facial structure of the subject. In contrast, the results for the method of the disclosure 810, 812, 814 have much more realistic results and it is clear that the person in the edited images 810, 812, 814 is the same as the person in the base image 802.

Figure 9 shows an example comparison 900 of Prompt-to-Prompt results 902-908 (top row) and the method of the disclosure 910-916 (bottom row). In the top row, the images 902-908 show the test of controllability in the Prompt-to-Prompt approach by varying weight of the cross-attention map corresponding to the word “old”. Image 902 has w_a=0.1, image 904 has w_a=0.3, image 906 has w_a=0.7 and image 908 has w_a=1.0. This approach was proposed in the Prompt-to-Prompt paper. As can be seen in the images 902-908, this is not a robust way to control the degree of the edit (e.g. control how old the man appears) and may fail for certain images as shown. In contrast, in the bottom row, the images 910-916 show the test of controllability for the method of the disclosure by varying the hyperparameter σ_e which controls the mixing of the embeddings. As can be seen in the images 910-916, the editing strength can be effectively controlled using this method, where lower hyperparameter values (e.g. 100) result in a stronger edit (e.g. an older looking man) and higher hyperparameter values (e.g. 700) result in a weaker edit (e.g. a younger looking man). The example hyperparameter values shown in Figure 9 are σ_e=100 for image 910, σ_e=300 for image 912, σ_e=500 for image 914, and σ_e=700 for image 916.

Figure 10 illustrates embodiments of the invention in more detail 1000. The edit prompt 604 and the base prompt 602 are input into tokenizer 702 which outputs edit tokens t_e and base tokens t_b, respectively. The tokens are then input into the text encoder 606 which outputs the base embeddings E_base and the edit embeddings E_edit. Mask and alignment index vector 1002 is a component used to align the tokens of the base and edit prompts. It takes tokens t_b, t_e as input and outputs m and i, where i is an index vector to align same tokens in t_b with t_e computed by comparing t_b and t_e and m is a binary mask that takes the value of 1 for the same tokens in t_b [i] and t_e (as defined above in relation to Figure 7). The embeddings E_base, E_edit are input into the embedding mixer 608 as described above with reference to Figures 6 and 7 along with timestep t, m and i. As described above, the embedding mixer 608 outputs new, timestep t dependent, edit embeddings E ̂_edit (t). The diffusion model 610, 612 and the cross-attention processor 614 have been described above in relation to Figure 6.

Figure 11 illustrates embodiments of the invention in more detail 1100. Figure 11 illustrates how the tokenizer 702, mask and alignment index vector 1002 and text encoder 606 steps (described above) are computed once at the start of the reverse process. The embedding mixer 608, diffusion model 610, 612 and cross-attention processor 614 steps are computed T number of times starting from t=T (L_T^e =L_T^b~N(0,1)) to t=0. In Figure 11, the matrices and tensors are annotated with their dimensions, e.g. t_b has dimensions 77 x 1.

Figure 12 is a flow chart 1200 illustrating embodiments of the invention. At step 1202, the method starts. At step 1204, the method checks if there is a real base image to be edited. Embodiments of the invention can work from a real image and work backwards to produce a suitable prompt which would give that real image, or a user may provide a prompt from which to generate the base image. If there is a real image, the method progresses to step 1206 where the real image is loaded and the latent L is computed with a null-text inversion method and t is set t=T and base latent L_b^t is set L_b^t=L_e^t=L. Alternatively, if there is no real image, the method progresses to step 1208 where the method sets t=T, and latents L_b^t=L_e^t~N(0,1). After step 1206 or step 1208 the method progresses to step 1210 where the base 602 and edit 604 prompts are loaded, along with the sample initial latent vector. The sample initial latent vector is the latent vector LT, the starting point of the reverse diffusion process. It is the same for both base and reverse processes. The edit prompt 604 may be provided by a user. At step 1212 both the base and edit prompts are tokenized by tokenizer 702 to produce base tokens t_b and edit tokens t_e. At step 1214 the mask vector m and index vector i are computed by mask and alignment index vector 1002. At step 1216 the base E_base and edit E_edit embeddings are computed by text encoder 606. At step 1218 the modified edit embeddings E ̂_edit are computed with the embedding mixer 608. At step 1220 latents L_b^t and L_e^t are updated with the base and edit reverse processes of the diffusion model 610, 612 and cross-attention processor 614 to L_b^(t-1) and L_e^(t-1). At step 1222, the method checks if timestep t>1 (i.e. checks if this is the last timestep of the process). If t>1 (i.e. it is not the last timestep), the method proceeds to step 1224 which sets t←t-1 (i.e. changes t to the next timestep) and then repeats steps 1218-1222. If t<1 (i.e. it is the last timestep), the method proceeds to step 1226 where the edit latent L_e^0 is decoded to get the edited image. At step 1228 the method ends.

The method disclosed herein provides an improved text guided image editing method by providing an embedding mixer to better control the editing process.

Text guided image editing may be used for any number of downstream tasks. Text guided image editing is advantageous as it provides a user-friendly and intuitive way to create or modify images without requiring expertise in graphic design or image editing software. Users can input text descriptions instead of manually manipulating image elements, making it accessible to a broader range of individuals. Text-to-image editing can be faster than traditional methods, especially for generating multiple variations of an image. Users can quickly describe their desired changes, and the software can generate or edit the image accordingly, saving time and effort. Text-to-image editing can automate repetitive tasks and streamline workflows by generating images based on textual descriptions. Additionally, it enables customization, as users can specify precise details or preferences in their text inputs.

Figure 13 is a block diagram of an information processing apparatus 1300 or a computing device 1300, such as a data storage server, which embodies the present invention, and which may be used to implement some or all of the operations of a method embodying the present invention, and perform some or all of the tasks of apparatus of an embodiment. The computing device 1300 may be used to implement any of the method steps described above, e.g. any of steps S. 1402 – S.1416, and/or any processes described above.

The computing device 1300 comprises a processor 1302 and memory 1304. Optionally, the computing device also includes a network interface 1306 for communication with other such computing devices. Optionally, the computing device also includes one or more input mechanisms such as keyboard and mouse 1308, and a display unit such as one or more monitors 1310. These elements may facilitate user interaction. The components are connectable to one another via a bus 1312.

The memory 1304 may include a computer readable medium, which term may refer to a single medium or multiple media (e.g., a centralized or distributed database and/or associated caches and servers) configured to carry computer-executable instructions. Computer-executable instructions may include, for example, instructions and data accessible by and causing a computer (e.g., one or more processors) to perform one or more functions or operations. For example, the computer-executable instructions may include those instructions for implementing a method disclosed herein, or any method steps disclosed herein, e.g. any of steps S. 1402 – S. 1416, and/or any processes described above. Thus, the term “computer-readable storage medium” may also include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the method steps of the present disclosure. The term “computer-readable storage medium” may accordingly be taken to include, but not be limited to, solid-state memories, optical media and magnetic media. By way of example, and not limitation, such computer-readable media may include non-transitory computer-readable storage media, including Random Access Memory (RAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory devices (e.g., solid state memory devices).
The processor 1302 is configured to control the computing device and execute processing operations, for example executing computer program code stored in the memory 1304 to implement any of the method steps described herein. The memory 1304 stores data being read and written by the processor 1302 and may store data, described above, and/or programs for executing any of the method steps and/or processes described above. As referred to herein, a processor may include one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. The processor may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. The processor may also include one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. In one or more embodiments, a processor is configured to execute instructions for performing the operations and operations discussed herein. The processor 1302 may be considered to comprise any of the modules described above. Any operations described as being implemented by a module may be implemented as a method by a computer and e.g. by the processor 1302.

The memory 1304 and the processor 1302 may be collectively configured to provide an embedding mixer 608 arranged to perform the step of determining the new edit embeddings (S. 1408). The memory 1304 and the processor 1302 may be collectively configured to provide the diffusion model. The memory 1304 and the processor 1302 may be collectively configured to provide a cross-attention processor 614 arranged to generate the cross-attention maps. The memory 1304 and the processor 1302 may be collectively configured to provide a tokenizer 702 arranged to convert the base and edit prompts to base and edit tokens and a text encoder 606 arranged to convert the base and edit tokens to base and edit embeddings. The memory 1304 and the processor 1302 may be collectively configured to provide a mask and alignment index vector module 1002 arranged to compute a mask vector and an index vector based on the base and edit tokens.

The display unit 1310 may display a representation of data stored by the computing device, enabling a user to interact with the apparatus 1300 by e.g. drag and drop or selection interaction, and/or any other output described above, and may also display a cursor and dialog boxes and screens enabling interaction between a user and the programs and data stored on the computing device. The input mechanisms 1308 may enable a user to input data and instructions to the computing device, such as enabling a user to input any user input described above.

The network interface (network I/F) 1306 may be connected to a network, such as the Internet, and is connectable to other such computing devices via the network. The network I/F 1306 may control data input/output from/to other apparatus via the network. Other peripheral devices such as microphone, speakers, printer, power supply unit, fan, case, scanner, trackerball etc may be included in the computing device.

Methods embodying the present invention may be carried out on a computing device/apparatus 1300 such as that illustrated in Figure 13. Such a computing device need not have every component illustrated in Figure 13 and may be composed of a subset of those components. For example, the apparatus 1300 may comprise the processor 1302 and the memory 1304 connected to the processor 1302. Or the apparatus 1300 may comprise the processor 1302, the memory 1304 connected to the processor 1302, and the display 1310. A method embodying the present invention may be carried out by a single computing device in communication with one or more data storage servers via a network. The computing device may be a data storage itself storing at least a portion of the data.

A method embodying the present invention may be carried out by a plurality of computing devices operating in cooperation with one another. One or more of the plurality of computing devices may be a data storage server storing at least a portion of the data.

The invention may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The invention may be implemented as a computer program or computer program product, i.e., a computer program tangibly embodied in a non-transitory information carrier, e.g., in a machine-readable storage device, or in a propagated signal, for execution by, or to control the operation of, one or more hardware modules.

A computer program may be in the form of a stand-alone program, a computer program portion or more than one computer program and may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a data processing environment. A computer program may be deployed to be executed on one module or on multiple modules at one site or distributed across multiple sites and interconnected by a communication network.

Method steps of the invention may be performed by one or more programmable processors executing a computer program to perform functions of the invention by operating on input data and generating output. Apparatus of the invention may be implemented as programmed hardware or as special purpose logic circuitry, including e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions coupled to one or more memory devices for storing instructions and data.

The above-described embodiments of the present invention may advantageously be used independently of any other of the embodiments or in any feasible combination with one or more others of the embodiments. While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and apparatuses described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made.
, Claims:1. A computer-implemented method for image editing, the method comprising:
obtaining a base prompt indicating a base image and an edit prompt indicating an edit to be made to the base image;
converting the base and edit prompts to base and edit embeddings, respectively;
repeating, for a plurality of iterations, the steps of:
determining new edit embeddings based on: the base and edit embeddings, a time step relating to the iteration, and a weight dependent on the time step wherein the weight controls mixing of the base and edit embeddings;
inputting the base embeddings into a diffusion model in a base reverse process arranged to update a base latent relating to the base image; and
inputting the new edit embeddings into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process;
converting the edit latent to the edited image; and
outputting the edited image.

2. The computer-implemented method of claim 1, wherein the weight is dependent on the time step such that: at earlier time steps, the base and edit embeddings mix less than at later time steps.

3. The computer-implemented method of claim 1 or claim 2, wherein the weight is further dependent on a time invariant parameter which controls how much the base and edit embeddings mix.

4. The computer-implemented method of any preceding claim, wherein the converting of the base and edit prompts to the base and edit embeddings involves converting the base and edit prompts to base and edit tokens and converting the base and edit tokens to base and edit embeddings.

5. The computer-implemented method of claim 4, wherein the new edit embeddings are further based on a mask vector and an index vector computed based on the base and edit tokens.

6. The computer-implemented method of any preceding claim, wherein the inputting of the base embeddings into the diffusion model in the base reverse process and the inputting of the new edit embeddings into the diffusion model in the edit reverse process overlap in time.

7. The computer-implemented method of any preceding claim, wherein the base prompt is derived from an image.

8. The computer-implemented method of any preceding claim, wherein the base prompt and/or the edit prompt are obtained from a user input.

9. A computer program which, when run on a computer, causes the computer to carry out a method comprising:
obtaining a base prompt indicating a base image and an edit prompt indicating an edit to be made to the base image;
converting the base and edit prompts to base and edit embeddings, respectively;
repeating, for a plurality of iterations, the steps of:
determining new edit embeddings based on: the base and edit embeddings, a time step relating to the iteration, and a weight dependent on the time step wherein the weight controls mixing of the base and edit embeddings;
inputting the base embeddings into a diffusion model in a base reverse process arranged to update a base latent relating to the base image; and
inputting the new edit embeddings into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process;
converting the edit latent to the edited image; and
outputting the edited image.

10. An information processing apparatus comprising a memory and a processor connected to the memory, wherein the processor is configured to perform a method, the method comprising:
obtaining a base prompt indicating a base image and an edit prompt indicating an edit to be made to the base image;
converting the base and edit prompts to base and edit embeddings, respectively;
repeating, for a plurality of iterations, the steps of:
determining new edit embeddings based on: the base and edit embeddings, a time step relating to the iteration, and a weight dependent on the time step wherein the weight controls mixing of the base and edit embeddings;
inputting the base embeddings into a diffusion model in a base reverse process arranged to update a base latent relating to the base image; and
inputting the new edit embeddings into the diffusion model in an edit reverse process arranged to update an edit latent relating to an edited image, wherein cross-attention maps generated from the diffusion model in the base reverse process are input into the diffusion model in the edit reverse process;
converting the edit latent to the edited image; and
outputting the edited image.

Documents

Application Documents

# Name Date
1 202411024596-STATEMENT OF UNDERTAKING (FORM 3) [27-03-2024(online)].pdf 2024-03-27
2 202411024596-POWER OF AUTHORITY [27-03-2024(online)].pdf 2024-03-27
3 202411024596-FORM 1 [27-03-2024(online)].pdf 2024-03-27
4 202411024596-DRAWINGS [27-03-2024(online)].pdf 2024-03-27
5 202411024596-DECLARATION OF INVENTORSHIP (FORM 5) [27-03-2024(online)].pdf 2024-03-27
6 202411024596-COMPLETE SPECIFICATION [27-03-2024(online)].pdf 2024-03-27
7 202411024596-Power of Attorney [21-06-2024(online)].pdf 2024-06-21
8 202411024596-Covering Letter [21-06-2024(online)].pdf 2024-06-21