Abstract: The present disclosure performs image in-painting with controlled text generations to overcome the challenges persisting in traditional diffusion-based methods, especially when it comes to generating textual content within the image with complex font attributes. In the present disclosure, initially, an input image and a textual prompt and a plurality of control parameters are given as input to the present disclosure. Further, character mask and conditional mask are extracted based on the inputs. Finally accurate customized textual images are generated based on the character mask and the conditional mask using a textual image generating diffusion model. The textual image generating diffusion model generates an intermediate image based on the input image and the random gaussian noise. This intermediate image is iteratively refined to generate a latent vector image. and the accurate customized textual image is generated from the latent vector image using a trained customized character map-guided consistency model. [To be published with FIG. 2]
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention: METHOD AND SYSTEM FOR DIFFUSION MODELS BASED GENERATION OF CUSTOMIZED TEXTUAL IMAGES
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
2
TECHNICAL FIELD
[001]
The disclosure herein generally relates to the field of image processing and, more particularly, to a method and system for diffusion models based generation of customized textual images.
BACKGROUND 5
[002]
The domain of text-to-image synthesis has witnessed remarkable advancements, with diffusion models emerging as a pivotal paradigm in this domain. The generation of textual images has diverse applications in industries such as entertainment, advertising, education, and product packaging. Creating high-quality text images in diverse formats such as posters, book covers, etc, 10 conventionally requires professional skills and iterative design processes, underscoring the significance of automated solutions. Traditional methods involving manual labor often yield unnatural artifacts due to complex background textures and lighting variations. Current efforts to enhance text rendering quality have turned to diffusion models, exemplified by pioneering frameworks. 15
[003]
Despite these successes, existing models predominantly focus on text encoders, lacking comprehensive control over the generation process. Current works, such as Glyph- Draw and TextDiffuser, aim to enhance control by conditioning on the location and structures of Chinese characters and English characters, respectively. However, the limitation of not supporting multiple text 20 bounding-box generation restricts the applicability of GlyphDraw to various text image scenarios, such as posters and book covers, TextDiffuser addresses the challenges in creating multiple text boxes within images, but still fails in generation of dense and small text. Hence there is a challenge in generating accurate textual images. 25
SUMMARY
[004]
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one 30 embodiment, a method for diffusion models based generation of customized textual images is provided. The method includes receiving, via one or more hardware
3
processors,
a data comprising an input image, a textual prompt, a mask representing a Region of Interest (RoI) in the input image, a plurality of control parameters for governing manipulation of font colour, font type, and background of the input image. Further, the method includes generating, via the one or more hardware processors, a character mask pertaining to the input image based on the plurality of 5 control parameters using a character mask generation technique. Furthermore, the method includes generating, via the one or more hardware processors, a generation mask comprising a plurality of character regions and a plurality of non-character regions based on the character mask, wherein the plurality of character regions are marked as one and the plurality of non-character regions are marked as zero, 10 wherein a bounding box is generated on each of the plurality of character regions using a renderer. Furthermore, the method includes generating, via the one or more hardware processors, a conditional mask pertaining to the input image based on the generated bounding box and the plurality of control parameters using the renderer. Finally, the method includes generating, via the one or more hardware processors, 15 a customized textual image based on the character mask and the conditional mask associated with the input image using a textual image generating diffusion model by: (i) initializing the textual image generating diffusion model with a random gaussian noise, the character mask, and the generation mask (ii) generating an intermediate image based on the input image and the random gaussian noise (iii) 20 iteratively refining the intermediate image for a predefined number of timesteps to generate a latent vector image and (iv) generating the customized textual image from the latent vector image using a trained customized character map-guided consistency model.
[005]
In another aspect, a diffusion models based generation of 25 customized textual images is provided. The system includes at least one memory storing programmed instructions, one or more Input /Output (I/O) interfaces, and one or more hardware processors operatively coupled to the at least one memory, wherein the one or more hardware processors are configured by the programmed instructions to receive 30
4
a data comprising an input image, a textual prompt, a mask representing a Region of Interest (RoI) in the input image, a plurality of control parameters for governing manipulation of font colour, font type, and background of the input image. Further, the one or more hardware processors are configured by the programmed instructions to generate a character mask pertaining to the input image based on the 5 plurality of control parameters using a character mask generation technique. Furthermore, the one or more hardware processors are configured by the programmed instructions to generate a generation mask comprising a plurality of character regions and a plurality of non-character regions based on the character mask, wherein the plurality of character regions are marked as one and the plurality 10 of non-character regions are marked as zero, wherein a bounding box is generated on each of the plurality of character regions using a renderer. Furthermore, the one or more hardware processors are configured by the programmed instructions to generate a conditional mask pertaining to the input image based on the generated bounding box and the plurality of control parameters using the renderer. Finally, 15 the one or more hardware processors are configured by the programmed instructions to generate a customized textual image based on the character mask and the conditional mask associated with the input image using a textual image generating diffusion model by: (i) initializing the textual image generating diffusion model with a random gaussian noise, the character mask, and the generation mask 20 (ii) generating an intermediate image based on the input image and the random gaussian noise (iii) iteratively refining the intermediate image for a predefined number of timesteps to generate a latent vector image and (iv) generating the customized textual image from the latent vector image using a trained customized character map-guided consistency model. 25
[006]
In yet another aspect, a computer program product including a non-transitory computer-readable medium embodied therein a computer program for diffusion models based generation of customized textual images is provided. The computer readable program, when executed on a computing device, causes the computing device to receive a data comprising an input image, a textual prompt, a 30 mask representing a Region of Interest (RoI) in the input image, a plurality of
5
control parameters for governing manipulation of font colour, font type, and
background of the input image. Further, the computer readable program, when executed on a computing device, causes the computing device to generate a character mask pertaining to the input image based on the plurality of control parameters using a character mask generation technique. Furthermore, the 5 computer readable program, when executed on a computing device, causes the computing device to generate a generation mask comprising a plurality of character regions and a plurality of non-character regions based on the character mask, wherein the plurality of character regions are marked as one and the plurality of non-character regions are marked as zero, wherein a bounding box is generated on 10 each of the plurality of character regions using a renderer. Furthermore, the computer readable program, when executed on a computing device, causes the computing device to generate a conditional mask pertaining to the input image based on the generated bounding box and the plurality of control parameters using the renderer. Finally, the computer readable program, when executed on a 15 computing device, causes the computing device to generate a customized textual image based on the character mask and the conditional mask associated with the input image using a textual image generating diffusion model by: (i) initializing the textual image generating diffusion model with a random gaussian noise, the character mask, and the generation mask (ii) generating an intermediate image 20 based on the input image and the random gaussian noise (iii) iteratively refining the intermediate image for a predefined number of timesteps to generate a latent vector image and (iv) generating the customized textual image from the latent vector image using a trained customized character map-guided consistency model.
[007]
It is to be understood that both the foregoing general description and 25 the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
6
BRIEF DESCRIPTION OF THE DRAWINGS
[008]
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[009]
FIG. 1A is a functional block diagram of a system for diffusion 5 models based generation of customized textual images, in accordance with some embodiments of the present disclosure.
[0010]
FIG. 1B illustrates modules of the system for diffusion models based generation of customized textual images, in accordance with some embodiments of the present disclosure. 10
[0011]
FIG. 1C illustrates a conditional mask based textual image generating diffusion model, in accordance with some embodiments of the present disclosure.
[0012]
FIG. 2 illustrate a flow diagram for a processor implemented method for diffusion models based generation of customized textual images, in accordance 15 with some embodiments of the present disclosure.
[0013]
FIG. 3 illustrates an example flow diagram for character mask and conditional mask generation for the processor implemented method for diffusion models based generation of customized textual images, in accordance with some embodiments of the present disclosure. 20
[0014]
FIG. 4 illustrates an example Modified Consistency Decoder architecture (customized character map-guided consistency model) for the processor implemented method for diffusion models based generation of customized textual images, in accordance with some embodiments of the present disclosure. 25
DETAILED DESCRIPTION OF EMBODIMENTS
[0015]
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer 30 to the same or like parts. While examples and features of disclosed principles are
7
described herein, modifications, adaptations, and other implementations are
possible without departing from the spirit and scope of the disclosed embodiments.
[0016]
Recent breakthroughs in diffusion models offer distinct advantages over traditional generative adversarial network (GAN)-based approaches. Notably, diffusion models provide enhanced stability throughout the training phase, 5 eliminating the need for intricate adversarial training processes. Moreover, these models provide meticulous control over the quality and diversity of generated content during the diffusion process. In contrast to GAN-centric methodologies, diffusion models leverage the semantic richness inherent in textual prompts. Despite the significant progress made in leveraging the semantic richness of textual 10 prompts for image synthesis, challenges persist in traditional diffusion-based methods, especially when it comes to generating textual content within the image with complex font attributes. For example, the two shortcoming of the conventional approaches are (1) Text Diffuser does not provide explicit control during textual image generation. Only the spatial positioning control is available, while generating 15 images with certain regions allocated for generation of text (2) In cases where the provided layout consists of small sized characters with respect to image dimensions, the model generates distorted characters, which are not visually clear.
[0017]
To overcome the challenges of the conventional approaches, embodiments herein provide a method and system for diffusion models based 20 generation of customized textual images. The objective of the present disclosure is to ensure that the resulting merged image exhibits a high degree of harmonization and photorealism. To achieve this objective and to fill the gap in the generation of realistic image, two shortcomings of the conventional methods has been identified and rectified in the present disclosure using a trained customized character map-25 guided consistency model. The present disclosure generates images or perform image in-painting with controlled text generations. This control extends to font attributes such as type, size, color, and background, all of which are seamlessly integrated into a given reference image layout (in case of image in-painting) as shown in FIG. 1C. Initially, an input image and a textual prompt and a plurality of 30 control parameters are given as input to the system. Further, character mask and
8
conditional mask are extracted based on the inputs. Finally accurate customized
textual images are generated based on the character mask and the conditional mask using a textual image generating diffusion model. The textual image generating text diffusion model generates an intermediate image based on the input image and the random gaussian noise. This intermediate image is iteratively refined to 5 generate a latent vector image. The accurate customized textual image is generated from the latent vector image using a trained customized character map-guided consistency model.
[0018]
Referring now to the drawings, more particularly to FIG. 1A through FIG. 4, where similar reference characters denote corresponding features 10 consistently throughout the figures, there are shown preferred embodiments, and these embodiments are described in the context of the following exemplary system and/or method.
[0019]
FIG. 1A is a functional block diagram of a system 100 for diffusion models based generation of customized textual images, in accordance with some 15 embodiments of the present disclosure. The system 100 includes or is otherwise in communication with hardware processors 102, at least one memory such as a memory 104, an Input /Output (I/O) interface 112. The hardware processors 102, memory 104, and the I/O interface 112 may be coupled by a system bus such as a system bus 108 or a similar mechanism. In an embodiment, the hardware processors 20 102 can be one or more hardware processors.
[0020]
The I/O interface 112 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a 25 mouse, an external memory, a printer and the like. Further, the I/O interface 112 may enable the system 100 to communicate with other devices, such as web servers, and external databases.
[0021]
The I/O interface 112 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for 30 example, local area network (LAN), cable, etc., and wireless networks, such as
9
Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface 112
may include one or more ports for connecting several computing systems with one another or to another server computer. The I/O interface 112 may include one or more ports for connecting several devices to one another or to another server.
[0022]
The one or more hardware processors 102 may be implemented as 5 one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, node machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 102 is configured to fetch and execute computer-readable instructions stored in memory 104. 10
[0023]
The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an 15 embodiment, memory 104 includes a plurality of modules 106. Memory 104 also includes a data repository (or repository) 110 for storing data processed, received, and generated by the plurality of modules 106.
[0024]
The plurality of modules 106 includes programs or coded instructions that supplement applications or functions performed by the system 100 20 for diffusion models based generation of customized textual images. The plurality of modules 106, amongst other things, can include routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types. The plurality of modules 106 may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or 25 component that manipulates signals based on operational instructions. Further, the plurality of modules 106 can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 102, or by a combination thereof. The plurality of modules 106 can include various sub-modules (not shown). The plurality of modules 106 may include computer-readable 30 instructions that supplement applications or functions performed by the system 100
10
for diffusion models based generation of customized textual images. For example,
the plurality of modules includes a character mask generation module 122 (shown in FIG. 1B), a generation mask generation module 124 (shown in FIG. 1B), a conditional mask generation module 126 (shown in FIG. 1B) and a customized textual image generation module 128 (shown in FIG. 1B). The customized textual 5 image generation module 128 includes a text diffuser initialization module 128A (shown in FIG. 1B), an intermediate image generation module 128B (shown in FIG. 1B), an intermediate image refining module 128C (shown in FIG. 1B) and a customized textual image generation module 128D (shown in FIG. 1B).
[0025]
FIG. 1B illustrates modules of a processor implemented method for 10 diffusion models based generation of customized textual images, in accordance with some embodiments of the present disclosure.
[0026]
The data repository (or repository) 110 may include a plurality of abstracted pieces of code for refinement and data that is processed, received, or generated as a result of the execution of the plurality of modules in the module(s) 15 106.
[0027]
Although the data repository 110 is shown internal to the system 100, it will be noted that, in alternate embodiments, the data repository 110 can also be implemented external to the system 100, where the data repository 110 may be stored within a database (repository 110) communicatively coupled to the system 20 100. The data contained within such an external database may be periodically updated. For example, new data may be added into the database (not shown in FIG. 1A) and/or existing data may be modified and/or non-useful data may be deleted from the database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational 25 Database Management System (RDBMS). The working of the components of the system 100 are explained with reference to the method steps depicted in FIG. 2.
[0028]
FIG. 2 is an exemplary flow diagrams illustrating a method 200 for diffusion models based generation of customized textual images implemented by the system of FIG. 1A and 1B, according to some embodiments of the present 30 disclosure. In an embodiment, the system 100 includes one or more data storage
11
devices or the memory 104 operatively coupled to the one or more hardware
processor(s) 102 and is configured to store instructions for execution of steps of the method 200 by the one or more hardware processors 102. The steps of method 200 of the present disclosure will now be explained with reference to the components or blocks of system 100 as depicted in FIG. 1A and 1B and the steps of flow diagram 5 as depicted in FIG. 2. The method 200 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. Method 200 may also be practiced in a distributed computing 10 environment where functions are performed by remote processing devices that are linked through a communication network. The order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method. Furthermore, the method 200 can be implemented in 15 any suitable hardware, software, firmware, or combination thereof.
[0029]
Now referring to FIG. 2, at step 202 of method 200, one or more hardware processors 102 are configured by the programmed instructions to receive data including an input image, a textual prompt, a mask representing a Region of Interest (RoI) in the input image, a plurality of control parameters for governing 20 manipulation of font color, font type, and background. The textual prompt comprises a plurality of requirements pertaining to customization. For example, to generate an image of a bear standing near a sea beach alongside a signboard which says hello world in red and blue color, the textual prompt as βA brown bear standing near a signboard that says 'Hello World' on a sea beachβ and Json text as 25 β{"Hello": ['red','Arial'], "World":['blue','Arial']}β.
[0030]
For example, given a textual prompt ππ‘ and a mask π representing a Region of Interest (RoI) where required texts ππ‘ need to be generated in accordance with prompt ππ‘, the task is to generate image π₯. This generated image should incorporate the text πΉ(ππ‘) within the designated region, whereπΉrepresents 30
12
a set of functions containing control parameters that govern the manipulation of
font color, type, and background.
[0031]
For example, in the first stage of the two-stage pipeline of the present disclosure as shown in FIG. 3, two masks, namely, a character mask πand a conditional mask πΆ were obtained based on the plurality of control parameters 5 provided for πΉ, which is explained further in conjunction with step 204 through step 208.
[0032]
At step 204 of the method 200, the character mask generation module 122, when executed by the one or more hardware processors 102 is configured by the programmed instructions to generate the character mask 10 pertaining to the input image based on the plurality of control parameters using a character mask generation technique. The character mask defines the spatial position of ππ‘ where rectangular box is allotted for each character generation.
[0033] The steps for generating the character mask pertaining to the input image based on the plurality of control parameters using the character mask 15 generation technique include the following. Initially extraction of at least one text to be written on the RoI of the input image from the textual prompt using lexical filtering is performed. Further, a bounding box of the extracted text is predicted using a Layout Transformer. Finally, the character mask is generated based on the bounding box using a renderer, wherein character regions are marked with positive 20 values and the non-character regions are marked with zeros.
[0034]
At step 206 of the method 200, the generation mask generation module 124, when executed by the one or more hardware processors 102 is configured by the programmed instructions to generate a generation mask comprising a plurality of character regions and a plurality of non-character regions 25 based on the character mask, wherein the plurality of character regions are marked as one and the plurality of non-character regions are marked as zero, wherein a bounding box is generated on each of the plurality of character regions using a renderer.
[0035]
At step 208 of the method 200, the conditional mask generation 30 module 126 when executed by the one or more hardware processors 102 is
13
configured by the programmed instructions to
generate a conditional mask pertaining to the input image based on a generated bounding box and the plurality of control parameters using the renderer. For example, the conditional mask πΆ specifies the necessary attributes of ππ‘ based on the functions in πΉ, ensuring that the texts are rendered accordingly. This stage takes textual prompt ππ‘ as input, 5 where ππ‘ is specified in single quotes. After obtaining ππ‘ a Layout Transformer based architecture is used to predict bounding box of ππ‘in the mask image of desired dimension. Subsequently, the character mask M is created by using the bounding box information, π΅. This information is also utilized by the rendering module for each character and combined with the control parameters defined by πΉ, to obtain 10 the conditional mask, πΆ=πΉ(π΅,ππ‘).
[0036]
At step 210 of the method 200, the customized textual image generation module 128 when executed by the one or more hardware processors 102 is configured by the programmed instructions to generate a customized textual image based on the character mask and the conditional mask associated with the 15 input image using a textual image generating diffusion model. At step 212A of method 200, the text diffusion model initialization module 128A when executed by the one or more hardware processors are configured by the programmed instructions to initialize the textual image generating diffusion model with a random gaussian noise, the character mask, and the generation mask. Further, at step 212B 20 of method 200, the intermediate image generation module 128B when executed by the one or more hardware processors are configured by the programmed instructions to generate an intermediate image based on the input image and the random gaussian noise. Furthermore, at step 212C of method 200, the intermediate image refining module 128C when executed by the one or more hardware 25 processors are configured by the programmed instructions to iteratively refine the intermediate image for a predefined number of timesteps to generate a latent vector image. Finally, at step 212D of method 200, the customized textual image generation module 128D when executed by the one or more hardware processors are configured by the programmed instructions to generate the customized textual 30
14
image from the latent vector image using a trained customized character map
-guided consistency model shown in FIG. 4.
[0037]
The customized character map-guided consistency model comprises a trainable ControlNet architecture built over pre-trained consistency decoder. A character map is used in the pre-trained consistency decoder to generate customized 5 textual image from the latent vector image. The customized character map-guided consistency model utilizes the character mask as control parameter for optimal customized textual image generation and the control parameter preserves identity and style of input characters, generating realistic and diverse small characters within the latent space. 10
[0038]
For example, the conditional images from ππΆ may belong to diverse domains. To ensure coherence in generation, the self-attention map of π₯π‘ must encapsulate the essence of ππ,π‘, where ππ,π‘ represents a sample from ππΆ at timestep π‘. Given that, in the generation task at π, π₯π initializes from noise, a pronounced essence of ππ,π‘ in π₯π is desirable for creating the self-attention map during 15 initialization. However, as the timestep progresses in the reverse denoising process, it becomes crucial to diminish the essence of ππ,π‘ to avoid sharp boundaries. Between the rendered image and the conditional image, facilitating harmonious integration. To control character properties, including font types, color, and background, the forward propagated conditional image ππ,π‘ from ππΆ at timestep π‘is 20 introduced into the reconstructing image π₯π‘. This is achieved using the weighting function ππ‘(π₯π‘,ππ,π‘,πΆπ,π‘), where πΆπ,is the binary mask representing the conditional image region. Itβs worth noting that, for enhanced harmonization, an additional threshold may be introduced. After a certain number of timesteps, no injection from ππΆ should occur. Nevertheless, defining the appropriate weighting 25 function, ππ‘, can simulate the thresholded behavior without causing abrupt changes in the reverse denoising process.
Wπ,π‘=ππ‘(π₯π‘,ππ,π‘,πΆπ,π‘) β¦β¦β¦β¦β¦β¦(1)
π₯π‘β²=π₯π‘ β (1βwπ,π‘) + ππ,π‘ β wπ,π‘ β¦β¦β¦β¦β¦(2)
15
[0039]
The present disclosure model employs Variational Autoencoder (VAE) networks to transform images into lowdimensional latent spaces, to enhance computational efficiency during training the diffusion models. However, when images are compressed into lower-dimensional latent spaces, fine details, such as the small-sized characters, might not be adequately preserved. This can result in the 5 generated images not accurately reproducing the original data, which could be problematic in applications where precise details are crucial. To address this issue, a novel decoder was proposed that can generate high-quality small characters from the latent representations learned by a stable diffusion model. The present disclosure introduces a Character Map-guided Consistency Model (CM) that 10 capitalizes on the semantic information of characters, ensuring consistency between the latent and output spaces. The intuition is that the regions containing small characters pose challenges in reconstruction. By incorporating the character guidance map, initially utilized for text generation, into the ControlNet architecture, the decoder gains additional guidance. The CM decoder proves effective in 15 preserving the identity and style of input characters, generating realistic and diverse small characters within the latent space.
[0040]
The consistency diffusion model consists of a decoder network ππ that takes as input a noise tensor π§π‘ sampled from a Gaussian distribution π(0,πΌ), and outputs an image π₯0 that corresponds to the starting point of the diffusion path 20 trajectory. The model can generate images in one step by sampling a noise tensor π§π from the final timestep of the diffusion process and passing it through the decoder network ππ. Alternatively, it can also generate images in multiple steps by sampling noise tensors from intermediate timesteps of the diffusion process and using a consistency model to refine the output at each step. This allows the model 25 to tradeoff between speed and quality of generation.
[0041]
For example, given the pre-trained consistency model DALLE-3 decoder π·π (., .), the parameters ΞΈ are frozen and ControlNet model πΆπ with trainable parameters Ο are introduced, as shown in the FIG. 4 (Modified Consistency Decoder architecture). The architecture takes latent vector, ππ‘as input 30 for π·π and character mask π for πΆπ. By adding the ControlNet architecture, a new
16
consistency model
π·{π,π}(.,.,) is defined and trained with the loss, as defined in equation (3), which ensures stable training.
πΏπΆ(π) = πΈπ,ππ‘,π,π[Ξ»(π‘π)π(π·{π,π}(ππ‘π+1,π‘π+1,π),π·{π,π}β(ππ‘π,π‘π,π))] β¦(3)
Here, πΈ[.] denotes the expectation over all random variables and π(π₯,π¦) is l2 squared distance. {π,π}β β π π‘ππππππ({π,π}) and only π is kept trainable in the pro-5 cess.
[0042]
The steps for iteratively refining the intermediate image for the predefined number of timesteps to generate the latent vector image includes the following. Initially, a conditional image is generated based on the intermediate image by performing null-text inversion of the conditional mask using Denoising 10 Diffusion Probabilistic Models (DDPM) sampler, wherein the intermediate image is updated during each iteration. Further, a reconstructed image is generated by performing reverse denoising process on the conditional image using the textual image generating diffusion model. Further, normalization is performed on the reconstructed image using a predetermined reweighting function. Finally, the 15 intermediate image is updated by integrating the conditional image with a normalized reconstructed image, wherein the intermediate image obtained at the end of predefined number of timesteps is considered as the latent vector image. The updated intermediate image is used in the next iteration.
[0043]
Experimentation: The publicly available train and test split of 20 CTW-1500 dataset is used to train the consistency decoder and to evaluate the performance of the present disclosure. Additionally, a custom dataset namely SmallFont-Size dataset is created to showcase the effectiveness of the present disclosure on generating small-sized fonts. The custom dataset includes 200 examples of textual prompts for generating small-sized texts in the images along 25 with spatial character maps.
[0044]
Implementation: The present disclosure utilizes the pre-trained TextDiffuser model and SD (Stable Diffusion)-1.5 model. For the decoder, a pre-trained DALLE-3 consistency decoder available publicly is used. Since the original consistency decoder often distorts the small sized characters in the decoded image, 30 character map assistance is provided for its correction (Refer FIG. 4). Original
17
ControlNet architecture is used to modify the consistency model.
The model is trained for 3500 steps over CTW(Curve Text in the Wild)1500 dataset, with effective batch size of 96, using gradient accumulation. During inference, the generated image resolution of 512Γ512 is used, which is computed over CFG (Classifier Free Guidance) scale 7.5. 5
[0045]
Results: To evaluate the reconstruction performance of the trained model, MSE (Mean Squared Error), PSNR (Peak Signal to Noise Ratio) and SSIM (Structural Similarity Index) metrics are used and evaluated over CTW-1500 test set in Table I. A vanilla decoder enhancing model is also created, which worked by cropping original image into smaller fragments and upscaling them. The Decoder 10 Enhance model refined these fragment for characters and finally the images were merged back to original resolution. Compared with the Controlnet-canny model, Text Diffuser and Decoder Enhance, it is evident that CustomText decoder (decoder of the present disclosure) performs best in all the three metrics. Furthermore, to evaluate readability, and to verify the quality of the reconstructed 15 image of Table I, EasyOCR is used. Here, an exact match of individual words is taken to compute the results. Additionally, the comparison results of CustomText for generating small-sized texts in the images is shown in Table II over SmallFontSize dataset using OCR performance and ClipScore. Although the ControlNet Consistency model (the present disclosure) outperforms other existing 20 methods in the OCR results, however, original TextDiffuser model performs better in terms of Clipscore by a small margin of 0.0015.
Table I
CTW-1500
Method
MSEβ
PSNRβ
SSIMβ
Controlnet-canny
0.033
0.031
17.832
17.82
0.656
0.6601
Text Diffuser
Decoder Enhance
0.027
18.17
0.6874
Present disclosure
0.019
21.33
0.712
Table II 25
18
SmallFontsize Dataset
Method
Precision
Recall
F1
ClipScore
StableDiffusion
0.0936
0.1174
0.1041
0.512
Controlnet-canny
0.6332
0.6572
0.645
0.6321
TextDiffuser
0.792
0.7863
0.7891
0.6407
Decoder Enhance
0.7894
0.7911
0.7902
0.6407
Present disclosure
0.8131
0.815
0.814
0.6392
[0046]
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are 5 intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0047]
The embodiments of the present disclosure herein address the unresolved problem of generating high-quality images with customized fonts. The 10 present disclosure can be used as a tool for incremental editing to obtain the best quality textual images such as posters, advertisement, and the like, as per the userβs requirement.
[0048]
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message 15 therein such computer-readable storage means contain program-code means for implementation of one or more steps of the method when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination 20 thereof. The device may also include means which could be e.g., hardware means
19
like e.g., an application
-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented 5 in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs, GPUs and edge computing devices.
[0049]
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not 10 limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with 15 the instruction execution system, apparatus, or device. The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building 20 blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained 25 herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words βcomprising,β βhaving,β βcontaining,β and βincluding,β and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to 30 only the listed item or items. It must also be noted that as used herein and in the
20
appended claims, the singular forms βa,β βan,β and βtheβ include plural references
unless the context clearly dictates otherwise. Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be 5 stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term βcomputer-readable mediumβ should be understood to include tangible items and exclude carrier waves and transient signals, i.e. non-transitory. 10 Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0050]
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the 15 following claims.
WE CLAIM:
1. A processor-implemented method (200), the method comprising:
receiving (202), via one or more hardware processors, a data comprising an input image, a textual prompt, a mask representing a Region of Interest (RoI) in the input image, a plurality of control parameters for governing manipulation of font colour, font type, and background of the input image;
generating (204), via the one or more hardware processors, a character mask pertaining to the input image based on the plurality of control parameters using a character mask generation technique;
generating (206), via the one or more hardware processors, a generation mask comprising a plurality of character regions and a plurality of non-character regions based on the character mask, wherein the plurality of character regions are marked as one and the plurality of non-character regions are marked as zero, wherein a bounding box is generated on each of the plurality of character regions using a renderer;
generating (208), via the one or more hardware processors, a conditional mask pertaining to the input image based on the generated bounding box and the plurality of control parameters using the renderer; and generating (210), via the one or more hardware processors, a customized textual image based on the character mask and the conditional mask associated with the input image using a textual image generating diffusion model by:
initializing the textual image generating diffusion model with a random gaussian noise, the character mask, and the generation mask;
generating an intermediate image based on the input image and the random gaussian noise;
iteratively refining the intermediate image for a predefined number of timesteps to generate a latent vector image; and
generating the customized textual image from the latent vector image using a trained customized character map-guided consistency model.
2. The method as claimed in claim 1, wherein generating the character mask
pertaining to the input image based on the plurality of control parameters
using the character mask generation technique comprises:
extracting at least one text to be written on the RoI of the input image from the textual prompt using lexical filtering;
predicting the bounding box of the extracted text using a Layout Transformer; and
generating the character mask based on the bounding box using a renderer, wherein character regions are marked with positive values and the non-character regions are marked with zeros.
3. The method as claimed in claim 1, wherein the textual prompt comprises a plurality of requirements pertaining to customization.
4. The method as claimed in claim 1, wherein iteratively refining the intermediate image for the predefined number of timesteps to generate the latent vector image comprises:
generating a conditional image based on the intermediate image by performing null-text inversion of the conditional mask using Denoising Diffusion Probabilistic Models (DDPM) sampler, wherein the intermediate image is updated during each iteration;
generating a reconstructed image by performing reverse denoising process on the conditional image using the textual image generating diffusion model;
performing normalization on the reconstructed image using a predetermined reweighting function; and
updating the intermediate image by integrating the conditional image with the normalized reconstructed image, wherein the intermediate image obtained at the end of predefined number of timesteps is considered as the latent vector image.
5. The method as claimed in claim 1, wherein the customized character map-guided consistency model comprises a trainable controlnet architecture built over pre-trained consistency decoder, wherein a character map is used in the pre-trained consistency decoder to generate customized textual image from the latent vector image, wherein the customized character map-guided consistency model utilizes the character mask as control parameter for optimal customized textual image generation, and wherein the control parameter preserves identity and style of input characters, generating realistic and diverse small characters within the latent space.
6. A system (100) comprising:
at least one memory (104) storing programmed instructions; one or more Input /Output (I/O) interfaces (112); and one or more hardware processors (102) operatively coupled to the at least one memory (104), wherein the one or more hardware processors (102) are configured by the programmed instructions to:
receive a data comprising an input image, a textual prompt, a mask representing a Region of Interest (RoI) in the input image, a plurality of control parameters for governing manipulation of font colour, font type, and background of the input image;
generate a character mask pertaining to the input image based on the plurality of control parameters using a character mask generation technique;
generate a generation mask comprising a plurality of character regions and a plurality of non-character regions based on the character mask, wherein the plurality of character regions are marked as one and the plurality of non-character regions are marked as zero, wherein a bounding box is generated on each of the plurality of character regions using a renderer;
generate a conditional mask pertaining to the input image based on the generated bounding box and the plurality of control parameters using the renderer; and
generate a customized textual image based on the character mask and the conditional mask associated with the input image using a textual image generating diffusion model by:
initializing the textual image generating diffusion model with a random gaussian noise, the character mask, and the generation mask;
generating an intermediate image based on the input image and the random gaussian noise;
iteratively refining the intermediate image for a predefined number of timesteps to generate a latent vector image; and
generating the customized textual image from the latent vector image using a trained customized character map-guided consistency model.
7. The system of claim 6, wherein generating the character mask pertaining to
the input image based on the plurality of control parameters using the
character mask generation technique comprises:
extracting at least one text to be written on the RoI of the input image from the textual prompt using lexical filtering;
predicting the bounding box of the extracted text using a Layout Transformer; and
generating the character mask based on the bounding box using a renderer, wherein character regions are marked with positive values and the non-character regions are marked with zeros.
8. The system of claim 6, wherein the textual prompt comprises a plurality of requirements pertaining to customization.
9. The system of claim 6, wherein iteratively refining the intermediate image for the predefined number of timesteps to generate the latent vector image comprises:
generating a conditional image based on the intermediate image by performing null-text inversion of the conditional mask using Denoising
Diffusion Probabilistic Models (DDPM) sampler, wherein the intermediate image is updated during each iteration;
generating a reconstructed image by performing reverse denoising process on the conditional image using the textual image generating diffusion model;
performing normalization on the reconstructed image using a predetermined reweighting function; and
updating the intermediate image by integrating the conditional image with the normalized reconstructed image, wherein the intermediate image obtained at the end of predefined number of timesteps is considered as the latent vector image. 10. The system of claim 6, wherein the customized character map-guided consistency model comprises a trainable controlnet architecture built over pre-trained consistency decoder, wherein a character map is used in the pre-trained consistency decoder to generate customized textual image from the latent vector image, wherein the customized character map-guided consistency model utilizes the character mask as control parameter for optimal customized textual image generation, and wherein the control parameter preserves identity and style of input characters, generating realistic and diverse small characters within the latent space.
| # | Name | Date |
|---|---|---|
| 1 | 202421023953-STATEMENT OF UNDERTAKING (FORM 3) [26-03-2024(online)].pdf | 2024-03-26 |
| 2 | 202421023953-REQUEST FOR EXAMINATION (FORM-18) [26-03-2024(online)].pdf | 2024-03-26 |
| 3 | 202421023953-FORM 18 [26-03-2024(online)].pdf | 2024-03-26 |
| 4 | 202421023953-FORM 1 [26-03-2024(online)].pdf | 2024-03-26 |
| 5 | 202421023953-FIGURE OF ABSTRACT [26-03-2024(online)].pdf | 2024-03-26 |
| 6 | 202421023953-DRAWINGS [26-03-2024(online)].pdf | 2024-03-26 |
| 7 | 202421023953-DECLARATION OF INVENTORSHIP (FORM 5) [26-03-2024(online)].pdf | 2024-03-26 |
| 8 | 202421023953-COMPLETE SPECIFICATION [26-03-2024(online)].pdf | 2024-03-26 |
| 9 | 202421023953-Proof of Right [22-04-2024(online)].pdf | 2024-04-22 |
| 10 | 202421023953-FORM-26 [08-05-2024(online)].pdf | 2024-05-08 |
| 11 | Abstract1.jpg | 2024-05-21 |
| 12 | 202421023953-Power of Attorney [11-04-2025(online)].pdf | 2025-04-11 |
| 13 | 202421023953-Form 1 (Submitted on date of filing) [11-04-2025(online)].pdf | 2025-04-11 |
| 14 | 202421023953-Covering Letter [11-04-2025(online)].pdf | 2025-04-11 |
| 15 | 202421023953-FORM-26 [22-05-2025(online)].pdf | 2025-05-22 |